A Robust Attentional Framework for License Plate Recognition in the Wild

06/06/2020 ∙ by Linjiang Zhang, et al. ∙ 18

Recognizing car license plates in natural scene images is an important yet still challenging task in realistic applications. Many existing approaches perform well for license plates collected under constrained conditions, eg, shooting in frontal and horizontal view-angles and under good lighting conditions. However, their performance drops significantly in an unconstrained environment that features rotation, distortion, occlusion, blurring, shading or extreme dark or bright conditions. In this work, we propose a robust framework for license plate recognition in the wild. It is composed of a tailored CycleGAN model for license plate image generation and an elaborate designed image-to-sequence network for plate recognition. On one hand, the CycleGAN based plate generation engine alleviates the exhausting human annotation work. Massive amount of training data can be obtained with a more balanced character distribution and various shooting conditions, which helps to boost the recognition accuracy to a large extent. On the other hand, the 2D attentional based license plate recognizer with an Xception-based CNN encoder is capable of recognizing license plates with different patterns under various scenarios accurately and robustly. Without using any heuristics rule or post-processing, our method achieves the state-of-the-art performance on four public datasets, which demonstrates the generality and robustness of our framework. Moreover, we released a new license plate dataset, named "CLPD", with 1200 images from all 31 provinces in mainland China. The dataset can be available from: https://github.com/wangpengnorman/CLPD_dataset.



There are no comments yet.


page 1

page 5

page 6

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

License Plate (LP) recognition in the wild is a fundamental problem in intelligent transportation systems. It can be used in a variety of applications including self-driving vehicles, traffic control and surveillance.

The LP numbers enable the link to a large body of information, including ownership, vehicle condition and driving record. Therefore, the technique of LP recognition in the wild can play a key role in road safety, traffic control and law enforcement. Although the recognition accuracy is acceptable for images shot under constrained conditions, recognizing license plates in complex environment is still far from satisfactory, especially for images photographed in dark, glare, occluded, rainy, snowy, tilted or blurred scenarios as shown in Figure 1.

Fig. 1: Examples of license plates successfully recognized by our proposed algorithm. (a) Dark Illumination; (b) Extremely bright or uneven; (c) Large horizontal tilt degree; (d) Large vertical tilt degree; (e) Images taken on a snowy or rainy day; (f) Mixture of bad conditions.

With the advantage of deep neural networks, numerous work is proposed in recent years for license plate recognition, with Convolutional Neural Networks (CNNs) used for feature extraction, and Connectionist Temporal Classification (CTC) 


, number classifiers 

[2], etc. followed for character reading. These methods perform well for regular license plates (e.g., nearly horizontal). When the license plate images are tilted or bent, an extra rectification step is required before recognition [3].

This paper tackles the task of license plate recognition in unconstrained scenarios. A robust framework is proposed to handle license plate recognition in both regular and challenging cases effectively. Our proposed license plate recognizer is composed of a 30-layer lightweight Xception for feature extraction and a 2D-attention based decoding module for character sequence recognition. Without extra processings like image rectification or character segmentation, the proposed model is capable of recognizing license plates in both regular and irregular patterns under various practical scenarios. Different from current methods of treating a license plate as a one-dimensional sequence, our method uses 2D-attention that considers license plate image as a 2-dimensional signal. Trained in a weakly supervised manner, the proposed model is able to approximately localize the corresponding characters on license plates in decoding process, regardless of the appearance of license plate patterns.

Many license plate datasets are collected from one region, which causes bias in the datasets. For example, Xu et al. [2] introduce a license plate dataset CCPD which contains about 290K real world license plate images in various complex situations, as shown in Figure 1. However, since more than of the images are photographed in one city, the first two characters in license plates are mostly the same, which may lead to bias for the trained model. In order to obtain a robust model which can be generally used for recognizing license plates from different regions, a CycleGAN model is tailored here which can mimic real scenarios and generate different kinds of license plate images, such as in dark or strong lighting conditions, containing shadows, etc. Moreover, license plates with various province characters can be synthesized, which alleviates the exhausting human annotation work to a large extent and enables a more general license plate recognition model. Our framework is evaluated on four public datasets. The competitive performance demonstrates the robustness of our framework. Moreover, we also collect a new license plate dataset with images from all provinces in China, named “CLPD”. It enables a more comprehensive evaluation of current plate recognition methods, and promotes the research of a more practical model.

It should be noted that the focus of this work is license plate recognition. So we simply train the off-the-shelf YOLOv2 detector [4] here to obtain bounding boxes of license plates.

The main contributions of this paper can be summarized as follows:

1. We design a robust method for license plate recognition in natural scene images. It is made up of a tailored Xception module and an encoder-decoder module. We optimized the recognition framework by using a 2D attention mechanism. It is able to extract local features for individual characters in a weakly supervised manner, without character level annotations needed. Compared to existing license plate recognition approaches, our method does not need an extra module to handle the irregularity of license plates or segment each character for recognition.

2. A tailored CycleGAN is proposed to synthesize license plates under various scenarios, including adding shadows, glare or darkness, perspective transformation, etc.. With this engine we can generate license plate images with less data bias, and so get models with better generalization abilities.

3. We build a new dataset, named CLPD. It covers a large variety of photographing conditions, vehicle types and region codes, which provides a more comprehensive evaluation benchmark for plate recognition algorithms and promotes a more practical model design.

Fig. 2:

Overview of the proposed architecture for LP recognition in complex scenarios. We extract license plates via a well-trained YOLOv2. The detected bounding box is fed into a 30-layer Xception network and get a global feature vector (denoted as

). An LSTM model is adopted to decode the obtained image feature into license plate numbers. We also extract an intermediate feature map (denoted as ) from the th layer of Xception, which provides local features during character decoding process.

Ii Related Work

In this section, we present a concise introduction to related works on license plate recognition, light-weight convolutional neural networks, generative adversarial networks and datasets of license plate.

Ii-a License Plate Recognition

Existing methods for license plate recognition can be divided into two categories: Segmentation based [3, 5, 6, 7, 8] and Non-segmentation based methods [1, 2, 9]. The segmentation based methods generally segment the license plate into characters and then recognize individual characters by OCR models [3, 10, 11]. Bulan et al[11]

perform segmentation and OCR jointly by using a hidden Markov models (HMMs) based probabilistic inference method, where the most likely character sequence is determined by Viterbi algorithm. Segmentation based methods rely heavily on the segmentation performance, which is very susceptible to the environment, including strong or weak lighting, bad weather, blurring,

etc., and will result in a low recognition accuracy even with a strong recognizer.

Recent methods are mostly segmentation free. For example, Li et al[1] propose to treat license plate as a character sequence. Sequential features are encoded by CNNs and Bidirectional RNNs (BRNNs), and decoded by CTC without character separation. The CNN features are extracted from a well-trained CNN classifier, and the model cannot be trained end-to-end. RPnet proposed by Xu et al[2] extracts ROI features from several different convolutional layers, and feeds the combined feathers to a series of classifiers for recognition. The number of classifiers is determined by the number of characters in the license plate, which limits its generalization ability in different regions. Li et al[9] later propose a unified network which is able to localize license plates and recognize the letters at the same time in a single forward process. Similarly, the region features are encoded by BRNNs and decoded by CTC, which restricts its application to oriented LPs. Compared to the previous work, our method uses a 2D attention based encoder-decoder framework, where characters can be approximately localized by 2D attention regardless of LP image appearance, which enables its application to arbitrarily-oriented LPs.

Ii-B Scene Text Recognition

License plate recognition can be regarded as a special case of general scene text recognition tasks, which have different characteristics. Characters in license plate usually use the same font in one region. There is no language model hidden in license plate, and no strong relationship with the context semantic information. In contrast, general scene text has a great variability on fonts. A language lexicon is existed and the text content is often highly relevant to the objects or scenes of the image. Xie 

et al[12] propose a novel method where aggregation cross-entropy (ACE) is used for sequence recognition, replacing the generally used CTC loss owing to its inconvenience in processing 2D problems. A multi-object rectified attention network (MORAN) for scene text recognition is proposed by Luo et al[13], which contains a multi-object rectification network (MORN) and an attention-based sequence recognition network (ASRN). The image is rectified by MORN and then input to ASRN for recognition. Shi et al[14] put forward a system that a flexible Thin-Plate Spline transformation is used to adaptively rectify a text image. A recognition model predicts a character sequence directly from the rectified image. Li et al[15] use a 2D attention based encoder-decoder framework for irregular text recognition, which is very similar to our work. However, in our framework, a tailored CycleGAN is added for synthetic license plate generation, which can reduce data bias and improve model generalization ability.

Ii-C Generative Adversarial Networks

With the invention of Generative Adversarial Networks (GANs) [16], many improved models have emerged, such as Deep Convolutional GANs (DCGANs) [17], Conditional GAN [18], Cycle-Consistent Adversarial Networks (CycleGAN) [19], Wasserstein GANs (WGAN) [20] etc.. Zhu et al[19] propose the CycleGAN, which learns the mapping between an input image and an output image using a training set of unaligned image pairs. In order to migrate the style of one image set to another one, cycle consistency loss is introduced. Based on this model, we propose an improved algorithm to generate synthetic license plate images in more complex environments, which improves the accuracy of license plate recognition furthermore. Wang et al[21] adopt CycleWGAN to generate license plate images for improving recognition performance. Images simulating different shotting conditions are generated simultaneously. BRNN+CTC is used for plate recognition, which does not take oriented license plates into consideration as well. Nevertheless, we use a tailored CycleGAN to generate license plates under different conditions separately, which can lead to a better recognition performance.

Ii-D Datasets of License Plate

Most datasets about license plates detection and recognition are collected from one area, and the type of license plate is monotonous (e.g., only containing civic cars, no buses or trucks). Images are taken under similar conditions, such as highway toll stations and parking lots. Hence those datasets could not verify the robustness of a model.

Silva et al[3] collect a dataset named CD-HARD with images, which covers some difficult situations, including tilting. However, because of the small number of images, the test result is susceptible to tricks. PKUData [22] captures images through a road surveillance camera, which includes a variety of license plate types and different lighting conditions. Unfortunately, all license plates are horizontal and taken from one province which has the same province code. Models trained on PKUData cannot be used to recognize license plates from other regions. AOLP [23] database consists of images with Taiwan license plate. This dataset is categorized into three subsets according to different levels of difficulty and photographing conditions. CCPD is currently the largest license plate dataset with k images, and is divided into multiple subsets such as tilt, difficulty, glare, and distance according to license plate conditions, which contributes greatly to the community. Nevertheless, more than of the images are from one city too, which limits the trained model to recognize license plates from other areas. In this work, we propose to synthesize license plates by CycleGAN so as to make up for the deficiency. A new dataset names CLPD is introduced, which includes license plates from different provinces, to evaluate recognition models comprehensively.

Fig. 3:

The tailored Xception architecture. “Conv” stands for Convolutional layers, with output channels and kernel sizes presented. The stride and padding for convolutional layers are all set to

, and no padding for Max-pooling layers.

Iii Model

We introduce our proposed model in this section. As presented in Figure 2, the whole LP recognition model consists of two main parts: a tailored Xception network for feature extraction and a 2D-attention based RNN model for character decoding.

Iii-a The Convolutional Image Encoder

A 30-layer Xception encoder is tailored from the original Xception [24] framework to fit our application, whose details are presented in Figure 3. The convolutional parts of our model are based entirely on depthwise separable convolution layers [25]. The convolutional layers are structured into

modules, where all of them have linear residual connections except for the first and the last one. The term “ResSeparableConv” stands for a stack of three separable convolution layers with an identity residual connection.

The entry flow downsamples the spatial size from to and increases the feature channel from to

using interleaved separable convolutions and max-poolings. In the middle flow, we adopt repeated ResSeparableConv blocks to extract deep features that contain higher level representations, while the spatial size and channel number are fixed. In the exit flow, we extract a middle-level feature map

of size as context for attention network and a final feature vector of dimensions.

Iii-B The Recurrent Sequence Decoder

RNN is widely used in translation, image caption, scene text recognition tasks. Here we extend it to license plate recognition. With a two-dimensional attention mechanism integrated, there is no need to make corrections for irregular license plate images or segment out each character for recognition. The proposed model can handle LPs in arbitrary shapes.

-layer LSTMs with hidden states each are adopted here in the sequence decoder. As shown in Figure 2, the holistic feature vector is fed into LSTMs at time step , which aims to provide an overall information about the input image. Then a “START” token is input into the model at time step . From time step

, the output of previous time step is fed into LSTMs until the “END” token received. The inputs of LSTMs are embedded by one-hot vectors followed by a linear transformation. The calculation of a single LSTM cell in training can be expressed as:


where is the current hidden state, represents the LSTM operation at each time step and is the embedding operation. In inferring process, which is the current output, while in training stage, the groundtruth character is adopted directly as . is a linear transformation, and is the output of the 2D-attention module, which is calculated as follows:


where is the feature vector at position in and is the hidden state at time step . are linear transformation matrices to be learned; is the attention weight at location ; is the weighted sum of image features, i.e., the local feature of the characters to be decoded at current time step . The schematic of the 2D attention mechanism is illustrated in Figure 4.

Fig. 4: The schematic of the 2D attention mechanism. is the feature map of the image obtained by Xception (as shown in Figure 3), and is the hidden state of each time step in decoding.

Iv AsymCycleGAN for LP Image Generation

As aforementioned, it is difficult to manually collect LP images from a variety of regions, which makes most existing LP datasets heavily biased towards specific regional identifiers. In this section, we introduce a method for generating high-quality synthetic LP images using OpenCV and a tailored CycleGAN model (termed as AsymCycleGAN). With this approach, we are able to construct a balanced training data and reduce the reliance on manually collected data.

Iv-a The Architecture of AsymCycleGAN

CycleGAN is an approach to translate an image from a source domain to a target domain in the absence of paired training examples. In this work, the source domain is composed of fake LP images generated by OpenCV and the target domain is made up of real LP images. There are four learnable modules in CycleGAN, leading to two mapping functions and two discriminators and

. The loss function of the standard CycleGAN can be expressed as follows:


where represents the adversarial loss and denotes the cycle-consistency loss:


In our case, what we need is the mapping function to generate real images from synthetic images. can be roughly regarded as generating a noisy image from a clean one and then remove these noises, while is the opposite process. Note that in the process of , the noise in removed by is in theory difficult to be exactly recovered by , as one clean image can be associated to multiple real images with different noises. To this end, we replace the original cycle-consistency loss  (4) with


where the term with respect to is removed. We term the modified CycleGAN model is AsymCycleGAN, as its cycle-consistency loss is asymmetric. The architecture of the proposed AsymCycleGAN model is shown in Figure 5.

Fig. 5: The architecture of the proposed AsymCycleGAN model. are synthetic LP images generated by OpenCV, are real LP images.

Iv-B AsymCycleGAN Generation Results

As in CycleGAN, the training of our proposed AsymCycleGAN model only requires two sets of unaligned images: synthetic and real images. As shown in Figure 6, the synthetic images are generated using OpenCV, while the real images are sampled from the CCPD dataset [2]. To generate different types of real images, we further divide the CCPD images into two subsets with different illumination conditions: dark and bright. We use this dataset to train standard CycleGAN and our asymmetric CycleGAN model respectively, which consists of synthetic LPs generated by OpenCV and real-life license plate images in dark or glare environments. The AsymCycleGAN model is trained with a learning rate of and epochs. The images generated by CycleGAN and asymmetric CycleGAN are shown in Figure 6. Moreover, we try to add shadows on the synthetic images so as to imitate real environment, the generated images are presented in Figure 6 (e).

Fig. 6: Various algorithms for generating license plate images. (a) Synthetic LPs generated by OpenCV; (b) The examples of LPs generated by CycleGAN model [19] ; (c) The examples of LPs generated by our asymmetric CycleGAN model; (d) Real LPs from CCPD-DB; (e) Shadowed Image.

V The Proposed LP Dataset

In this chapter, we introduce a new LP dataset named CLPD (China License Plate Dataset), for a more comprehensive evaluation of LP detection and recognition algorithms, including how it is collected (Section V-A) and the comparison with other datasets (Section V-B).

V-a Data Collection

The LP images in the proposed CLPD dataset are collected from a variety of real-scene image sources, for example, searched from the Internet, taken by mobile phones or captured by car driving recorders. All the faces shown in the images are blurred for privacy reasons. When taking LP photos, we also diversify the photographing angles, shooting times, resolutions and background so as to cover different conditions. The proposed dataset includes multiple vehicle types, such as trucks, cars, police cars and new energy vehicles. Note that new energy vehicles in China have license plates with eight letters, while other vehicles have seven-letter license plates. We also allow occluded license plates which have less than seven visible letters. The variation in the length of license plate letters increases the recognition difficulty as well, and makes the rule based recognition methods infeasible. The bounding boxes and license plate letters are annotated manually. In summary, the CLPD dataset contains LP images from all provinces in mainland China. Some examples are shown in Figure 7. To our knowledge, our proposed LP dataset is the only one that covers all mainland China provinces with real shotted images.

Fig. 7: Sample images in our proposed CLPD dataset. Each license plate is manually annotated with a bounding box and its license number.

V-B Dataset Comparison

As presented in Table I, we compare our proposed dataset with other LP datasets in several aspects. Although the size of our dataset is small, it contains the most number of region codes. As we collect LP images from multiple sources, the image sizes are not fixed, in contrast to other datasets. Furthermore, AOLP, CCPD and our CLPD contain tilted images, while PKUData does not. Finally, our dataset contains LPs from different types of vehicles, including police car, new energy car and truck, which further increases the diversity of LP styles.

AOLP [23] PKUData [22] CCPD [2] CLPD (ours)
#Region Codes
LP size
Var in vehicle type
TABLE I: A comparison of available datasets for LP detection and recognition. LP size is the average size of all license plate areas in a dataset.

Vi Experiments

In this section, we conduct extensive experiments to compare our license plate recognition method with the state-of-the-art recognition methods. To demonstrate the effectiveness of the proposed model, plenty of experiments are performed on different license plate datasets.

Vi-a Datasets

CCPD [2] is currently the largest publicly available License Plate (LP) dataset that provides over unique Chinese LP images with detailed annotations. This dataset is separated into different groups according to the difficulty of identification, the illuminations on LP area, the distance from the license plate when photographing, the degree of horizontal tilt and vertical tilt, and the weather (rainy, snowy or fog). Each category includes 10k to 20k images. CCPD-base consists of approximately images, where are used for training and the other half is for test. The other subdatasets (CCPD-DB, CCPD-FN, CCPD-Rotate, CCPD-Weather, CCPD-Challenge) are also used for test.

AOLP[23] database consists of images of Taiwan license plate. This dataset is categorized into three subsets according to complexity levels and photographing conditions: Access Control (AC), Traffic Law Enforcement (LE) and Road Patrol (RP). Since we do not have any other images with Taiwan license plate, we use any two of these subsets for training and the remaining one for test, similar to previous practices [1, 9, 26].

PKUData[22] is released by Yuan et al., which provides images for license plate detection. The license plate labels are not annotated and we labeled the images in this dataset. images are randomly selected for training and the rest are used for test.

CLPD is our proposed LP dataset, which contains images across all provinces in mainland China, with different vehicle types included. The images in the newly proposed CLPD dataset are all real and cover a large variety of photographing conditions, vehicle types and region codes. They are only used for test to verify the practicality of LP recognition models.

Vi-B Implementation Details

In this work, we mainly focus on license plate recognition. In order to get the bounding boxes of license plates, a YOLOv2 [4] detector is trained on the training set of CCPD. We set the IOU threshold to , and achieve a detection performance of and on CCPD test sets. For fair comparison, we use the same evaluation criteria as that in [2]. An LP recognition result is correct if and only if the IoU between the detection and the ground truth is greater than and all characters of the LP are correctly recognized (including the region code).

The recognition network is trained with cross-entropy loss and ADAM optimizer without any pre-training. In the training process, we adopt a batch size of and a learning rate of initially. The learning rate is multiplied by at every iterations until it reaches to . The heights of input images in a batch are fixed, while the widths are calculated according to the aspect ratios of original images. All the experiments are conducted on an NVIDIA GTX1080Ti GPU with 11GB memory.

Vi-C Ablation Studies

To analyze our proposed framework in detail, in this section, we evaluate it with different settings on CCPD dataset.

Vi-C1 Effect of CNN structures

In order to analyze the impact of CNN capacities, we first experiment with different number of CNN channels and layers. As shown in Table II, using more CNN channels indeed improves the license plate recognition accuracy, and the performance is saturated when the channel number reaches . Experimental results with different convolutional layers are demonstrated in Table III. The -layer Xception performs better than models with less layers, but the performance does not significantly improve when further increasing the depth. Hereinafter, we use the -layer Xception with channels.

Vi-C2 Effect of inaccurate bounding box

Secondly, we test the recognition performance with detected and ground truth bounding boxes respectively, to demonstrate the robustness of our algorithm. Note that the detected bounding boxes may not encompass the license plates exactly as the groundtruth. This experiment is conducted to show the effect of bounding box variance on recognition performance. As shown in Table 

IV, the recognition accuracy only drops slightly by using detected bounding boxes (smaller than for all cases except the “Challenge” one), which validates the robustness of our algorithm to inaccurate bounding boxes. One of the possible reasons is that the adopted 2D attention mechanism makes our algorithm not heavily depend on accurate bounding boxes: at each character decoding step, the adopted attention module will extract the most relevant local feature for each character in 2D space, instead of relying on heuristics rules for character separation.

Vi-C3 Effect of synthetic data

The last ablation study is on the effectiveness of generated synthetic data. Here we also compare the performance by using different GAN models. We train our model with different numbers of real and synthetic images (k, k and k), and then test the performance on CCPD-DB dataset. As shown in Table V, using the synthetic data generated by our proposed AsymCycleGAN offers better improvements than using that generated by the original CycleGAN, which demonstrates the superiority of our proposed AsymCycleGAN.

In addition, when comparing the improvements by using different number of real images, it can be found that the synthetic data plays a more important role when the real data size is smaller.

We can also see that the improvement is reduced when using smaller number of synthetic images. Note that the cost of generating synthetic images is very cheap: they do not need human annotation and the generation speed is fast (about 1K/min). So we can easily employ massive synthetic data for training, to improve the accuracy of LP recognition algorithms.

CNN Channels Base DB FN Rotate Tilt Weather Challenge
TABLE II: Recognition accuracy () with different CNN channels on CCPD subsets. channels will be adopted in furture experiments.
Layers Base DB FN Rotate Tilt Weather Challenge
TABLE III: Comparing the recognition accuracy () with different layers of Xception on CCPD subsets. We choose the optimal 30-layers Xception for feature extraction.
Bounding Box Base DB FN Rotate Tilt Weather Challenge
by Detection
Ground truth
TABLE IV: Recognition accuracy () by using different bounding boxes on sub-datasets of CCPD. The experimental results show small gap when using inaccurate bounding boxes, which demonstrates the robustness of our algorithm.
Training Data CCPD-DB Improvement
Real (20k)
Real (20k) + CycleGAN (20k)
Real (20k) + AsymCycleGAN (20k)
Real (20k) + CycleGAN (200k)
Real (20k) + AsymCycleGAN (200k)
Real (50k)
Real (50k) + CycleGAN (50k)
Real (50k) + AsymCycleGAN (50k)
Real (50k) + CycleGAN (200k)
Real (50k) + AsymCycleGAN (200k)
Real (100k)
Real (100k) + CycleGAN (100k)
Real (100k) + AsymCycleGAN (100k)
Real (100k) + CycleGAN (200k)
Real (100k) + AsymCycleGAN (200k)
TABLE V: The recognition accuracy () on CCPD-DB, with different number of real images and different GAN models adopted. Using synthetic data generated by our proposed AsymCycleGAN offers better performance. The superiority is even obvious if there are a small number of real images.

Vi-D Experiments on Existing Benchmarks

Vi-D1 Results on CCPD

Model Overall Base DB FN Rotate Tilt Weather Challenge Test time
#Images () () () () () () () ms
Ren et al. (2015) [27]
Liu et al. (2016) [28]
Joseph et al. (2016) [4]
Li et al. (2017) [9]
Zherzdev et al. (2018) [29]
Xu et al. (2018) [2]
Zhang et al. (2019) [30, 29]
Luo et al. (2019) [13]
Wang et al. (2020) [31]
Ours (Real Data Only)
Ours (Real + Synthetic data)
TABLE VI: LP recognition accuracy () on each CCPD test set (Number of images in parentheses). We achieve the highest recognition accuracy compared with other algorithms, especially in the datasets with rotattion and challenging license plates.

It can be seen from Table VI that our algorithm outperforms other algorithms in terms of the overall AP and most of subsets, using the same real training data. The only exception is that the method of Luo et al. [13] is better than ours on the rotate and tilt subsets. The reason may be that Luo et al. [13] adopts an STN-based [32] technique which is specifically designed for rotated images. Note that our algorithm can also benefit from using this technique and the accuracy on the rotate and tile subset is expect to be further improved.

Our algorithm shows significant superiority on subsets with irregular LP images, such as “Rotate”, “Weather” and “Challenge”, which again proves the robustness of our model to the deformation of license plates. Moreover, by adding synthetic images generated by our AsymCycleGAN, the recognition accuracies consistently raise furthermore on all subsets (a gain of Overall). The increment is even obvious when LPs are rotated or tilted (raising on CCPD-Rotate and on CCPD-Tilt). The main reason is that random perspective transformation and rotation are applied to the synthesized data, which is a great complementary to real data.

Fig. 8:

Visualization of 2D attention weights at each decoding timestep. Results indicate that the 2D-attention model can handle challenging cases.

We select some extremely distorted images and visualize the 2D attention heat maps when decoding each character in Figure 8. The results show that even for very tilted images, the 2D attention model can locate to the character being decoded and extract corresponding features for recognition. It should be noted that the attention module does not require additional character-level annotations. It is trained in a weakly supervised manner by the cross-entropy loss on the whole plate recognition.

Vi-D2 Results on AOLP

Model  AC  LE  RP
#Images  (681)  (757)  (611)
Li et al. (2016) [1]
Li et al. (2017) [9]
Wu et al. (2018) [26]
TABLE VII: The recognition accuracy () on sub-datasets of AOLP. Our approach performs better than other methods on all three subsets.

In this section, we compare our model with other state-of-the-art methods on AOLP dataset. For fair comparison, we did not use any synthetic data during model training. Perspective transformation is employed for data augmentation. The results in Table VII show that our approach performs better than other methods on all three subsets, which validates the superiority of our approach. In particular, our method leads to the accuracy increments of on AC, on LE and on RP, compared to the second best results. Note that the RP subset is mainly composed of oriented or distorted license plates, on which our method obtains the largest performance gain. This result further demonstrates the effectiveness of our model in recognizing irregular license plates.

Vi-D3 Results on PKUData

For PKUData, we randomly sample three-fifths for training and use the remaining two-fifths for test. For fair comparison, we re-train the model proposed in [2] by the same training data. An open API called Sighthounds [33] is tested as well, but we have no idea about the training data it used. We evaluate the LP recognition accuracy on two settings, i.e., with and without region code (a Chinese character) considered. The model in Sighthounds [33] does not support region code recognition, so we only report its accuracy without region code. The recognition results are shown in Table VIII. Our model outperforms that in [2] by about when only real data is adopted, and surpasses Sighthounds [33] if synthetic training data is added. In comparison with the improvement on CCPD dataset, the accuracy gain is even more obvious when using synthetic data (about ), because of the limited real training images in PKUData, which demonstrates the usefulness of our synthesis engine when there is scarce training data.

Dataset PKUData CLPD
Criterion ACC ACC w/o RC ACC ACC w/oRC
Masood et al. (2017) [33] - -
Xu et al. (2018) [2]
Ours (Real Data Only)
Ours (Real + Synthetic Data)
TABLE VIII: The recognition accuracy (ACC,) and recognition accuracy without region code (ACC w/o RC,) on PKUData and CLPD. For ACC w/o RC, the recognition is considered to be correct if all the characters except the first one region code are correctly recognized.

Vi-E Experiments on Our CLPD Dataset

As aforementioned, the diversity of our proposed CLPD dataset is much larger than existing LP datasets, which provides a platform to evaluate current algorithms comprehensively. We train the proposed model on CCPD-Base dataset, and test it on CLPD. Experimental results in Table VIII show the advantage of our model. It leads to the highest accuracy no matter region code is considered or not. By adding synthetic data, the accuracy increases further if region code is considered, which benefits from a more balanced region code distribution in our synthetic data that can be easily obtained by the proposed engine. Some experimental results are visualized in Figure 9.

We also present some failure cases in Figure 10. As there is no specific language rule used in license plate, some similar characters are rather difficult to be distinguished, such as “4” and “A”, “8” and “B”,“0” “D”, and “O”. Images with extreme blur or occlusion are also unable to be recognized.

Fig. 9: Detection and recognition results on CLPD using YOLOv2 and our recognition model. With the addition of synthetic data, the model is able to recognize license plates from different provinces under various scenarios.
Fig. 10: Examples of LPs that are incorrectly recognized by the proposed method. The ground truth is shown in the parenthese.

Vii Conclusion

In this paper, we present a robust model for license plate recognition in unconstrained environment. The proposed model is built upon an Xception CNN module for feature extraction, and a 2D-attention based RNN module for sequence decoding. To handle the shortage or unbalance of real training data, CycleGAN is tailored to generate synthetic LP images with different deformation styles and a more balanced region codes, which provides a simple yet effective way to complement available real data. Extensive experimental results indicate the superiority of our methods, especially when addressing distorted license plates or with limited training data. An LP dataset that contains images captured in different ways from various regions is collected so as to evaluate LP recognition methods more comprehensively.

We use an LSTM-based sequence decoder for license plate recognition, which cannot be trained in parallel over time steps. For future works, a transformer-like decoder may be explored to accelerate training speed.