Practical License Plate Recognition in Unconstrained Surveillance Systems with Adversarial Super-Resolution

10/10/2019 ∙ by Younkwan Lee, et al. ∙ Gwangju Institute of Science and Technology 13

Although most current license plate (LP) recognition applications have been significantly advanced, they are still limited to ideal environments where training data are carefully annotated with constrained scenes. In this paper, we propose a novel license plate recognition method to handle unconstrained real world traffic scenes. To overcome these difficulties, we use adversarial super-resolution (SR), and one-stage character segmentation and recognition. Combined with a deep convolutional network based on VGG-net, our method provides simple but reasonable training procedure. Moreover, we introduce GIST-LP, a challenging LP dataset where image samples are effectively collected from unconstrained surveillance scenes. Experimental results on AOLP and GIST-LP dataset illustrate that our method, without any scene-specific adaptation, outperforms current LP recognition approaches in accuracy and provides visual enhancement in our SR results that are easier to understand than original data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

License plate recognition (LPR) is a fundamental and essential process of identifying vehicles and can be extended to a variety of real-world applications. LPR methods have been widely studied over the last decade, and are especially of big interest in intelligent transport systems (ITS) applications such as access control [Chinomi et al., 2008], road traffic monitoring [Noh et al., 2016, Pu et al., 2013, Song and Jeon, 2016, Lee et al., 2017, Yoon et al., 2018] and traffic law enforcement [Zhang et al., 2011]. Since all license plate recognition methods always deal with the letters and numbers in images, they are closely related to image classification [Simonyan and Zisserman, 2014, Russakovsky et al., 2015] and text localization [Anagnostopoulos et al., 2006].

Figure 1: Example in GIST-LP dataset. Poor-resolution and plate variation are common challenging issues on license plate recognition problem.

Conventional LPR methods typically include two stages: character localization and character recognition. Those methods are widely designed for unrealistically most constrained scenarios: a high-quality resolution and an unrotated frontal or rear image. However, unlike the ideal situation, many traffic surveillance cameras scattered around the world are operating in a number of unconstrained scenarios: they produce poor-resolution images and tilted license plates as shown in Figure 1. Although considerable progress of computer vision technology has been made, existing methods may fail to recognize license plates in such an environment without considering any unconstrained conditions. As a consequence, we find its limitations in three aspects: first, many license plate samples only constitute incomplete text search space; second, the projection angle of the sample is tilted with respect to the image plane at an angle of up to 30 degrees, interfering character exploitation; third, bad text localization often results in erroneous outputs.

Based on this finding, we propose a novel deep convolutional neural network based method for better LPR.

Adversarial Super-Resolution We suggest an adversarial super-resolution (SR) method including a generator and a discriminator networks over an image area. Modern SR method [Dong et al., 2014] commonly targets the pixel-wise average as optimization goal, minimizing the mean squared error (MSE) between the super-resolved image and the ground truth, which leads to the smoothing effect, especially across text. Instead, we follow [Ledig et al., 2017]

’s generator network, which solves minimax game as optimization goal, avoiding a smoothing effect, which provide a sharpening effect. Combined with SR in generator, we introduce a new loss function that encourages the discriminator to count characters and distinguish whether SR or high-resolution(HR) sample concurrently. Character counting results from the discriminator network help improve character recognition performance in one-stage recognition module as a conditional term.

Reconstruction Auto-Encoder We always reconstruct the samples to straighten when the horizontally or vertically tilted license plate is projected onto the image plane. To address this issue, we utilize the convolutional auto-encoder network with the objective function as the difference between the tilted image and the straightened image. By doing so, it serves as a preprocessing for correct character exploitation.

One-Stage Recognition

We do use the commonly used character segmentation and localization process. Instead, we propose a unified character localization and recognition approach as one-stage. One-stage recognition is not only more intuitive, but also more accurate than segmentation that requires precise an estimate of each pixel’s class. Our One-stage method divides the input image into a 1*S grid, and detects LP at three different scales which includes a conditional term. The result of our character localization using each grid cell is naturally unified with character classification.

In summary, our key contributions are:

  • We show that adversarial SR module and AE based reconstruction module in the real world for unconstrained surveillance cameras can improve the recognition performance greatly by (2.57% (AOLP) and 8.06% (GIST-LP)) compared with the state-of-the-art methods.

  • The One-stage method combined with the conditional term, instead of the two-stage method (character detection and classification), reduced the localization and classification error.

  • We collected a dataset of challenging license plate samples from unconstrained conditions accompanied by the text annotations (1,800 samples, 50 different license plates).

Figure 2: The proposed license plate recognition pipeline.

2 Related Work

2.1 License Plate Recognition

Traditionally, numerous LPR methods proposed consists of the two stages: semantic segmentation of the exact character region and recognition of the characters. The related methods generally utilize discriminate features, such as edge, color, shape and texture but does not show good results. Edge-based methods [Kim et al., 2000, Zhang et al., 2006, Wang and Lee, 2003, Kim et al., 2000, Zhang et al., 2006] and geometrical features [Wang and Lee, 2003] assume the presence of characters in the license plate. Many color-based methods [Shi et al., 2005, Chen et al., 2009] usually use the combination of the license plate and the characters.

However, since the two-stage methods are not only slow to run, but also take more time to converge for optimized training due to the double networks, one-stage pipeline based methods, segmentation-free approach [Zherzdev and Gruzdev, 2018, Cheang et al., 2017, Li and Shen, 2016, Wang et al., ], including segmentation and recognition at once, are proposed. Most segmentation-free models take advantage of deeply learned features which outperforms traditional methods on the task of classification by deep convolutional neural networks (DCNN) [Simonyan and Zisserman, 2014, He et al., 2016] and data-driven approaches [Russakovsky et al., 2015]. The core underlying assumption of these methods extracts features directly without sliding window for LPR. As examples of these models, Sergey et al. [Zherzdev and Gruzdev, 2018] adopted a lightweight convolutional neural network to learn end-to-end way. In another work that use RNN module, Teik Koon et al. [Cheang et al., 2017] proposed CNN-RNN unification model that feed the entire image as input. It is assumed that the context of the entire image is further evaluated for exact classification than the sliding window approaches being. Also, Hui et al. [Li and Shen, 2016] utilized a cascade framework using DCNN and LSTM and Xinlong et al. [Wang et al., ] proposed DCNN and a bidirectional LSTM to use sequence labeling.

2.2 Adversarial Learning

The generative adversarial network (GAN)  [Goodfellow et al., 2014, Radford et al., 2015, Radford et al., 2015]

is an amazing solution for training deep neural network of generative models, which aim to learn the probability distributions of the input data. Originally, GAN is suggested to yield the more realistic-fake images 

[Frid-Adar et al., 2018], but recent researches show that this adversarial technique can be utilized to produce the specific training algorithms. e.g,. generative focused tasks; super-resolution  [Nguyen et al., , Ledig et al., 2017, Lee et al., 2018], style transfer  [Zhu et al., 2017, Li et al., 2017]

, natural-language processing  

[Rajeswar et al., 2017]

and discriminative focused tasks; human pose estimation  

[Chou et al., 2017, Peng et al., 2018].

3 Proposed Method

In this section, we describe the details of the proposed end-to-end pipeline for LPR. The schematics of the method is illustrated in Figure 2. We first introduce the adversarial network to super-resolve the input image, and reconstruct its output. Then, the details of the proposed one-stage character recognition network are presented for recognizing characters on the license plate and locating individual text regions without character segmentation. Finally, we describe a training process to find optimal parameters of our model.

Figure 3: The proposed Auto-Encoder based reconstruction sub-network structure.

3.1 Adversarial Network Architecture

Adversarial learning techniques have been widely used in many tasks [Frid-Adar et al., 2018, Zhu et al., 2017, Rajeswar et al., 2017, Chou et al., 2017], providing boosted performance through adversarial data or features. In vanilla GAN [Goodfellow et al., 2014], a minimax game is trained by alternately updating a generator sub-network and a discriminator sub-network simultaneously. The value function of the generator and the discriminator is defined as:

(1)

where is the real data distribution observation from and is the fake data distribution observation from a random distribution . These sub-networks have conflicting goals to minimize their own cost and maximize the opposite’s cost. Therefore, the conclusion to play the minimax game can be that the probability distribution () generated by the generator exactly matches the data distribution (). After all, the discriminator will not be able to distinguish between sampling distribution from the generator and real data distribution. At this time, for the fixed generator, the optimal discriminator function is as follows:

(2)

In a similar way, we modified the minimax value function in the vanilla GAN for solving SR so that the generator consisting of a HR generator and a reconstruction network creates an HR image from LP image, while the discriminator trains to distinguish the HR fake image obtained by the generator from the actual LR image. This adversarial SR process can be defined as follows:

(3)

where is the high-resolution image, is the low-resolution image, and denote the parameters trained by a feed-forward CNN and respectively.

Generator Network. Different from [Goodfellow et al., 2014], our generator network is composed of two sub-networks: (1) HR Generator and (2) Convolutional Auto-encoder for reconstruction

as shown in Figure 2. The former is a series of convolutional layers and fractionally-strided convolution layers (

i.e. upsample layer) inspired by [Ledig et al., 2017]. We use two upsample layers(2 times upsampling) as proposed by Radford et al. [Radford et al., 2015], and acquire a 4 times enhanced image image from them.

In addition to its network, we include a reconstruction sub-network for the refinement task of image with enhanced resolution. Given the output of 4 times super-resolved image, our proposed network aims at discovering that it corrects slightly distorted image through denoising learning manner. Basically, we employ a convolutional neural network (CNN) as encoder and decoder, as shown in Figure 3. Although both encoder and decoder consist of the same number of convolutional layers, the former adds MaxPooling2D layers for spatial down-sampling, while the latter adds UpSampling2D layers, with the BatchNormalization [Ioffe and Szegedy, 2015].

Discriminator Network. Figure 2 shows the architecture of the discriminator network and its output components. Inspired by VGG19 [Simonyan and Zisserman, 2014], we follow the same network structure. To discriminate exact object regions, we design all the fully-connected layers to split into two parallel branches to obtain two outputs: (1) how many characters are in the image as counting result and (2) the HR vs. SR .

3.2 Character Recognition Network Architecture

In this section, we describe the details of the proposed character recognition approach where localization and recognition are integrated into one-stage. We employ YOLO v3 [Redmon and Farhadi, 2018] as our detection network. To achieve scale-invariance, it detects characters at three scales, which are given by diminished dimensions of the image by 32, 16 and 8 each other, without the MaxPooling2D layer. Unlike previous model [Redmon and Farhadi, 2017], this allows better detection performance of small size character, which is optimized for character on a license plate that is mostly expressed in small size localization and recognition with residual skip connections.

The shape of detection kernel denoted as 1 1 ( (5 + )), where is the number of bounding boxes, is the sum of the four attributs of bounding boxes (coordinates (, ), width and height) and one object confidence score and is the number of classes. In our method, we define the detection kernel size as and is 66 (10 numbers (0-9), 26 English letters and 30 Korean letters), result in 1 1 213.

Furthermore, we add the counting information output from the discriminator as a conditional term in our character recognition model. The last layer of recognition model has the previous layer’s output and as inputs. We demonstrate that our recognition model can be extended to the sophisticated model where it can accurately count and localize any character in any input. These are further discussed later in Section 4.4.

3.3 Training

In this section, we discuss the objective to optimize our adversarial network and one-stage recognition network. Let , and denote a low-resolution image, high-resolution image and SR image, respectively. Given a training dataset , our goal is to learn the adversarial model that predicts SR image from low-resolution image and recognition model that predicts character’s class and location from SR image.

Pixel-wise loss To force the generated plate image to high-resolution ground truth, our generator network is optimized for the MSE loss in each pixel values between the generated image sets and the small and blurry plate image sets calculated as follows:

(4)

where means HR generator, denotes the reconstruction network, and are the parameters of generator network.

Adversarial loss In order to provide a sharpening effect to the generated image different from the MSE loss that gives the smoothing effect, we define adversarial loss as:

(5)

Adversarial loss amplifies the photo-realistic effect and is trained in the direction of deception of the discriminator.

Reconstruction loss In order to let the quality of generated images by the to be more photo-realistic, we propose the reconstruction loss that corrects changes in the generated image topology that interfere with the detection and is defined as follows:

(6)

The reconstruction loss is calculated as L1 loss, the difference between the output of and .

Classification loss

The classification loss is playing both the roles of an character counting task as well as the discrimination task. To be more specific, the discriminator takes an image as input and classified it into two outputs: the HR real natural image or the SR fake image and the numbers of characters respectively. The loss of this multi-task is calculated as follows:

(7)

where represents prediction value of the number of characters and the operations with and output 1 if it predicts correctly or 0 respectively.

4 Experimental Results

4.1 Setup

All the reported implementations are based on TensorFlow as learning framework, and our method has done on the NVIDIA TITAN X GPU. First of all, we use the YOLO-v3 for the pre-trained model on COCO

[Lin et al., 2014] as our one-stage recognition model so that we trained license plate images by fine-tuning their network parameters.

Also, to avoid the premature convergence of the discriminator network, the generator network is updated more frequently than original one. In addition, higher learning rate is applied to the training of the generator. For stable training, we use a technique called gradient clipping trick

[Pascanu et al., 2013] and the Adam optimizer [Kingma and Ba, 2014] with a high momentum term. For the discriminator network, we use the VGG-19 [Simonyan and Zisserman, 2014]

model pre-trained on ImageNet as our backbone network and we divide all the fully connected layers into two parallel

and

layers. The weights in all parallel fully connected layers are initialized from the standard Gaussian distribution with zero-mean, a standard deviation of

and the constant

as the bias in all layers. All models are trained on loss function for first 10 epochs with initial learning rate of

. After that, we set the learning rate to a further reduced

for the remaining epochs. Finally, batch normalization

[Ioffe and Szegedy, 2015] is used in all layers of generator and discriminator, except the last layer of the and the first layer of the .

4.2 Dataset

Figure 4: Samples from the unconstrained surveillance cameras in GIST-LP dataset.

AOLP : This dataset[Hsu et al., 2013] includes 2,049 images of Taiwan license plates, which are collected from the unconstrained surveillance scenes. AOLP dataset is divided into three subsets: access control (AC) with 681 samples, traffic law enforcement (LE) with 757 samples, and road patrol (RP) with 611 samples, based on diverse application parameters. 100 samples per subset are used for the training, and the rest of the 581(AC)/657(LE)/511(RP) samples are used for testing. More specifically, AC has a narrow range of variation conditions, while LE/RP have a wider range of variation conditions. Therefore, compared to the AC subset, LE/RP are more challenging subsets because they require a wider range of search conditions on the experiments. Besides, the RP samples collected via mobile have more challenging conditions because of the larger pan and orientation changes compared to the LE samples collected at road cameras with fixed viewing angles.

GIST-LP : We collected and annotated a new dataset GIST-LP for LPR. Our dataset is targeted on images captured from surveillance cameras under unconstrained scenes. We do not limit the license plate always to be large and front. We used traffic surveillance cameras which has 1920 x 1080 pixels of spatial resolution. We annotated the characters, including Korean (30 categories) and numbers (0-9, 10 categories) for all of the license plate images. In total, there are 1,800 license plates that appear in 1,569 frames. For license plate images, the characters are usually small-sized, blurred or tilted without occlusion. The dataset include information about bounding box for each character and text class (Koreans and numbers).

Method Performance
AC LE RP Avg
[Anagnostopoulos et al., 2006] 92.00% 88.00% 91.00% 86.34%

[Jiao et al., 2009]
90.00% 86.00% 90.00% 88.51%

[Smith, 2007]
96.00% 83.00% 83.00% 87.31%

[Hsu et al., 2013]
95.00% 93.00% 94.00% 94.17%
Baseline (YOLO v3) [Redmon and Farhadi, 2018] 94.66% 89.04% 89.04% 90.90%
without pixel-wise MSE loss 97.24% 94.67% 94.91% 95.60%
without reconstruction loss 96.21% 88.89% 94.32% 92.91%
without adversarial loss 95.18% 87.67% 93.93% 92.00%
without classification loss 96.39% 94.98% 96.48% 95.88%
Ours 97.59% 95.89% 96.87% 96.74%
Table 1: Comparison of our method with other state-of-the-art method on the AOLP dataset.
Method Performance
RCNN based on VGG-16 [Girshick et al., 2014] 74.44%
RCNN based on ZFNET [Girshick et al., 2014] 72.11%
Faster-RCNN et al. [Ren et al., 2015] 86.77%
Baseline (YOLO v3) [Redmon and Farhadi, 2018] 84.16%
Ours without pixel-wise MSE loss 91.78%
Ours without reconstruction loss 89.00%
Ours without adversarial loss 87.72%
Ours without classification loss 90.78%
Ours 93.83%
Table 2: Comparison of our method with other state-of-the-art method on the GIST-LP dataset.
Figure 5: Example in GIST-LP dataset [Laroca et al., 2018]. Qualitative sample images of recognition results. The first column shows the original plates, the second column shows the character localization results and the third indicates the recognitionm results.

4.3 Comparison with Other Methods

Figure 6: Example in AOLP dataset [Hsu et al., 2013]. Poor-resolution and background clutter are common challenging issues on character recognition problem.

In the experiment with AOLP, we compared our method with the state-of-the-are license plate recognition approaches [Anagnostopoulos et al., 2006, Jiao et al., 2009, Smith, 2007, Hsu et al., 2013]. The results are listed in Table 1, which are provided with accuracy of recognition to evaluate both text localization and classification are all performed well at the same time. We see that our method obtained the highest performance (i.e. 96.74%) on the all subsets, and outperformed the state-of-the-are LPR approaches by more than 2.5%. Also, it is important to note that, under the fairly tilted conditions, our method operated consistently robust and successfully detects the characters, while the baseline fail to detect. Furthermore, one interesting finding of these results is that, based on Figure 6 (b,c), the addition of adversarial loss lead to the highlighting of the positive features, while decimating of other irrelevant features. By doing so, it was further improved when detecting under night or confusing conditions. Based on these observations, our proposed method operated at least as well as others, which outperformed all other methods in most cases.

To show the results of experiment of LPR with GIST-LP, we compared our method with [Girshick et al., 2014, Ren et al., 2015] and followed the standard metrics (i.e. accuracy of recognition) of the GIST-LP. There were many tiny license plates in GIST-LP, making character detection not be accurate. Hence, we found that the state-of-the-art method [Redmon and Farhadi, 2018] that performed without considering the tiny size and blurred condition recorded on the inferior performance. However, our method mitigated the influence of these conditions and indicated these license plates successfully. Under such a challenging condition, our LPR performance still achieved a comparable performance (93.83%) over all other state-of-the-art LPR approaches, as shown Table 2.

4.4 Ablation Study

In the proposed method, the loss functions of adversarial networks locate different regions, each with their unique roles. In order to inspect its influence on character recognition performance, we removed one loss function from the objective function at a time and performed an ablation study with it to compare the complete objective function. Most extremely, we perform experiments that compare the baseline and overall objective function, which obtain the superior performance by a considerable gap (5.84% / 9.67%) from Table 1 and 2.

Also, when removing one loss function from the overall objective function our method shows a considerable performance drop. First of all, even if the MSE loss is not suitable for tiny objects due to the smoothing effect, if there is no MSE loss, the performance degradation is up to 1.14% (in AOLP) / 2.05% (in GIST-LP), affecting the image up-scaling super-resolution. Then the reconstruction loss affects the correct converting of the tilted plate, because the SR performance of the generator is somewhat dependent on the degree of tilted angle of the license plate, and it leads to about 3.83% (in AOLP), 4.83%(in GIST-LP) improvement in performance. In another step, we observe that adversarial loss leads to the sharpened super-resolved result of minimax game. Thus it has a great influence on the detection performance as shown in Figure 6. The GIST-LP dataset which has relatively more tiny plates than AOLP dataset has found a performance improvement of almost 4.74% as shown Table 2, and the AOLP dataset also achieves performance improvement of nearly 6.11% as shown Table 1. Finally removing classification loss in the objective function shows a significant impact on the character recognition performance, which observes an impressive improvement of 0.86% (in AOLP) and 3.15% (in GIST-LP). This proves that our two parallel fully-connected layers for classification affect the classification performance for our text localization of the detector as well as the SR performance of the generator. Also, we demonstrate that the counting term as conditional data benefits to better explore the space of the character localization as much as possible.

4.5 Qualitative Results

As shown in Figure 6., we give additional examples of the clear LP generated by the proposed generator network from the tiny ones. Upon thorough investigation of the generated images, we find that our method learn strong priors using the proposed new loss functions of GAN by focusing on images of plate contour, certain letters and numbers as shown in Figure 6 (a). It implies that the proposed loss significantly allows visually clearer LP and can be used to solve the ill-posed problem. Thus, SR module can capture the tiny LP without hallucination and it implies the proposed architecture has an impact on reducing the false negatives.

5 Conclusions

In this paper, we propose a new method based on GAN to recognize characters in unconstrained license plates. We design a novel network to directly generate a clear SR image from a blurry small one, and our up-sampling sub-network and reconstruction sub-network are trained in an end-to-end way. Moreover, we introduce an extra classification branch to the discriminator network, which can distinguish the HR/SR and the character counting probability simultaneously. Furthermore, the adversarial loss brings to generator network to restore a clearer SR image. Our experiment on AOLP and GIST-LP datasets demonstrate the substantial improvements, when compared to previous state-of-the-art methods.

Acknowledgements

This work was supported by institute for the information & communications technology promotion (IITP) grant funded by the Korean government (MSIP) (B0101-16-0525, development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis.

References