Adversarial Generation of Training Examples: Applications to Moving Vehicle License Plate Recognition

07/11/2017 ∙ by Xinlong Wang, et al. ∙ The University of Adelaide 0

Generative Adversarial Networks (GAN) have attracted much research attention recently, leading to impressive results for natural image generation. However, to date little success was observed in using GAN generated images for improving classification tasks. Here we attempt to explore, in the context of car license plate recognition, whether it is possible to generate synthetic training data using GAN to improve recognition accuracy. With a carefully-designed pipeline, we show that the answer is affirmative. First, a large-scale image set is generated using the generator of GAN, without manual annotation. Then, these images are fed to a deep convolutional neural network (DCNN) followed by a bidirectional recurrent neural network (BRNN) with long short-term memory (LSTM), which performs the feature learning and sequence labelling. Finally, the pre-trained model is fine-tuned on real images. Our experimental results on a few data sets demonstrate the effectiveness of using GAN images: an improvement of 7.5 available. We show that the proposed framework achieves competitive recognition accuracy on challenging test datasets. We also leverage the depthwise separate convolution to construct a lightweight convolutional RNN, which is about half size and 2x faster on CPU. Combining this framework and the proposed pipeline, we make progress in performing accurate recognition on mobile and embedded devices.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

page 12

page 23

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As GANs [6, 27] have been developed to generate compelling natural images, we attempt to explore whether GAN-generated images can help training as well as real images. GANs are used to generate images through an adversarial training procedure that learns the real data distribution. Most of previous work focuses on using GANs to manipulate images for computer graphics applications [14, 29]

. However, to date little success was observed in using GAN generated images for improving classification tasks. To our knowledge, we are one of the first to exploit GAN-generated images to supervised recognition tasks and gain remarkable improvements. For many tasks, labelled data are limited and hard to collect, not to mention the fact that training data with manual annotation typically costs tremendous effort. This problem becomes worse in the deep learning era as in general deep learning methods are much more data-hungry. Applications are hampered in these fields when training data for the task is scarce. Inspired by the excellent photo-realistic quality of images generated by GANs, we explore its effectiveness in supervised learning tasks.

More specifically, in this work we study the problem of using GAN images in the context of license plate recognition (LPR). License plate recognition is an important research topic, as a crucial part in intelligent transportation systems (ITS). A license plate is a principal identifier of a vehicle. This binding relationship creates many applications using license plate recognition, such as traffic control, electronic payment systems and criminal investigation. Most research on LPR have been focused on using a fixed camera to capture car license images. However, the task of using a moving camera to capture license plate images of moving cars can be much more challenging. This task is not well studied in the literature. Moreover, Lack of a large labelled training data for this task renders data-hungry deep learning methods to not well.

Recent advances in deep learning algorithms bring multiple choices which can achieve great performance in specific datasets. A notable cascade framework [18]

applies a sliding window to feature extraction, which perform the LPR without segmentation. Inspired by the work of

[33, 18] and CNN’s strong capabilities of learning informative representations from images, we combine deep convolutional neural network and recurrent neural network in a single framework, and recognize a license plate without segmentation, which views a license plate as a whole object and learns its inner rules from data. Independent of the deep models being applied, to achieve a high recognition accuracy requires a large amount of annotated license plate image data. However, due to privacy issues, image data of license plates are hard to collect. Moreover, manual annotation is expensive and time-consuming. What is more, LPR is a task with regional characteristic: license plates differ in countries and regions. For many of regions, it is even harder to collect sufficient samples to build a robust and high-performance model. To tackle this problem, we propose a novel pipeline which leverages GAN-generated images in training. The experiments on Chinese license plates show its effectiveness especially when training data are scarce.

To achieve this goal, first, we use computer graphic scripts to synthesize images of license plates, following the fonts, colors and composition rules. Note that these synthetic images are not photo realistic, as it is very difficult to model the complex distribution of real license plate images by using hand-crafted rules only. This step is necessary as we use these synthetic images as the input to the GAN. Thus the GAN model refines the image style to make images look close to real images. At the same time the content, which is the actual vehicle plate number, is kept. This is important as we need the plate number—the ground-truth label—for supervised classification in the sequel. We then train a CycleGAN [44] model to generate realistic images of license plates with labels, by learning the mapping from synthetic images to real images. The main reason for using CycleGAN is that, we do not have paired input-output examples for training the GAN. As such, methods like pixel-to-pixel GAN [14] are not applicable. CycleGAN can learn to translate between domains without paired input-output examples. An assumption of CycleGAN is that there is an underlying relationship between the two domains. In our case, this relationship is two different renderings of the same license plate: synthetic graphics versus real photos. We rely on CycleGAN to learn the translation from synthetic graphics to real photos.

Finally, we train the convolutional recurrent neural network [33] on the GAN images, and fine-tune it on real images, as illustrated in Fig. 1(b) in the manner of curriculum learning paradigm [4].

The core insight behind this is that we exploit the license number from synthetic license plates and the photo-realistic style from real license plates, which combines knowledge-based (i.e., hand-crafted rules) and learning-based data-driven approaches. Applying the proposed pipeline, we achieve significantly improved recognition accuracy on moderate-sized training sets of real images. Specifically, the model trained on 25% training data (50) achieves an impressive accuracy of 96.3%, while the accuracy is 96.1% when trained on the whole training set (200) without our approach.

Besides the proposed framework, we consider semi-supervised learning using unlabeled images generated by deep convolutional GAN (DCGAN)

[27] as an add-on, for the sake of comparison between our approach and existing semi-supervised method using GAN images.

To run the deep LPR models on mobile devices, we replace the standard convolutions with depthwise separate convolutions [34] and construct a lightweight convolutional RNN. This architecture efficiently decreases the model size and speed up the inference process, with only a small reduction in accuracy.

The main contributions of this work can be summarized as follows.

  • We propose a pipeline that applies GAN-generated images to supervised learning as auxiliary data. Furthermore, we follow a curriculum learning strategy to train the system with gradually more complex training data. In the first step, we use a large number of GAN-generated images whose appearance is simpler than real images, and a large lexicon of license plate numbers

    111Recall that the vehicle license plate numbers/letters follow some rules or grammars. Our system learns the combinatorial rules with BRNN on the large number of GAN images.. At the same time, the first step finds a good initialization of the appearance model for the second step. By employing relatively small real-world images, the system gradually learns how to handle complex appearance patterns.

  • We show the effectiveness of Wasserstein distance loss in CycleGAN training, which aids data variety and converged much better. The original CycleGAN uses a least-squares loss for the adversarial loss, which is more stable than the standard maximum-likelihood loss. We show that the Wasserstein distance loss as in Wasserstein GAN (WGAN) [3] works better for CycleGAN as well.

  • We show that when and why GAN images help in supervised learning.

  • We build a small and efficient LightCRNN, which is half model size and 2 faster on CPU.

The rest of paper is arranged as follows. In Section 2 we review the related works briefly. In Section 3 we describe the details of networks used in our approach. Experimental results are provided in Section 4, and conclusions are drawn in Section 5.

Ii Related Work

In this section, we discuss the relevant work on license plate recognition, generative adversarial networks and data generation for training.

(a) The CycleGAN Model
(b) Our proposed Pipeline
Fig. 1: (a) The architecture of the CycleGAN model, which is essential for our framework. There are two generators (or mapping functions): and . and represent the synthetic and real images respectively. Adversarial loss () is used to evaluate the performance of each generator and corresponding discriminator. Besides, the cycle consistency loss () is employed to evaluate whether and are cycle-consistent. (b) The pipeline of proposed approach. There are two components: a generative adversarial model and a convolutional recurrent neural network. The “Synthetic Data” represent the labeled images generated by our scripts.

Ii-a License Plate Recognition

Existing methods on license plate recognition (LPR) can be divided into two categories: segmentation-based [23, 7, 9] and segmentation free-based [18]

. Segmentation-based methods firstly segment the license plate into individual characters, and then recognize the segmented character respectively using a classifier. Segmentation algorithms mainly consist of projection-based

[23, 9] and connected component-based [2, 16]. After the segmentation, template matching based [28, 5] and learning based [38, 21, 16] algorithms can be used to tackle this character level classification task.

Learning based algorithms including support vector machine

[38]

, hidden Markov model (HMM)

[21] and neural networks [16, 35] are more robust than the template matching based methods since they extract discriminative features. However, the segmentation process loses information about inner rules in license plates and the segmentation performance has a significant influence on the recognition performance. Li and Shen [18] proposed a cascade framework using deep convolutional neural networks and LSTMs for license plate recognition without segmentation, where a sliding window is applied to extracting feature sequence. Our method is also a segmentation-free approach based on framework proposed by [33], where a deep CNN is applied for feature extracting directly without a sliding window, and a bidirectional LSTM network is used for sequence labeling.

Ii-B Generative Adversarial Networks

The generative adversarial networks [6] train a generative model and discriminative model simultaneously via an adversarial process. Deep convolutional generative adversarial networks (DCGANs) [27] provide a stable architecture for training GANs. Conditional GAN [22] generate images with specific class labels by conditioning on both the generator and discriminator. Not only class labels, GANs can be conditioned on text descriptions [29] and images [14]

, which build a text-to-image or image-to-image translation.

Zhu et al. [44] proposed the CycleGAN. It learns a mapping between two domains without paired images, upon which our model builds. In order to use unpaired images for training, CycleGAN introduces the cycle consistency loss to fulfill the idea of “If we translate from one domain to another and back again we must arrive where we start”.

The work of Wasserstein GAN (WGAN) [3]

designed a training algorithm that provides some techniques to improve the stability of learning and prevent from mode collapse. Beyond that, GANs have also achieved impressive results in image inpainting

[25], representation learning [32] and 3D object generation [39].

However, to date, little results were reported to demonstrate the GAN images’ effectiveness in supervised learning. We propose to use CycleGAN [44] and techniques in WGAN to generate labeled images and show that these generated images indeed help improve the performance of recognition.

Ii-C Data Generation for Training

Typically, supervised deep networks achieve acceptable performance only with a large amount of labeled examples available. The performance often improves as the dataset size increases. However, in many cases, it is hard or even impossible to obtain a large number of labeled data.

Synthetic data are used to show great performance in text localisation [10] and scene text recognition [15], without manual annotation. Additional synthetic training data [19] yields improvements in person detection [40], font recognition [36] and semantic segmentation [30]. But, the knowledge-based approach which hard encodes knowledge about what the real images look like are fragile, as the generated examples often look like toys to discriminative model when compared to real images. Zheng et al. [43] use unlabeled samples generated by a vanilla DCGAN for semi-supervised learning, which slightly improves the person re-identification performance. In this work, we combine knowledge-based approach and learning-based approach, to generate labeled license plates from generator of GANs for supervised training. For comparison, we also perform a semi-supervised learning using unlabeled GAN images.

Iii Network Overview

In this section, the pipeline of the proposed method is described. As shown in Fig. 1, we train the GAN model using synthetic images and real images simultaneously. We then use the generated images to train a convolutional BRNN model. Finally we fine tune this model using real images. Below is an illustration of the two components, namely, the GAN and the convolutional RNN, in detail.

Iii-a Generative Adversarial Network

The generative adversarial networks generally consists of two sub-networks: a generator and a discriminator. The adversarial training process is a minimax game: both sub-networks aim to minimize its own cost and maximize the other’s cost. This adversarial process leads to a converged status that the generator outputs realistic images and the discriminator extracts deep features. The GAN frameworks we used in this pipeline are Cycle-Consistent Adversarial Networks (CycleGAN)

[44] with the Wasserstein distance loss as in Wasserstein GAN (WGAN) [3].

CycleGAN As our goal is to learn a mapping that maps synthetic images into real images, namely domain to domain , we use two generators: , , and two discriminators and . The core insight behind this is that, we not only want to have , but also , which is an additional restriction used to prevent the well-known problem of mode collapse, where all input images map into the same output image. As illustrated in Fig. 1(a), mapping functions and both should be cycle-consistent. We follow the settings in [44]

. The generator network consists of two stride-2 convolutions, six residual blocks and two fractionally-strided convolutions with stride

, i.e., deconvolutions that upsample the feature maps. For the discriminator networks, we use PatchGANs [14], which, instead of classifying the entire image, classifies image patches to be fake or real. The objective is defined as the sum of adversarial loss and cycle consistency loss:

(1)

Here evaluates the adversarial least-squares loss of the mapping function and discriminator :

(2)

tries to minimize (2) against that tries to maximize it: while aims to generate images that look similar to real images, tries to distinguish between and real images , and vice versa. Cycle consistency loss is defined as below:

(3)

The optimization of brings back to , to . in (1) represent the strength of the cycle consistency. Details of CycleGAN can be referred to [44].

Wasserstein GAN We apply the techniques proposed in [3] to CycleGAN and propose the CycleWGAN. In training of GANs, the method of measuring the distance between two distances plays a crucial role. A poor evaluation of distance makes it hard to train, even causes mode collapse. The Earth-Mover (EM) distance, a.k.a. Wasserstein distance, is used when deriving the loss metric, as it is more sensible and converges better. The EM distance between two distributions and is defined as below:

(4)

which can be interpreted as the “mass” must be moved to transform the distribution into the distribution . To make it tractable, we apply the Kantorovich-Rubinstein duality and get

(5)

where are 1-Lipschitz function. To solve the problem

(6)

we can find a parameterized family of functions which are -Lipschitz for some constant such that . Then this problem is transformed to

(7)

In order to enforce a Lipschitz constraint that parameters range in a compact space, we clamp the weights to a fixed range after each gradient update. In the training of GAN, the discriminator play the role of above.

Specifically, we make four modifications based on CycleGAN described above:

  • Replace the original adversarial loss (2) with:

    (8)

    The same to .

  • Update discriminators times before each update of generators.

  • Clamp the parameters of discriminators to a fixed range after each gradient update.

  • Use RMSProp

    [31] for optimization, instead of Adam [17].

Layer Type Configurations
Bidirectional-LSTM #hidden units: 256
Bidirectional-LSTM #hidden units: 256
ReLu -
BatchNormalization -
Convolution #filters:512, k:, s:1, p:0
ReLu -
BatchNormalization -
Convolution #filters:512, k:, s:1, p:0
MaxPooling s:, p:
ReLu -
Convolution #filters:512, k:, s:1, p:1
ReLu -
BatchNormalization -
Convolution #filters:512, k:, s:1, p:1
MaxPooling s:, p:
ReLu -
Convolution #filters:256, k:, s:1, p:1
ReLu -
BatchNormalization -
Convolution #filters:256, k:, s:1, p:1
MaxPooling p:, s:2
ReLu -
Convolution #filters:128, k:, s:1, p:1
MaxPooling p:, s:2
ReLu -
Convolution #filters:64, k:, s:1, p:1
Input RGB images
TABLE I:

Configuration of the convolutional RNN model. “k”, “s” and “p” represent kernel size, stride and padding size.

Name Width Height Time Style Provinces Training set size Test set size
Dataset-1 Day, Night Blue, Yellow 30/31 203,774 9,986
Dataset-2 Day, Night Blue, Yellow 4/31 45,139 5,925
TABLE II: Composition of the real datasets. The two datasets are collected separately but both come from open environments.

Iii-B Convolutional Recurrent Neural Network

The framework that we use is a combination of deep convolutional neural networks and recurrent neural networks based on [33]. The recognition procedure consists of three components: Sequence feature learning, sequence labelling and sequence decoding, corresponding to the convolutional RNN architecture of convolutional layers, recurrent layers and decoding layer one to one. The overall configurations are presented in Table I.

Sequence feature learning CNNs have demonstrated impressive ability of extracting deep features from image [37]. An 8-layer CNN model is applied to extracting sequence features, as shown in Table I

. Batch normalization is applied in 3rd, 5th, 7th and 8th layers, and rectified linear units are followed after all these 8 layers. All three channel in RGB images are used: images are resized to

before fed to networks. After the 8-layer extraction, a sequence of feature vectors is outputted, as the informative representation of the raw images, and the input for the recurrent layers.

Sequence labelling The RNNs have achieved great success in sequential problems such as handwritten recognition and language translation. To learn contextual cues in license numbers, a deep bidirectional recurrent neural network is built on the top of CNNs. To avoid gradient vanishing, LSTM is employed, instead of vanilla RNN unit. LSTM is a special type of RNN unit, which consists of a memory cell and gates. The form of LSTM in [41]

is adopted in this paper, where a LSTM contains a memory cell and four multiplicative gates, namely input, input modulation, forget and output gates. A fully-connected layer with 68 neurons is added behind the last BiLSTM layer, for the 68 classes of label—31 Chinese characters, 26 letters, 10 digits and “blank”. Finally the feature sequence is transformed into a sequence of probability distributions

where is the sequence length of input feature vectors and the superscript can be interpreted as time step. is a probability distribution of time : is interpreted as the probability of observing label at time , which is a probability distribution over the set where is the length and contains all of the 68 labels, as illustrated in Fig. 2.

Sequence decoding Once we have the sequence of probability distributions, we transform it into character string using best path decoding [8]. The probability of a specific path given an input is defined as below:

(9)

As described above, means the probability of observing label at time , which is visualized in Fig. 2. For the best path decoding, the output is the labelling of most probable path:

(10)

Here we define the operation as removing all blanks and repeated labels from the path. For example, . As there is no tractable globally optimal decoding algorithm for this problem, the best path decoding is an approximate method that assumes the most probable path corresponds to the most probable labelling, which simply takes the most active output at each time step as and then map onto . More sophisticated decoding methods such as dynamic programming may lead to better results.

Training method

We train the networks with stochastic gradient descent (SGD). The labelling loss is derived using Connectionist Temporal Classification (CTC)

[8]. Then the optimization algorithm Adadelta [42] is applied, as it converges fast and requires no manual setting of a learning rate.

Fig. 2: License plate recognition confidence map. The top is the license plate to be predicted. The color map is the recognition probabilities from the linear projections of output by the last LSTM layer. The recognition probabilities on 68 classes are shown vertically (with classes order from top to bottom: 0-9, A-Z, Chinese characters and “blank”). Confidence increases from green to white.
Fig. 3: Some Samples of Dataset-1. The license plates in Dataset-1 come from 30 different provinces of all 31 provinces. All of them are captured from real traffic monitoring scenes.
Fig. 4: Some samples of Dataset-2. All of license plates in Dataset-2 are collected from four provinces, from different channels with Dataset-1.

Iv Experiments

Iv-a Datasets

We collect two datasets in which images are captured from a wide variety of real traffic monitoring scenes under various viewpoints, blurring and illumination. As listed in Table II, Dataset-1 contains a training set of 203,774 plates and a test set of 9,986 plates; Dataset-2 contains a training set of 45,139 plates and a test set of 5,925 plates. Some sample images are shown in Fig. 3 and Fig. 4. For Chinese license plates, the first character is a Chinese character that represents province. While there are 31 abbreviations for all of the provinces, Dataset-1 contains 30 classes of them and Dataset-2 contains 4 classes of them, which means the source areas of our data covers a wide range of ground-truth license plates. Our datasets are very challenging and comprehensive due to the extreme diversity of character patterns, such as different resolution, illumination and character distortion. Note that these two datasets come from different sources and do not follow the same distribution.

Iv-B Evaluation Criterion

In this work, we evaluate the model performance in terms of recognition accuracy and character recognition accuracy, which are evaluated at the license plate level and character level, respectively. Recognition accuracy is defined as the number of license plates that each character is correctly recognized divided by the total number of license plates:

(11)

Character recognition accuracy is defined as the number of correctly recognized characters divided by the number of all characters:

(12)

Besides, we compute the top- recognition accuracy as well. Top- recognition accuracy is defined as the number of license plates that ground-truth label is in top predictions divided by the total number of license plates. The recognition accuracy defined above is a special case of the top- recognition with equals 1. Benefiting from LSTMs’ sequence output, we can easily obtain the top-

recognition accuracy by making a beam search decoding on the output logits. The top-

recognition accuracy make sense in criminal investigation: an accurate list of candidates can be provided when searching for a specific license plate, which means a high top- accuracy promise a low missing rate.

(a) Script Images
(b) CycleGAN Images
(c) CycleWGAN Images
(d) Real Images
Fig. 5: (a) The examples of license plates generated by our scripts (simple graphics with hand-crafted rules; in other words, colors and character deformation are hard-coded). Note that these images are generated with labels. (b) The examples of license plates generated by CycleGAN [44] model, using synthetic images as input. (c) The examples of license plates generated by CycleWGAN model, in which WGAN [3] techniques are applied. (d) Real license plates from Dataset-1.

Iv-C Implementation Details

The recognition framework The convolutional RNN framework is similar to that of [33] and [18]

, We implement it using Tensorflow

[1]. The images are first resized to and then fed to the recognition framework. We modify the last fully-connected layer to have 68 neurons according to the 68 classes of label—31 Chinese characters, 26 letters, 10 digits and “blank”. Note that, the decoding method that we use in the above decoding process, independent of top-1 or top-, is based on a greedy strategy. It means that we get the best path—most probable path, not the most probable labelling. Graves et al. [8] find the almost best labeling using prefix search decoding, which achieves a slight improvement over best path decoding, at the cost of expensive computation and decoding time. Same as [33, 18], we adopt the best path decoding in inference.

Synthetic data We generate 200,000 synthetic license plates as our SynthDataset using computer graphics scripts, following the fonts, composition rules and colors, adding some Gaussian noise, motion blurring and affine deformation. The composition rules is according to the standard of Chinese license plates, such as the second character must be letter and at most two letters (exclude “I” and “O”) in last five characters, as depicted in Fig. 6. Some synthetic license plates are shown in Fig. 5(a).

GAN training and testing We first train the CycleGAN model on 4,500 synthetic blue license plates from SynthDataset and 4,500 real blue license plates from training set of Dataset-1, as depicted in Fig. 1. We follow the settings in [44]. The in Equation is set to 10 in our experiments. All the images are scaled to , cropped to and randomly flipped for data augmentation. We use Adam with = 0.5,

= 0.999 and learning rate of 0.0002. We save the model for each epoch and stop training after 30 epochs. When testing, we use the last 20 models to generate 100,000 blue license plates. The same goes for yellow license plates. Finally, we obtain 200,000 license plates that acquire the license number and realistic style from SynthDataset and Dataset-1 separately. When applying the training method of WGAN,

is set to 5, and the parameters of discriminator are clamped to after each gradient update. Note that only the training set of Dataset-1 is used when training the GAN models. For data augmentation, 200,000 images are randomly cropped to obtain 800,000 images before fed to the convolutional RNN models.

RA (top-1) CRA
SVM + ANN [20] 68.2 82.5
CNN + LSTMs 96.1 98.9
TABLE III: Comparison of models using different frameworks on Dataset-1. Recognition accuracy (%) and character recognition accuracy (%) accuracy are listed.
RA (top-1) CRA CRA-C CRA-NC
Random 0.0 0.8 0.0 0.9
Script 4.4 30.0 20.0 31.7
CycleGAN 34.6 82.8 41.3 89.8
CycleWGAN 61.3 90.6 66.2 94.8
TABLE IV: Comparison of models that use different generated images for supervised learning without real images. The performance is evaluated on test set of Dataset-1. “Random” represents the randomly initialized model. Recognition accuracy (%) and character recognition accuracy (%) are listed. “CRA-C” is the recognition accuracy (%) of Chinese character which is the first character, and “CRA-NC” is the recognition accuracy (%) of letters and digits which are the last six characters.
Training Data Methods RA (top-1) CRA top-3 top-5
9 Baseline 86.1 94.9 90.1 92.3
Script 90.2 97.0 94.6 95.6
CycleWGAN 93.6 98.4 96.8 97.4
50 Baseline 93.1 97.9 96.4 97.2
Script 95.2 98.8 97.7 98.1
CycleWGAN 96.3 99.2 98.3 98.8
200 (All) Baseline 96.1 98.9 98.0 98.5
Script 96.7 99.1 98.6 98.8
CycleWGAN 97.6 99.5 98.9 99.2
TABLE V: Comparison of using different volumes of the real data and synthetic data generated by different methods on Dataset-1, i.e., scripts and CycleWGAN model. It means the generated images are used for training first, then specific proportion of Dataset-1 is fed to the model. The performance is evaluated on the test set of Dataset-1. The CRNN baseline that trained on Dataset-1 only is also provided. Recognition accuracy (%), character recognition accuracy (%), top-3 recognition accuracy (%) and top-5 recognition (%) accuracy are listed.
RA (top-1) CRA top-3 top-5
Baseline 94.5 98.4 97.6 98.1
CycleWGAN 96.2 99.2 98.7 99.1
TABLE VI: Comparison of the baseline on Dataset-2. We fine-tune the model on Dataset-2 after pretraining on CycleWGAN images. Recognition accuracy (%), character recognition accuracy (%), top-3 recognition accuracy (%) and top-5 recognition (%) accuracy are shown.
Fig. 6: Standard Chinese license plate. The first character is Chinese character representing region. The second must be a letter. And the last 5 characters can be letter (exclude “I” and “O”) or digit, but there should be no more than two letters.

Iv-D Evaluation

The convolutional RNN baseline model We train the baseline models on two datasets directly. As shown in Table V, we achieve a top-1 recognition accuracy of 96.1% and character accuracy of 98.9% on Dataset-1, without additional data. This baseline already yields a high accuracy, especially the top-3 and top-5 accuracy which are 98.0% and 98.5%. On Dataset-2, the recognition accuracy and character recognition accuracy are 94.5% and 98.4%, as shown in Table VI.

For comparison and to demonstrate the effectiveness of the CRNN framework on the task of license plate recognition, we also evaluate the performance of EasyPR [20] on test set of Dataset-1, which is a popular open source software using support vector machines to locate plates and a single hidden layer neural network for character-level recognition. EasyPR is a segmentation based method. As shown in Table III, our approach yield a superior accuracy.

Synthetic data only To directly evaluate the effectiveness of the data generated by different methods, we train the models using synthetic data generated by script, CycleGAN and CycleWGAN respectively. Then accuracy evaluations are performed, which are listed in Table IV.

When we only use the SynthDataset which is generated by script for training, the recognition accuracy on test set of Dataset-1 is just 4.4%. To rule out the possibility of random initialization, we take a test on the model with random initial weights, which produces a result of 0.0% and 0.8% on recognition accuracy and character recognition accuracy. It means these synthetic images generated by script do help. However, compared with real license plates in Fig. 3, our synthetic license plates generated by script are overly simple.

The CycleGAN images achieve a recognition accuracy of 34.6%. The improvement of recognition accuracy from 4.4% to 34.6% demonstrates that the distribution generated by CycleGAN is much closer to the real distribution, which means that the generator performs a good mapping from synthetic data to real data with the regularization of discriminator. Some of the samples are shown in Fig. 5(b). However, we observe the phenomenon that the outputs of generator tend to converge to a specific style, which shows some kind of mode collapse [6, 27, 13], the well-known problem of GAN training.

As shown in Fig. 5(c), the CycleWGAN images show more various styles of texture and colors. Part of them are really undistinguishable from real images. Although to the naked eyes the images’ quality may be slightly lower, which is similar to the observation in [3], CycleWGAN shows aggressive rendering ability that prevents from mode collapse. When we use the CycleWGAN images, we yield 61.3 RA, which indicates the CycleWGAN images make a relatively improvements of 1300% compared to the script images. Note that, the recognition accuracy of non-Chinese characters (last six characters) has already achieved 94.8% without using any real data, as letters and digits are much easier than Chinese characters.

The results are consistent with the appearance of the images in Fig. 5. We conclude that the trained model is more accurate if the synthetic data is more real.

Curriculum learning To further explore these synthetic images’ significance, we then fine tune the pre-trained models on Dataset-1. Thus, we follow a curriculum learning strategy to train the system with gradually more complex training data. As in the first step, the large number of synthetic images whose appearance is simpler than real images, and a large lexicon of license plate numbers. At the same time, the first step finds a good initialization of the appearance model for the second step which uses real images.

As shown in Table V, with the help of CycleWGAN images, we obtain an impressive improvement of 1.5 pp over such a strong baseline.

To provide more general insights about the ability of GAN to generate synthetic training data, we adopt a cross-dataset strategy where the CycleWGAN is trained on data from Dataset-1 and produces images that will be used to train a model on Dataset-2. Thus, the model pre-trained on CycleWGAN images is fine tuned on Dataset-2, with the same procedure. We observe an improvement of 1.7 pp (from 94.5% to 96.2%) in Table VI. The experimental results on the Dataset-1 and Dataset-2 demonstrate the effectiveness: even on a strong baseline, these CycleWGAN images effectively yield remarkable improvements.

Smaller datasets, more significant improvements What if we only have a small real dataset? It is a common problem faced by some special tasks, especially when related to personal privacy.

We sample a small set of 9,000 ground-truth labeled license plate images from Dataset-1, For the first method, we train our recognition networks directly using these 9,000 license plates. The second method is the pipeline proposed in Section 1 that we use this dataset to train a GAN model and generate synthetic license plates. We then fine-tune this pre-trained model using 9,000 real images.

As shown in Table V, when we add 800 GAN images to the network’s training, our method significantly improves the LPR performance. We observe an impressive improvement of 7.5 pp (from 86.1% to 93.6%). When using 50,000 real images for the fine-tuning instead of 9,000, we also observe impactful improvement of 3.2 pp (from 93.1% to 96.3%). It is clear that the improvement is more remarkable with a smaller training set of real images. What is more, the model trained on 9,000 training set and GAN images deliver better performance than the model trained on a training set with size of 50,000. This comparison can also be found between 50 and 200 in Table V. It means that we can train a better model on a smaller real training set using this method, validating the effectiveness of the proposed pipeline.

Fig. 7: License plate samples generated by a DCGAN model trained on part of Dataset-1. Through the all-in-one method, they are mixed with real license plates to regularize the CRNN model.
RA (top-1) CRA
Dataset-1-9k 86.1 94.9
Dataset-1-9k + All in one 78.1 92.8
Dataset-1-50k 93.1 97.9
Dataset-1-50k + All in one 92.6 97.7
Dataset-1-200k 96.1 98.9
Dataset-1-200k + All in one 96.0 98.9
TABLE VII: Comparison of mixing different amount of real images with DCGAN images. For all-in-one method, 50,000 DCGAN images is mixed with real images. Recognition accuracy (%), character recognition accuracy (%), top-3 recognition accuracy (%) and top-5 recognition (%) accuracy are shown.

Unlabeled DCGAN images To compare the effectiveness of labeded CycleWGAN images and unlabeled images generated by DCGAN [27], we train a DCGAN model to generate unlabeled images, and apply the all-in-one [32, 24, 43] method to them for a semi-supervised learning. All-in-one is an approach that takes all generated unlabeled data as one class.

DCGAN provides a stable architecture for training GANs and can only generate unlabeled images. It consists of one generator and one discriminator. We follow the settings in [27]. For the generator, we use four fractionally-strided convolutions after projecting and reshaping a 100-dim random vectors to to generate images with size of . For the discriminator, a fully-connected layer following four stride-2 convolutions with kernel size of is used to perform a binary classification that whether the image is real or fake. We train the DCGAN model on 46,400 images from Dataset-1, and then use the model to generate 50,000 images by inputting 100-dim random vectors in which each entry is in a range of . Some samples are shown in Fig. 7.

For all of our license plates, there are 68 different characters, which means 68 classes. We create a new character class, and every character in generated license plate is assigned to this class. Same as [43], these generated images and ground truth images are mixed and trained simultaneously. As shown in Table VII, we explore the effect of adding these 50,000 images to Dataset-1-9k, Dataset-1-50k and Dataset-1-200k. The results show that the DCGAN images fail to obtain improvements, which is opposite to the results of [43]. We think it is because that the bias in the task of person re-identification does not exist in LPR.

Fig. 8: Random images that consist of random value at each pixel. They are input to a trained model to produce the condition distribution which can be viewed as the knowledge of the model.

Iv-E Interpretation

Thus far we empirically prove the effectiveness of CycleGAN generated images for recognition. However from a theoretical perspective, it is not clear about the effect of images generated by GAN, in real learning tasks. Thus, we perform some visualizations to interpret the effectiveness of GAN images.

The knowledge of a model can be viewed as the conditional distribution it produces over outputs given an input [11, 26]. Inspired by this insight, we input 5,000 random-generated images to models, then average the outputs of last fully-connected layer together as a confidence map to simulate the output distribution. Some samples of random images are shown in Fig. 8. The confidence map namely is illustrated in Fig. 2, where can interpreted as probability of observing label at time as described in Section 3.2.

Here, we mainly explain this question: why do those labeled GAN images yield further improvements? First, the model trained on script-generated images only is visualized, as shown in Fig. 11(a). We observe that the bottom of the first column is blue while the top is green, which means that it is more probable to observe Chinese characters in the place of license plate’s first character. On the contrary, the rest places tend to be digits and letters.

When we observe the model trained on CycleWGAN images only, the composition rules of license plates are more clear, as illustrated in Fig. 11(b). Besides, the columns next to the first column show concentrated probabilities in middle position, which indicate the second character tends to be letters. These observations match the ground truth rules explicitly. The above results indicate the knowledge of composition rules that are hard-coded in scripts is learned by the model and help in sequence labelling.

Once the model pre-trained on CycleWGAN images is fine-tuned on real images instead of script images, these general rules are still clear. However, the model trained on real images directly learns the non-general features, e.g., very high probability of “9” which is caused by the many “9”s in training set, as depicted in Fig. 11(d). The non-general features cause over-fitting. Besides, we pick the images that make up the improvements. About 35% of them are caused by the first character (Chinese character), and about one quarter of them are caused by the confusing pairs like “0” & “D”, “8” & “B” and “5” & “S”. It means that these CycleWGAN images help extract discriminative features of license plates.

In a word, the knowledge and prior hard-coded in script does help; the GAN images soften these knowledge and make output distribution be close to ground-truth distribution which lead to further improvements. It is consistent with the experimental results in Section 4.4. When training data is large, the knowledge in additional GAN images help less because the real training set is sufficiently large to carry most of these knowledge.

Iv-F Model Compression

To further meet requirements of inference on mobile and embedded devices, we replace the standard convolution with depth-wise separable convolution [34] and adopt the hyper-parameter width multiplier proposed by [12]. A depth-wise separate convolution contains a depth-wise convolution and a 11 (point-wise) convolution. For depth-wise convolution, a single filter is applied to each input channel separately. Then a pointwise convolution is applied which is a 11 convolution that combines different channels to create new features. This factorized convolution efficiently reduces the model size and computation.

With stride one and padding, a standard convolution filtering on a feature map outputs a feature map O. Here, and represent the height and width of the feature map. and are the depth of the input and output feature map. While the size of convolution kernel is where denotes the spatial dimension of the kernel, we can compute the output feature map as:

(13)

During this forward computation, the computation cost is defined as below:

(14)

As for the depth-wise convolution, the temporary output feature map is computed as:

(15)

Then we perform a convolution to produce the final output feature map of this depth-wise separate convolution. Over all, the computation cost of the depth-wise separate convolution is:

(16)

To explore the trade off between the accuracy and inference speed, we bring in a hyper-parameter called width multiplier. At each layer, we multiply both input depth and output depth by . Then the computation cost of the separate depth-wise convolution becomes:

(17)

Dividing the by , we then get the ratio of computation cost:

(18)

If we use the default width multiplier of 1.0 and depth-wise separable convolutions, the computation cost becomes about 9 times less than the original standard convolution, while the output depth is large.

Based on this efficient depth-wise separable convolution and the existing CRNN framework, we construct a lightweight convolutional recurrent neural network and denote it as “LightCRNN”. Note that only the convolutions are modified, while the parameters and computation cost in LSTM and the fully-connected layer in it are kept. The entire architecture is defined in Table VIII.

Layer Type Configurations
Bidirectional-LSTM #hidden units: 256
Bidirectional-LSTM #hidden units: 256
Conv #filters:512, k:, s:1, p:0
Depthwise Conv k:, s:1, p:0
Conv #filters:512, k:, s:1, p:0
Depthwise Conv k:, s:1, p:0
MaxPooling s:, p:
Conv #filters:512, k:, s:1, p:0
Depthwise Conv k:, s:1, p:1
MaxPooling s:, p:
Conv #filters:256, k:, s:1, p:0
Depthwise Conv k:, s:2, p:1
Conv #filters:128, k:, s:1, p:0
Depthwise Conv k:, s:2, p:1
Conv #filters:64, k:, s:1, p:1
Input RGB images
TABLE VIII: Configuration of the LightCRNN model. “k”, “s” and “p” represent kernel size, stride and padding size. Each convolution is followed by batch normalization and ReLU nonlinearities.
Model RA (top-1) Size Speed
LightCRNN 96.5 40.3 13.9
1.2 LightCRNN 97.0 44.2 11.5
1.2 LightCRNN + CycleWGAN images 98.6 44.2 11.5
TABLE IX: Comparison of using different models. All are trained on Dataset-1 and tested on its test set. Recognition accuracy (%), model size (MB) and inference speed (FPS) are listed. The proposed architecture significantly decreases the model size and speeds up the inference.

We evaluate the proposed LightCRNN on Dataset-1 without GPU. The following experiments are carried out on a workstation with 2.40 GHz Intel(R) Xeon(R) E5-2620 CPU and 64GB RAM. The LightCRNN models are trained and tested using TensorFlow [1]. Only a single core is used when we perform forward inference. As shown in Table IX, the LightCRNN not only efficiently decreases the model size from 71.4 to 40.3 and speeds up the inference from 7.2 to 13.9, but also increase the recognition accuracy. It is probably because this LightCRNN reduce the overfitting of the model. To adjust the trade off between latency and accuracy, we apply the width multiplier of 1.2. This 1.2 LightCRNN brings another improvement of 0.5 pp. We then apply the proposed pipeline which leverages CycleWGAN image data and yields an improvement of 1.6 pp.

Thus, the combination of our proposed pipeline and architecture make it more close to perform accurate recognition on mobile and embedded devices.

Fig. 9: Some samples of Dataset-3. The videos are shot when both the source and target objects are moving.
Fig. 10: Some samples of CycleWGAN images for moving LPR. The CycleWGAN models are trained using script images and training set of Dataset-3.
RA (top-1) CRA
Baseline 89.4 97.6
CycleWGAN + 92.1 98.0
TABLE X: Results on test set of Dataset-3. Fine-tuning on the model pre-trained on the CycleWGAN images improves the baseline. Recognition accuracy (%) and character recognition accuracy (%) accuracy are listed.

Iv-G Moving LPR

We further validate our pipeline on more complicated data. We shoot in a moving car and capture the license plates of other vehicles. Both the viewpoint and target object are moving fast and irregularly. This data is much more hard than the normal monitoring data which is collected at settled location and angle. The LPR task then becomes more challenging and we denote it as the task of moving LPR, which is with wide applications in driving assistance, surveillance and mobile robotics, etc.

The moving LPR has two main bottlenecks. First, this harder task needs more examples and large-scale data are expensive to collected, as you can not just export the data from monitoring camera but have to shoot videos in various regions to meet the requirement of diversity. Second, the models need to be more efficient when you deploy the systems in vehicles or drones. Our proposed pipeline eases up the first bottleneck, and the proposed LightCRNN helps the latter.

These images are collected in Suzhou, a normal Chinese city. They cover various situatiions, such as night, highway, crossroad and so on. Some samples are shown in Fig. 9. After the images are annotated and cropped, we obtain 22026 license plates, namely, the Dataset-3. 2000 of them are splitted as test set of Dataset-3.

Considering the trade off between accuracy and latency, we adopt the 1.2x LightCRNN in our following experiments. We first train a model using the training set of Dataset-3 and evaluate on the corresponding test set. We obtain a recognition accuracy of 89.4%. Then the proposed pipeline is applied. First, we train the CycleWGAN using script images and the training set of Dataset-3. Both are 20 images. Then we feed the script images to the trained generators and obtain 800 synthetic CycleWGAN images. We show some samples in Fig. 10. The recognition model is pre-trained on the CycleWGAN images and fine-tuned on the training set of Dataset-3. As thus, we yield an improvement of 2.7 pp, as listed in Table X. It indicates that our methods work well on even more hard LPR scenario.

V Conclusion

In this paper, we integrate the GAN images into supervised learning, i.e., license plate recognition. Using our method, large-scale realistic license plates with arbitrary license numbers are generated by a CycleGAN model equipped with WGAN techniques. The convolutional RNN baseline network is trained on GAN images first, and then real images in a curriculum learning manner.

Experimental results demonstrate that the proposed pipeline brings significant improvements on different license plates datasets. The significance of GAN-generated data is magnified when real annotated data is limited. Furthermore, the question that why and when do these GAN images help in supervised learning is answered, using clear visualizations. We also propose a lightweight convolutional RNN model using depth-wise separate convolution, to perform fast and efficient inference.

Acknowledgment

The work was supported by the Shanghai Natural Science Foundation (No. 17ZR1431500).

References