End-to-End License Plate Recognition Pipeline for Real-time Low Resource Video Based Applications

by   Alif Ashrafee, et al.

Automatic License Plate Recognition systems aim to provide an end-to-end solution towards detecting, localizing, and recognizing license plate characters from vehicles appearing in video frames. However, deploying such systems in the real world requires real-time performance in low-resource environments. In our paper, we propose a novel two-stage detection pipeline paired with Vision API that aims to provide real-time inference speed along with consistently accurate detection and recognition performance. We used a haar-cascade classifier as a filter on top of our backbone MobileNet SSDv2 detection model. This reduces inference time by only focusing on high confidence detections and using them for recognition. We also impose a temporal frame separation strategy to identify multiple vehicle license plates in the same clip. Furthermore, there are no publicly available Bangla license plate datasets, for which we created an image dataset and a video dataset containing license plates in the wild. We trained our models on the image dataset and achieved an AP(0.5) score of 86 and observed reasonable detection and recognition performance (82.7 rate, and 60.8 second).



There are no comments yet.


page 4

page 6

page 7

page 8


End-to-End United Video Dehazing and Detection

The recent development of CNN-based image dehazing has revealed the effe...

Aggregated Channels Network for Real-Time Pedestrian Detection

Convolutional neural networks (CNNs) have demonstrated their superiority...

A Robust Real-Time Automatic License Plate Recognition based on the YOLO Detector

Automatic License Plate Recognition (ALPR) has been a frequent topic of ...

Real-time Face Mask Detection in Video Data

In response to the ongoing COVID-19 pandemic, we present a robust deep l...

Fast and Accurate Convolutional Object Detectors for Real-time Embedded Platforms

With the improvements in the object detection networks, several variatio...

Deep-CNN based Robotic Multi-Class Under-Canopy Weed Control in Precision Farming

Smart weeding systems to perform plant-specific operations can contribut...

Low-light Environment Neural Surveillance

We design and implement an end-to-end system for real-time crime detecti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years there have been many advancements in developing Automatic License Plate Recognition (ALPR) systems and their necessity is apparent. In Bangladesh, there are over 4.47 million registered vehicles [vehicles]. Yet, parking lot facilities are inadequate, and tracking cars entering or leaving parking lots is done manually and is generally disorganized. Due to the large volume of vehicles in metropolitan areas, it is difficult to find a vacant parking slot due to the lack of optimized parking systems. Furthermore, cars have to wait in long queues while they are manually logged one by one. This process can be very time-consuming for all involved parties. Hence, the demand for automated parking management systems has skyrocketed. In a parking system, by registering the license plate number of a car while it is entering and exiting, we can automatically generate a parking ticket fee using the time interval. ALPR systems can also be used to automate numerous other real-life application scenarios. It can be used to track guest vehicles in a private parking area. A similar tracking process can be used by traffic police to identify vehicles that are parked in no-parking areas or to track vehicles involved in criminal activities. We can also automate toll collection systems and gas station management systems using similar methods. Examples of Bangla License Plates can be seen from 1.

Figure 1: Example of Bangladeshi License Plates. Source: [plates]

Given the large variety of applications scenarios, ALPR systems have to be robust in detecting license plates in various conditions (unclear characters, plate variations, occlusions, illumination changes) and accurately recognize the characters while fulfilling the key criteria of real-time inference speed. Furthermore, if there are temporally separate instances of different vehicles in the same video, the system has to be able to correctly differentiate between them and store them separately. In such a system, each frame of a video needs to be processed to detect license plates but this comes at the cost of slower inference speed. Our proposed method contains a two-stage detection module. The first stage is a haar-cascade classifier [soo2014object] that works as a wakeup mechanism to quickly discard frames with no license plates. The second stage is a MobileNet SSDv2 [chiu2020mobilenet] detection model. The detection module stores only the 3 best-cropped images for a single-vehicle. Our pipeline also takes into account any interval between the appearance of two vehicles so it can identify and separately store the result of individual vehicles. Even with all these features, our proposed pipeline achieves real-time inference speed with satisfactory detection and recognition performance.

In summary, our contribution includes the creation of two datasets and their corresponding annotations. The first being a Bangla license plate image dataset for training purposes and a video dataset that includes license plates in the wild for evaluating our proposed architecture. To our knowledge, this is the first annotated dataset of this type. We introduce a pipeline that includes a novel two-stage detection module with the following features:

  • Wake-up mechanism for reducing inference time

  • Best frame selection strategy

  • Temporally separate vehicle instance identification

This method is highly optimized for low resource server-side run-time environments, thus providing a versatile solution to ALPR system deployment in various applications.

2 Related Works

2.1 Real Time Object Detection

Object detection has been a core concept and a challenging problem in Computer Vision since the earlier days, Paul Viola et. al.


introduced cascade classifiers trained using AdaBoost to detect faces in real-time. With the breakthrough in deep learning technologies in recent times, many advancements have been made in this particular field. Many object detection networks

[redmon2016you, ren2015faster, Howard2017MobileNetsEC]

use convolutional neural networks to detect objects with performance close to human accuracy. Fast YOLO

[shafiee2017fast] produce an optimized architecture evolving the YoloV2 network in order to provide accurate detections in real-time. MobileNet SSDv2 [chiu2020mobilenet] reduces computational complexity of object detection using a lightweight architecture based on MobileNetV2[mobilenetv2] providing real-time object detection in embedded devices. Object detection in videos is even more challenging since each frame has to be processed which increases processing time. Shengyu Lu et. al. [lu2019real] achieved real-time performance in video by using image preprocessing to eliminate the influence of image background with Fast YOLO as their object detection model. Han et. al.[han2016seq] use temporal information to improve weaker detections in the same clip by using higher scoring detections in nearby frames. This helps to detect objects consistently across frames. For license plate detection, Montazolli et. al.[8097294]

uses a cascaded Fast YOLO architecture to detect larger regions around the license plate and then extract the license plate from those patches. They achieved a high precision and recall rate on a Brazilian license plate dataset. Hsu et. al.

[8078493] used a modified YoloV2 model for in-the-wild license plate detection, attaining a high FPS rate on a high-end GPU. Xie et al. [article] take into consideration rotated license plates and proposed a rotation angle prediction for multi-directional license plate detection.

Figure 2: Proposed Methodology in 4 stages

2.2 License Plate Recognition

For an end-to-end automatic license plate recognition system, the characters in the license plate need to be extracted and predicted accurately which is vital in many applications. Montazolli et. al.[8097294] proposed a YOLO-based model to recognize the characters within a cropped license plate region, however, the recognition performance was poor with less than 65% correctly recognized license plates. Li et. al. [LI201814]

performed character recognition without applying any segmentation. They modeled it as a sequence labeling problem where the sequential features were extracted using a CNN in a sliding window manner. To label the sequence they used a bi-directional recurrent neural network (BRNNs) with Long-Short-Term-Memory(LSTMs). Finally, they used a Connectionist Temporal Classification (CTC) network for sequence decoding. Zhuang et. al.

[zhuang2018towards] proposed a semantic segmentation with a character counting method to perform license plate character recognition. They used a simplified DeepLabV2 network for the semantic segmentation, performed Connected Component Analysis (CCA) for character generation, and used Inception-v3 and AlexNet for character classification and counting. License plates can become blurred due to the fast motion of vehicles or when the extracted license plate patch is too small and needs to be enlarged. Many papers tackle this issue to boost recognition performance. Lu et. al. [lu2016robust] deblur the license plate using a novel scheme based on sparse representation to identify the blur kernel. Svoboda et. al.[svoboda2016cnn] implemented a text deblurring CNN to reconstruct blurred license plates. For Bangla license plates, character recognition is an even more complex issue due to its conjunct-consonants and grapheme roots [chatterjee2019bengali]. Roy et al. [roy2016license] proposed a boundary-based contour algorithm to detect the license plate in the vehicle region. Since the Bangla license plates contain information in two rows, they used horizontal projection with thresholding to separate the rows, and vertical projection with thresholding was used to separate the letters and numbers. Finally, they applied template matching for recognizing the characters. Dhar et. al. [dhar2018system] introduced CNNs to extract features for character recognition. They also verified the shape of the license plate using the distance to borders (DtBs) method and corrected the horizontal tilting of plates using extrema points. Saif et. al.[saif2019automatic] leverages the advantage of YOLO in that it can detect and localize objects simultaneously in an image using a single CNN. They fed the cropped license plate images after detection to a YOLOv3 model for segmenting and recognizing the characters and numbers. Their results, however, do not indicate real-time performance as they only achieved 9 frames per second on a Tesla K80 GPU while detecting and recognizing license plates in a video.

3 Machine Learning Life Cycle

There are currently no publicly available Bangla license plate datasets. Furthermore, to our knowledge, there are no active Bangla ALPR systems that perform real-time processing in video footage. So we had to work on all portions of the Machine Learning Life-cycle including Data Collection and Preparation, Machine Learning Modeling, and also a graphical user interface for Deployment and Interaction.

3.1 Data Collection and Preparation

To train all our models for license plate detection we created our own Bangla license plate image dataset, and to evaluate the performance of our model in real-time we collected, annotated, and prepared a video dataset containing license plates in the wild.

3.1.1 License Plate Image Dataset

We created our own dataset by taking pictures of cars including their license plate from different distances and angles around Bangladesh and manually annotated them by drawing bounding boxes around the license plate regions. In total, we annotated 1000 images and split the whole dataset into an 80-20 percent train-test split.

3.1.2 License Plate Video Dataset

We collected 79 video clips containing 98 license plates using crowd-sourcing, where each clip includes single or multiple cars from around Bangladesh. We annotated the content of each license plate and the number of license plates present in each video to evaluate our entire model performance and speed from detection to recognition.

Figure 3: End-to-End Pipeline for Interaction

3.2 Machine Learning Modeling

We mainly focus on achieving real-time inference for each frame while also retaining satisfactory performance. As mentioned, feeding each frame to a large model is a drawback for our case because larger models have slower inference speeds. To tackle this issue, we first trained a lightweight haar-cascade classifier using video footage. Haar-cascade classifiers are trained using positive and negative examples [opencv]. We extracted frames from the footage and labeled the frames accordingly for training. Frames with a license plate in them are labeled positive and frames containing only backgrounds are considered negative samples. A cascade classifier consists of multiple stages, where each stage is an ensemble of weak classifiers. Each stage is trained using boosting techniques that provide the ability to train a highly accurate classifier by taking a weighted average of the weak classifier results [viola2001rapid]. Cascade classifiers divide the image into sub-windows. In the early stages, the weak classifiers are used to discard the negative sub-windows as fast as possible and detect all the positive instances to achieve low false-positive rates [matlab]. Since the vast majority of windows do not contain license plates, this provides a very fast solution to the problem of inference time for each frame. Conversely, true positives are rare and worth taking the time to verify.

For the second stage of our architecture, we use a more robust object detection model MobileNet SSDv2, which is a single shot detector. To train this model, we fed our custom dataset of license plate images along with their corresponding annotations to it. In the overall pipeline, the cascade classifier acts as a wakeup mechanism which is the primary filter for determining the presence of a license plate and if any license plate is detected by the cascade in any frame, that frame is then fed to the MobileNet SSDv2 module which acts as a second filter. MobileNet can draw more accurate bounding boxes [deepa2019comparison] which we then use to localize the position of the license plate.

For multiple instances of license plates in the same video, we assume our third stage. We store each cropped instance of a detected license plate along with its corresponding confidence value in a dictionary. Since one vehicle appears in multiple consecutive frames, if we store each frame, there will be many redundant frames for each vehicle which is not preferable. We assume that whenever there is an interval period of one second where no license plate is found, it means that the previous car has passed. Then we look into the dictionary and based on the highest confidence values, we store the best 3 cropped instances of a license plate. We follow this process for each following vehicle in the same video.

For the final stage, the detected license plates are retrieved from the storage directory, enlarged, and passed to the Vision API [vision] recognition module that returns the content of the license plate in string format.

Pipeline Precision (%) Recall (%) F1 Score (%) Detection Rate (%) FPS
YoloV3 Tiny 60.2 55.6 57.1 75.5 11.1
Cascade + YoloV3 Tiny 56.3 51.6 53.1 69.4 27.1
SSDv2 66.1 58.4 60.5 98.0 17.7
Cascade + SSDv2 63.6 59.3 60.8 82.7 27.2
Table 1: Comparative result analysis of different pipelines

We can summarize our proposed architecture in the following four stages as shown in figure 3.

  • Stage 1: We trained a cascade object detection model using positive and negative frames from video footage.

  • Stage 2: Secondly, we use an object detection module MobileNet SSDv2, which is a zero-shot classifier. We feed our custom dataset of license plate images along with their corresponding annotations to the module in order to train it.

  • Stage 3: We used our custom video dataset and fed it to the system where the frame in which a license plate exists, gets first detected by the cascade detector, and only then our MobileNet module detects it. The cascade detector acts as a wakeup mechanism for the MobileNet in order to increase efficiency. In the case of multiple instances of cars, we only accept that two license plates are different when there is a 24 or more frame gap between two detected license plates.

  • Stage 4: The detected license plate is then cropped out, enlarged, and passed to the Vision API recognition module that returns the content of the license plate in string form.

3.3 Deployment and Interaction

We developed the whole system with user interaction and ease of use in mind. Since we aim to make ALPR systems more accessible, we built the system as a web app using Flask [grinberg2018flask] framework so that it can be easily deployed. The end-to-end pipeline of our system architecture has been demonstrated in figure 3.

On the left-hand side of the figure, we can see the interactive GUI components of the system. It starts with an interface where the user can upload video footage. Upon receiving a video as request, the flask framework sends it to the backend which can be seen on the right-hand side of the figure. On the backend side, at first, each frame is processed by the cascade classifier that separates only the frames containing license plates from the footage and feeds them forward to the backbone detection model, MobileNet SSDv2. It then detects and draws bounding boxes around the license plates. All of this processing is done in real-time and visual feedback of the detections is shown to the user on-screen. For each temporally separate vehicle appearing in the video, MobileNet SSDv2 stores 3 best-cropped license plate instances in the output directory. On the front-end side, the user can now view the detected plates. While fetching the stored license plates, Flask also sends a request to the Vision API to fetch the OCR result of each plate. Finally, the cropped license plates along with their corresponding OCR outputs are shown to the user. We also provide users the option for saving the OCR outputs in a database or deleting a particular result.

4 Experimental Analysis

Figure 4: License plates that are missed by Cascade + SSDv2 but are successfully detected by standalone SSDv2
Figure 5: OCR predictions from plates detected by standalone SSDv2 but missed by Cascade + SSDv2

4.1 Experimental Setup

We trained the MobileNet SSDv2[chiu2020mobilenet] model using a Tesla T4 GPU on Google Colaboratory. We set the batch size to 24, image size 300*300, intersection over union (IOU) threshold 0.35, score threshold

, learning rate 0.004, momentum 0.9, decay 0.9 and RMSProp optimizer


. We ran the training for 50 epochs.

For testing, all the experiments were conducted on a single thread of an Intel Core i5-7200U mobile CPU with 8GB of RAM. Each of the test videos was resized into 480*480 pixels to maintain consistency across the inference times. The yolov3 tiny model takes a blob of size 320*320 as input with a scale factor of 0.00392 to scale the values, mean of (0, 0, 0) which is subtracted from the values and swapRB set to True in order to swap the red and blue channels. Non-max suppression (NMS) is applied to the bounding boxes detected from the blob with a score threshold of 0.1 and NMS threshold of 0.4 which is also known as the IoU threshold. Finally, another thresholding is performed so that only the frames that have confidence higher than 0.1 are stored in the dictionary for a particular vehicle. Among all the frames stored in the dictionary for a single vehicle, 3 frames with the topmost confidence values are saved in the output directory in the end.

In the case of MobileNet SSDv2, a blob of 300*300 is extracted from each frame to feed into the network. Here, we only set the swapRB parameter to True. From the acquired predictions, we only take boxes with confidence values higher than 0.5 to be stored in the dictionary. Finally, the best 3 predictions are saved as output based on their confidence values.

For the cascade classifier, we give the entire 300*300 pixel frame as input along with a scale factor of 1.1 which specifies how much the image size is reduced at each image scale, min neighbors set to 10 which determines how many neighbors each candidate rectangle should have to retain it and finally the min size set to 45*45 which is the lowest allowed size for a detection.

4.2 Ablation Study

We conducted 4 experiments to determine which pipeline works best for our task. The pipeline configurations and their corresponding results are outlined in Table 1. Finally, we observe the obtained results and determine how and why a particular component leads to that particular result.


is an object detection framework that produces state-of-the-art detection results on the COCO dataset

[lin2014microsoft]. However, since our goal is to provide a real-time solution with minimal hardware resources, we used a more lightweight variant YoloV3 tiny model which decreases the depth of convolutional layers increasing the running speed significantly compared to the original YoloV3 network[adarsh2020yolo]. This works as our backbone detection model and we consider this to be our baseline.

For the second pipeline, we used the same YoloV3 tiny model as our backbone detection model but with an additional cascade classifier as a wake-up mechanism. This is because YoloV3 tiny has much slower FPS rates which are not capable of real-time inference. Using the cascade classifier as a filter helps to quickly discard frames with no license plates. From Table 1 we can observe that this approach greatly reduces computational complexity and achieves a very high FPS (27.1) which is more than a 100% gain from the baseline FPS (11.1). Although this modification achieves real-time performance, it still does not have satisfactory detection rates or OCR accuracy.

Figure 6: Unreadable license plate missed by Cascade + SSDv2 but detected by standalone SSDv2
Figure 7: OCR predictions from plates detected by standalone SSDv2 that are not favorable for Vision API

In the aim of attaining greater detection and recognition performance, for our third pipeline, we used a trained MobileNet SSDv2 as the main backbone detection model instead of YoloV3 tiny. Since there is a trade-off between performance and speed in detection models, MobileNet is an optimal choice for the backbone as discussed in [deepa2019comparison, adarsh2020yolo]. We also remove the cascade component since we found from our experiments that it fails to identify license plates that are far away in the videos. For the standalone MobileNet however, this is not the case as it can detect distant plates easily. Consequently, from Table 1 we can observe a drastic improvement in the detection rate (98.0%). Nevertheless, in contrast to the detection rate, the F1 score (60.5%) does not improve as substantially as compared to that of the baseline (57.1%) and the second pipeline (53.1%). This is because standalone MobileNet SSDv2 successfully detects a lot of plates that are small and far away but Vision API fails to make accurate predictions for such Bangla license plates. This experiment shows that using MobileNet SSDv2 as the backbone component instead of Yolov3 tiny has boosted both detection and recognition performance. While the speed of this pipeline (17.7 FPS) is slightly better than the baseline (11.1 FPS), real-time performance is still hindered as the average FPS drops by almost 10 FPS compared to the second pipeline (27.1) since MobileNet processes each and every frame much slower without the cascade component. Therefore, this pipeline is not adequate for real-time inference scenarios because we consider real-time videos to have at least 24 FPS.

For our final pipeline, we kept the MobileNet SSDv2 as the backbone detection model and added the cascade component on top of it similar to the second pipeline. This yields in a much higher FPS rate (27.2) even with the MobileNet SSDv2 backbone while also gaining a higher F1 score (60.8%) than the standalone MobileNet SSDv2 (60.5%). Even though the detection rate of this pipeline (82.7%) is much lower than that of the third pipeline (98.0%), it still achieves a higher F1 score because the cascade component only feeds forward frames with clearly visible license plates to the MobileNet SSDv2 component that are in turn more favorable for the Vision API. As a result, with fewer plates detected than the third pipeline, this pipeline still provides more accurate OCR results than all the other pipelines because of the added layer of consistency that the cascade classifier provides.

4.3 Result Analysis

4.3.1 Training Results

We trained our MobileNet SSDv2 model on the license plate image dataset and obtained the results outlined in Table 2. We achieved an score of 86.2%.

Model AP AP50 AP75 APs APm APl
MobileNet 0.47 0.86 0.47 0.13 0.46 0.55
Table 2: Training results

4.3.2 Evaluation Metrics

The metrics used to calculate the recognition performance are determined by the Levenshtein distance [levenshtein1965levenshtein] which measures the difference between two strings based on the number of insertions, deletions, and substitutions that have to be performed on the target string in order to match it with the reference string. According to the methods discussed in [karpinski2018metrics], we calculated the corresponding precision, recall, and F1 Score for recognition of each license plate. We applied pre-processing on both the ground truth and our OCR outputs such that they contain only the Bangla character set [alam2020large] and numeric digits, converting the grapheme roots to individual consonants, and discarding any other character so that the Levenshtein distance is not affected by noise.

Figure 8: OCR predictions from plates detected by standalone SSDv2 that are not favorable for Vision API are shown on the left, whereas detections by Cascade + SSDv2 that produce better OCR predictions are shown on the right are shown on the right

4.3.3 Quantitative Results

We evaluated our baseline and the other proposed pipelines on the license plate video dataset and outlined the results in Table 1. From the ablation study, it can be observed that the SSDv2 and Cascade + SSDv2 pipelines have superior performance in all aspects in comparison to the other pipelines. Both of the pipelines have their respective merits and demerits. The detection accuracy for standalone SSDv2 is superior (98%) at the cost of losing real-time performance while using the Cascade + SSDv2 results in much faster FPS rates (27.1 on average), but with a lower detection rate. The recognition performance of both pipelines using Vision API is similar ( 61% F1 Score). The Bangla alphabet is made up of 11 vowels, 7 consonants, and 168 grapheme roots. This makes Bangla a very complex language having over 13,000 combinations compared to that of English which has only 250[chatterjee2019bengali]. As a result, license plates often contain a variety of conjunct-consonant characters which makes it difficult to perform character recognition. Furthermore, we resized all videos to be 480*480 pixels which resulted in the extracted license plate patches being extremely small. When enlarged for OCR, these patches become blurred for which Vision API fails to recognize most characters.

4.3.4 Qualitative Results

It is apparent from table 1 that the SSDv2 pipeline has a much higher detection rate than the Cascade + SSDv2 pipeline. In many scenarios, standalone SSDv2 can detect plates that are missed by the Cascade + SSDv2 pipeline since the cascade classifier is less sensitive to plates that are less visible, further away, or tilted. A few of these examples can be seen in figure 4. In a few of these cases, the frames extracted by standalone SSDv2 pipeline also yield highly accurate OCR predictions from Vision API as seen from the examples in figure 5.

However, from our experiments we found that most of the time, the plates that are missed by Cascade + SSDv2 but detected by standalone SSDv2 pipeline are not very useful in terms of OCR prediction. This is because SSDv2 has a tendency of detecting plates that are further away with high confidence. These outputs are often either too small or too blurry and evidently unreadable. These detections certainly give the SSDv2 pipeline an edge over the Cascade + SSDv2 pipeline in terms of detection rate, but the OCR results show unsatisfactory results as seen from figure 6 and 7. We can also observe that the detections made by the Cascade + SSDv2 pipeline are much more reliable as cascade only detects plates that are clearly perceivable. This yields good OCR outputs as compared to the detections made by standalone SSDv2. We can see these detections and OCR predictions from figure 8.

5 Conclusion and Future Works

In this paper, we propose a pipeline that can reliably detect license plates from a video stream. We proposed a method to store only the 3 best-detected instances for a particular vehicle. Our pipeline also consists of a method for individually storing temporally separate instances of different vehicles appearing in the same video. As seen from the various experiments, each of the pipelines has its own trade-offs. MobileNet SSDv2 provides a pipeline with a very high detection rate but at the cost of a slower inference speed. Whereas Cascade + SSDv2 has the fastest speed and the highest recognition accuracy among all the pipelines but has much lower detection rates. As our future work, we want to deploy our system in various real-life environments and verify its day-to-day performance. We also want to apply preprocessing mechanisms on the detected plates and build our own Bangla OCR system that can improve the recognition performance compared to what the Vision API currently provides.