Log In Sign Up

QuickBrowser: A Unified Model to Detect and Read Simple Object in Real-time

There are many real-life use cases such as barcode scanning or billboard reading where people need to detect objects and read the object contents. Commonly existing methods are first trying to localize object regions, then determine layout and lastly classify content units. However, for simple fixed structured objects like license plates, this approach becomes overkill and lengthy to run. This work aims to solve this detect-and-read problem in a lightweight way by integrating multi-digit recognition into a one-stage object detection model. Our unified method not only eliminates the duplication in feature extraction (one for localizing, one again for classifying) but also provides useful contextual information around object regions for classification. Additionally, our choice of backbones and modifications in architecture, loss function, data augmentation and training make the method robust, efficient and speedy. Secondly, we made a public benchmark dataset of diverse real-life 1D barcodes for a reliable evaluation, which we collected, annotated and checked carefully. Eventually, experimental results prove the method's efficiency on the barcode problem by outperforming industrial tools in both detecting and decoding rates with a real-time fps at a VGA-similar resolution. It also did a great job expectedly on the license-plate recognition task (on the AOLP dataset) by outperforming the current state-of-the-art method significantly in terms of recognition rate and inference time.


page 1

page 4

page 7


Few-Shot Object Detection in Real Life: Case Study on Auto-Harvest

Confinement during COVID-19 has caused serious effects on agriculture al...

Advanced Audio Aid for Blind People

One of the most important senses in human life is vision, without it lif...

Hierarchical Few-Shot Object Detection: Problem, Benchmark and Method

Few-shot object detection (FSOD) is to detect objects with a few example...

An Efficient and Layout-Independent Automatic License Plate Recognition System Based on the YOLO detector

In this paper, we present an efficient and layout-independent Automatic ...

Real time expert system for detecting object region and working state of aerators

Aerators are essential and important auxiliary devices in intensive cult...

Real Time Multi-Class Object Detection and Recognition Using Vision Augmentation Algorithm

The aim of this research is to detect small objects with low resolution ...

I Introduction

Deep convolutional neural networks have recently shown remarkable performance in vision-related tasks such as image classification, object detection, or segmentation. From those successes, we witnessed many applications utilizing those fundamental advances to tackle practical, useful jobs, although their scope and accuracy are still limited. One of the typical use cases is that we need to look for objects and read the object contents. For example, we’re walking around a street to browse billboards, panels and also read to contents of those objects for next steps like making a phone call by retrieved numbers or surfing their website by URL. As a norm, most machine approaches tackling this kind of application are step-by-step: using an object detector to detect the regions of interest (ROI) and then read the object clearly by another model. This way proved its robustness in many problems, especially when the items are very varied, non-standardized, and have highly complex structured content such as clothing labels, shop signs, etc. However, applying this stereotype to simple, standardized objects is not flexible and lengthy. Two typical examples of simple items are one-dimensional barcodes and license plates.

Barcode scanning in supermarkets or warehouse is daily, time-consuming and unhandy work for both customers and staff these days. Such devices assisting workers to pick tagged packages correctly would be a beneficial applicable scenario. In this work, we mainly talk about camera-based barcode scanning (to differentiate from laser-based solutions). Many approaches proposed in 20 past years are getting better and better in performance and speed; however, they are still limited and have not yet reach real-time speed. Briefly, existing works could be categorized into 2 types: localizing focus and decoding focus, which are also 2 sub-tasks of the problem under the view of conventional approaches. Localizing-focus works tried to get a better detection rate under various real conditions like blurry, low-light, and other kinds of noise. On the other hand, decoding-focus works concentrate on translating the localized (or preprocessing) regions from the localizing task to encoded sequences correctly.

Both sub-tasks were quite challenging and required handcrafted feature extraction as in [14, 22, 31]

until the surging of the deep convolutional neural network (DNN). DNN-era approaches as

[8] use a CNN-based multi-label classifier to decode all digits of a barcode sequence at once proved robustness even in challenging conditional inputs whereas [11] attacked mainly on detection by adopted a DNN-based object detector to reach high detection rate surpassing all previous works [19, 15, 30, 3, 4]. Although recent works almost conquer each sub-task, the end-to-end task joining the two sub-tasks becomes lengthy, slow because of this pipeline structure and the duplication in extracting features.

Similarly, license plates in most countries are often standardized designed with a limited maximum number of characters and symbols, while Automatic License Plate Recognition (ALPR) has many useful application domains in traffic controlling and parking management. Conventional ALPR used to consist of three steps: object detection, segmentation, and optical character recognition (OCR) as [26, 16]. In 2019, [2] proposed a faster methodology by using an ensemble of multiple tiny alphanumerical detectors and combining reasonably those sub-results to final results. [2] boosted its speed by skipping segmentation but it also adds a logical aggregation of those tiny model outputs. Though tiny detectors work well independently, they bring many noise characters (out-of-the-plate characters). Moreover, final aggregation takes a considerable time to validate characters, dismiss noises, arrange character orders, and group sub-results to the final bounding boxes.

Therefore, in this study, we proposed a unified model to deal with the end-to-end detect-and-read-content problem for simple structured objects with the following contributions:

  • Our method – QuickBrowser successfully integrates multi-digit classification into DNN one-stage object detection along with modified architecture, loss function, data augmentation, and training strategy, making it fast, computational saving, extensible and robust even under various harsh conditions and object smallness.

  • We extended and upgraded the [6] dataset to a more standard barcode dataset by collecting more samples, annotating, and double-checking. The dataset has 3 subsets: a main EAN-13 single-barcode set having a clear train/validation/test split, an EAN-13 multi-barcode set, and a small EAN-8 set to confirm the model extensibility. The dataset is public to challenge further research.

  • Lastly, our experimental results on the extended barcode dataset and the AOLP license plate dataset demonstrated the method’s efficiency by outperforming industrial software, existing works in terms of detection rate, recognition rate, and inference time while our fastest models achieve real-time and near-real-time speed under practical resolutions.

Ii Related Work

Regarding barcodes - a typical example of simple objects aforementioned, there are two development flows: localization and decoding. In the first flow, [19] introduced the first multiple and rotation-tolerated barcode recognition method in 2011. The work uses some traditional filters to segment out barcode regions, enhanced the stripes, rotated the regions to horizontal before putting them to a regular decoder. Though the method worked well on plain papers (lottery tickets), it was still slow and no other material results. [15] proposed a method using morphological operations to segment out barcode regions under various conditions with good performance in 2012. [30] in 2013 focused on dealing with blur using structure matrix and saturation from HSV color system to detect better but with the cost of speed. Recently, [4] offered a faster method for also blurred input based on Line Segment Detector following their previous work [3] using Maximal Stable Extremal Region shown sensitive to blur. Lastly, [11] in 2017 first applied a DNN object detection, shown achieving the best detection rate on small datasets at that time expectedly. [11] also tried to predict horizontal-rotating angles by another model and input cropped rotated barcode patches into a traditional decoder. A common point most previous works sharing is that they still need a decoder to complete their final goal.

On the decoding flow, several works had been proposed sparsely since the 1990s. Early works [14, 22] achieved their goal by techniques as Hough transformation, wavelet-based peak location on their naive inputs. In 2008, [31] proposed a scanline-based approach which just worked well on a slightly rotated (±15 degrees) barcode; stronger rotated or distorted would be problematic. Similar to [15], [34]

also attacked out-of-focus input by using a multi-layer perceptron model to find an adaptive threshold to enhance blurry patches to clearer before decoding by a regular decoder. Recently,

[32] proposed a method heavily based on scanline-base and hand-crafty featuring and analysis for each challenging condition. Despite getting high detection and recognition rate, it is a less scalable solution to extend other 1D types. More recently, [8] in 2017 leveraged CNN to directly extract features and predict simultaneously all digits of barcode sequence given a localized region. The idea of this method was actually coined in [10] so we call it multi-digit classification. Compared with its previous methods, the DNN-based approach is straightforward and data-driven than case-by-case analysis. One noticeable case when there is no single line crossing all the code bars is problematic for scanline-based approaches, but it could be learned and overcome by a DNN classifier. [6] followed that CNN base but improved it by exploiting the self-validation feature of the barcode and test-time augmentation. Similar to first flow, decoders require a locator as they are weak at diverse-background input. This loose-coupling has advantages in some cases but with the cost of speed and decoder dependence. Thus, our method, which its application scope focuses more on speed, is mainly tight-coupled by its integrated design.

Another example of simple objects we try to apply our method on is vehicle license plate detection and recognition. The topic attracted numerous research conducted in [1, 28, 13, 2]. The plate detection predicts bounding boxes for the plate regions given the wild image. Conventional approaches that use morphological categorizations could be grouped in some ways: edge-based, color-based, texture-based, and character-based. Edge-based [1] methods highlight the edges of black characters against the plate’s white background and the plate frame with the vehicle body. Similarly, [28] briefly reduces noise after edge-based extraction from the license plate. Despite fast processing, it is sensitive to unwanted edges. Color-based approaches basically emphasize the difference between vehicle body color and plate color, such as [5] proposed an HSI color model to detect candidate regions, which is later filtered out by histogram. The color-based could detect inclined and deformed license plates; however, it failed if the vehicle body and plate colors are similar or under the illumination changes in outdoor locations. Unlike the edge or color approaches, [13]

proposed a texture-based method that extracts dense sets and rectangular shapes similar to a plate shape and then verifies them by Gaussian mixture model with Expectation-Maximization algorithm. Features extracted from this way are more discriminative but with the cost of computation, time. Besides, it still handcraft features that mean takes time when new types and shapes are added. Finally, character-based approaches powered by CNN detect the objects using the string of characters extracted directly from the image such as

[2] we mentioned in the Introduction. The last type of approach is potential because of automatically sophisticated feature learning and the reduction step of reading plate sequences. Although [2] took the state-of-the-art place in AOLP dataset in 2019, there are some limitation: maintaining a bunch of tiny models is inconvenient in deployments, improving YOLO to deal with small objects by sliding windows is quite time-consuming; and final aggregation takes a nontrivial period as summarised in the penultimate part of Introduction.

From a modern view, simple object localization (i.e., detection) is just a simplified case of generic object detection. DNN-based object detectors were born gradually to conquer higher accuracy, more extensive scale. There are mainly two types of DNN-adopted models: one-stage and two-stage. Typical two-stage models are the R-CNN family [9] where detection happens in two stages: first, the model proposes a set of regions of interests by selective search or regional proposal network (bounding box candidates can be infinite), then the classifier part classifies the region candidates. On the other hand, one-stage models like SSD [21], or YOLO [23] skips the proposal stage and detect directly on a dense sampling of possible locations. This is quicker and simpler but potentially with the cost of accuracy. Methods proposed or improved in recent years getting more powerful and complicated could be overkill for simple cases. YOLO’s approach is straightforward but yet effective, rapid and widespread adopted; hence, we adopt this head in our architecture in this work.

Iii Methodology

As we mentioned before, a common limitation of most of the existing approaches is that breaking the end-to-end task of detect-and-read-content into 2-stage tasks which slows down the full process while disposing valuable extracted feature from the first-step task (localization task). Therefore, we focus on solving this end-to-end problem by unifying multi-digit classification (as [10]) and DNN one-stage object detection. To the best of our knowledge, this is the first work trying to incorporate one-stage object detection with multi-digit CNN to deal with the problem for simple objects.

Iii-a Unified Recognition

Without losing the generalization, we opt for YOLO [23] head model due to its simplicity and the task suitability. Compared to YOLO version 2,3 [24, 25] or SSD [21]

, the use of anchor boxes (i.e., box priors) is unnecessary since barcode is omnidirectional (i.e., readers must detect and read no matter rotation) and stretchable which cause height-width ratios are not clustered like what happens to datasets like COCO

[20]. Furthermore, recent models trying to solve absolute overlapping objects (e.g., predict multi objects per cell) with the higher cost of computation while overlapped barcodes are non-decodable and things like plate-on-plate cases are non-existed or deliberated law-violated.

Similar to YOLO [23], our system models the task as a regression problem. The input is divided into an grid, and one cell is responsible for detecting an object if that object center lay inside the cell. Each cell predicts bounding boxes and confidence scores for those boxes. Specifically, each box of bounding boxes predicts 5 figures: (scaled, relative to the cell by offsetting) and . However, in classification, instead of solely predicting (only one or) a few object classes, we predict in the same way as multi-digit CNN did in[10]. In other words, we read an object content simply by viewing it as a single structured sequence and simultaneously predict digits ( is the maximum/fixed length of the sequence ) and the real length of the sequence (if length is not fixed). Each digit is predicted through

conditional class probabilities

(e.g. 1D barcode EAN-13 has digits, each digit could be one of values from to and if necessary). Furthermore, as the condition of no absolute overlap, we decided to make the grid denser and design each cell only represents only one sequence regardless of the number of boxes .

Finally, at the test time, we take the product of box confidence and arithmetic mean of all variables’ max probabilities, and this final confidence reflexes how well that predicted box locate the object and the confidence of the model in predicting the sequence correctly. Those boxes with final confidences are then inputted into non-maximum suppression (NMS) with a threshold to reduce overlapping boxes that predict same objects. The after-NMS sequences satisfying validity (e.g. barcode satisfied checksum, predicted plate length agreed with – length variable) are reported, while invalid ones but with high confidence are still listed as unreadable objects needing adjustments such as zoom in or hold still, etc.

Iii-B Network Architecture

Overall, implemented model has 3 parts: backbone, neck and head as shown in Figure 1 below. We take extraction parts of ResNet [12] and MobileNetV2 [27] as backbones instead of Darknet19 [24]

because many papers report that ResNet backbones (with skip connections and batch normalization) gave the better results in term of accuracy while MobileNetV2 pays more attention to speed while still maintaining good accuracy.

On top of that, we put a stack of Dilated BottleNeck blocks (proposed in [33]) in a similar way to DetNet [18]

which demonstrated maintaining the spatial resolution and enlarging the receptive field, which is good for the case of smaller objects are often dominated by bigger objects. This is valuable since object content texture (e.g., barcode bars) are individually small in wild images. Note that this stack slows down the model a bit by increasing its complexity. Eventually, a batch normalization followed by an adaptive max pooling and a 2D convolution are taped on the neck to normalize and ensure the shape of the output follows what we want no matter the input size.

Fig. 1: A general architecture for end-to-end recognition

This type of model not only accelerates the end-to-end process by making use of features extracted from DNN but also gives a better prediction as it provides the classification part of the model supportive contextual information around the regions of interest. Intuitively, humans sometimes have trouble with recognition things too small objects but cropped without context around them or the object image is under non-uniform conditions (non-uniform motion blur or non-uniform cover).

Iii-C Training


Using the YOLO [23] loss function defined as the expression 1 below. The first 2 lines represent the loss of bounding box regression part (the differences between the ground-truth and the predicted of box’s center coordinates on the first line, box’s size on the second line) while the third line describes the loss of object confidence part. However, in the last part which is classification loss by mean square error, we add to weight the importance of predicting content sequence correctly. Furthermore, experiments showed that this weight useful in boosting model recognition rate; however, we have to carefully set and adjust this hyper-parameter value along with the learning rate decay schedule, mixing real/synthesized ratio during training to achieve the best results.

Iii-D Data Augmentation & Synthesizing Data

Despite the similar structure, data augmentation applied to each object should be considered based on the object characteristic. For example, barcodes are omnidirectional and sensitive to horizontal loss (cause non-decodable), so besides common augmentations (like make blur, adjust brightness/hue/saturation, flip horizontally, add noise), we use various barcode-centric augmentations: random cropping/shifting without cutting off bounding boxes, random rotating within a circle (360°), stretching, synthesizing elastic effects [29] (to make it resemble a wrinkled barcode on plastic bags), shearing randomly both horizontally and vertically. On the other hand, license plates are different because Arabic numerals are sensitive to strong rotation or flipping (e.g., 6 becomes 9 after 180° rotation, 3 becomes a weird symbol).

Due to the limited number of real samples, synthesizing simple objects which is not complicated is extremely beneficial for model training to be converged. In the barcode case, numerous random barcode sequences complied with naming convention are generated, encoded to barcode form. They are then transformed perspective (3D rotated) or augmented randomly before taped on random images. Similarly, large synthesized license plates are created by randomizing sequences on random vehicle plate templates by a font similar to the real plate’s font. Those plates are undergone random perspective transform and noise adding operations to make them more realistic. The detailed processes and their effects are described in the Experiments part.

Iv Experiments

Iv-a Datasets

We prepared a trustable real dataset for the barcode recognition that has 3 standard subsets: training set, validation set, and test set as usual. The dataset was extended from the [6] work, and our new collecting campaign (also carefully double-check and annotate bounding box) from real image captures and 2 crawling online on product tagging websites: & [17]. Those images are EAN-13 single-barcode and various in size and format. Totally we have samples comprised of samples for testing, samples for training, and ones for validation. We limit our primary test dataset at single-barcode images because most crawling samples are single-barcode and the LabelImage tool we used to annotate is not ready for writing multiple decoded sequences. Eventually, we also prepared 2 small sets of 20 multi-barcode images ( barcodes total) and 90 EAN-8 samples ( for training, for evaluation) for supporting demonstrations.

Barcodes are synthesized with a variety of height (), width (), quiet zone (), valid background and foreground colors, gap widths, and no numerals printed to force models to learn strip pattern, not numerals. We paste barcodes per image with random rotations to one of 1000 images (from VOC dataset [7]) randomly. The use of randomization each iteration and non-barcode (0) inputs are made to ensure the model generalization and reduce the false-positive. We often set synthetic samples for training from scratch and increase this to tune models.

On license plate reading, we used AOLP dataset [13] as its reliability and challenging conditions. The dataset consists of 2049 images of car/motorbike license plates. The samples are collected at different locations, times, traffic densities, and weather conditions. The dataset consists of 3 categories which are Access Control (AC), Law Enforcement (LE), and Road Patrol (RP). Each category has unique characteristics, and LE, RP conditions are tougher to deal with. Specifically, we use the same train-test split as this state-of-the-art paper [2] used: 581 samples of AC, 657 samples of LE, 511 of RP for training while 300 items (100 each) from AC, LE and RP are used for evaluation. Synthesize samples have similar features as the real set: -digit sequences with dash (-) symbol in between, plate templates with random cut (for obscured -digit plate), center-aligned, various colors, and various scale.

Iv-B Experimental Setups

We chose Dynamsoft, Cognex, Inlite, and Zxing as competitors in evaluating barcode sets. The first 3 software are trustworthy and used a lot in the industry (worth thousands of dollars). While Cognex is evaluated via an online demo website, other official binaries were tested on our computer. No test-time augmentation is used. The training processes were made on Intel Core i7 9700, NVIDIA Titan RTX 24GB from scratch (no pretrain), while evaluations were done on Intel Core i7 4700 and GTX 1060 6GB. We use mostly batch size of or , and the initial learning rate of . We reconfigure these hyper-params and

to achieve the best results. In decoding the output tensor, confidence threshold and NMS threshold are set at 0.2 and 0.3, respectively. The standard input size is

, grid size , number of boxes

. For the higher input resolution, we slice it into standard-size patches by a stride, make predictions on patches in parallel, then combine those predictions by patch positions and apply NMS a second time to get final results. Test samples are normalized and padding at the size of

, while the speed test is conducted at the batch size of and resolution ( VGA) to simulate as practical video input.

The license plate dataset is inputted at for both training and evaluation by only padding operation as we avoid resizing because many numbers on plates already very tiny and distorted characters could change ground-truth such as (B, 8), (4, 9) or (1, I). Other setups and tuning hyper-parameters are nearly the same as the barcode problem.

Iv-C Evaluation

In this work, we adopt: standard Mean Average Precision (mAP with threshold ) to evaluate detection performance; recognition to measure the accuracy of predict each entire sequence correctly, formally

prediction time (millisecond) per one input; the number of floating-point operations (FLOPs), and the number of parameters that model has (i.e., how lightweight the model is).

Table I shows us the best models of each backbone and industrial tools. MobileNet2[:X]_Y means we take only a part of MobileNetV2 from layer and stick Det blocks with channels inside. Clearly, the ResNet backbone takes the first place but relatively slow with a bigger model size. On the other hand, MobileNet2[:15]_256 not only gets excellent result even higher all tools but also with a lightning inferencing speed whereas its shallower variants MobileNet2[:6]_256 and MobileNet2[:6]_512 took too low abstraction features leading to bad performances. We also witness how useful Det blocks (first 5 models use) since models without them (_no_Detnet) could not soar higher. However, when we tried to increase the number of channels inside Det blocks, it does not help, so we have to be careful in setting this number (256 is the default setting used in [18]). As we anticipated before, the Darknet19 backbone (used originally in YOLOv2 [24], has batch normalization, no Det blocks) did fairly well in detection; however, it did not demonstrate adequate performance in recognition because of object smallness in this dataset. Last but not least, we also used MobileNet2[:15]_256 to do speed evaluation on 2000 EAN-13 test set at a resolution of ( VGA), and the result we got is 25.8 ms/image a smooth and real-time 38.76 FPS.

mAP Recog. Rate Avg. Time/img GFLOPs Params
Ours ResNet50 94.38% 86.90% 127 ms 17.23 26.98 M
MobileNet2[:15]_256 91.62% 75.35% 49 ms 1.45 3.2 M
MobileNet2[:15]_512 85.15% 72.10% 55 ms 2.31 7.62 M
MobileNet2[:6]_256 83.46% 66.00% 67 ms 7.17 2.48 M
MobileNet2[:6]_512 81.43% 61.70% 74 ms 20.81 6.83 M
ResNet50_no_Detnet 84.19% 77.80% 91 ms 17.05 26.09 M
MobileNet2_no_Detnet 82.93% 65.65% 53 ms 1.57 3.84 M
Darknet19 87.17% 69.65% 87 ms 17.13 23.71 M
Tools Zxing 26.90% 26.90% 53 ms - -
Dynamsoft 76.55% 76.30% 375 ms - -
Cognex 67.15% 66.75% 115 ms - -
Inlite 51.35% 50.55% 218 ms - -
TABLE I: Model and Tool results on the Main (2000 single-barcode) test set

To demonstrate the ability of the model under multiple barcode images and the extensibility of adding a new type of barcode, we also conducted an evaluation on the multi-barcode dataset and on the EAN-8 barcode set as results in Table II. On the multiple set, our best model (ResNet50 with Detnet) get 95.98% mAP and 81.8% for accuracy, which is very good and outperformed the commercial counterpart with only 63.6% even though we set the maximum effort in the setting. The extensibility of this method is also proved by the winning of MobileNet2[:15]_256 over Zxing (the fastest commercial tool) after training purely on synthetic samples but still achieved 81.1% on 90 EAN-8 test samples.

Model mAP Recog. Rate
Multi set ResNet50 (our best) 95.98% 63/77 81.8%
Dynamsoft (com. best) 68.7% 49/77 63.64%
EAN-8 set MobileNet2 (our fastest) 87.99% 73/90 81.1%
Zxing (com. fastest) 45.56% 41/90 45.56%
TABLE II: Multi-barcode and EAN-8 evaluation

On the work of license plate recognition task, we conducted experiments on 2 backbones which are ResNet50 and MobileNet2[:15] with normal DetNet blocks. As we mentioned about 3 categories of AOLP dataset before, besides the overall evaluation, we separately measure each category with an aim to see how the models react to each condition, where model showed weakness to improve in feature works. As we can see in Table III, the ResNet-backed model again took the winner position in both mAP (99.35%) figures and recognition rate (92.21%) overall. Although the lightweight MobileNetV2-backed model comes second in mAP, it not only did a very good job in term of recognition rate but also the fastest model with only 52.54 ms per image compared with the state-of-the-art paper (SOTA) [2] on this dataset (note that [2] setup is not the same with ours – we used GTX 1060 which is only 10% faster than GTX 970 [2] used). Last but not least, both of our models do inference very fast although it is still not real-time rate (25-30 fps). As expected, LE and RP are challenging conditions, so our model predicted less correctly in both detection and full-plate classification. We might need to learn more about those conditions and provide better data augmentations that fit them.

ResNet50 MobileNet2 [2]
AC mAP 100% 100% -
recog. rate 97.00% 95.00% -
LE mAP 98.11% 98.15% -
recog. rate 91.67% 87.96% -
RP mAP 100.00% 99.00% -
recog. rate 89.00% 85.00% -
Overall mAP 99.35% 99.02% 99.00%
recog. rate 92.21% 89.29% 78.00%
GFLOPs 35.13 2.93 -
Params (M) 27.2 3.4 -
Time (ms) 70.86 52.54 800 - 1000
TABLE III: Results on AOLP dataset

Iv-D Ablation Study

Regarding the influence of different data augmentations, as Table IV shown, it is obvious that the shearing effect and rotation are crucial for this model training because, without it, the model dim down clearly in recognition rate (57.35% and 78.85% respectively). On the other hand, training without making blurry, adding Gaussian noise or elastic effects just reduces the model performance a little bit. We guessed that because those conditions exist already in the real dataset.

Back to the subsection of Training in Methodology when we talked about , we assume that we should put a small value for that hyper-parameter when we start training from scratch. The reason behind is that model should focus more on learning to know where each object is located rather than classifying what each object. So bigger would just dominate loss function and hinders the spatial convergence to the correct grid, and consequently, it does not classify digits of sequence correctly either. Therefore, only after a certain degree of convergence appeared (i.e. the model has learned to locate correctly), we should increase the hyper-parameter value to emphasize the role of classification digits correctly. Empirical results at Table IV demonstrate our hypothesis as we can see: no matter by ResNet or MobileNetV2, if we set too high from the beginning, the model gets confused and could not converge; however, keeping low for the whole process also does not produce superior models; and lastly, rise the hyper-parameter value higher ( after ) after certain convergence results in better models.

mAP Recog Rate
ResNet50_w1 91.87% 82.20%
ResNet50_w5 34.28% 0.20%
ResNet50_w1_5 94.38% 86.90%
MobileNet2[:15]_256_w1 96.60% 69.85%
MobileNet2[:15]_256_w5 46.33% 0.00%
MobileNet2[:15]_256_w1_5 91.62% 75.35%
ResNet50_w1_no_elastic 92.95% 80.95%
ResNet50_w1_no_rotate 86.62% 57.90%
ResNet50_w1_no_shear 93.85% 78.85%
ResNet50_w1_no_blur 91.23% 81.15%
ResNet50_w1_no_noise 94.38% 81.55%
TABLE IV: Data augmentation and effect ablation study

Eventually, during researching this topic, we always keep in mind that in many cases, even we humans sometimes found and absolutely sure that a certain small object in an image is something we know. However, we cannot see clearly what on. That could also happen to the machine, so we tried to analyze correct decoded and non-decodable (and wrongly predicted included) cases to answer the question: How small of a barcode in an image could make a model difficult to decode correctly ? The answer to this also helps us to notice that if a barcode is too small compared to the whole image, we should be careful with the model outcome, or we had better to make the camera focus closer to the detected location to have a better and trustworthy prediction. As Figure 2, it is obvious that if the barcode area is larger than 0.02 of the whole image, our best model (ResNet50) is ready to give over 90% correctly. (This analysis is done on 2000 main test set).

Fig. 2: Barcode area on image ratio and decodability

V Conclusion

In conclusion, in this study, we successfully proposed QuickBrowser to quickly detect and read simple object content by integrating multi-digit classification into DNN one-stage object detection with modified architecture, loss function, data augmentation, and training to make it lightweight, fast, and robust even under most real-life conditions. We also upgraded an existing dataset to a rigorous benchmark dataset for barcode scanning by collecting more samples, annotating, and double-checking. The dataset has a main EAN-13 single-barcode set having a clear train/validation/test split, an EAN-13 multi-barcode set, and an EAN-8 set to confirm the model extensibility. The dataset is public to promote further research. Lastly, our experimental results on the extended barcode dataset and the AOLP license plate dataset demonstrated the method’s efficiency by outperforming industrial software, existing works in terms of detection rate, recognition rate, and inference duration. At the same time, our fastest models accomplish real-time and near-real-time rates under realistic resolutions.

However, our methodology also has a few limitations, such as grid limitation, max length of sequence limitation, and the simple object structure. We are still planning for better improvements by more in-depth analysis of datasets for those drawbacks in future works.


  • [1] C. E. Anagnostopoulos, I. E. Anagnostopoulos, I. D. Psoroulas, V. Loumos, and E. Kayafas (2008) License plate recognition from still images and video sequences: a survey. IEEE Transactions on intelligent transportation systems 9 (3), pp. 377–391. Cited by: §II.
  • [2] R. Chen et al. (2019)

    Automatic license plate recognition via sliding-window darknet-yolo deep learning

    Image and Vision Computing 87, pp. 47–56. Cited by: §I, §II, §IV-A, §IV-C, TABLE III.
  • [3] C. Creusot and A. Munawar (2015) Real-time barcode detection in the wild. In

    2015 IEEE winter conference on applications of computer vision

    pp. 239–245. Cited by: §I, §II.
  • [4] C. Creusot and A. Munawar (2016) Low-computation egocentric barcode detector for the blind. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 2856–2860. Cited by: §I, §II.
  • [5] K. Deb and K. Jo (2008) HSI color based vehicle license plate detection. In 2008 International Conference on Control, Automation and Systems, pp. 687–691. Cited by: §II.
  • [6] T. Do, Y. Tolcha, T. J. Jun, and D. Kim (2020) Smart inference for multidigit convolutional neural network based barcode decoding. arXiv preprint arXiv:2004.06297. Cited by: 2nd item, §II, §IV-A.
  • [7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. Note: Cited by: §IV-A.
  • [8] F. Fridborn (2017) Reading barcodes with neural networks. External Links: Link Cited by: §I, §II.
  • [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 580–587. Cited by: §II.
  • [10] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet (2013) Multi-digit number recognition from street view imagery using deep convolutional neural networks. arXiv preprint arXiv:1312.6082. Cited by: §II, §III-A, §III.
  • [11] D. K. Hansen, K. Nasrollahi, C. B. Rasmussen, and T. B. Moeslund (2017) Real-time barcode detection and classification using deep learning.. In IJCCI, pp. 321–327. Cited by: §I, §II.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §III-B.
  • [13] G. Hsu, J. Chen, and Y. Chung (2012) Application-oriented license plate recognition. IEEE transactions on vehicular technology 62 (2), pp. 552–561. Cited by: §II, §IV-A.
  • [14] E. Joseph and T. Pavlidis (1994) Bar code waveform recognition using peak locations. IEEE Transactions on Pattern Analysis and Machine Intelligence 16 (6), pp. 630–640. Cited by: §I, §II.
  • [15] M. Katona and L. G. Nyúl (2012) A novel method for accurate and efficient barcode detection with morphological operations. In 2012 Eighth International Conference on Signal Image Technology and Internet Based Systems, pp. 307–314. Cited by: §I, §II, §II.
  • [16] T. Kumar, S. Gupta, and D. S. Kushwaha (2016) An efficient approach for automatic number plate recognition for low resolution images. In Proceedings of the Fifth International Conference on Network, Communication and Computing, pp. 53–57. Cited by: §I.
  • [17] G. Lazzari, Y. Jaquet, D. J. Kebaili, L. Symul, and M. Salathé (2018) FoodRepo: an open food repository of barcoded food products. Frontiers in nutrition 5, pp. 57. Cited by: §IV-A.
  • [18] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun (2018) Detnet: design backbone for object detection. In Proceedings of the European conference on computer vision (ECCV), pp. 334–350. Cited by: §III-B, §IV-C.
  • [19] D. Lin, M. Lin, and K. Huang (2011) Real-time automatic recognition of omnidirectional multiple barcodes and dsp implementation. Machine Vision and Applications 22 (2), pp. 409–419. Cited by: §I, §II.
  • [20] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §III-A.
  • [21] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §II, §III-A.
  • [22] R. Muniz, L. Junco, and A. Otero (1999) A robust software barcode reader using the hough transform. In Proceedings 1999 International Conference on Information Intelligence and Systems (Cat. No. PR00446), pp. 313–319. Cited by: §I, §II.
  • [23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §II, §III-A, §III-A, §III-C.
  • [24] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §III-A, §III-B, §IV-C.
  • [25] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §III-A.
  • [26] A. Safaei, H. L. Tang, and S. Sanei (2016)

    Real-time search-free multiple license plate recognition via likelihood estimation of saliency

    Computers & Electrical Engineering 56, pp. 15–29. Cited by: §I.
  • [27] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §III-B.
  • [28] H. Sheng, C. Li, Q. Wen, and Z. Xiong (2009) Real-time anti-interference location of vehicle license plates using high-definition video. IEEE Intelligent Transportation Systems Magazine 1 (4), pp. 17–23. Cited by: §II.
  • [29] P. Y. Simard, D. Steinkraus, J. C. Platt, et al. (2003) Best practices for convolutional neural networks applied to visual document analysis.. In Icdar, Vol. 3. Cited by: §III-D.
  • [30] G. Sörös and C. Flörkemeier (2013) Blur-resistant joint 1d and 2d barcode localization for smartphones. In Proceedings of the 12th International Conference on Mobile and Ubiquitous Multimedia, pp. 1–8. Cited by: §I, §II.
  • [31] S. Wachenfeld, S. Terlunen, and X. Jiang (2008) Robust recognition of 1-d barcodes using camera phones. In 2008 19th International Conference on Pattern Recognition, pp. 1–4. Cited by: §I, §II.
  • [32] H. Yang, L. Chen, Y. Chen, Y. Lee, and Z. Yin (2016) Automatic barcode recognition method based on adaptive edge detection and a mapping model. Journal of Electronic Imaging 25 (5), pp. 053019. Cited by: §II.
  • [33] F. Yu, V. Koltun, and T. Funkhouser (2017) Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 472–480. Cited by: §III-B.
  • [34] A. Zamberletti, I. Gallo, M. Carullo, and E. Binaghi (2010) Decoding 1-d barcode from degraded images using a neural network. In International Conference on Computer Vision, Imaging and Computer Graphics, pp. 45–55. Cited by: §II.