1 Introduction
Barcodes are digital signs often made of adjacent and alternating black and white smaller rectangles that have become an intrinsic part of human society. In administration, for example, they are used to encode, save, and retrieve various users’ information. At grocery stores, they are used to track sales and inventories. More interestingly in e-commerce, they are used to track and speed up processing time in warehouses and fulfillment centers.
In classical signal processing, filters used for detection are image-specific since input images are not all necessarily acquired with the same illumination, brightness, angle, or camera. Consequently, adaptive image processing algorithms are required, which can impact detection accuracy [10]
. In addition, because classical signal processing methods often run on Central Processing Units, they tend to be much slower compared with deep learning implementations that are easily optimized on Graphics Processing Units (GPUs).

Over the years, a number of methods have been proposed to detect barcodes using classical signal processing [10, 6, 9, 13, 3], but nearly all of them take too long to process Ultra High-Resolution (UHR) images. More specifically, [3] used parallel segment detectors which improved on their previous work [2] of finding imaginary perpendicular lines in Hough space with maximal stable extremal regions to detect barcodes. Katona et al [9] used morphological manipulation for barcode detection, but this method did not generalize well as different barcode types have varying performances. Similarly, [4] proposed using x and y derivative differences, but varying input images yielded different outputs, and using such operation on UHR images would be highly inefficient.
With neural networks, though there has been much improvement in barcode detection tasks, few of them have addressed the fast and accurate detection problem in UHR images. Zamberletti et al. [14] paved the way for using neural networks to detect barcodes by investigating Hough spaces. This was followed by [5] which adapted the You Only Looked Once (YOLO) detector to find barcodes in Low Resolution (LR) images, but the YOLO algorithm is known to perform poorly with long shaped objects such as code 39 barcodes. Instance segmentation methods such as Mask R-CNN [8] perform better on images of size pixels but on smaller size images, the outputted Region of Interests (RoI) do not align well with long, 1D-barcode structures. This is because it typically predicts masks on pixels irrespective of object size, and thereby generates ”wiggly” artifacts on some barcode predictions, losing spatial resolution. In the same way, dedicated object detection pipelines, such as YOLOv4 [1], though they perform well on lower Intersection over Union (IoU) thresholds, suffer accuracy at higher IoU thresholds. Among those using segmentation on LR images as a means for detection, [15] also tends to not perform well at higher IoU thresholds.
In this paper, we propose a pipeline for detecting barcodes using deep neural networks, shown in Fig. 1
, which consists of two stages trained separately. When compared with classical signal processing methods, neural networks not only provide a faster inference time, but also yield higher accuracy because they learn meaningful filters for optimal feature extraction. As seen in Fig.
1, in the first stage, we expand on the Region Proposal Network (RPN) introduced in Faster R-CNN [11] to extract high definition regions of potential locations where barcodes might be. This stage allows us to significantly reduce inference computation time that would have been required otherwise in the second stage. In the second stage, we introduce Y-Net, a semantic segmentation network that detects all instances of barcodes in a given outputted RoI image (). We then apply morphological operations on the predicted masks to separate and extract the corresponding bounding boxes as shown in Fig. 2.One of the limitations of existing work on barcode detection is the insufficient number of training examples. ArTe-Lab 1D Medium Barcode Dataset [14] and the WWU Muenster Barcode Database [12] are two examples of existing available datasets. They contain 365 and 595 images respectively, with ground truth masks at a resolution of . Most of the samples in the ArTe-Lab dataset have only one EAN13 barcode per sample image, and few of them in the Muenster database have more than one barcode instance on a given image. To address this dataset availability problem, we have released 100,000 UHR and 100,000 LR synthetic barcode datasets along with their corresponding bounding boxes ground truths, and their ground truth masks to facilitate further studies. The outline of this paper is as follows: in Section 2, we describe details of our approach; in Section 3, we summarize our experimental results and in Section 4, we end with conclusions and future work.
2 Proposed Approach
As seen in Fig. 1, our proposed method consists of three stages: modified Region Proposal Network, Y-Net segmentation network, and bounding box extraction.
2.1 Modified Region Proposal Network
Region proposals have been influential in computer vision and more so when it comes to object detection in UHR images. It is common in UHR images that barcodes are clustered in a small region of the image. To filter out most of the non-barcode backgrounds, we modified the RPN introduced in Faster R-CNN [11] to propose regions of barcodes for our next stages. By first transforming the UHR input image to an LR input image of size , the RPN was trained to identify blobs in LR images. Once a bounding box is placed around the identified blobs, the resulting proposed bounding box is remapped to the input UHR image by a perspective transformation, and the resulting regions are cropped out. The LR input to the RPN is chosen to be of size
as a lower resolution results in the loss of pertinent information. Non-Max Suppression (NMS) is used on the predictions to select the most probable regions.

2.2 Y-Net Segmentation Network

As depicted in Fig. 3, Y-Net is made out of 3 main modules distributed in 2 branches: a Regular Convolutional Module shown in blue which constitutes the left branch, and a Pyramid Pooling Module shown in brown, along with a Dilated Convolution Module shown in orange which after concatenation and convolution constitute the right branch.
The Regular Convolution Module takes in output images of the RPN and consists of convolutional and pooling layers. It starts with 64-channel
kernels and doubles the number at each layer. We alternate between convolution and max-pooling until we reach a feature map size of
pixels. This module allows the model to learn general pixel-wise information anywhere in the input image.The Dilated Convolution Module takes advantage of the fact that barcodes have alternating black and white rectangles to learn sparse features in their structure. The motivation for this module comes from the fact that dilated convolution operators play a significant role in the ”algorithme a trous” for biorthogonal wavelet decomposition [7]. Therefore, the discontinuities in alternating patterns and sharp edges in barcodes are more accurately learned by such filters. In addition, they leverage a multiresolution and multiscale decomposition as they allow the kernels to widen their receptive fields with dilation rates from 1 up to 16. Here too a input image is used and we maintain 32 – channel
kernels throughout the module while the dimensions of the layers are gradually reduced using a stride of 2 until a feature map of
pixels is obtained.The Pyramid Pooling Module allows the model to learn global information about potential locations of the barcodes at different scales and its layers are concatenated with the layers on the dilated convolution module in order to preserve the features extracted from both modules.
The resulting feature maps from the right branch are then added to the output of the Regular Convolution Module, which allows for the correction of features that would have been missed by either branch. In other words, the output of each branch constitutes a residual correction for the other thereby refining the result at each node as shown in white. The nodes are then up-sampled and concatenated with transposed convolution feature maps shown in red and yellow of the corresponding dimension. Throughout the network, we use ReLU as a non-linearity after each layer and add
regularization to account for possible over-fitting scenarios that could have occurred during training. On all datasets, we use 80% for the training set, 10% for the validation set, and the remaining 10 % for the testing set. We use one NVIDIA Tesla V100 GPU for the training process. Since this is a segmentation network and we are interested in classifying background and barcodes, we use binary cross-entropy as loss function.
2.3 Bounding Box Extraction
Since some images contain barcodes that are really close to each other, their Y-Net outputs reflect the same configuration which makes the extraction of individual barcode bounding boxes complex as shown in Fig. 4(a). To separate them effectively, we perform an erosion, contour extraction, and bounding box expansion with a pixel correction margin. As shown in Fig. 4(b), the erosion stage allows the algorithm to widen gaps between segmented barcodes that may be separated by 1 or more pixels. The resulting mask is then used to infer individual barcode bounding boxes in the contour extraction stage in Fig. 4(c) through border following. A pixel correction margin is used to recover the original bounding boxes’ dimensions during the expansion stage as shown in Fig. 4(d). This post-processing stage of our pipeline has an average processing time of 1.5 milliseconds (ms) because it is made of a set of Python matrix operations to efficiently extract bounding boxes from predicted masks.
|
|||||||||||
(all) | (all) | (all) | (small) | (medium) | (all) | (all) | (all) | (all) | Latency (ms) | Resolution (px) | |
Mask R-CNN [8] | .466 | .985 | .317 | .340 | .489 | .990 | .740 | .279 | .023 | 94.8 | |
YOLOv4 [1] | .882 | .990 | .989 | .815 | .897 | 1. | 1. | .995 | .873 | 40.5 | |
Ours | .937 | .990 | .990 | .903 | .945 | 1. | 1. | 1. | .972 | 16.0 | |
|
Average Precision for Max Detection of 100 and Average Recall for Max Detection of 10 computed using MS COCO API.
|
||||||||
Muenster Dataset | ArTe Lab Dataset | |||||||
DR | Precision | Recall | mIoU | DR | Precision | Recall | mIoU | |
Creusot et al. [3] | .982 | - | - | - | .989 | - | - | - |
Hansen et al. [5] | .991 | - | - | .873 | .926 | - | - | .816 |
Namane et al. [10] | .966 | - | - | .882 | .930 | - | - | .860 |
Zharkov et al. [15] | .980 | .777 | .990 | .842 | .989 | .814 | .995 | .819 |
ours | 1. | .984 | 1. | .921 | 1. | .974 | 1. | .934 |
|
|
||||
Px Acc | Px mIoU | Px Prec | Px Rec | |
Mask R-CNN [8] | .993 | .990 | .989 | .890 |
Ours | 1. | 1. | .999 | .999 |
|
3 Datasets and Results
For the synthetic dataset, we use treepoem 111https://github.com/adamchainz/treepoem and random-word 222https://github.com/vaibhavsingh97/random-word to generate UHR and LR barcode images. We use Code 39, Code 93, Code 128, UPC, EAN, PD417, ITF, Data Matrix, AZTEC, and QR among others. We model the number of barcodes in a given image using a Poisson process and a combination of perspective transforms is used to make the barcodes vary in shape and position from one image to the other. We have also added random black blobs at random locations on the original UHR and LR canvases. The real UHR barcodes dataset obtained from Amazon.com, Inc is made of 3.8 million UHR images of resolution up to grayscale images and could not be released due to confidentiality reasons. Additionally, the Muenster and Artelab datasets are used with some data augmentation schemes for more samples.
For the RPN, we accumulated the number of bounding boxes inside the proposed regions and divided it by the total number of ground truth bounding boxes. Our implementation yields an accuracy of 98.03% on the synthetic dataset at 10 ms per image and 96.8% on the real dataset at 13 ms per image while the baseline [11] yields the same accuracies and an average latency over 2.5 seconds (s) per image for both datasets.
For Y-Net, we use the Microsoft (MS) COCO API, and Pixel-wise metrics to evaluate against [8, 1]. By default, the MS COCO API configuration evaluates on small, medium and large areas objects but in our application, the largest detected barcode area is medium. Since Y-Net is a segmentation network and does not output confidence scores for each segmented barcode, we propose using pseudo scores, the ratio of the total number of nonzero pixels in a predicted mask to the total number of nonzero pixels in the corresponding ground truth mask at the location of a given object.
Table 1 shows and values of the models on the synthetic dataset. As seen, our pipeline outperforms [8], and [1] by a of 47.1% and 5.5% and of 67.3% and 0.1% respectively. Also shown in Table 1, is a improvement of 94.9% and 9.9% on [8] and, [1] respectively which highlights that Y-Net continues to yield better results even at higher IoU thresholds. Both our approach and [1] achieve an of 100% and outperform [8] by 1%. For small area barcodes, Y-Net outperforms [8] and [1] by a of 56.3% and 8.8% and for medium area barcodes, Y-Net displays a increase of 45.6% and 4.8% on [8] and [1] respectively. In addition, Table 3 reveals that Y-Net a has much better semantic segmentation performance than [8]. Table 1 displays that Y-Net performs at least faster than the fastest of models [8] and, [1] on LR images.
Similarly, we have used the Detection Rate (DR), mIoU, Precision, and Recall, as described in
[10, 3, 5, 15] on the Arte-Lab and Muenster datasets and as can be seen in Table 2, our method outperforms previous works on all of the mentioned metrics. This indicates that our bounding box extraction algorithm is working as expected to detect accurate bounding boxes. However, while it is successful in separating barcodes that are relatively close to each other, it has limitations when barcodes are overlapping as shown in Fig. 4(e). For those occlusion scenarios, the algorithm tends to group the overlapping barcodes into one bounding box instead of separate bounding boxes as shown in Fig. 4(f).
4 Conclusion
In this paper, we showed that barcodes can be efficiently, accurately, and speedily detected using Y-Net on UHR images. With pseudo scores as confidence scores, our approach outperforms existing detection pipelines with a much better latency. In future work, we aim to extend this method to the multi-class detection task for small objects in UHR images and videos in a weakly supervised fashion.
References
- [1] (2020) YOLOv4: optimal speed and accuracy of object detection. External Links: 2004.10934 Cited by: §1, Table 1, §3, §3.
- [2] (2015) Real-time barcode detection in the wild.. IEEE Winter Conference on Applications of Computer Vision , pp. 239–245. Cited by: §1.
- [3] (2016) Low-computation egocentric barcode detector for the blind. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 2856–2860. Cited by: §1, Table 2, §3.
- [4] (2011) Reading 1d barcodes with mobile phones using deformable templates.. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, pp. 1834– 1843. Cited by: §1.
- [5] (2017) Real-time barcode detection and classification using deep learning.. IJCCI 1, pp. 321–327. Cited by: §1, Table 2, §3.
- [6] (2004) Barcode readers using the camera device in mobile phones. In 2013 International Conference on Cyberworlds, pp. 260–265. Cited by: §1.
- [7] (1987) A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets: Time-Frequency Methods and Phase Space. Proceedings of the International Conference , pp. . Cited by: §2.2.
- [8] (2017) Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. External Links: Document Cited by: §1, Table 1, Table 3, §3, §3.
- [9] (2013) Efficient 1d and 2d barcode detection using mathematical morphology. In Mathematical Morphology and Its Applications to Signal and Image Processing, pp. 464–475. Cited by: §1.
- [10] (2017) Fast real time 1d barcode detection from webcam images using the bars detection method. In Proceedings of the World Congress on Engineering (WCE), Vol. 1. Cited by: §1, §1, Table 2, §3.
- [11] (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. External Links: Document Cited by: §1, §2.1, §3.
- [12] (2010) Robust 1-d barcode recognition on camera phones and mobile product information display. In Mobile Multimedia Processing: Fundamentals, Methods, and Applications, C. W. C. X. Jiang (Ed.), pp. 53–69. External Links: ISBN 978-3-642-12349-8, Document, Link Cited by: §1.
- [13] (2013) Blur-resistant joint 1d and 2d barcode localization for smartphones. In Proceedings of the 12th International Conference on Mobile and Ubiquitous Multimedia, MUM ’13. Cited by: §1.
- [14] (2010-01) Neural image restoration for decoding 1-d barcodes using common camera phones.. Vol. 1, pp. 5–11. Cited by: §1, §1.
- [15] (2019) Universal barcode detector via semantic segmentation. 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 837–843. Cited by: §1, Table 2, §3.