RPC Dataset Leaderboard
Automatic Check-Out (ACO) receives increased interests in recent years. An important component of the ACO system is the visual item counting, which recognize the categories and counts of the items chosen by the customers. However, the training of such a system is challenged by the domain adaptation problem, in which the training data are images from isolated items while the testing images are for collections of items. Existing methods solve this problem with data augmentation using synthesized images, but the image synthesis leads to unreal images that affect the training process. In this paper, we propose a new data priming method to solve the domain adaptation problem. Specifically, we first use pre-augmentation data priming, in which we remove distracting background from the training images and select images with realistic view angles by the pose pruning method. In the post-augmentation step, we train a data priming network using detection and counting collaborative learning, and select more reliable images from testing data to train the final visual item tallying network. Experiments on the large scale Retail Product Checkout (RPC) dataset demonstrate the superiority of the proposed method, i.e., we achieve 80.51 of the baseline methods.READ FULL TEXT VIEW PDF
RPC Dataset Leaderboard
Data Priming Network for Automatic Check-Out - ACMMM 2019
The recent success of Amazon Go system has invigorated the interests in Automatic Check-Out (ACO) in supermarket and grocery stores. With ACO, customers do not need to put items on the conveyer belt and wait in line for a store assistant to scan them. Instead, they can simply collect the chosen items and an AI-based system will be able to produce the categories and count of these items and automatic process the purchase. Successful ACO system will revolutionize the way we do our shopping and will have significant impact to our daily life in the coming years.
The bedrock of an ACO system is visual item counting that takes images of shopping items as input and generates output as a tally of different categories. With the recent successes of deep learning, deep neural network is a tool of choice for this task. The training of deep neural networks predicates on the availability of large annotated dataset. However, unlike other tasks in computer vision such as object detection and recognition, the training of deep neural network for visual item counting faces a special challenge of domain shift. Specifically, the training data are usually images of individual items under different viewing angles, which is collected using an isolated item sitting on a turntable. As such, the training images may have a distribution different from the images of shopping items piled together over a surface, see Figure1. The visual item counting algorithm needs to be able to adapt to the difference between the source domain (images of isolated objects) and the target domain (images of collections of objects).
Existing work  attempts to solve this problem with data argumentation. Firstly, images of collections of objects are generated by overlaying images of individual objects randomly. To improve the realism of the target images, the CycleGAN method  is used to render realistic shadows and boundaries. However, such a scheme has a serious drawback. The synthesized testing images have low level of realism due to some unrealistic poses. Such unrealistic images are used in training deep neural network models, which affects the final detection accuracy.
In this work, we propose a new strategy termed as data priming, to solve the challenging domain adaptation in the visual item counting problem. Instead of simply increasing the data volume by data augmentation as in the previous method , we improve the relevancy of the augmented data in two steps. In the pre-augmentation data priming step, we extract the foreground region from the training images of isolated objects using the coarse-to-fine saliency detection method. Then, we develop a pose pruning method to choose images only with consistent configurations of the target domain as candidates to generate synthesized images of checked out items with realistic poses. In the post-augmentation data priming step, we construct a data priming network with two heads, one for counting the total number of items and the other for detecting individual objects. Trained on the synthesized images, the data priming network is used to determine the reliability of testing data by detection and counting collaborative learning. Thus reliable testing data is selected to train the visual item tallying network. Experiments on the large-scale Retail Product Checkout (RPC) dataset  demonstrate significant performance improvement of the proposed method compared with the baselines, — we achieve checkout accuracy compared with of the baseline method.
The main contributions of this work are three-fold.
First, we develop a simple and effective pose pruning method to select synthesized checkout samples with realistic poses for training data.
Second, we propose the data priming network by detection and counting collaborative learning to determine the reliability of testing data to guide the training of visual item tallying network.
Third, experiments on the RPC dataset shows that our proposed method achieves favorable performance compared to the baselines.
In this section, we review previous works that are relevant to the proposed method.
obtain the saliency map based on the multi-scale features extracted from CNN models. Huet al.  propose a saliency detection method based on the compactness hypothesis that assumes salient regions are more compact than background from the perspectives of both color layout and texture layout. Liu et al.  develop a two-stage deep network, where a coarse prediction map is produced and followed by a recurrent CNN to refine the details of the prediction map hierarchically and progressively. Tang and Wu  develop multiple single-scale fully convolutional networks integrated chained connections to generate saliency prediction results from coarse to fine. Recently, Hou et al.  take full advantage of multi-level and multi-scale features extracted from fully CNNs, and introduce short connections to the skip-layer structures within the holistically-nested edge detector.
Data augmentation is a common method used in deep network training to deal with training data shortage. Recently, generative models including variational auto-encoder (VAE) [8, 9] and generative adversarial networks (GANs) [10, 2] are used to synthesize images similar to those in realistic scenes for data augmentation. Oord et al. 
propose a new conditional image generation method based on the Pixel-CNN structure. It can be conditioned feature vectors obtained from descriptive labels or tags, or latent embeddings created by other networks. In, a layered VAE model with disentangled latent variables is proposed to generate images from visual attributes. Besides, different from VAE, Goodfellow et al. estimate generative models via an adversarial process of two models, where the generative model captures the data distribution, and the discriminative model estimates the probability that a sample came from the training data rather than the generative model. Recently, the CycleGAN model  is to learn the mapping between an input image and an output image in different styles.
In training deep learning models, due to many factors, there exists a shift between the domains of the training and testing data that can degrade the performance. Domain adaptation uses labeled data in one source domains to apply to testing data in a target domain. Recently there have been several domain adaptation methods for visual data. In 
, the authors learn deep features such that they are not only discriminative for the main learning task on the source domain but invariant with respect to the shift between the domains. Saitoet al.  propose an asymmetric tri-training method for unsupervised domain adaptation, where unlabeled samples are assigned to pseudo-labels and train neural networks as if they are true labels. In 
, a novel Manifold Embedded Distribution Alignment method is proposed to learn the domain-invariant classifier with the principle of structural risk minimization while performing dynamic distribution alignment. The work of adapts the Faster R-CNN  with both image and instance level domain adaptation components to reduce the domain discrepancy. Qi et al.  propose a covariant multimodal attention based multimodal domain adaptation method by adaptively fusing attended features of different modalities.
Supermarket Produce Dataset  includes product categories of fruit and vegetable and images in diverse scenes. However, this dataset is not very challenging and does not reflect the challenging aspects of real life checkout images. SOIL-47  contains product categories, where each category has images taken from different horizontal views. Then, Grozi-120  contains grocery product categories in natural scenes, including from the web and from the store. Similar to Grozi-120, Grocery Products Dataset  is proposed for grocery product recognition. It consists of grocery products comprising training images and testing images. The training images are downloaded from the web, and the testing images are collected in natural shelf scenario. Freiburg Groceries Dataset  collects images of grocery classes using four different smartphone cameras at various stores, apartments and offices in Freiburg, Germany, rather than collecting them from the web. Specifically, the training set consists of images that contains one or more instances of one class, while the testing set contains images of clutter scenes, each containing objects of multiple classes. Besides, in , the MVTec D2S dataset is proposed for instance-aware semantic segmentation in an industrial domain. It consists of images of object categories with pixel-wise labels.
Different from the aforementioned datasets, the RPC dataset  is the largest scale of grocery product dataset to date, including product categories and images. Each image is obtained for a particular instance of a type of product with different appearances and shapes, which is divided into sub-categories, such as puffed food, instant drink, dessert, gum, milk, personal hygiene and stationery. Specifically, single-product images are taken in isolated environment as training exemplar images. To capture multi-view of single-product images, four cameras are used to cover the top, horizontal, and views of the exemplar image on a turntable. Then, each camera takes photos every degrees when the turntable rotating. The resolution of the captured image is . Then, several random products are placed on a white board, and then a camera mounted on top takes the photos with a resolution of pixels to generate checkout images. Based on the number of products, the testing images are categorized in three difficulty levels, i.e., easy ( categories and instances), medium ( categories and instances), and hard ( categories and instances), each containing images. The dataset provides three different types of annotations for the testing checkout images:
shopping lists that provide the category and count of each item in the checkout image,
point-level annotations that provide the center position and the category of each item in the checkout image,
bounding boxes that provide the location and category of each item.
In this section, we present in detail our data priming scheme for data augmentation in the training of visual item tallying network for automatic check-out system. As mentioned in the Introduction, our method has two steps. The pre-augmentation step we process training images of isolated items to remove those with irrelevant poses to improve the synthesized images. In the post augmentation step, we introduce a data priming network that helps to sift synthesized images to train the visual item tallying network.
Since the training images are obtained with examplar items captured on the turntable, it contains background that affects training of the visual item tallying network to focus on the object. To remove background noise, we develop a coarse-to-fine saliency based refinement method. Specifically, we first extract the contour of the object using the method of , and remove the edges with the confidence score less than . Then any holes inside the contour are filled and small isolated regions are removed using the mathematical morphology operations such as dilation and erosion. As a last step, we use median filter to smooth the edges of the masks. A qualitative example of coarse mask generation is shown in Figure 2. Given the coarse masks, we employ the saliency detection model  to extract fine masks with detailed contours of the object. The saliency model is formed by a deep neural network trained on the MSRA-B salient object database . Then, the deep neural network is fine-tuned based on the generated coarse masks of exemplars. We use these masks to extract the foreground object to use in the synthesis of testing checkout images.
Since the testing image contains multiple objects while the training image only contains a single object, we use the segmented isolated items to create synthesized checkout images. However, not all the poses of the isolated items are viable in checkout images. For example, it is difficult to put bag-like products on the checkout table with the view from bottom to top, as shown in Figure 3. To remove these inappropriate poses of exemplars, we propose a simple metric based on the ratio of areas, i.e.,
where is the area of the item mask captured by the -th view in the -th category. If the ratio is less than a pre-set threshold ( in the experiment), it indicates that the area of this pose is too small to be put on the checkout table stably, i.e., unrealistic pose. Otherwise, we regard this pose as a realistic pose.
After obtaining the selected segmented items, we synthesize the checkout images using the method in . Specifically, segmented items are randomly selected and freely placed (i.e., random angles from to and scales from to ) on a prepared background image such that the occlusion rate of each instance less than . Thus the synthesized images are similar to the checkout images in terms of item placement.
We can train a deep neural network for visual item tallying using the rendered synthesized checkout images. However, the rendered images still have different characteristics with regards to the actual checkout images. To solve the problem, we propose the Data Priming Network (DPNet) to select reliable testing samples using the detection and counting collaborative learning strategy to guide the training of visual item tallying network.
The goal of the visual item tallying in ACO is to predict the count and the category of items in the checkout image. To this end, we introduce a data priming network to select reliable checkout images to facilitate the training. Specifically, the data priming network consists of three components, i.e., base network with counter head and detector head . The proposed network structure is shown in Figure 4. denotes the base network that outputs shared features among two heads, which is implemented using the ResNet-101 backbone  with Feature Pyramid Network (FPN) architecture . Based on the shared features, the counter head predicts the number of total instances using the predicted density map, while the detector head recognizes the location and category of instances. From the last feature maps of the base network, the counter head consists of several dilated convolutional layers to extract deeper features without losing resolutions and a convolutional layer as output layer, similar to . Notably, the feature maps are first down-sampled with a factor of
to reduce computational complexity using a stride-dilated convolutional layer. The detector head includes fully connected layers to calculate regression and classification losses from multi-scale feature maps (i.e., size of the input image).
The loss function of the proposed network consists of terms of the counter and detector heads. For the counter head, we use the Euclidean distance to measure the difference between the ground-truth map and the estimated density map we generated. For the detector head, we use the standard cross-entropy loss for classification and smooth L1 loss for regression. The loss function is given as follow:
where represents the input image and is the batch size. and are the estimated and ground-truth density of location in the input image , respectively. Both maps are size of the input image. and are the predicted and ground-truth class label of detection in the image , including the class index of background . We have if its argument is true (objects), and otherwise (background), That is, we only consider the regression loss of objects, where and are the regression vectors representing the parameterized coordinates of the predicted and ground-truth bounding box of detection in the image , respectively. is the factor to balance the two terms.
To train the DPNet, we need to generate ground-truth density maps. Using the center locations of extracted item masks, we generate ground-truth density maps for rendered images using the strategy in . First, we blur the center of each instance using a normalized Gaussian kernel. Then, we generate the ground-truth considering the spatial distribution of all instance in the rendered image. For the detector, both the locations and labels of instances simply come from the exemplars in the synthesized images.
We train the network using detection and counting collaborative learning, the whole procedure of which is presented in Algorithm 1. First, we train the entire network with the source training set. Here both the counter and the detector are optimized by Eq. (2). Then, we can select reliable testing data such that the estimated number of items by the counter head is equal to the number of detections with high confidence (we set as in the experiment) by the detection head after NMS operation, i.e.,
where is the estimated density of location in the sample and indicates the rounding operation. is the probability of detection in the sample . if its argument is true, and otherwise. Finally, after removing the counter head, the network is fine-tuned based on selected reliable testing data from target domain as the visual item tallying network.
The propose method is implemented by PyTorch. The setting for the cycleGAN model is similar to that of . Each mini-batch consists of images on each GPU and we set the number of detections to be for each image. We use the SGD optimization algorithm to train the DPNet, and set the weight decay to be and momentum is set to be . The factor in Eq. (2) is set as . For the counter head, the initial learning rate is for the first 120k iterations, which decays by a factor of for the next 40k iterations. For the detection head, the initial learning rate is for the first 120k iterations, which decays by a factor of for the next 40k iterations. All the experiments are conducted on a workstation with Nvidia TITAN Xp GPUs.
|Clutter mode||Methods||cAcc ()||ACD ()||mCCD ()||mCIoU ()||mAP50 ()||mmAP ()|
|Render (DPNet(w/o PP))||79.82%||0.31||0.05||95.84%||98.33%||82.05%|
|Render (DPNet(w/o DP))||85.38%||0.23||0.03||96.82%||98.72%||83.10%|
|Render (DPNet(w/o DPC))||84.46%||0.23||0.03||96.92%||97.93%||83.22%|
|Syn+Render (DPNet(w/o DP))||86.58%||0.21||0.03||97.12%||98.62%||83.47%|
|Render (DPNet(w/o PP))||58.76%||0.74||0.06||94.10%||97.55%||76.05%|
|Render (DPNet(w/o DP))||70.90%||0.49||0.04||95.90%||98.16%||77.22%|
|Render (DPNet(w/o DPC))||69.85%||0.50||0.04||95.95%||97.24%||77.09%|
|Syn+Render (DPNet(w/o DP))||73.20%||0.46||0.04||96.24%||98.19%||77.69%|
|Render (DPNet(w/o PP))||44.58%||1.20||0.07||93.25%||96.86%||73.62%|
|Render (DPNet(w/o DP))||56.25%||0.84||0.05||95.28%||97.67%||74.88%|
|Render (DPNet(w/o DPC))||52.80%||0.86||0.05||95.17%||96.51%||74.77%|
|Syn+Render (DPNet(w/o DP))||59.05%||0.77||0.04||95.71%||97.77%||75.45%|
|Render (DPNet(w/o PP))||60.98%||0.75||0.06||94.05%||97.29%||75.89%|
|Render (DPNet(w/o DP))||70.80%||0.52||0.04||95.86%||97.93%||77.07%|
|Render (DPNet(w/o DPC))||69.03%||0.53||0.04||95.82%||96.96%||77.09%|
|Syn+Render (DPNet(w/o DP))||72.83%||0.48||0.04||96.17%||97.94%||77.56%|
To evaluate the performance of the proposed method, we use several metrics following . First, the counting error for a specific category in an image is defined as
where and indicates the predicted count and ground-truth item number of the -th category in the -th image, respectively. To measure the error over all categories for the -th image is calculated as
Checkout Accuracy (cAcc) is the primary metric for ranking in the ACO task , which is the accuracy when the complete product list is predicted correctly. It is calculated as
where if its argument is true, and otherwise. The range of the cAcc score is from to . For example, if , all items are accurately predicted, i.e., .
Mean Category Intersection of Union (mCIoU) measures the compatibility between the predicted product list and ground-truth. It is defined as
The range of the mCIoU score is from to .
Different from cAcc focusing on the counting error, Average Counting Distance (ACD) indicates the average number of counting errors for each image, i.e.,
Moreover, the Mean Category Counting Distance (mCCD) is used to calculate the average ratio of counting errors for each category, i.e.,
On the other hand, according to the evaluation protocols in MS COCO  and the ILSVRC 2015 challenge , we use the mean Average Precision (mAP) metrics (i.e., mAP50 and mmAP) to evaluate the performance of the detector. Specifically, mAP50 is computed at the single Intersection over Union (IoU) threshold over all item categories, while mmAP is computed by averaging over all IoU thresholds (i.e., in the range with the uniform step size ) of all item categories.
The authors of  provide four baselines for comparison. Specifically, a detector is trained to recognize the items based on the following four kinds of training data.
Single. We train the FPN detector  using training images of isolated items based on the bounding box annotations.
Render. To reduce domain gap, we employ Cycle-GAN  to translate the synthesized images into the checkout image domain for detector training, resulting in more realistic render images.
Syn+Render. We train the detector based on both synthesized and rendered images.
The performance compared with baseline methods are presented in Table I. More visual examples for different difficulty levels are shown in Figure 6. The Single method fails in every level because of the huge gap between the exemplars and the checkout images, confirming the problem of domain difference. By combining segmented exemplars into synthesized checkout images, the checkout accuracy is improved, e.g., vs. in averaged level. Moreover, significant boost is achieved by training the detector on rendered images. This is because the GAN method can mimic the realistic checkout images in lighting conditions or shadow patterns effectively. Compared to the aforementioned Render baseline method (i.e., cAcc score), our DPNet achieves cAcc score in averaged level only training on rendered images. Given the Syn+Render data, the checkout accuracy is further improved by , , for easy, medium and hard level respectively compared with the Syn+Render baseline method. This indicates the effectiveness of our approach.
We further perform experiments to study the effect of different modules of the proposed method by construct three variants, i.e., DPNet(w/o DPC), DPNet(w/o DP) and DPNet(w/o PP). DPNet(w/o DPC) indicates that the DPNet removes the counter head to select reliable testing data. In this way, the reliability checking condition in Eq. (3) is rewritten as , because the least number of items in the checkout image is (easy mode). DPNet(w/o DP) indicates that we do not use the DPNet for domain adaptation, i.e., the detector is trained based on the rendered data. DPNet(w/o PP) denotes the method that further removes the pose pruning module from DPNet(w/o DP). For fair comparison, we use the same parameter settings and input size in evaluation. We choose all testing checkout images to conduct the experiments.
The Render baseline method uses the Saliency +CRF  model to obtain the masks of exemplars. As presented in Table I, our DPNet(w/o PP) method achieves better performance, i.e., vs. checkout accuracy based on the rendered data. This may be attributed to better segmentation results by our DPNet(w/o PP) method using coarse-to-fine strategy.
If we remove the pose pruning module, the DPNet(w/o PP) method decreases in terms of checkout accuracy ( vs. ). This noticeable performance drop validates the importance of the pose pruning module to remove the synthesized images including the items with unrealistic poses (see Figure 3).
From Table I, our proposed DPNet achieves better results than its variant DPNet(w/o DP). The increase in checkout accuracy indicates that the data priming method adapts the data from source domain to that from target domain effectively. Besides, DPNet(w/o DPC) performs even slightly worse than DPNet(w/o DP), i.e., ( vs. ). It is not confident to determine reliable testing data only based on the detection head, resulting in much unreliable testing data ( of selected testing data). On the contrary, we can select correct reliable testing data based on the proposed DPNet with both counter and detection heads. Moreover, we randomly select images from both the source and target domains, and calculate their deep features with dimension. Then we visualize the distribution from two domains using the t-Distributed Stochastic Neighbor Embedding (t-SNE) technique , as shown in Figure 7. It can be seen that the distribution of source domain is similar to that of target domain after cycleGAN translation. Moreover, the two domains are more consistent visually if we further perform detection and counting collaborative learning.
Similar to the trend in the baseline methods ( cAcc of Render (baseline) vs. cAcc of Syn+Render (baseline)), the performance is constantly improved when training the network using both synthesized and rendered data. Specifically, “Syn+Render (DPNet)” achieves cAcc score compared to cAcc score of the “Render (DPNet)” configuration.
In this paper, we propose a new data priming network to deal with automatic checkout. Different from the previous domain adaptation methods, we construct both counter and detector heads to measure the reliability of testing images for the target domain. Then, the detector of the target branch can learn target-discriminative representation based on the reliable testing samples using detection and counting collaborative learning, resulting in robust performance. The experiment on the RPC dataset shows that our method surpasses the previous baseline methods significantly by more than checkout accuracy in the averaged level. For future works, we would like to further study other potential options for the data priming network, including heads of other types of attributes.
J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” inICCV, 2017, pp. 2242–2251.
Y. Ganin and V. S. Lempitsky, “Unsupervised domain adaptation by backpropagation,” inICML, 2015, pp. 1180–1189.
Y. Li, X. Zhang, and D. Chen, “Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,” inCVPR, 2018, pp. 1091–1100.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li, “Imagenet large scale visual recognition challenge,”IJCV, vol. 115, no. 3, pp. 211–252, 2015.