At least 1.5 million preventable adverse drug events (ADE) occur each year in the U.S. . Medication errors related to ADEs can occur while writing or filling prescriptions, or even when taking or managing medications . For example, pharmacists must verify thousands of pills dispensed daily in the pharmacy, a process that is largely manual and prone to error. Chronically-ill and elderly patients often separate pills from their original prescription bottles, which can lead to confusion and misadministration. Many errors are preventable; however, pills with visually similar characteristics are difficult to identify or distinguish, increasing error potential.
The pill images available for vision-based approaches fall into two categories: reference and consumer images (Figure 1). The reference images are taken with controlled lighting and backgrounds, and with professional equipment. For most of the pills, one image per side (two images per pill type) is available from the National Institutes of Health (NIH) Pillbox dataset . The consumer images are taken with real-world settings including different lighting, backgrounds, and equipment. Building pill-image datasets, especially for prescription medications, is costly and has additional regulatory obstacles as prescription medications require a clinician’s order to be dispensed. A labeled dataset is scarce and even unlabeled images are hard to collect. This setup requires the model to learn representations from very few training examples. Existing benchmarks are either not publicly available or smaller scale, which are not suitable for developing real-world pill identification systems.
Over the years, deep learning has achieved unprecedented performance on image recognition tasks with efforts from both large-scale labeled datasets[41, 66] and modeling improvements [18, 30, 43]; however, fine-grained visual categorization (FGVC) is still a challenging task, which needs to distinguish subtle differences within visually similar categories. FGVC tasks mainly include natural categories (e.g., birds [50, 52], dogs  and plants [38, 56]) and man-made objects (e.g., cars [29, 58] and airplanes 
). In this work, we target the pill identification task, which is an under-explored FGVC task yet important and frequent in healthcare settings. The distributions of pill shape and color are skewed to certain categories (Figure2), highlighting the importance of distinguishing subtle differences such as materials, imprinted text and symbols.
The main contribution of this paper is introducing ePillID, a new pill identification benchmark with a real-world low-shot recognition setting. Leveraging two existing NIH datasets, our benchmark is composed of 13k images representing 8184 appearance classes (two sides for 4092 pill types). This is a low-shot fine-grained challenge because (1) for most of the appearance classes there exist only one image and (2) many pills have extremely similar appearances. Furthermore, we empirically evaluate various approaches with the benchmark to serve as baselines. The baseline models include standard image classification approaches and metric learning based approaches. Finally, we present error analysis to motivate future research directions.
2 Related Work
Image-based Pill Identification:
In 2016, the NIH held a pill image recognition challenge , and released a dataset composed of 1k pill types. The challenge winner  proposed a deep similarity-based approach  and developed a mobile-ready version using knowledge-distillation techniques . Following the competition, classification-based approaches were applied, with higher recognition performance reported [12, 48, 57]
. Aside from deep neural networks, various approaches with feature engineering have been proposed[6, 11, 31, 37, 57]
. For instance, the Hu moment was applied [11, 31], because of its rotation-invariant properties. Other methods [7, 47, 62, 61] were proposed to generate imprint features, recognizing the importance of imprints for pill identification. The past methods achieved remarkable success; however, the lack of benchmarks prevents us from developing methods for real-world low-shot settings. Wong et al.  created a 5284-image dataset of 400 pill types with a QR-like board to rectify geometric and color distortions. Yu et al.  collected 12,500 images of 2500 pill categories. Unfortunately, neither of these datasets are publicly available.
Fine-Grained Visual Categorization (FGVC):
In many FGVC datasets [5, 3, 25, 9], the number of categories is not extremely large, often less than 200. Recent large-scale datasets [17, 14, 51] offer large numbers of categories with many images (e.g., training images for 5089 categories ) with challenging long-tailed distributions. Compared to other FGVC benchmarks, the data distribution on the ePillID benchmark imposes a low-shot setting (one image for most of the classes) with a large number of classes (8k appearance classes). Among many algorithms [59, 55, 65] proposed for FGVC tasks, bilinear models [28, 32, 64] achieved remarkable performances by capturing higher-order interactions between feature channels. B-CNN  is one of the first approaches, which obtains full bilinear features by calculating outer product at each location of the feature map, followed by a pooling across all locations; however, the full bilinear features can be very high dimensional (e.g., over 250k when the input has 512 channels). Compact Bilinear Pooling (CBP)  addresses the dimensionality issue by approximating bilinear features with only a few thousand dimensions, which was shown to outperform B-CNN in few-shot scenarios. Another line of work is metric learning [40, 10, 45], where an embedding space that captures semantic similarities among classes and images is learned. Metric learning has been also successfully used in few- and low-shot settings [54, 46], making it suitable for our ePillID benchmark.
3 ePillID Benchmark
We construct a new pill identification benchmark, ePillID, by leveraging the NIH challenge dataset  and the NIH Pillbox dataset . We use the challenge dataset, which offers consumer images, as a base dataset and extend it with the reference images from the Pillbox dataset. In total, the ePillID dataset includes 3728 consumer images for 1920 appearance classes (two sides for 960 pill types) and 8192 reference images (two sides for 4092 pill types). This requires a fine-grained low-shot setup, where models have access to one reference image for all the 8192 appearance classes; however, there only exists a few consumer images for 1920 appearance classes (Figure 3).
The consumer images are split into 80% training and 20% holdout sets in such a way that the pill types are mutually exclusive. The training set is further split on pill types for 4-fold cross-validation. The models have access to reference images for all of the 4092 pill types, but consumer images are unavailable for most pill types during training. To evaluate the performance in situations where both front- and back-sides are available as input, we construct two-sided queries by enumerating all possible consumer image pairs for each pill type.
For each query, a model calculates an ordered list of pill types with confidence scores. Note that, for the experiments with both sides of a pill, a query consists of a pair of front and back images. We consider Mean Average Precision (MAP) and Global Average Precision (GAP) for the model performance evaluation. For MAP, the average precision score is calculated separately for each query, and the mean value is calculated. MAP measures the ability to predict the correct pill types given queries. For GAP, all the query and pill-type pairs are treated independently, and the average precision score is calculated globally. GAP measures both the ranking performance and the consistency of the confidence scores i.e. the ability to use a common threshold across different queries . We also calculate MAP@1 and GAP@1, where only the top pill type per query is considered.
|Plain Classification: Both-sides input|
|Multi-head Metric Learning: Both-sides input|
|Multi-head Metric Learning: Single-side input|
Mean and standard deviations of the holdout metrics from the 4-fold cross-validation are reported in percentages.
We first introduce our baseline approaches, then present quantitative and qualitative results.
4.1 Baseline Models
as base networks pretrained on ImageNet for initial weights. In addition to the global average pooling layer as features, we evaluate two bilinear methods, B-CNN  and CBP 
, applied to the final pooling layer of the base network. For CBP, we use their Tensor Sketch projection with 8192 dimensions, which was suggested for reaching close-to maximum accuracy. We insert a 1x1 convolutional layer before the pooling layer to reduce the dimensionality to 256.
As a first set of baselines, we train the models with the standard softmax cross-entropy loss. For regularization, we add a dropout layer 
with probability 0.5 before the final classification layer. We use the appearance classes as target labels during training, and take the max of the softmax scores for calculating pill-type confidence scores. For the two-sides evaluation, the mean confidence score is used for a score between a two-sided input and a pill type.
Multi-head Metric Learning:
As another set of baselines, we employ a combination of four losses to learn an embedding space optimized for fine-grained low-shot recognition. The set up is a multi-task training procedure:
where indicates softmax cross-entropy, cosine-softmax loss (ArcFace ), triplet loss , and contrastive loss . The loss weights, , , , and are chosen empirically with a ResNet50 model (Section 4.1
). In order to compute the loss for every mini-batch, the triplet and contrastive loss requires additional sampling and pairing procedures. We apply online hard-example mining to find informative negatives for the triplets and pairs respectively evaluated by these losses. The trained model is used for generating embeddings for the query consumer images and all the reference images. We calculate cosine similarities between the query and reference embeddings and use them as confidence scores.
The images are cropped and resized to . Extensive data augmentation is employed to mimic consumer-image-like variations, including rotation and perspective transformation. The Adam optimizer  is used with an initial learning rate of . The learning rate is halved whenever a plateau in the validation GAP score is detected. The model hyper-parameters are chosen to optimize the average validation GAP score using the 4-fold cross-validation. The loss weights for the metric learning are determined based on the ResNet50 experiment (Table 2
). The training is done with the mini-batch of 48 images in a machine equipped with Intel Xeon E5-2690, 112GB RAM, one NVIDIA Tesla P40, CUDA 8.0 and PyTorch 0.4.1.
|Contrastive ()||Triplet ()||SCE ()||CSCE ()||Validation GAP|
4.2 Quantitative Results
In Table 1, we report the baseline results on the ePillID benchmark. The plain classification baselines performed much worse than the metric learning baselines, suggesting the difficulty of the low-shot fine-grained setting. The performance differences among the models are consistent with the ImageNet pretraining performance . The multi-head metric leraning models performed remarkably well, achieving over 95% MAP and 90% GAP@1. In most of the cases, bilinear pooling methods outperformed the global average pooling counterpart, showing the representation power of the bilinear features. As an ablation study, we report single-side experiment results i.e. only one image per query. The ResNet152 metric learning approach achieved over 85% MAP and 82% GAP@1; however, the results indicate that both sides are required for accurate identification.
4.3 Qualitative Results
Figure 4 depicts qualitative comparisons from examples of the ePillID holdout dataset with confidence scores. In (a), the plain classification approaches misclassified, whereas the metric learning approaches identified successfully, even with the challenging lighting and background variations in the query images. In (b), only the metric learning with CBP approach identified correctly, suggesting CBP was effective for capturing the small difference in the imprinted text. For (c) and (d), all the four models failed to identify the correct types. In (c), the consumer images are affected by the lighting variations with the shiny pill material. In (d), the pill types share extremely similar appearances, except the one character difference in the imprinted text.
We introduced ePillID, a low-shot fine-grained benchmark on pill image recognition. To our knowledge, this is the first publicly available benchmark that can be used to develop pill identification systems in a real-world low-shot setting. We empirically evaluated various baseline models with the benchmark. The multi-head metric learning approach performed remarkably well; however, our error analysis suggests that these models still cannot distinguish confusing pill types reliably. In the future, we plan to integrate optical character recognition (OCR) models. OCR integration has been explored for storefronts and product FGVC tasks [23, 2] and recent advances in scene text recognition are promising [24, 16, 34]
; however, existing OCR models are unlikely to perform reliably on pills as they stand. Challenging differences include low-contrast imprinted text, irregular-shaped layouts, lexicon, and pill materials such as capsules and gels. Finally, we plan to extend the benchmark further with more pill types and images as we collect more data. By releasing this benchmark, we hope to support further research in this under-explored yet important task in healthcare.
-  (2009) Medication errors: what they are, how they happen, and how to avoid them. QJM: An International Journal of Medicine 102 (8), pp. 513–521. Cited by: §1.
-  (2018) Integrating scene text and visual appearance for fine-grained image classification. IEEE Access 6, pp. 66322–66335. Cited by: §5.
-  (2019) The iwildcam 2019 challenge dataset. arXiv preprint arXiv:1907.07617. Cited by: §2.
-  (2018) Benchmark analysis of representative deep neural network architectures. IEEE Access 6, pp. 64270–64277. Cited by: §4.2.
Food-101–mining discriminative components with random forests. In ECCV, pp. 446–461. Cited by: §2.
-  (2012) Automatic identification of prescription drugs using shape distribution models. In Proceedings of the IEEE International Conference on Image Processing, pp. 1005–1008. Cited by: §2.
-  (2013) A new accurate pill recognition system using imprint information. In Proceedings of the International Conference on Machine Vision, Vol. 9067, pp. 906711. Cited by: §2.
-  (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR, Vol. 1, pp. 539–546. Cited by: §4.1.
Large scale fine-grained categorization and domain-specific transfer learning. In CVPR, pp. 4109–4118. Cited by: §2.
-  (2016) Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In CVPR, pp. 1153–1162. Cited by: §2.
-  (2014) Helpmepills: a mobile pill recognition tool for elderly persons. Procedia Technology 16, pp. 1523–1532. Cited by: §2.
-  (2019) Fast and accurate medication identification. npj Digital Medicine 2 (1), pp. 10. Cited by: §2.
ArcFace: additive angular margin loss for deep face recognition. In CVPR, Cited by: §4.1.
-  (2019) Butterflies and Moths. Note: CVPR Workshophttps://sites.google.com/view/fgvc6/competitions/butterflies-moths-2019 Cited by: §2.
-  (2016) Compact bilinear pooling. In CVPR, pp. 317–326. Cited by: §2, §4.1.
-  (2017) ICDAR 2017 robust reading challenge on coco-text. In International Conference on Document Analysis and Recognition, Vol. 1, pp. 1435–1443. Cited by: §5.
-  (2019) iMaterialist Challenge. Note: CVPR Workshophttps://github.com/malongtech/imaterialist-product-2019 Cited by: §2.
-  (2016) Deep Residual Learning for Image Recognition. In CVPR, Cited by: §1, §4.1.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
Visual pattern recognition by moment invariants. IRE transactions on information theory 8 (2), pp. 179–187. Cited by: §2.
-  (2017) Densely connected convolutional networks. In CVPR, pp. 4700–4708. Cited by: §4.1.
-  P. Aspden, J. Wolcott, J. L. Bootman, and L. R. Cronenwett (Eds.) (2007) Preventing medication errors. The National Academies Press, Washington, DC. External Links: Cited by: §1.
-  (2016) Words matter: scene text for image classification and retrieval. IEEE Transactions on Multimedia 19 (5), pp. 1063–1076. Cited by: §5.
-  (2015) ICDAR 2015 competition on robust reading. In International Conference on Document Analysis and Recognition, pp. 1156–1160. Cited by: §5.
-  (2019) FoodX-251: a dataset for fine-grained food classification. arXiv preprint arXiv:1907.06167. Cited by: §2.
-  (2011) Novel dataset for fgvc: stanford dogs. In CVPR Workshop, Vol. 1. Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2017) Low-rank bilinear pooling for fine-grained classification. In CVPR, pp. 365–374. Cited by: §2.
-  (2013) 3d object representations for fine-grained categorization. In ICCV Workshop, pp. 554–561. Cited by: §1.
Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §1.
-  (2012) Pill-id: matching and retrieval of drug pill images. Pattern Recognition Letters 33 (7), pp. 904–910. Cited by: §2.
-  (2018) Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In CVPR, pp. 947–955. Cited by: §2.
-  (2015) Bilinear cnn models for fine-grained visual recognition. In ICCV, pp. 1449–1457. Cited by: §2, §4.1.
-  (2018) Fots: fast oriented text spotting with a unified network. In CVPR, pp. 5676–5685. Cited by: §5.
-  (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151. Cited by: §1.
-  (2016) Pillbox. Note: https://pillbox.nlm.nih.gov Cited by: §1, §3.
-  (2018) CoforDes: an invariant feature extractor for the drug pill identification. In IEEE International Symposium on Computer-Based Medical Systems, pp. 30–35. Cited by: §2.
Automated flower classification over a large number of classes.
Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: §1.
A family of contextual measures of similarity between distributions with application to image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2358–2365. Cited by: §3.
-  (2015) Fine-grained visual categorization via multi-stage metric learning. In CVPR, pp. 3716–3724. Cited by: §2.
-  (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §1, §4.1.
-  (2015) FaceNet: a unified embedding for face recognition and clustering. In CVPR, pp. 815–823. Cited by: §4.1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
Dropout: a simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research15 (1), pp. 1929–1958. Cited by: §4.1.
-  (2018) Multi-attention multi-class constraint for fine-grained image recognition. In ECCV, pp. 805–821. Cited by: §2.
-  (2018) Learning to compare: relation network for few-shot learning. In CVPR, Cited by: §2.
Pill image binarization for detecting text imprints. In International Joint Conference on Computer Science and Software Engineering, pp. 1–6. Cited by: §2.
-  (2015) Going deeper with convolutions. In CVPR, pp. 1–9. Cited by: §2.
-  (2016) Pill image recognition challenge. Note: https://pir.nlm.nih.gov/challenge/ Cited by: §3.
-  (2015) Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In CVPR, pp. 595–604. Cited by: §1.
-  (2018) The inaturalist species classification and detection dataset. In CVPR, pp. 8769–8778. Cited by: §2.
-  (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §1.
-  (2014) Learning fine-grained image similarity with deep ranking. In CVPR, Vol. , pp. 1386–1393. External Links: Cited by: §2.
-  (2019) Multi-similarity loss with general pair weighting for deep metric learning. In CVPR, Cited by: §2.
-  (2018) Learning a discriminative filter bank within a cnn for fine-grained recognition. In CVPR, pp. 4148–4157. Cited by: §2.
-  (2016) Cataloging public objects using aerial and street-level images-urban trees. In CVPR, pp. 6014–6023. Cited by: §1.
-  (2017) Development of fine-grained pill identification algorithm using deep convolutional network. Journal of biomedical informatics 74, pp. 130–136. Cited by: §2.
-  (2015) A large-scale car dataset for fine-grained categorization and verification. In CVPR, pp. 3973–3981. Cited by: §1.
-  (2018) Learning to navigate for fine-grained classification. In ECCV, pp. 420–435. Cited by: §2.
-  (2016) The national library of medicine pill image recognition challenge: an initial report. In IEEE Applied Imagery Pattern Recognition Workshop), pp. 1–9. Cited by: §2.
-  (2015) Accurate system for automatic pill recognition using imprint information. IET Image Processing 9 (12), pp. 1039–1047. Cited by: §2.
-  (2014) Pill recognition using imprint information by two-step sampling distance sets. In Proceedings of the International Conference on Pattern Recognition, pp. 3156–3161. Cited by: §2.
-  (2017) MobileDeepPill: A Small-Footprint Mobile Deep Learning System for Recognizing Unconstrained Pill Images. In Proceedings of the ACM International Conference on Mobile Systems, Applications, and Services, Niagara Falls, NY, USA, pp. 56–67. Cited by: §2.
-  (2019) Learning deep bilinear transformation for fine-grained image representation. In Advances in Neural Information Processing Systems, pp. 4279–4288. Cited by: §2.
-  (2019) Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition. In CVPR, pp. 5012–5021. Cited by: §2.
-  (2014) . In Advances in Neural Information Processing Systems, pp. 487–495. Cited by: §1.