While electronic commerce continues to make great strides, in-store purchases are likely to continue to be important in the coming years: 91of purchases are still made in physical stores (forrester; phy_shopping) and 82 of millennials prefer to shop in these stores (millennial). However, a significant pain point for in-store shopping is the checkout queue: customer satisfaction drops significantly when queuing delays exceed more than four minutes (checkout_time). To address this, retailers have deployed self-checkout systems (which can increase instances of shoplifting (selfcheckout_time; selfcheckout_theft1; selfcheckout_theft2)), and expensive vending machines.
The most recent innovation is cashier-free shopping, in which a networked sensing system automatically (a) identifies a customer who enters the store, (b) tracks the customer through the store, (c) and recognizes what they purchase. Customers are then billed automatically for their purchases, and do not need to interact with a human cashier or a vending machine, or scan items by themselves. Over the past year, several large online retailers like Amazon and Alibaba (amazon; taobao) have piloted a few stores with this technology, and cashier-free stores are expected to take off in the coming years (cashierfreetrend; retaileradopt). Besides addressing queue wait times, cashier-free shopping is expected to reduce instances of theft, and provide retailers with rich behavioral analytics.
grabA shopper only needs to grab items and go.
Not much is publicly known about the technology behind cashier-free shopping, other than that stores need to be completely redesigned (amazon; taobao; bingobox) which can require significant capital investment (§2). In this paper, we ask: Is cashier-free shopping viable without having to completely redesign stores? To this end, we observe that many stores already have, or will soon have, the hardware necessary to design a cashier-free shopping system: cameras deployed for in-store security, sensor-rich smart shelves (smart_shelves) that are being deployed by large retailers (smallbig) to simplify asset tracking, and RFID tags being deployed on expensive items to reduce theft. Our paper explores the design and implementation of a practical cashier-free shopping system called Grab grab using this infrastructure, and quantifies its performance.
Grab needs to accurately identify and track customers, and associate each shopper with items he or she retrieves from shelves. It must be robust to visual occlusions resulting from multiple concurrent shoppers, and to concurrent item retrieval from shelves where different types of items might look similar, or weigh the same. It must also be robust to fraud, specifically to attempts by shoppers to confound identification, tracking, or association. Finally, it must be cost-effective and have good performance in order to achieve acceptable accuracy: specifically, we show that, for vision-based tasks, slower than 10 frames/sec processing can reduce accuracy significantly (§4).
. An obvious way to architect Grab is to use deep neural networks (DNNs) for each individual task in cashier-free shopping, such as identification, pose tracking, gesture tracking, and action recognition. However, these DNNs are still relatively slow and many of them cannot process frames at faster than 5-8 fps. Moreover, even if they have high individual accuracy, their effective accuracy would be much lower if they were cascaded together.
Grab’s architecture is based on the observation that, for cashier-free shopping, we can use a single vision capability (body pose detection) as a building block to performall of these tasks. A recently developed DNN library, OpenPose (openpose)
accurately estimates body ”skeletons” in a video at high frame rates.
Grab’s first contribution is to develop a suite of lightweight identification and tracking algorithms built around these skeletons (§3.1
). Grab uses the skeletons to accurately determine the bounding boxes of faces to enable feature-based face detection. It uses skeletal matching, augmented with color matching, to accurately track shoppers even when their faces might not be visible, or even when the entire body might not be visible. It augments OpenPose’s elbow-wrist association algorithm to improve the accuracy of tracking hand movements which are essential to determining when a shopper may pickup up items from a shelf.
Grab’s second contribution is to develop fast sensor fusion algorithms to associate a shopper’s hand with the item that the shopper picks up (§3.2). For this, Grab uses a probabilistic assignment framework: from cameras, weight sensors and RFID receivers, it determines the likelihood that a given shopper picked up a given item. When multiple concurrent such actions occur, it uses an optimization framework to associate hands with items.
Grab’s third contribution is to improve the cost-effectiveness of the overall system by multiplexing multiple cameras on a single GPU (§3.3). It achieves this by avoiding running OpenPose on every frame, and instead using a lightweight feature tracker to track the joints of the skeleton between successive frames.
Using data from a pilot deployment in a retail store, we show (§4) that Grab has 93% precision and 91% recall even when nearly 40% of shopper actions were adversarial. Grab needs to process video data at 10 fps or faster, below which accuracy drops significantly: a DNN-only design cannot achieve this capability (§4.4). Grab needs all three sensing modalities, and all of its optimizations: removing an optimization, or a sensor, can drop precision and recall by 10% or more. Finally, Grab’s design enables it to multiplex up to 4 cameras per GPU with negligible loss of precision.
2. Approach and Challenges
Cashier-free shopping systems. A cashier-free shopping system automatically determines, for every customer in a shop, what items the customer has picked from the shelves, and directly bills each customer for those items. Cashier-free shopping is achieved using a networked system containing several sensors that together perform three distinct functions: identifying each customer, tracking each customer through the shop, and identifying every item pickup or item dropoff on a shelf to accurately determine which items the customer leaves the store with.
These systems have several requirements. First, they must be non-intrusive in the sense that they must not require customers to wear or carry sensors or any form of electronic identification, since these can detract from the shopping experience. Second, they must be robust to real-world conditions (Figure 1), in being able to distinguish between items that are visually similar or have other similarities (such as weight), as well as to be robust to occlusion. Third, they must be robust to fraud: specifically, they must be robust to attempts by shoppers to circumvent or tamper with sensors used to identify customers, items and the association between customers and items. Finally, they must be cost-effective: they should leverage existing in-store infrastructure to the extent possible, while also being computationally efficient in order to minimize computing infrastructure investments.
Today’s cashier-free shopping systems. Despite widespread reports of cashier-free shopping deployments (amazon; taobao; bingobox; std_cog), not much is known about the details of their design, but they appear to fall into three broad categories.
Vision-Only. This class of systems, exemplified by (std_cog_forbes; std_cog)
, identifies customers and items, and tracks customers, only using cameras. It trains a deep learning model to recognize customers and objects, and uses this to bill them. However, such a system can fail to distinguish between items that look similar (Figure1(a)) especially when these items are small in the image (occupy a few pixels), or items that are occluded by other objects (Figure 1(c)) or by the customer.
Vision and Weight. Amazon Go (amazon) uses both cameras and weight sensors on shelves, where the weight sensor can be used to identify when an item is removed from a shelf even if it is occluded from a camera. One challenge such a system faces is the ability to discriminate between items of similar weight (Figure 1(b)). Moreover, their design requires a significant redesign of the store: user check-in gates, an array of cameras on the ceiling and the shelf, and additional sensors at the exit (go-nyt; go-geek). Finally, Amazon Go also reportedly encounters issues when shoppers put back items randomly (amzn_prob).
Vision and RFID. The third class of approaches, used by Taobao Cafe (taobao) and Bingo Box (bingobox), does not track shoppers within the store, but uses vision to identify customers and RFID scanners at a checkout gate that reads all the items being carried by the customer. Each object needs to be attached with an RFID tag, and users have to queue at the checkout gate. This approach has drawbacks as well: RFID tags can be expensive relative to the price of some items (rfid_tag), and RFID readers are known to have trouble when scanning tags that are stacked, blocked, or attached to conductors (nikitin2006performance; lu2009performance; nekoogar2011ultra).
Approach and Challenges. While all of these approaches are non-intrusive, it is less clear how well they satisfy other requirements: robustness to real-world conditions and to fraud, and cost-effectiveness. In this paper, with a goal towards understanding how well these requirements can be met in practice, we explore the design, implementation, and evaluation of a cashier-free shopping system called Grab, which combines the three technologies described above (vision, weight scales, and RFID). At a high-level, Grab combines advances in machine vision, with lightweight sensor fusion algorithms to achieve its goals. It must surmount four distinct challenges: (a) how to identify customers in a lightweight yet robust manner; (b) how to track customers through a store even when the customer is occluded by others or the customer’s face is not visible in a camera; (c) how to determine when a customer has picked up an item, and which item the customer has picked up, and to make this determination robust to concurrent item retrievals, customers putting back items, and customers attempting to game the system in various ways; (d) how to meet these challenges in a way that minimizes investments in computing infrastructure.
3. Grab Design
Grab addresses these challenges by building upon a vision-based keypoint-based pose tracker DNN for identification and tracking, together with a probabilistic sensor fusion algorithm for recognizing item pickup actions. These ensure a completely non-intrusive design where shoppers are not required to scan item codes or pass through checkout gates while shopping. Grab consists of four major components (Figure 2).
Identity tracking recognizes shoppers’ identities and tracks their movements within the store. It includes efficient and accurate face and body pose detection and tracking, adapted to work well in occluded environments, and to deal with corner cases in body pose estimation that can increase error in item pickup detection (§3.1).
Action recognition uses a probabilistic algorithm to fuse vision, weight and RFID inputs to determine item pickup or dropoff actions by a customer (§3.2). This algorithm is designed to be robust to real-world conditions and to theft. When multiple users pickup the same type of item simultaneously, the algorithm must determine which customer takes how many items. It must be robust to: customers concealing items or to attempts to tamper with the sensors (e.g., replacing an expensive item with an identically weighted item).
GPU multiplexing enables processing multiple cameras on a single GPU (§3.3). DNN-based video processing usually requires a dedicated GPU for each video stream for reasonable performance. Retail stores need tens of cameras, and Grab contains performance optimizations that permit it to multiplex the processing of multiple streams on a single GPU, thereby reducing cost.
Grab also has a fourth, offline component, registration. Customers must register once online before their first store visit. Registration involves taking a video of the customer to enable matching the customer subsequently (§3.1), in addition to obtaining information for billing purposes. If the identity tracking component detects a customer who has not registered, she may be asked to register before buying items from the store.
3.1. Identity tracking
Identity tracking consists of two related sub-components (Figure 3). Shopper identification determines who the shopper is among registered users. A related sub-component, shopper tracking, determines (a) where the shopper is in the store at each instant of time, and (b) what the shopper is doing at each instant.
Requirements and Challenges. In designing Grab, we require first that customer registration be fast, even though it is performed only once: ideally, a customer should be able to register and immediately commence shopping. Identity tracking requires not just identifying the customer, but also detecting each person’s pose
, such as hand position and head position. These tasks have been individually studied extensively in the computer vision literature. More recently, with advances in deep learning, researchers in computer vision have developed different kind of DNNs for people detection(ren2015faster; liu2016ssd; yolo), face detection (zhang2016joint; dlib) and hand gesture recognition (tang2015real; chen2016deep).
Each of these detectors performs reasonably well: e.g., people detectors can process 35 frames per second (fps), face detectors can process 30 fps, and hand recognizers can run at 12 fps. However, Grab requires all of these components. Dedicating a GPU for each component is expensive: recall that a store may have several cameras (Grab proposes to re-purpose surveillance cameras for visual recognition tasks, §1), and using one GPU per detection task per camera is undesirable (§2) as it would require significant investment in computing infrastructure.
The other option is to run these on a single GPU per camera, but this would result in lower frame rates. Lower frame rates can miss shopper actions: as §4.4 shows, at frame rates lower than 10 fps, Grab’s precision and recall can drop dramatically. This highlights a key challenge we face in this paper: designing fast end-to-end identity tracking algorithms that do not compromise accuracy.
Approach. In this paper, we make the following observation: we can build end-to-end identity tracking using a state-of-the-art pose tracker. Specifically, we use, as a building block, a keypoint based body pose tracker, called OpenPose (openpose). Given an image frame, OpenPose detects keypoints for each human in the image. Keypoints identify distinct anatomical structures in the body (Figure 4(a)) such as eyes, ears, nose, elbows, wrists, knees, hips etc. We can use these skeletons for identification, tracking and gesture recognition. OpenPose requires no pose calibration (unlike, say, the Kinect (bin2011study)), so it is attractive for our setting, and is fast, achieving up to 15 fps for body pose detection. (OpenPose also has modes where it can detect faces and hand gestures using many more keypoints than in Figure 4(a), but using these reduces the frame rate dramatically, and also has lower accuracy for shoppers far away from the camera).
However, fundamentally, since OpenPose operates only on a single frame, Grab needs to add identification, tracking and gesture recognition algorithms on top of OpenPose to continuously identify and tracks shoppers and their gestures. The rest of this section describes these algorithms.
Shopper Identification. Grab uses fast feature-based face recognition to identify shoppers. While prior work has explored other approaches to identification such as body features (bai2017scalable; chen2017beyond; zhao2017spindle) or clothing color (lu2001color), we use faces because (a) face recognition has been well-studied by vision researchers and we are likely to see continued improvements, (b) faces are more robust for identification than clothing color, and (c) face features have the highest accuracy in large datasets (§5).
Feature-based face recognition.
When a user registers, Grab takes a video of their face, extracts features, and builds a fast classifier using these features. To identify shoppers, Grab does not directly use a face detector on the entire image because traditional HAAR based detectors(lienhart2002extended) can be inaccurate, and recent DNN-based face detectors such as MTCNN (zhang2016joint) can be slow. Instead, Grab identifies a face’s bounding box using keypoints from OpenPose, specifically, the five keypoints of the face from the nose, eyes, and ears (Figure 4(b)). Then, it extracts features from within the bounding box and applies the trained classifier.
Grab must (a) enable fast training of the classifier since this step is part of the registration process and registration is required to be fast (§3.1), (b) must robustly detect the bounding box for different facial orientations relative to the camera to avoid classification inaccuracy.
Fast Classification. Registration is performed once for each customer. During registration, Grab extracts features from the customer’s face. To do this, we evaluated several face feature extractors (baltruvsaitis2016openface; facefeature; schroff2015facenet), and ultimately selected ResNet-34’s feature extractor (facefeature)
which produces a 128-dimension feature vector, performs best in both speed and accuracy (§4.6).
With these features, we can identify faces by comparing feature distances, build classifiers, or train a neural network. After experimenting with these options, we found that a
nearest neighbor (kNN) classifier, in which each customer is trained as a new class, worked best among these choices (§4.6). Grab builds one kNN-based classifier for all customers and uses it across all cameras.
Tightening the face bounding box. During normal operation, Grab extracts facial features after drawing a bounding box (derived from OpenPose keypoints) around each customer’s face. Grab infers the face’s bounding box width using the distance between two ears, and the height using the distance from nose to neck. This works well when the face points towards the camera (Figure 4(c)), but can result in an inaccurate bounding box when a customer faces slightly away from the camera (Figure 4(d)). This inaccuracy can degrade classification performance.
To obtain a tighter bounding box, we estimate head pitch and yaw using the keypoints. Consider the line between the nose and neck keypoints: the distance of each eye and ear keypoint to this axis can be used to estimate head yaw. Similarly, the distance of the nose and neck keypoints to the axis between the ears can be used to estimate pitch. Using these, we can tighten the bounding box significantly (Figure 4(e)). To improve detection accuracy (§4) when a customer’s face is not fully visible in the camera, we also use face alignment (baltruvsaitis2016openface), which estimates the frontal view of the face.
Shopper Tracking. A user’s face may not always be visible in every frame, since customers may intentionally or otherwise turn their back to the camera. However, Grab needs to be able to identify the customer in frames where the customer’s face is not visible, for which it uses tracking. Grab assumes the use of existing security cameras, which, if placed correctly, make it unlikely that a customer can evade all cameras at all times (put another way, if the customer is able to do this, the security system’s design is faulty).
Skeleton-based Tracking. Existing human trackers use bounding box based approaches (milan2016mot16; leal2015motchallenge; tang2017multiple; wojke2017simpl; ristani2014tracking), which can perform poorly in in-store settings with partial or complete occlusions (Figure 5(a)). We quantify this in §4.4 with the state-of-the-art bounding box based tracker, DeepSort (wojke2017simpl), but Figure 5 demonstrates this visually.
Instead, we use the skeleton generated by OpenPose to develop a tracker that uses geometric properties of the body frame. We use the term track to denote the movements of a distinct customer (whose face may or may not have been identified). Suppose OpenPose identifies a skeleton in a frame: the goal of the tracker is to associate the skeleton with an existing track if possible. Grab uses the following to track customers. It tries to align each keypoint in the skeleton with the corresponding keypoint in the last seen skeleton in each track, and selects that track whose skeleton is the closest match (the sum of match errors is smallest). Also, as soon as it is able to identify the face, Grab associates the customer’s identity with the track (to be robust to noise, Grab requires that the customer’s face is identified in 3 successive frames). To work well, the tracking algorithm needs to correctly handle partial and complete occlusions.
Dealing with Partial Occlusions. When a shopper’s body is not completely visible (e.g., because she is partially obscured by another customer, Figure 5(b)), OpenPose can only generate a subset of the key points. In this case, Grab matches only on the visible subset. However, with significant occlusions, very few key points may be visible. In this case, Grab attempts to increase matching confidence using the color histogram of the visible upper body area. However, if the two matching approaches (color and skeletal) conflict with each other, Grab skips matching attempts until subsequent frames when this process is repeated.
Dealing with Complete Occlusions. In some cases, a shopper may be completely obscured by another. Grab uses lazy tracking (Figure 6) in this case. When an existing track disappears in the current frame, Grab checks if, in the previous frame, the track was close to the edge of the image, in which case it assumes the customer has moved out of the camera’s field of view and deletes the track. Otherwise, it marks the track as blocked. When the customer reappears in a subsequent frame, it reactivates the blocked track.
Shopper Gesture Tracking. Grab must recognize the arms of each shopper in order to determine which item he or she purchases (§3.2). OpenPose has a built-in limb association algorithm, which associates shoulder joints to elbows, and elbows to wrists. We have found that this algorithm is a little brittle in our setting: it can miss an association (Figure 7(a)), or mis-associate part of a limb of one shopper with another (Figure 7(b)).
How limb association in OpenPose works. OpenPose first uses a DNN to associate with each pixel confidence value of it being part of an anatomical key point (e.g., an elbow, or a wrist). During image analysis, OpenPose also generates vector fields (called part affinity fields (cao2017realtime)) for upper-arms and forearms whose vectors are aligned in the direction of the arm. Having generated keypoints, OpenPose then estimates, for each pair of keypoints, a measure of alignment between an arm’s part affinity field, and the line between the keypoints (e.g., elbow and wrist). It then uses a bipartite matching algorithm to associate the keypoints.
Improving limb association robustness. One source of brittleness in OpenPose’s limb association is the fact that the pixels for the wrist keypoint are conflated with pixels in the hand (Figure 7
(a)). This likely reduces the part affinity alignment, causing limb association to fail. To address this, for each keypoint, we filtered outlier pixels by removing pixels whose distance from the mediod(park2009simple) was greater than the 85th percentile.
The second source of brittleness is that OpenPose’s limb association treats each limb independently, resulting in cases where the key point from one person’s elbow may get associated with another person’s wrist (Figure 7(b)). To avoid this failure mode, we modify OpenPose’s limb association algorithm to treat one person’s forearms or upper-arms as a pair (Figure 8). To identify forearms (or upper-arms) as belonging to the same person, we measure the Euclidean distance between color histograms belonging to the two forearms, and treat them as a pair if the distance is less than an empirically-determined threshold . Mathematically, we formulate this as an optimization problem:
i,j∑_i ∈E∑_j ∈WA_i,jz_i,j ∑_j ∈Wz_i,j≤1 ∀i ∈E ∑_i ∈Ez_i,j≤1 ∀j ∈W ED(F(i,j),F(i’,j’))¡ thresh ∀j, j’ ∈W i, i’ ∈E
where and are the sets of elbow and wrist joints, and is the alignment measure between the -th elbow and the -th wrist, while is an indicator variable indicating connectivity between the elbow and the wrist. The third constraint models whether two elbows belong to the same body, using the Euclidean distance between the color histograms of the body color. This formulation reduces to a max-weight bipartite matching problem, and we solve it with the Hungarian algorithm (kuhn1955hungarian).
3.2. Shopper Action Recognition
When a shopper is being continuously tracked, and their hand movements accurately detected, the next step is to recognize hand actions, specifically to identify item(s) which the shopper picks up from a shelf. Vision-based hand tracking alone is insufficient for this in the presence of multiple shoppers concurrently accessing items under variable lighting conditions. Grab leverages the fact that many retailers are installing smart shelves (smart_shelves; smart_shelves2) to deter theft. These shelves have weight sensors and are equipped with RFID readers. Weight sensors cannot distinguish between items of similar weight, while not all items are likely to have RFID tags for cost reasons. So, rather than relying on any individual sensor, Grab fuses detections from cameras, weight sensors, and RFID tags to recognize hand actions.
Modeling the sensor fusion problem. In a given camera view, at any instant, multiple shoppers might be reaching out to pick items from shelves. Our identity tracker (§3.1) tracks hand movement, the goal of the action recognition problem is to associate each shopper’s hand with the item he or she picked up from the shelf. We model this association between shopper’s hand and item
as a probabilityderived from fusing cameras, weight sensors, and RFID tags (Figure 9). is itself derived from association probabilities for each of the devices, in a manner described below. Given these probabilities, we then solve the association problem using a maximum weight bipartite matching. In the following paragraphs, we discuss details of each of these steps.
Proximity event detection. Before determining association probabilities, we need to determine when a shopper’s hand approaches a shelf. This proximity event is determined using the identity tracker module’s gesture tracking (§3.1). Knowing where the hand is, Grab uses image analysis to determine when a hand is close to a shelf. For this, Grab requires an initial configuration step, where store administrators specify camera view parameters (mounting height, field of view, resolution etc.), and which shelf/shelves are where in the camera view. Grab uses a threshold pixel distance from hand to the shelf to define proximity, and its identity tracker reports start and finish times for when each hand is within the proximity of a given shelf (a proximity event).
In some cases, the hand may not be visible. In these cases, Grab estimates proximity using the skeletal keypoints identified by OpenPose (§3.1). Specifically, Grab knows, from the initial configuration step, the camera position (including its height), its orientation, and its field of view. From this, and simple geometry, it can estimate the pixel position of any point on the visible floor. In particular, it can estimate the pixel location of a shopper’s ankle joint (Figure 10), and use this to estimate the distance to a shelf. When the ankle joint is occluded, we extrapolate its position from the visible part of the skeleton to estimate the position.
Association probabilities from the camera. When a proximity event starts, Grab starts tracking the hand and any item in the hand. It uses the color histogram of the item to classify the item. To ensure robust classification, Grab performs (Figure 11(a)) (a) background subtraction to remove other items that may be visible and (b) eliminates the hand itself from the item by filtering out pixels whose color matches typical skin colors. Grab extracts a 384 dimension color histogram from the remaining pixels.
During an initial configuration step, Grab requires store administrators to specify which objects are on which shelves. Grab then builds, for each shelf (a single shelf might contain 10-15 different types of items), builds a feature-based kNN classifier (chosen both for speed and accuracy). Then, during actual operation, when an item is detected, Grab runs this classifier on its features. The classifier outputs an ordered list of matching items, with associated match probabilities. Grab uses these as the association probabilities from the camera. Thus, for each hand and each item , Grab outputs the camera-based association probability.
Association probabilities from weight sensors. In principle, a weight sensor can determine the reduction in total weight when an item is removed from the shelf. Then, knowing which shopper’s hand was closest to the shelf, we can associate the shopper with the item. In practice, this association needs to consider real-world behaviors. First, if two shoppers concurrently remove two items of different weights (say a can of Pepsi and a peanut butter jar), the algorithm must be able to identify which shopper took which item. Second, if two shoppers are near the shelf, and two cans of Pepsi were removed, the algorithm must be able to determine if a single shopper took both, or each shopper took one. To increase robustness to these, Grab breaks this problem down into two steps: (a) it associates a proximity event to dynamics in scale readings, and (b) then associates scale dynamics to items by detecting weight changes.
Associating proximity events to scale dynamics. Weight scales sample readings at 30 Hz. At these rates, we have observed that, when a shopper picks up an item or deposits an item on a shelf, there is a distinct ”bounce” (a peak when an item is added, or a trough when removed) because of inertia (Figure 11(b)). If is the duration of this peak or trough, and is the duration of the proximity event, we determine the association probability between the proximity event and the peak or trough as the ratio of the intersection of the two to the union of the two. As Figure 11(b) shows, if two shoppers pick up items at almost the same time, our algorithm is able to distinguish between them. Moreover, to prevent shoppers from attempting to confuse Grab by temporarily activating the weight scale with a finger or hand, Grab filters out scale dynamics where there is high frequency of weight change.
Associating scale dynamics to items. The next challenge is to measure the weight of the item removed or deposited. Even when there are multiple concurrent events, the 30 Hz sampling rate ensures that the peaks and troughs of two concurrent actions are likely distinguishable (as in Figure 11(b)). In this case, we can estimate the weight of each item from the sensor reading at the beginning of the peak or trough and the reading at the end . Thus is an estimate of the item weight . Now, from the configuration phase, we know the weights of each type of item on the shelf. Define as where is the known weight of the -th type of item in the shelf. Then, we say that the probability that the item removed or deposited was the -th item is given by . This definition accounts for noise in the scale (the estimates for might be slightly off) and for the fact that some items may be very similar in weight.
Combining these association probabilities. From these steps, we get two association probabilities: one associating a proximity event to a peak or trough, another associating the peak or trough to an item type. Grab multiplies these two to get the probability, according to the weight sensor, that hand picked item .
Association probabilities from RFID tag. For items which have an RFID tag, it is trivial to determine which item was taken (unlike with weight or vision sensors), but it is still challenging to associate proximity events with the corresponding items. For this, we leverage the fact that the tag’s RSSI becomes weaker as it moves away from the RFID reader. Figure 11(c) illustrates an experiment where we moved an item repeatedly closer and further away from a reader; notice how the changes in the RSSI closely match the distance to the reader. In smart shelves, the RFID reader is mounted on the back of the shelf, so that when an object is removed, its tag’s RSSI decreases. To determine the probability that a given hand caused this decrease, we use probability-based Dynamic Time Warping (bautista2013probability), which matches the time series of hand movements with the RSSI time series and assigns a probability which measures the likelihood of association between the two. We use this as the association probability derived from the RFID tag.
Putting it all together. In the last step, Grab formulates an assignment problem to determine which hand to associate with which item. First, it determines a time window consisting of a set of overlapping proximity events. Over this window, it first uses the association probabilities from each sensor to define a composite probability between the -th hand and the -th item: is a weighted sum of the three probabilities from each sensor (described above), with the weights being empirically determined.
Then, Grab formulates the assignment problem as an optimization problem:
k,m∑p_k,mz_k,m ∑_k ∈Hz_k,m≤1 ∀m ∈I ∑_l ∈I_tz_k,l≤u_l∀k ∈H
where is the set of hands, is the set of items, and is the set of item types, and is an indicator variable that determines if hand picked up item . The first constraint models the fact that each item can be removed or deposited by one hand, and the second models the fact that sometimes shoppers can pick up more than one item with a single hand: is a statically determined upper bound on the number of items of the -th item that a shopper can pick up using a single hand (e.g., it may be physically impossible to pick up more than 3 bottles of a specific type of shampoo). This formulation is a max-weight bipartite matching problem, which we can optimally solve using the Hungarian (kuhn1955hungarian) algorithm.
3.3. GPU Multiplexing
Because retailer margins can be small, Grab needs to minimize overall costs. The computing infrastructure (specifically, GPUs) is an important component of this cost. In what we have described so far, each camera in the store needs a GPU.
Grab actually enables multiple cameras to be multiplexed on one GPU. It does this by avoiding running OpenPose on every frame. Instead, Grab uses a tracker to track joint positions from frame to frame: these tracking algorithms are fast and do not require the use of the GPU. Specifically, suppose Grab runs OpenPose on frame . On that frame, it computes ORB (ORB) features around every joint (Figure 12(a)): ORB features can be computed faster than previously proposed features like SIFT and SURF. Then, for each joint, it identifies the position of the joint in frame by matching ORB features between the two frames. Using this it can reconstruct the skeleton in frame without running OpenPose on that frame.
Grab uses this to multiplex a GPU over different cameras. It runs OpenPose from a frame on each camera in a round-robin fashion. If a frame has been generated by the -the camera, but Grab is processing a frame from another (say, the -th) camera, then Grab runs feature-based tracking on the frame from the camera. Using this technique, we show that Grab is able to scale to using 4 cameras on one GPU without significant loss of accuracy (§4).
We now evaluate the end-to-end accuracy of Grab and explre the impact of each of our optimizations on overall performance. 111Demo video of Grab: https://vimeo.com/245274192
4.1. Grab Implementation
hx711The HX711 can sample at 80 Hz, but the Arduino MCU, when used with several weight scales, limits the sampling rate to 30 Hz.
Weight-sensing Module. To mimic weight scales on smart shelves, we built scales costing $6, with fiberglass boards and 2 kg, 3 kg, 5G kg pressure sensors. The sensor output is converted by the SparkFun HX711 load cell amplifier (hx711) to digital serial signals. An Arduino Uno Micro Control Unit (MCU) (uno) (Figure 13(a)-left) batches data from the ADCs and sends it to a server. The MCU has nine sets of serial Tx and Rx so it can collect data from up to nine sensors simultaneously. The sensors have a precision of around 510 g, with an effective sampling rate of 30 Hzhx711.
RFID-sensing Module. For RFID, we use the SparkFun RFID modules with antennas and multiple UHF passive RFID tags (sparkfun) (Figure 13(a)-right). The module can read up to 150 tags per second and its maximum detection range is 4 m with and antenna. The RFID module interfaces with the Arduino MCU to read data from tags.
Video input. We use IP cameras (ipcam) for video recording. In our experiments, the cameras are mounted on merchandise shelves and they stream 720p video using Ethernet. We also tried webcams and they achieved similar performance (detection recall and precision) as IP cameras.
Identity tracking and action recognition. These modules are built on top of the OpenPose (openpose) library’s skeleton detection algorithm. As discussed earlier, we use a modified limb association algorithm. Our other algorithms are implemented in Python, and interface with OpenPose using a boost.python wrapper. Our implementation has over 4K lines of code.
4.2. Methodology, Metrics, and Datasets
In-store deployment. To evaluate Grab, we collected traces from an actual deployment in a retail store. For this trace collection, we installed the sensors described above in two shelves in the store. First, we placed two cameras at the ends of an aisle so that they could capture both the people’s pose and the items on the shelves. Then, we installed weight scales on each shelf. Each shelf contains multiple types of items, and all instances of a single item were placed on a single shelf at the beginning of the experiment (during the experiment, we asked users to move items from one shelf to another to try to confuse the system, see below). In total, our shelves contained 19 different types of items. Finally, we placed the RFID reader’s antenna behind the shelf, and we attached RFID tags to all instances of 8 types of items.
Trace collection. We then recorded five hours worth of sensor data from 41 users who registered their faces with Grab. We asked these shoppers to test the system in whatever way they wished to (Figure 13(b)). The shoppers selected from among the 19 different types of items, and interacted with the items (either removing or depositing them) a total of 307 times. Our cameras saw an average of 2.1 shoppers and a maximum of 8 shoppers in a given frame. In total, we collected over 10GB of video and sensor data, using which we analyze Grab’ performance.
Adversarial actions. During the experiment, we also asked shoppers to perform three kinds of adversarial actions. (1) Item-switching: The shopper takes two items of similar color or similar weight and then puts one back, or takes one item and puts it on a different scale; (2) Hand-hiding: The shopper hides the hand from the camera and grabs the item; (3) Sensor-tampering: The shopper presses the weight scale with their hand. Of the 307 recorded actions, nearly 40% were adversarial: 53 item-switching, 34 hand-hiding, and 31 sensor-tampering actions.
Metrics. To evaluate Grab’s accuracy, we use precision and recall. In our context, precision is the ratio of true positives to the sum of true positives and false positives. Recall is the ratio of true positives to the sum of true positives and false negatives. For example, suppose a shopper picks items A, B, and C, but Grab shows that she picks items A, B, D, and E. A and B are correctly detected so the true positives are 2, but C is missing and is a false negative. The customer is wrongly associated with D and E so there are 2 false positives. In this example, recall is 2/3 and precision is 2/4.
4.3. Accuracy of Grab
Overall precision and recall. Figure 14(a) shows the precision and recall of Grab, and quantifies the impact of using different combinations of sensors: using vision only (V Only), weight only (W only), RFID only (R only) or all possible combinations of two of these sensors. Across our entire trace, Grab achieves a recall of nearly 94% and a precision of over 91%. This is remarkable, because in our dataset nearly 40% of the actions are adversarial (§4.2). We dissect Grab failures below and show how these are within the loss margins that retailers face today due to theft or faulty equipment.
singsensFor computing the association probabilities §3.2. Cameras are still used for identity tracking and proximity event detection.
rfidIn general, since RFID is expensive, not all objects in a store will have RFID tags. In our deployment, a little less than half of the item types were tagged, and these numbers are calculated only for tagged items.
Using only a single sensorsingsens degrades recall by 12-37% and precision by 16-36% (Figure 14(a)). This illustrates the importance of fusing readings from multiple sensors for associating proximity events with items (§3.2). The biggest loss of accuracy comes from using only the vision sensors to detect items. RFID sensors perform the best, since RFID can accurately determine which item was selectedrfid. Even so, an RFID-only deployment has 12% lower recall and 16% lower precision. Of the sensor combinations, using weight and RFID sensors together comes closest to the recall performance of the complete system, losing only about 3% in recall, but 10% in precision.
Adversarial actions. Figure 14(b) shows precision and recall for only those actions in which users tried to switch items. In these cases, Grab is able to achieve nearly 90% precision and recall, while the best single sensor (RFID) has 7% lower recall and 13% lower precision, and the best 2-sensor combination (weight and RFID) has 5% lower precision and recall. As expected, using a vision sensor or weight sensor alone has unacceptable performance because the vision sensor cannot distinguish between items that look alike and the weight sensor cannot distinguish items of similar weight.
Figure 14(c) shows precision and recall for only those actions in which users tried to hide the hand from the camera when picking up items. In these cases, Grab estimates proximity events from the proximity of the ankle joint to the shelf (§3.2) and achieves a precision of 80% and a recall of 85%. In the future, we hope to explore cross-camera fusion to be more robust to these kinds of events. Of the single sensors, weight and RFID both have more than 24% lower recall and precision than Grab. Even the best double sensor combination has 12% lower recall and 20% lower precision.
Finally, Figure 14(d) shows precision and recall only for those items in which the user trying to tamper with the weight sensors. In these cases, Grab is able to achieve nearly 87% recall and 80% precision. RFID, the best single sensor, has more than 10% lower precision and recall, while predictably, vision and RFID have the best double sensor performance with 5% lower recall and comparable precision to Grab.
In summary, Grab has slightly lower precision and recall for the adversarial cases and these can be improved with algorithmic improvements, its overall precision and recall on a trace with nearly 40% adversarial actions is over 91%. When we analyze only the non-adversarial actions, Grab has a precision of 95.8% and a recall of 97.2%.
Taxonomy of Grab failures. Grab is unable to recall 19 of the 307 events in our trace. These failures fall into two categories: those caused by identity tracking, and those by action recognition. Five of the 19 failures are caused either by wrong face identification (2 in number), false pose detection (2 in number) (Figure 13(c)), or errors in pose tracking (one). The remaining failures are all caused by inaccuracy in action recognition, and fall into three categories. First, Grab uses color histograms to detect items (§3.2), but these can be sensitive to lighting conditions (e.g., a shopper takes an item from one shelf and puts it in another when the lighting condition is slightly different) and occlusion (e.g., a shopper deposits an item into a group of other items which partially occlude the items). Incomplete background subtraction can also reduce the accuracy of item detection. Second, our weight scales were robust to noise but sometimes still could not distinguish between items of similar, but not identical, weight. Third, our RFID-to-proximity event association failed at times when the tag’s RFID signal disappeared for a short time from the reader, possibly because the tag was temporarily occluded by other items. Each of these failure types indicates directions or future work for Grab.
Contextualizing the results. From the precision/recall results, it is difficult to know if Grab is within the realm of feasibility for use in today’s retail stores. Grab’s failures fall into two categories: Grab associates the wrong item with a shopper, or it associates an item with the wrong shopper. The first can result in inventory loss, the second in overcharging a customer. A survey of retailers (nrf) estimates the inventory loss ratio (if a store’s total sales are $100, but $110 worth of goods were taken from the store, the inventory loss rate is 10%) in today’s stores to be 1.44%. In our experiments, Grab’s failures result in only 0.79% inventory loss. Another study (shopper_cost_rate) suggests that faulty scanners can result in up to 3% overcharges on average, per customer. In our experiments, we see a 2.8% overcharge rate. These results are encouraging and suggest that Grab may be with the realm of feasibility, but larger scale experiments are needed to confirm this. Additional investments in sensors and cameras, and algorithm improvements, could further improve Grab’s accuracy.
4.4. The Importance of Efficiency
Grab is designed to process data in near real-time so that customers can be billed automatically as soon as they leave the store. For this, computational efficiency is important to lower cost (§4.5), but also to achieve high processing rates in order to maintain accuracy.
precisionIn this and subsequent sections, we focus on precision, since it is lower than recall (§4.3), and so provides a better bound on Grab performance.
Impact of lower frame rates. If Grab is unable to achieve a high enough frame rate for processing video frames, it can have significantly lower accuracy. At lower frame rates, Grab can fail in three ways. First, a customer’s face may not be visible at the beginning of the track in one camera. It usually takes several seconds before the camera can capture and identify the face. At lower frame rates, Grab may not capture frames where the shopper’s face is visible to the camera, so it might take longer for it to identify the shopper. Figure 15(a) shows that this identification delay decreases with increasing frame rate approaching sub-second times at about 10 fps. Second, at lower frame rates, the shopper moves a greater distance between frames, increasing the likelihood of identity switches when the tracking algorithm switches the identity of the shopper from one registered user to another. Figure 15(b) shows that the ratio of identity switches approaches negligible values only after about 8 fps. Finally, at lower frame rates, Grab may not be able to capture the complete movement of the hand towards the shelf, resulting in incorrect determination of proximity events and therefore reduced overall accuracy. Figure 15(c) shows precisionprecision approaches 90% only above 10 fps.
Infeasibility of a DNN-only architecture. In §3
we argued that, for efficiency, Grab could not use separate DNNs for different tasks such as identification, tracking, and action recognition. To validate this argument, we ran the state-of-the-art open-source DNNs for each of these tasks on our data set. These DNNs were at the top of the leader-boards for various recent vision challenge competitions(coco_challenge; mot_challenge; mpii). We computed both the average frame rate and the precision achieved by these DNNs on our data (Table 1).
For face detection, our accuracy measures the precision of face identification. The OpenFace (amos2016openface) DNN can process 15 fps and achieve the precision of 95%. For people detection, our accuracy measures the recall of bounding boxes between different frames. Yolo (yolo) can process at a high frame rate but achieves only 91% precision, while Mask-RCNN (he2017mask) achieves 97% precision, but at an unacceptable 5 fps. The DNNs for people tracking showed much worse behavior than Grab, which can achieve an identity switch rate of about 0.027 at 10 fps, while the best existing system, DeepSORT (wojke2017simpl) has a higher frame rate but a much higher identity switch rate. The fastest gesture recognition DNN is OpenPose (cao2017realtime) (whose body frame capabilities we use), but its performance is unacceptable, with low (77%) accuracy. The best gesture tracking DNN, PoseTrack (iqbal2016PoseTrack), has a very low frame rate.
Thus, today’s DNN technology either has very low frame rates or low accuracy for individual tasks. Of course, DNNs might improve over time along both of these dimensions. However, even if, for each of the four tasks, DNNs can achieve, say, 20 fps and 95% accuracy, when we run these on a single GPU, we can at best achieve 5 fps, and an accuracy of . By contrast, Grab is able to process a single camera on a single GPU at over 15 fps (Figure 16), achieving over 90% precision and recall (Figure 14(a)).
|People tracking||FPS||Avg ID switch|
|Gesture Tracking||FPS||Avg ID switch|
4.5. GPU multiplexing
In the results presented so far, Grab processes each camera on a separate GPU. The bottleneck in Grab is pose detection, which requires about 63 ms per frame: our other components require less than 7 ms each (Table 2).
In §3.3, we discussed an optimization that uses a fast feature tracker to multiplex multiple cameras on a single GPU. This technique can sacrifice some accuracy, and we are interested in determining the sweet spot between multiplexing and accuracy. Figure 16 quantifies the performance of our GPU multiplexing optimization. Figure 16(a) shows that Grab can support up to 4 cameras with a frame rate of 10 fps or higher with fast feature tracking; without it, only a single camera can be supported on the GPU (the horizontal line in the figure represents 10 fps). Up to 4 cameras, Figure 16(b) shows that the precision can be maintained at nearly 90% (i.e., negligible loss of precision). Without fast feature tracking, multiplexing multiple cameras on a single GPU reduces the effective frame rate at which each camera can be processed, reducing accuracy for 4 cameras to under 60%. Thus, with GPU multiplexing using fast feature tracking, Grab can reduce the investment in GPUs by 4.
4.6. Evaluating Design Choices
In this section, we experimentally validate design choices and optimizations in identification and tracking.
|Module||Avg time per frame (ms)|
retrainFast re-training is essential to minimize the time the customer needs to wait between registration and shopping.
Identification. Customer identification in Grab consists of three steps (§3.1
): face detection, feature extraction, and feature classification. For face detection, Grab adjusts the bounding box from OpenPose output. It could have used the default OpenPose output box or run a separate neural network for face detection. Table3 shows that our design choice preserves detection accuracy while being an order of magnitude faster. For feature extraction, we compared our ResNet face features with another face feature (FaceNet), with a neural net generated body feature, and with a body color histogram. Table 4 shows that our approach has the highest accuracy. Finally, for feature classification, we tried three approaches: comparing features’ cosine distance, using kNN, or using a simple neural network. Their re-training timeretrain, running speed, and accuracy, are shown in Table 5. We can see that kNN has the best accuracy with retraining overhead of 2 s and classification overhead less than 2 ms.
|Speed (ms/img)||Accuracy (%)|
|Adjusted box (Grab)||4.1||95.1|
|Original box from pose||<1||83.0|
|Box from DNN model||93||95.1|
|Grab’s model (ResNet)||95.1%|
Body deep feature
|Cosine Dist||kNN||Neural Net|
|Retraining latency (s)||0||2.1||68.6|
|Classification latency (ms)||0.1*||1.9||10.7|
Tracking. Replacing our pose tracker with a bounding box tracker (wojke2017simpl) can result in below 50% precision. Removing the limb association optimization drops precision by about 11%, and removing the optimization that estimates proximity when the hand is not visible reduces precision by over 7%. Finally, removing lazy tracking, which permits accurate tracking even in the presence of occlusions can reduce precision by over 15%. Thus, each optimization is necessary to achieve high precision.
5. Related Work
We are not aware of published work on end-to-end design and evaluation of cashier-free shopping.
Commercial cashier-free shopping systems. Amazon Go was the first set of stores to permit cashier-free shopping. Several other companies have deployed demo stores, including Standard Cognition (std_cog), Taobao (taobao), and Bingobox (bingobox). Amazon Go and Standard Cognition use deep learning and computer vision to determine shopper-to-item association ((amzn_no_rfid; go-geek; go-nyt; std_cog_forbes)). Amazon Go does not use RFID (go-nyt; go-geek) but needs many ceiling-mounted cameras. Imagr(imagr) uses a camera-equipped cart to recognize the items put into the cart by the user. Alibaba and Bingobox use RFID reader to scan all items held by the customer at a ”checkout gate” ((ali_tech; bingo_tech)). Grab incorporates many of these elements in its design, but uses a judicious combination of complementary sensors (vision, RFID, weight scales).
Person identification. Person (re)-identification has used face features and body features. Body-feature-based re-identification (bai2017scalable; chen2017beyond; zhao2017spindle) can achieve the precision of up to 80%, insufficient for cashier-free shopping. Proprietary face feature based re-identification (face_id; yitu; sighthound_face; amazon_face_recog) can reach 99% precision. Recent academic research using face features has achieved an accuracy of more than 95% on public datasets, but such systems are either unavailable (hayat2017joint; chen2018face; zhao2018towards) or too slow (tran2017disentangled; masi2016pose). Grab uses fast feature-based face re-identification with comparable accuracy while using a pose tracker to accurately bound the face (§3.1).
People tracking. Bounding box based trackers (ristani2018features; wojke2017simpl; liu2018tar) can track shopper movement, but can be less effective in crowds (§4.4) since they do not detect limbs and hands. Some pose trackers (Iqbal_CVPR2017; insafutdinov2017) can do pose detection and tracking at same time, but are too slow for Grab (§4.4) which uses a skeleton-based pose tracker both for identity tracking and gesture recognition.
Action detection. Action detection is an alternative approach to identifying shopping actions. Publicly available state-of-the-art DNN-based solutions (kalogeiton17iccv; sun2018optical; heilbron2017scc; dave2017predictive) have not yet been trained for shopping actions, so their precision and recall in our setting is low.
Item detection and tracking. Prior work has explored item identification using Google Glass (ha2014towards) but such devices are not widely deployed. RFID tag localization can be used for item tracking (shangguan2015relative; shangguan2017; jiang2018orientation) but that line of work does not consider frequent tag movements, tag occlusion, or other adversarial actions. Vision-based object detectors (he2017mask; chen2018domain; redmon2018yolov3) can be used to detect items, but need to be trained for shopping items and can be ineffective under occlusions and poor lighting (§4.3). Single-instance object detection scales better for training items but has low accuracy (karlinsky2017fine; held2016robust).
Cashier-free shopping systems can help improve the shopping experience, but pose significant design challenges. Grab is a cashier-free shopping system that uses a skeleton-based pose tracking DNN as a building block, but develops lightweight vision processing algorithms for shopper identification and tracking, and uses a probabilistic matching technique for associating shoppers with items they purchase. Grab achieves over 90% precision and recall in a data set with up to 40% adversarial actions, and its efficiency optimizations can reduce investment in computing infrastructure by up to 4. Much future work remains including obtaining results from longer-term deployments, improvements in robust sensing in the face of adversarial behavior, and exploration of cross-camera fusion to improve Grab’s accuracy even further.