Seeding Deep Learning using Wireless Localization

by   Zhujun Xiao, et al.

Deep learning is often constrained by the lack of large, diverse labeled training datasets, especially images captured in the wild. We believe advances in wireless localization, working in unison with cameras, can produce automated labeling of targets in videos captured in the wild. Using pedestrian detection as a case study, we demonstrate the feasibility, benefits, and challenges of a possible solution using WiFi localization. To enable the automated generation of labeled training data, our work calls for new technical development on passive localization, mobile data analytics, and error-resilient ML models, as well as design issues in user privacy policies.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6


Machine Learning in Appearance-based Robot Self-localization

An appearance-based robot self-localization problem is considered in the...

Topometric Localization with Deep Learning

Compared to LiDAR-based localization methods, which provide high accurac...

Transfer Learning for Future Wireless Networks: A Comprehensive Survey

With outstanding features, Machine Learning (ML) has been the backbone o...

The Herbarium Challenge 2019 Dataset

Herbarium sheets are invaluable for botanical research, and considerable...

Application of End-to-End Deep Learning in Wireless Communications Systems

Deep learning is a potential paradigm changer for the design of wireless...

Distant Pedestrian Detection in the Wild using Single Shot Detector with Deep Convolutional Generative Adversarial Networks

In this work, we examine the feasibility of applying Deep Convolutional ...

Weakly supervised training of deep convolutional neural networks for overhead pedestrian localization in depth fields

Overhead depth map measurements capture sufficient amount of information...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has produced game-changing applications wherever it has been applied, from image and face recognition, to self-driving cars, knowledge extraction and retrieval, and natural language processing and translation.

Despite numerous advances, one fundamental limitation remains: building accurate deep learning models often requires training on large, labeled datasets [21, 10, 39], e.g. Google’s InceptionV3 model was trained on 1.28M labeled images [1]. Building these training datasets, however, is often prohibitively costly in resources. In fact, researchers from multiple application areas report that development of DNN models has been impeded by the lack of large-scale labeled training data [47]. Examples range from sign language recognition [24], facial recognition [22], prognostics for industrial applications [41], to smart-city applications [28].

In this paper, we propose the development of a network-based system to help address this fundamental need for labeled training data. Our proposed system uses state-of-the-art wireless localization hardware, combined with digital imaging systems (cameras), to scalably produce self-annotated ground truth images for deep learning systems.

1.1 The Data Labeling Problem

Current sources of labeled training data have two issues. First, most existing datasets come from curated sources, e.g., facial images from government agencies [31, 40] or images of people or pets collected from photo archives. These are often inherently biased in lighting conditions, pose, or specific subjects, and these biases produce recognition failures when applied to raw data in the wild [36]. Recent work shows that models trained on videos captured in the wild show significantly improved accuracy over those trained on curated datasets [21]. Second, there is simply not enough large labeled training datasets that are captured in the wild [10, 39]. Addressing this shortage directly is extremely difficult, as it requires manually labeling images and videos, an extremely labor-intensive task. While some have proposed using generative models to produce training data for text-based applications [32], annotation of images and videos still relies on manual annotation by humans [29].

Another approach that can help is transfer learning,

where a “teacher model” trained by a trusted party with access to large-scale data can be shared with many users, who then use smaller local, targeted datasets to incrementally train top layers of the model, producing a “student model.” Today, transfer learning is recommended by deep learning frameworks like Google Cloud ML, Microsoft Cognitive Toolkit, and PyTorch from Facebook. That said, obtaining targeted, labeled training data is critical to the success of developing student models, and data labeling is still a critical problem even in the context of transfer learning. We illustrate this point using the following case study.

Figure 1: Manually labeling a pedestrian by a bounding box.
Figure 2: Impact of training data size on pedestrian detection accuracy.

A Case Study on Pedestrian Detection.    Camera-based pedestrian detection is a critical component for applications like self-driving vehicles and urban traffic management in smart cities. Labeling is an effort-intensive human task, where training data is created by annotators who mark the boundary of each pedestrian in images using a bounding box (Figure 2). The most popular dataset is the Caltech Pedestrian Dataset [14], which contains 2.3 hours of urban video footage collected by cars, and took 400 man-hours to label. We perform some illustrative experiments to demonstrate the impact of training data choice and size on system performance.

First, we look at the benefits of targeted training of deep learning models for pedestrian detection. The best performing pedestrian detection system today is the Regional Proposal Network (RPN) [44], a student model that builds on a generic object detection model called Faster R-CNN [33]. To understand the impact of targeted training even in a transfer learning context, we customize two versions of the RPN model, one trained using a generic object detection dataset (PASCAL VOC [16]), and one trained using the Caltech dataset. Both are of the same size. We test both models using two pedestrian datasets: the Caltech dataset (using portions of the dataset not included in training the model), and a pedestrian dataset collected by cars from Daimler Chrysler [15].

Training Set
Caltech 60.1% 20.8%
Daimler 37.9% 21.7%

We plot the results in the table above, where numbers show for each test the miss rate, the log-average miss rate on False Positive Per Image (FPPI) in  [14]. There are two takeaways. First, general results show that even using transfer learning with a tuned dataset, the miss rate is still quite significant. Second, results show that customizing DNN models by task-specific training data (i.e. the Caltech training set) leads to significantly higher accuracy. This is consistent with prior empirical work [21, 35].

Next, we look at the impact of training set size on system accuracy. We look at different variants of student models trained using the Caltech dataset, with varying sizes of training data from 2,000 to 40,000 images. Results in Figure 2 show that the impact from training data size is huge. This further underlines that even when used in the training of student models in transfer learning, highly accurate models require large, labeled datasets (e.g. 40,000 images). Note that 1% improvement in miss rate is already significant for most vision tasks [44].

1.2 Self-annotating Image Generation System

Instead of manually labeling training datasets, we consider an alternative that automatically labels images captured in the wild. We ask the question: can we leverage the presence of ubiquitous wireless devices to automate the generation of labeled image datasets for training DNNs? For example, can a wireless infrastructure determine the location of a smartphone, and use that to annotate images of its owner from a synchronized camera?

We believe advances in localization have made it possible for wireless infrastructure to precisely compute the location of a passive wireless device even in outdoor areas, thus enabling automated labeling of images and videos in the wild. If successful, this would provide a cheap mechanism for generating large volumes of labeled data for tasks such as pedestrian detection, object recognition (of connected vehicles), and facial recognition (of wireless users).

There are additional benefits to using wireless device localization to label images. Interacting with a wireless device generally produces a device identifier that can be used to correlate the same device or user across different images, e.g., temporal correlation of moving users across images in a sequence, or geographic correlation of the same user from images taken from different perspectives. Identifiers also enables “user-driven” policies such as “opt-out” lists for users who wish to be excluded for privacy reasons.

Challenges.    A number of significant challenges remain

First, a data generation platform must impose minimal overhead on targets, e.g., users who agree to be captured in image form should stay passive, and do not need to obtain specialized hardware or software. This is significantly more challenging than localizing active participants, e.g., users tagged with RFIDs. Second, localization results must be precise, e.g., at least sub-meter level, to enable accurate identification of targets on images. Third, the system must address issues of privacy and user consent.

Initial Feasibility Study.    In this paper, we perform an intial study to explore the feasibility of automatically producing image training data using a passive RF localization scheme. We refer to this new approach as RF labeling. Our system leverages IEEE 802.11mc with the fine timing measurement (FTM) chipset feature111supported by Intel 8260 WiFi chip and Android P.. FTM is designed to facilitate highly accurate localization, by responding to remote probes at the hardware level without requiring network-level connectivity or synchronization with an access point. Thus targets carrying smartphones or WiFi devices (802.11 FTM capable) can be automatically labeled. While we use 802.11 FTM for our feasibility study, the core concept and our observations should generalize to other localization methods. Our work makes three key contributions.

  • Using pedestrian detection as a case study, we identify practical problems in deploying RF labeling. We identify four types of mismatch errors between human-labeled image data and RF labels, and discuss potential efforts to address each.

  • Our empirical measurements show that pedestrian detection requires high quality RF labels, beyond the precision of today’s 802.11 FTM hardware settings. We show via emulation that the quality of labels improves significantly with tuned hardware settings.

  • We recognize the seriousness of issues of participant privacy and consent, and present some initial discussion in §6.

1.3 Related Work

Our work differs from existing ML directions to reduce reliance on labeled training data, which we summarize below.

Automatic Annotation.    Some prior works annotate objects or gestures by physically tagging them with RFIDs [23] or magnetic sensors [18]. These require active participation by the target and significantly limit scalability and applicable uses. In contrast, our goal is to scalably produce labeled images of passive targets (e.g., people, pets with collars, vehicles that already carry WiFi devices). Our work also differs from automatic image annotation that uses visual features (e.g., color, texture, shape) to generate image labels. It requires complex generative models that are hard to build [43].

New Model Architecture.    Transfer learning and semi-supervised learning use local labeled data to adapt well-trained generic models to new scenarios. Self-taught learning and unsupervised feature learning

learn features from unlabeled data, but still require a sizable amount of labeled data to train the classifier. Finally,

weakly supervised learning [46] reduces labeling complexity by using coarse-grained labels (e.g., image-level labels without object bounding box), but has limited applicability. Our work differs by taking a different (and complementary) perspective, i.e. removing labeling overhead via automation.

2 Automated RF Labeling

We take a different perspective in addressing visual data labeling. We propose to automate the process of labeling using wireless localization systems that can be deployed at the time of training data collection.

Feasibility.    Our key insight is that the locations of targets (e.g.

, pedestrians, cars, drones) on a 2D image can be derived from their physical 3D locations w.r.t. the camera. While human locates targets on 2D images, RF localization can directly estimate each target’s 3D physical location, and then label the target on the image by projecting its 3D location to the 2D image. If the target’s physical size is known or can be estimated, we can use the same projection to build its bounding box on the image.

Figure 3 illustrates the process of RF labeling for pedestrian detection. The system takes as input the 3D location of each pedestrian ( via localization) and the camera image captured at the same time. Using information of the camera (location, view angle) and the environment (road elevation, etc.), the system first projects each 3D location to a 2D point on the image as the target center. It then crafts a 3D body box based on the average human height and body aspect ratio, and projects it to a 2D bounding box based on the target’s 3D location (depth).

Figure 3: RF labeling on camera data.

Applicability.    RF labeling is not yet a universal solution for automated labeling. It can only label targets identifiable by RF localization, e.g., a person carrying a smartphone, but not a cat. Since many users do carry smartphones, it is particularly applicable to human-oriented vision tasks.

Another question is “instead of labeling camera images, why not use RF localization in place of camera for the target task (e.g., pedestrian detection)?” The answer is no because outdoor localization is in general device-based222Device-free sensing/localization [7, 38] works in short (indoor) ranges with a few (5) targets, and is not applicable here., requiring targets to carry specific wireless devices, e.g., smartphones, wearables. Cameras, on the other hand, are ubiquitous and do not impose any requirements on targets, and can detect pedestrians even when they do not carry wireless devices. Thus camera will remain as the prevalent technology for many applications, e.g., smart-city video surveillance [5].

Similarly, RF labeling also differs from sensor fusion, e.g., RGB-W [8] that combines smartphone’s wireless signal strength data with camera images to improve detection accuracy. Fusion does not address the fundamental problem of labeling and requires all the users to carry specific wireless devices.

Deployment Requirements.    Practical RF labeling imposes three key requirements on the underlying localization system. To achieve sufficient coverage, the localization system needs to be passive, not requiring targets to actively communicate or synchronize with the system. It also needs to support a range similar to that of camera ( 60m for outdoor clear view [14]), and offer high precision, e.g., cm-level accuracy (as we will show later). It needs to be synchronized with the camera (at the level of camera frame rate).

2.1 Benefits

RF labeling has unique advantages over human labeling.

Volume.    As data labeling is fully automated, the size, number and diversity of training data will no longer be constrained by human labor and can become arbitrarily large.

Labeling while collecting data.    RF labeling works in unison with training image capturing and thus the two tasks run simultaneously.

Adding location/depth to Images.    Each RF label includes the 3D physical location of the target, adding depth on 2D images. Human labeling cannot do so.

Temporal Tracking.    With consent, RF localization can track each identified target over time, producing fine-grained, context-rich labels on the target. For example, one can infer the moving speed and context of a target from its sequence of location data (e.g., standing, walking, running, biking), and use them to produce fine-grained labels.

Cross-perspective Correlation.    RF localization can extract hidden identity of the targets that are hard to identify from the camera data. For example, it can use captured WiFi MAC addresses as identity trackers (assume no MAC randomization). After recognizing the same identity across camera and time, one can correlate these images together to build a comprehensive view of the target.

2.2 Implications on ML Applications

Aside from boosting the number and size of labeled training data, RF labeling can also facilitate development of highly complex ML applications. Below are some examples.

3D Face Models.   Computer vision tasks like person re-identification and multi-view face detection and recognition face significant challenges since they require training data on each target’s identity with images across frames, cameras, and locations. The labeling task is extremely difficult for human annotators since viewpoint, background, lighting and illumination can change significantly across images. With RF labeling, we can track targets by their device identities, and automatically build a comprehensive view of the target across many images. This helps to create a large database of 3D, multi-view facial data for individual users.

Human Action Recognition.    A conventional method for building training data for human activity recognition is to ask volunteers to perform predefined actions in front of cameras [34], which clearly does not scale. RF labeling can automatically identify and label activities based on each target’s location data over time. For example, our initial experiments find it can separate bikers and runners from stationary users and walkers based on their moving speeds. It can also use target location (e.g., bike lane vs. side walk) to separate bikers and runners who move at similar speeds.

Abnormal Event Detection in Video Surveillance.    The physical location and trajectory can also be used to identify and label abnormal events for video surveillance. For example, one can create detailed labels when users stand or run in the middle of the street, or follow an unusual route.

Scene Recognition.    RF labeling can label physical objects, from vehicles (cars, buses and trucks), robots and drones that are equipped with RF devices, to doors with WiFi smart video doorbells. The same benefits of adding depth, temporal tracking and across-camera integration can be utilized to recognize and label scenes on images.

2.3 Limitations

RF labeling faces three key limitations when compared to human labeling. First, RF labeling cannot label existing camera datasets (that do not contain localization data). Second, RF labeling introduces extra cost of deploying RF localization systems that synchronize with cameras. One can potentially mitigate this by leveraging smart city devices being deployed in major cities or (re)using camera’s on-board RF radios. Third, RF labeling cannot recognize and label targets that do not have wireless devices, and errors in localization results will translate into erroreous data labels. We further discuss the third limitation in §3.

Figure 4: Four types of mismatch between camera and RF labels.
Figure 5: Partial RF coverage affects performance, but gets compensated by more images.
Figure 6: RF localization can potentially identify when a target gets blocked.

3 Challenge: Label Mismatch

The most fundamental challenge facing RF labeling is the potential mismatch between information captured by camera and that generated by RF labeling. We categorize such mismatch into four types (Figure 6).

Type \⃝raisebox{-0.9pt}{1}: Missing Labels due to Partial RF Coverage.    Not every target can be detected and localized by RF localization, e.g., device-based RF localization can only localize users who carry the required RF device and are in range. Therefore, RF labeling could miss some targets.

We expect that these missing labels will have minimal impact on ML performance because they can be compensated by using more training data for labeling (a key feature of RF labeling). Of course, this assumes RF labeling either does not impose any bias on targets, or any such bias can be addressed by the DNN model itself, and that training cost does not grow drastically with the number of images.

Using the pedestrian detection example, we study the impact of missing labels and test our hypothesis. Given a RF coverage of (%), we create RF labels by randomly sampling the labels in the Caltech dataset. Figure 6 plots the log-average miss rate as a function of the number of training images, for =10% and 30%. We also plot, as the baseline, the performance of using 10k human labeled images. We see that for a given number of training images, reducing RF coverage does degrade the application performance, but can be easily compensated by adding more training images. With 10k training images, switching from human labeling to RF labeling with 30% coverage increases the miss rate from 17.3% to 28%, which drops back to 18% after adding 10k training images and 17.3% with 20k more.

We also study the impact of potential bias imposed by RF labeling. For pedestrian detection, RF labeling cannot label children since they do not normally carry wireless devices. We verified that today’s pedestrian detection model [44] can detect children accurately even when all labeled pedestrians are adults. This is because the DNN model treats children as a scaled down version of adults.

Type \⃝raisebox{-0.9pt}{2}: Extraneous Labels due to Camera Occlusion.    Since RF signals often can penetrate or go “around” obstacles, RF labeling may locate and label targets behind obstacles. Yet cameras can only capture the obstacles or parts of the target, i.e., camera occlusion. Since both full and partial occlusions should not be used during model training [14], RF labels of occluded targets need to be identified and removed.

To detect camera occlusion, one potential direction is to analyze RF signals and localization results over time. Intuitively, obstacle blockage leads to large degradation in signal strength, and the use of NLoS paths will produce much longer range estimation. Figure 6 shows an example where a pedestrian was blocked by an obstacle for a period of time. These artifacts will appear as “anomalies” in the time sequence of signal measurements and localization results, which can be used to detect occlusion.

Type \⃝raisebox{-0.9pt}{3}: Missing Size Information.    While a human annotator will draw a bounding box around each detected target, RF labeling does not offer such information. For pedestrian detection, this can be overcome by first establishing an estimate of the target size (e.g., the average human height 1.76m [17] and the average aspect ratio 0.41 [14]), and then projecting the physical bounding box to the camera bounding box based on the target’s relative location to the camera. This is a reasonable estimate, because today’s pedestrian detection models [44] also resize the bounding boxes based on the same average aspect ratio (0.41).

A related challenge is how to distinguish between targets that carry the same type of RF devices, e.g., human with WiFi vs. machines/vehicles with WiFi. One can analyze the MAC address or the physical location and trajectory data obtained via localization. For example, human users will most likely travel on sidewalks while vehicles stay in their lanes. Another direction is to develop outdoor RF sensing techniques like passive RF imaging [6, 20, 48] to capture more information of the target.

Type \⃝raisebox{-0.9pt}{4}: Noisy Labels due to Localization Errors.    In this case, a target is captured by both RF localization and camera, but its RF label deviates from the true camera label due to localization errors (and bounding box estimation errors). Localization errors include both angular and depth errors. The angular error shifts a label in the image, while the depth error changes the bounding box size.

4 Initial Noisy Label Analysis

While the first three types of mismatch are results of inherent difference between camera and RF systems, noisy labels are results of the specific RF localization design. In this section, we analyze its impact empirically using a specific RF localization system.

We note that many of today’s localization systems do not meet the requirements of RF labeling (§2).

Active localization requires either placing tags on targets or network connectivity and synchronization with targets Device-free systems are largely limited by range () and number of targets (5). Most device-based passive localization systems, including GPS, WiFi-assist, and RSS achieve at most meter-level accuracy in outdoor scenarios, e.g. 5 m median error using GPS [26].

Passive Localization via IEEE 802.11 FTM.    Instead, we consider a recent development of passive RF localization: IEEE 802.11mc with fine timing measurements (FTM) [4, 12], or 802.11 FTM in short. It uses time-of-flight (ToF) for ranging and trilateration for localization. In our case, each AP will broadcast FTM beacons and targets (who enable FTM) will automatically respond to these beacons. The AP then estimates the distance by measuring the round-trip time (RTT)333 Differed from ToF with FMCW (e.g.[7, 27, 37, 42]), 802.11 FTM operates on simple sine waves. Its accuracy depends on the precision of hardware timing and RTT estimation.. 802.11 FTM is a hardware feature and does not require network-level connectivity or synchronization (it estimates RTT locally like [30]). By enabling PHY-layer timing with picosecond-level accuracy, it can offer sub-meter-level or less outdoor ranging accuracy. Using ToF, its accuracy is location independent, as long as the signal can reach the target. Finally, 802.11 FTM is already supported by off-the-shelf WiFi chipsets (Intel 8260, $20, range 100m [11]) and Google Android P [2].

# of TXs sample rate localization error (cm)  median 95% S0 2 256 132.0 462.8 S1 4 2048 31.8 93.8 S2 6 2048 24.6 63.8 S3 6 5012 16.2 42.0
Figure 7: 802.11 FTM localization under different configurations. S0 is our testbed, S1–S3 are projections.
Figure 8: Impact of noisy labels on pedestrian detection w/ 10k and 40k training images.

Emulation of Noisy Labels.    To study the impact of noisy labels on today’s large-scale camera datasets, we take an emulation approach, first building an empirical model of 802.11 FTM’s localization errors using testbed measurements, then “injecting” the errors onto labels of existing datasets.

Our measurements used three Dell XPS 13 laptops with the Intel 8260 chipset (2 as transmitters and 1 as target) placed on typical streets of approximately m in size. After analyzing 10,000 measurements with varying target locations, we found that the ranging error follows the (folded) t location-scale distribution [3]

with zero mean and 0.54m standard deviation, and the localization error follows the gamma distribution. Both distributions remain invariant (with 91% confidence) despite changes in weather, target orientation and location, distance to transmitters, and hardware. We also confirmed that the error models match the testbed data.

The above model assumes 2 transmitters, a sampling rate of 256 beacons per localization instance (a hardcoded limit). We also simulated more sophisticated hardware configurations by increasing the number of transmitters and the sampling rate and found that they do not change the model distributions, only the parameters. Table 8 summarizes the model parameters for four different configurations.

Given the error models, we emulate RF-labeling on the Caltech Pedestrian dataset, producing noisy labels for pedestrian detection. This requires us to first recover the 3D physical location (especially depth) of each labeled pedestrian the image data, which we approximate using the pinhole camera model [19] assuming a pedestrian height of 1.76m [17] (with 10% random variation). We then inject our localization error on each instance and then project it back to the same 2D image to produce noisy labels.

Impact on Pedestrian Detection.    Figure 8 compares the log-average miss rate of models trained with 10k and 40k RF labeled images for different configurations (S0–S3 in Table 8), and that with 10k human labeled images (i.e. noise free labels).

We make two key observations. First, for pedestrian detection, the current DNN model is sensitive to noisy labels produced by RF labeling. Under our basic FTM configuration (S0, 1.32m median error), the miss rate rises to more than 70% compared to 17.3% under noise-free labels. It drops back to 20% after largely improving hardware and infrastructure (S3, 16.2cm median error) and adding more labeled data. Second, we found that angular error is the dominant factor for performance degradation (compared to depth error). That is, placing a bounding box at the wrong 2D location on the image leads to much higher damage than wrongly sizing the bounding box.

5 Addressing Noisy Labels

Our results show that the most immediate challenge facing RF labeling is noisy labels. We now discuss three orthogonal and complementary directions to address this problem. Further research efforts are needed in these areas.

Advancing Outdoor Localization.    For pedestrian detection, RF labeling requires precise outdoor localization (tens of ) at a range. Today’s solutions were never designed with this level of accuracy in mind. The straightforward solution to advancing outdoor localization is to motivate industry to increase AP density and upgrade RF hardware, e.g., increasing 802.11 FTM beacon rate from 256 to 5012 per unit (S3 in Table 8), or switching to directional mmWave radios for localization. We can emulate some of these upgrades today, e.g. using multiple 802.11 FTM chipsets to emulate higher beacon rates; or adapt localization methods to focus on minimizing angular errors.

Filtering Out Noisy Labels.    Our second approach is using data analysis to identify “bad” localization instances, and ignore the corresponding labels. As a result, each RF labeled image will miss some labels, which can be compensated by labeling more images (see §3).

There are two potential methods for identifying bad localization instances, depending on the data used for analysis. The first directly identifies bad instances by looking at raw localization data. Recent work achieves this by applying unsupervised feature clustering on large-scale WiFi and cellular RSS localization datasets [25]. The system can effectively identify and remove bad localization instances (e.g., by percentile). It would be interesting to study whether the same approach can be used on FSM localization data.

The second and complementary approach is to cross-validate each RF label using its corresponding visual content, e.g., the image content inside the bounding box. Intuitively, an accurate label will create a bounding box around an object, which “stands out” from the surrounding background. In computer vision, this is captured by a metric called objectness score, which measures how likely a bounding box contains an object [9, 13, 33, 49]. This method, however, cannot differentiate between types of objects (e.g. pedestrians vs. trash bins). One can partially compensate by considering a sequence of images and leveraging temporal correlation of pedestrian movement. For example, recent work has leveraged view synthesis [45] to estimate depth and motion of targets, which can be combined with localization results for cross-validation.

Error-Resilient DNN Models.    Our third approach is to apply architectural modification to existing ML/DL models so that they can tolerate noisy labels. This is a well-studied topic in the ML community, especially for classification tasks.

RF labeling brings new opportunities in this domain, since our RF localization can simultaneously provide multiple forms of labels of different accuracy levels, i.e. minimum number of pedestrians in the image (most accurate), depth of each pedestrian, and 3D physical location of each pedestrian (least accurate). Our RF data analysis can also offer confidence scores for each label [32]. How to build robust models for these new scenarios is an interesting open research question.

6 Discussion and Conclusion

We believe advances in localization will make it possible to precisely compute the location of a passive wireless device, thus enabling automated labeling of some targets on images (and other datasets). Using case studies on pedestrian detection, we demonstrate the feasibility, benefits, and challenges of such concept. Our work calls for new technical developments on passive localization, mobile data analytics, and error-resilient ML models, as well as privacy protection during ML training. Compared with ongoing efforts in the ML community, this approach tackles the hard challenge of training data labeling from a different (and complementary) perspective, i.e. removing labeling overhead via automation.

Privacy Opt-out via Device-based RF Labeling.    For passive annotated imaging to move forward, participant consent and privacy is a critical issue that must be addressed comprehensively. We believe device-based RF labeling can help address such privacy concerns. Since only targets carrying a specific wireless device will be recognized and labeled by the system, a user can specify her privacy constraints to the labeling system based on her device identity, e.g., the MAC address (using 802.11mc probing). A user can opt out completely or at specific locations and time periods (since RF labeling knows user location and time). Such privacy protection cannot be implemented using manual labeling. Finally, while this new feature offers an initial start on user privacy protection, we still need significant research efforts to address the issue of participant privacy and consent.


  • [1]
  • [2] Previewing android p.
  • [3] Statistics: t location-scale distribution.
  • [4] Part 11: Wireless lan medium access control (MAC) and physical layer (PHY) specifications. IEEE P802.11-REVmc (2016).
  • [5] Smart city surveillance: Singapore’s camera system stands as a potent deterrent., 2017.
  • [6] Adib, F., Hsu, C., Mao, H., Katabi, D., and Durand, F. Capturing the human figure through a wall. ACM Transactions on Graphics 34, 6 (2015).
  • [7] Adib, F., Kabelac, Z., and Katabi, D. Multi-person localization via RF body reflections. In Proc. of NSDI (2015).
  • [8] Alahi, A., Haque, A., and Fei-Fei, L. RGB-W: When vision meets wireless. In Proc. of ICCV (2015).
  • [9] Alexe, B., Deselaers, T., and Ferrari, V. What is an object? In Proc. of CVPR (2010).
  • [10] Andriluka, M., Iqbal, U., Milan, A., Insafutdinov, E., Pishchulin, L., Gall, J., and Schiele, B. Posetrack: A benchmark for human pose estimation and tracking. In Proc. of CVPR (2018).
  • [11] Au, E. The Latest Progress on IEEE 802.11mc and IEEE 802.11ai Standards. IEEE Vehicular Technology Magazine 11, 3 (2016).
  • [12] Banin, L., Schatzberg, U., and Amizur, Y. Wifi ftm and map information fusion for accurate positioning. In Proc. of IPIN (2016).
  • [13] Cheng, M., Zhang, Z., Lin, W., and Torr, P.

    Bing: Binarized normed gradients for objectness estimation at 300fps.

    In Proc. of CVPR (2014).
  • [14] Dollar, P., Wojek, C., Schiele, B., and Perona, P. Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 4 (2012).
  • [15] Enzweiler, M., and Gavrila, D. M. Monocular pedestrian detection: Survey and experiments. IEEE Transactions on Pattern Analysis & Machine Intelligence 12 (2008).
  • [16] Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The Pascal visual object classes (voc) challenge. International journal of computer vision 88 (2010).
  • [17] Fryar, C. D., Gu, Q., Ogden, C. L., and Flegal, K. M. Anthropometric reference data for children and adults: United states, 2011-2014. Vital and Health Statistics Series (2016).
  • [18] Garcia-Hernando, G., Yuan, S., Baek, S., and Kim, T.-K. First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In Proc. of CVPR (2018).
  • [19] Hartley, R. I., and Zisserman, A. Multiple View Geometry in Computer Vision. Cambridge University Press, 2004.
  • [20] Huang, D., Nandakumar, R., and Gollakota, S. Feasibility and limits of Wi-Fi imaging. In Proc. of SenSys (2014).
  • [21] Kang, D., Emmons, J., Abuzaid, F., Bailis, P., and Zaharia, M.

    Noscope: optimizing neural network queries over video at scale.

    Proceedings of the VLDB Endowment 10 (2017).
  • [22] Kemelmacher-Shlizerman, I., Seitz, S. M., Miller, D., and Brossard, E. The megaface benchmark: 1 million faces for recognition at scale. In Proc. of CVPR (2016).
  • [23] Kim, H., and Chang, S. RFID assisted image annotation system for a portable digital camera. In Proc. of ICCAS (2010).
  • [24] Kim, T., Keane, J., Wang, W., Tang, H., Riggle, J., Shakhnarovich, G., Brentari, D., and Livescu, K. Lexicon-free fingerspelling recognition from video: Data, models, and signer adaptation. Computer Speech & Language 46 (2017).
  • [25] Li, Z., Nika, A., Zhang, X., Zhu, Y., Yao, Y., Zhao, B. Y., and Zheng, H. Identifying value in crowdsourced wireless signal measurements. In Proc. of WWW (2017).
  • [26] Liu, X., Nath, S., and Govindan, R. Gnome: A practical approach to nlos mitigation for GPS positioning in smartphones. In Proc. of MobiSys (2018).
  • [27] Ma, Y., Selby, N., and Adib, F. Minding the billions: Ultra-wideband localization for deployed RFID tags. In Proc. of MobiCom (2017).
  • [28] Mallapuram, S., Ngwum, N., Yuan, F., Lu, C., and Yu, W. Smart city: The state of the art, datasets, and evaluation platforms. In Proc. of ICIS (2017).
  • [29] Papadopoulos, D. P., Uijlings, J. R., Keller, F., and Ferrari, V. Extreme clicking for efficient object annotation. In Proc. of ICCV (2017).
  • [30] Peng, C., Shen, G., Zhang, Y., Li, Y., and Tan, K. BeepBeep: A High Accuracy Acoustic Ranging System Using COTS Mobile Devices. In Proc. of SenSys (2007).
  • [31] Phillips, P. J., Wechsler, H., Huang, J., and Raussa, P. J. The FERET database and evaluation procedure for face-recognition algorithms. Image and Vision Computing 16, 5 (1998).
  • [32] Ratner, A., Bach, S. H., Ehrenberg, H. R., Fries, J. A., Wu, S., and Ré, C. Snorkel: Rapid training data creation with weak supervision. CoRR (2017).
  • [33] Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proc. of NIPS (2015).
  • [34] Schuldt, C., Laptev, I., and Caputo, B. Recognizing human actions: a local SVM approach. In Proc. of ICPR (2004).
  • [35] Shen, H., Han, S., Philipose, M., and Krishnamurthy, A. Fast video classification via adaptive cascading of deep models. CoRR abs/1704.02463 (2017).
  • [36] Torralba, A., and Efros, A. A. Unbiased look at dataset bias. In Proc. of CVPR (2011).
  • [37] Vasisht, D., Kumar, S., and Katabi, D. Decimeter-level localization with a single WiFi access point. In Proc. of NSDI (2016).
  • [38] Wang, J., Xiong, J., Chen, X., Jiang, H., Balan, R. K., and Fang, D. Tagscan: Simultaneous target imaging and material identification with commodity RFID devices. In Proc. of MobiCom (2017).
  • [39] Wang, W., Shen, J., Guo, F., Cheng, M.-M., and Borji, A. Revisiting video saliency: A large-scale benchmark and a new model. In Proc. of CVPR (2018).
  • [40] Wiskott, L., Krüger, N., Kuiger, N., and von der Malsburg, C. Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 7 (1997).
  • [41] Xi, Z., and Zhao, X. Data driven prognostics with lack of training data sets. In Proc. of IDETC/CIE (2015).
  • [42] Xiong, J., Sundaresan, K., and Jamieson, K. Tonetrack: Leveraging frequency-agile radios for time-based indoor wireless localization. In Proc of MobiCom (2015).
  • [43] Zhang, D., Islam, M. M., and Lu, G. A review on automatic image annotation techniques. Pattern Recognition 45 (2012).
  • [44] Zhang, L., Lin, L., Liang, X., and He, K. Is faster R-CNN doing well for pedestrian detection? In Proc. of ECCV (2016).
  • [45] Zhou, T., Brown, M., Snavely, N., and Lowe, D. G. Unsupervised learning of depth and ego-motion from video. In Proc. of CVPR (2017).
  • [46] Zhou, Z. A brief introduction to weakly supervised learning. National Science Review (2017).
  • [47] Zhu, X., Vondrick, C., Ramanan, D., and Fowlkes, C. C. Do we need more training data or better models for object detection? In Proc. of BMVC (2012).
  • [48] Zhu, Y., Zhu, Y., Zhao, B. Y., and Zheng, H. Reusing 60GHz radios for mobile radar imaging. In Proc. of MobiCom (2015).
  • [49] Zitnick, L., and Dollar, P. Edge boxes: Locating object proposals from edges. In Proc. of ECCV (2014).