One Size Does Not Fit All: Multi-Scale, Cascaded RNNs for Radar Classification

Edge sensing with micro-power pulse-Doppler radars is an emergent domain in monitoring and surveillance with several smart city applications. Existing solutions for the clutter versus multi-source radar classification task are limited in terms of either accuracy or efficiency, and in some cases, struggle with a trade-off between false alarms and recall of sources. We find that this problem can be resolved by learning the classifier across multiple time-scales. We propose a multi-scale, cascaded recurrent neural network architecture, MSC-RNN, comprised of an efficient multi-instance learning (MIL) Recurrent Neural Network (RNN) for clutter discrimination at a lower tier, and a more complex RNN classifier for source classification at the upper tier. By controlling the invocation of the upper RNN with the help of the lower tier conditionally, MSC-RNN achieves an overall accuracy of 0.972. Our approach holistically improves the accuracy and per-class recalls over ML models suitable for radar inferencing. Notably, we outperform cross-domain handcrafted feature engineering with time-domain deep feature learning, while also being up to ∼3× more efficient than a competitive solution.



There are no comments yet.


page 2

page 6

page 9


Deep Recurrent Neural Networks for mapping winter vegetation quality coverage via multi-temporal SAR Sentinel-1

Mapping winter vegetation quality coverage is a challenge problem of rem...

Recurrent Neural Network from Adder's Perspective: Carry-lookahead RNN

The recurrent network architecture is a widely used model in sequence mo...

RNNSecureNet: Recurrent neural networks for Cyber security use-cases

Recurrent neural network (RNN) is an effective neural network in solving...

MuBiNN: Multi-Level Binarized Recurrent Neural Network for EEG signal Classification

Recurrent Neural Networks (RNN) are widely used for learning sequences i...

Neural Speed Reading via Skim-RNN

Inspired by the principles of speed reading, we introduce Skim-RNN, a re...

Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation

Large-scale image annotation is a challenging task in image content anal...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the rapid growth in deployment of Internet of Things (IoT) sensors in smart cities, the need and opportunity for computing increasingly sophisticated sensing inferences on the edge has also grown. This has motivated several advances in designing resource efficient sensor inferences, particularly those based on machine learning and especially deep learning. The designs, however, encounter a basic tension between achieving efficiency while preserving predictive performance that motivates a reconsideration of state-of-the-art techniques. In this paper, we consider a canonical inference pattern, namely discriminating clutter from several types of sources, in the context of a radar sensor. This sort of

–class classification problem, where is the number of source types, has a variety of smart city applications, where diverse clutter is the norm. These include triggering streetlights smartly, monitoring active transportation users (pedestrians, cyclists, and scooters), crowd counting, assistive technology for safety, and property surveillance. As an example, streetlights should be smartly triggered on for pedestrians but not for environmental clutter such as trees moving in the wind. Similarly, property owners should be notified only upon a legitimate intrusion but not for passing animals. The radar is well suited in the smart city context as it is privacy preserving in contrast to cameras. Moreover, it consumes low power (15mW), because of which it can be deployed at operationally relevant sites with little dependence on infrastructure, using, for instance, a small panel solar harvester or even a modest sized battery, as shown in Figure 1. Experiences with deploying sensors in visionary smart city projects such as Chicago’s Array of Things (Catlett et al., 2017; UrbanCCD, ) and Sounds of New York City (Bello et al., 2019) have shown that wired deployments on poles tend to be slow and costly, given constraints of pole access rights, agency coordination, and labor unions, and can sometimes be in suboptimal locations. Using a low-power sensor that is embedded wirelessly or simply plugged in to existing platforms while imposing only a nominal power cost simplifies smart city deployment.

(a) Micro-power PDR system
(b) Solar harvested Signpost platform supporting low power sensors (Adkins and others, 2018)
Figure 1. The micro-power pulse-Doppler radar (PDR) device can be independently deployed or interfaced with existing multi-sensor smart city platforms such as Signpost (figure adapted from (Adkins and others, 2018), Copyright ©2019 ACM, Inc.)

Table 1 illustrates an efficiency-accuracy trade-off for the canonical inference pattern with

, wherein clutter is distinguished from human and other (i.e., non-human) sources. The more accurate deep models, the Convolutional Neural Network (CNN)

(Krizhevsky et al., 2012)

and the Long Short-Term Memory (LSTM)

(Hochreiter and Schmidhuber, 1997)

, that we machine-learned for this 3-class classifier from a reference dataset are significantly less efficient, in terms of speed and therefore power consumption. In contrast, the more efficient shallow solution, Support Vector Machine (SVM), is significantly less accurate. While the SVM classifier has been implemented to operate in near real-time on the Cortex-M3 single-microcontroller processor in the device depicted in Fig. 

1(a), neither the CNN nor the LSTM per se yield a near real-time implementation. To implement deep models in near real-time on the M3, we therefore consider model optimization with recent state-of-art-techniques such as fast gated RNNs (FastGRNN) (Kusupati et al., 2018) and Early-exit Multi-Instance RNNs (EMI-LSTM and EMI-FastGRNN) (Dennis et al., 2018). However, Table 1 illustrates that the trade-off remains: the best accuracy we achieve, namely with the FastGRNN, has significantly lower efficiency than the best efficiency achieved, namely with EMI-FastGRNN, but that has an accuracy that is comparatively significantly worse.

ML Model Accuracy FLOPS Real-time?
SVM (15 features) 0.85 37K Yes
LSTM 0.89 100K No
CNN (1s FFT) 0.91 1.3M No
EMI-LSTM 0.90 20K Yes
FastGRNN 0.96 35K Yes
EMI-FastGRNN 0.88 8K Yes
Table 1. Trade-offs in accuracy and runtime efficiency for the 3-class radar problem (window length 1s, feature computation overhead ignored for SVM, dataset and machine architecture details are in Section 5)

Problem Statement. In this work, we investigate alternative optimizations of deep models for the above classification task that achieve both high accuracy and speed. In doing so, we do not wish to sacrifice the recall performance for achieving high precision. For instance, radar sensing applications require that the clutter recall be very high so that there are minimal false alarms. However, a solution that restricts false alarms at the cost of detectability (i.e., low source recall, where a source could be either human or non-human) would be undesirable as it would have limited applicability in the smart city contexts discussed above. Solution Overview. The -class radar problem, where the -class is clutter, conflates discrimination between classes that are conceptually different. In other words, discriminating clutter from sources has a different complexity from that of disambiguating source types. This insight generalizes when the sources themselves are related by a hierarchical ontology, wherein different levels of source types involve concepts of correspondingly different complexity of discrimination. By way of example, in the 3-class clutter vs. human vs. non-human classification problem, discriminating clutter from sources turns out to be simpler than discriminating the more subtle differences between the source types. Using the same machine architecture for 3 classes of discrimination leads to the accuracy-efficiency trade-off, as the last two rows of Table 1 indicate. A more complex architecture suffices for discriminating among source types accurately, whereas a simpler architecture more efficiently suffices for discriminating clutter from sources, but hurts the accuracy of discriminating between source types. We, therefore, address the problem at hand with an architecture that decomposes the classification inference into different hierarchical sub-problems. For the 3-class problem, these are: (a) Clutter vs Sources, and (b) Humans vs. Non-humans given Sources. For each sub-problems we choose an appropriate learning architecture; given the results of Table 1, both architectures are forms of RNN albeit with learning at different time-scales. The lower tier RNN for (a) uses a short time-scale RNN, the Early-exit Multi-Instance RNN (EMI-FastGRNN) (Dennis et al., 2018; Kusupati et al., 2018), whereas the higher tier for (b) uses a longer time-scale RNN, a FastGRNN (Kusupati et al., 2018)

, which operates at the level of windows (contiguous, fixed-length snippets extracted from the time-series) as opposed to short instances within the window. The upper tier uses the features created by the lower tier as its input; for loss minimization, both tiers are jointly trained. To further improve the efficiency, we observe that source type discrimination needs to occur only when a source is detected and clutter may be the norm in several application contexts. Hence, the less efficient classifier for (b) is invoked only when (a) discriminates a source: we refer to this as cascading between tiers. The joint training loss function is refined to emulate this cascading. We call this architecture

Multi-Scale, Cascaded RNNs (MSC-RNN). Contributions. Our proposed architecture exploits conditional inferencing at multiple time-scales to jointly achieve superior sensing and runtime efficiency over state-of-the-art alternatives. To the best of our knowledge, this approach is novel to deep radar systems. For the particular case of the 3-class problem, MSC-RNN performs as follows on the Cortex-M3:
Accuracy Clutter Recall Human Recall Non-human Recall FLOPS 0.972 1 0.92 0.967 9K  
Its accuracy and per-class recalls are mostly better than, and in remaining cases competitive with, the models in Table 1. Likewise, its efficiency is competitive with that of EMI-FastGRNN, the most efficient of all models, while substantially outperforming it in terms of sensing quality. We also validate that this MSC-RNN solution is superior to its shallow counterparts not only comprehensively, but at each individual tier as well. The data and training code for this project are open-sourced at (Roy et al., ). Other salient findings from our work are summarized as follows:

  1. [topsep=0cm,after=]

  2. Even with deep feature learning purely in the time-domain, MSC-RNN surprisingly outperforms handcrafted feature engineering in the amplitude, time, and spectral domains for the source separation sub-problem. Further, this is achieved with 1.75-3 improvement in the featurization overhead.

  3. The Tier 1 component of MSC-RNN, which classifies legitimate sources from clutter, improves detectability by up to compared to popular background rejection mechanisms in radar literature, even when the false alarm rate is controlled to be ultra-low.

  4. MSC-RNN seems to tolerate the data imbalance among its source types better than other compared RNN models. In particular, it enhances the non-dominant human recall by up to 20%, while simultaneously maintaining or improving the dominant non-human recall and overall accuracy.

Organization. In Section 2, we present related research and outline the basics of micro-power radar sensing in Section 3. In Section 4, we detail the various components in our solution and discuss the training and inference pipelines. We provide evaluation and prototype implementation details in Sections 5 and 6 respectively. We conclude and motivate future research in Section 7.

2. Related Work

Shallow Radar Sensing.  Micro-Doppler features have been used in myriad applications ranging from classification (He et al., 2014; Liu et al., 2011; Kim and Ling, 2008) to regression (He and Arora, 2014)

. Most of these applications employ the short-time Fourier transform (STFT) representation for analyzing micro-Doppler signatures. Although shallow classifiers can be computationally cheaper than deep solutions, the spectrogram generation over a sliding window for the STFT incurs significant computational overhead for real-time applications on single microcontroller devices. In order to decrease this overhead for feature extraction, different feature extraction methods like linear predictive coding

(Jesus Javier and Kim, 2014), discrete-cosine coefficients (Molchanov et al., 2011)

, log-Gabor filters with principal component analysis

(Lam Phung et al., 2015), empirical mode decomposition (P. Fairchild and Narayanan, 2014) have been investigated in the past. We, on the other hand, use a deep learning approach that learns relevant features from raw time-series data, and avoid spectrogram computation altogether. Feature engineering requires sophisticated domain knowledge, is not assured to be efficient per se, and may not transfer well to solutions for other research problems. Moreover, selection of relevant and non-redundant features requires care for sensing to be robust (Roy et al., 2017). Deep Radar Sensing.  In recent years, there has been significant use of deep learning for radar applications. Most works use spectrogram-based input (Kim and Toomajian, 2016; Kim and Moon, 2015; Jokanovic et al., 2016)

with deep architectures like CNNs/autoencoders. The authors of

(Mendis et al., 2016)

digitize the radio receiver’s signal and generate a unique spectral correlation function for the Deep Belief Network to learn signatures from. The pre-processing needed in these applications and the resulting model sizes make them unsuitable for single microcontroller devices. We use raw time-series data in conjunction with variants of RNNs to achieve a faster and efficient solution.

Efficient RNN.

 The ability of RNNs in learning temporal features has made it ubiquitous in various sequence modeling tasks. RNNs, albeit theoretically powerful, often fail to reach the best performance due to instability in their training resulting from the exploding and vanishing gradient problem (EVGP)

(Pascanu et al., 2013). Gated RNNs like LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014) have been proposed to circumvent EVGP and achieve the desired accuracy for the given task. A drawback of LSTM and GRU is their model size and compute overhead which makes them unattractive for the near real-time single microcontroller implementations. Recently, FastGRNN (Kusupati et al., 2018) has been proposed to achieve prediction accuracies comparable to LSTM and GRU while ensuring that the learned models are smaller than 10 KB for diverse tasks. Our proposed hierarchical classifier solution is based on this architecture. Multi-Instance Learning and Early Classification.  MIL is a weakly supervised learning technique that is used to label sub-instances of a window. MIL has found use in applications from vision (Wu et al., 2015)

to natural language processing (NLP)

(Kotzias et al., 2014). It enables a reduction in the computational overhead of sequential models like RNNs by localizing the appropriate activity signature in a given noisy and coarsely-labeled time-series data along with early detection or rejection of a signal (Dennis et al., 2018). We use it as our lower tier classifier for clutter versus source discrimination. Multi-Scale RNN.  One of the early attempts to learn structure in temporally-extended sequences involved using reduced temporal sequences (Hinton, 1990) to make detectability over long temporal intervals feasible in recurrent networks (Mozer, 1992; Schmidhuber, 1992). With the resurgence of RNNs, multi-scale RNNs can discover the latent hierarchical multi-scale structure of sequences (Chung et al., 2016). While they have been traditionally used to capture long-term dependencies, we use it to design a computationally efficient system. We use different scales of temporal windows for the lower and upper tier RNN. By conditioning the upper tier classifier, which works on longer windows and is hence bulkier we make sure that the former is invoked only when necessary, i.e., when the lower tier predicts a source. Compression Techniques.  Sparsity, low-rank, and quantization have been proven to be effective ways of compressing deep architectures like RNNs (Wang et al., 2017; Ye et al., 2018) and CNNs (Han et al., 2016; Kumari et al., 2019). Many other compression methods like householder reflectors in Spectral-RNN (Zhang et al., 2018), Kronecker factorization in KRU (Jose et al., 2017) have been proposed, which are complementary to the solution proposed in this paper. We incorporate low-rank representation, Q15 quantization, and piecewise-linear approximation (Kusupati et al., 2018) to make MSC-RNN realizable on Cortex-M3 microcontrollers.

3. Radar and Classifier Models

3.1. Micro-power Radar Model

The monostatic PDR sensor depicted in Figure 1 has a bandwidth of nearly MHz and a center frequency at about GHz. It is a short-range radar with an anisotropic radiation pattern yielding a maximum detection range of m. Sensing itself consumes

mW of power, not counting the inference computation on the associated microcontroller. The radar response is low pass filtered to 100 Hz; hence the output is typically sampled at rates over 200Hz. The output signal from the radar is a complex time-series with In-phase (I) and Quadrature (Q) components. When a source moves within the detection range, in addition to the change in received power, the phase of this signal changes according to the direction of motion. Consequently, its relative displacement can be estimated with high accuracy (typically, sub-cm scale), and a rich set of features can be derived by tracking its phase evolution over time.

3.2. Classifier Architectures

3.2.1. Input and Feature Representation

The radar classifier system uses the aforementioned complex time-series as input. Extant end-to-end architectures for micro-power radar sensing mostly eschew deep feature learning for cheap handcrafted feature engineering in the amplitude, time, and spectral domains (He and Arora, 2014; Roy et al., 2017). However, these solutions incur significant featurization overhead; this is exemplified in Table 2 on 1-second snippets extracted from the complex time-series. Even ignoring the SVM computation latency, it can be seen that the main computation bottleneck is this incremental overhead which results in ¿30% duty cycle on the Cortex-M3, of which 10% constitutes the FFT overhead alone.

Component Latency (ms)
FFT 80
Incremental feature computation 212
SVM inference (700 SVs) 55
Table 2. Computation overheads in a shallow (SVM) radar solution on Cortex-M3 (10 features, 1s windows)

3.2.2. Deep Classifier Architecture

Deep radar classifier systems such as CNNs (or even some RNNs) convert the raw time-series to STFT, and hence also maintain this steady overhead in input representation. In the interest of designing resource efficient solutions, in this work, we instead focus on being competitive with all-domain featurization using purely time-domain learning.

3.2.3. Shallow Classifier Architecture

As shown in Figure 2, a prototypical shallow radar classifier system consists of three subsystems: (i) a displacement detector for discriminating clutter vs. sources, (ii) an incremental featurizer, (iii) an end inference engine that discriminates source types, and (iv) a composition manager that handles their interactions. The displacement detector is a simple module that thresholds unwrapped phase over incoming windows of radar data ( s or 1 s) to detect legitimate source displacements in the scene, filtering in-situ clutter that tends to yield self-canceling phase unwraps. When a source displacement is speculatively detected, the featurizer is invoked till the current displacement ends or a pre-specified time limit is reached. The final feature vector is fed to an end classifier such as Support Vector Machine (Roy et al., 2017). Note that incremental feature computation overhead is the primary impediment in realizing efficiency in these systems, hence techniques like replacing the heavy SVM classifier with the much lighter Bonsai (Kumar et al., 2017), or observing longer displacements to run inference infrequently do not alleviate this problem.

Figure 2. SVM classifier data and control planes; control signal-response pairs are color coded

While we preserve this ontological hierarchy in our solution, we replace this simple “ensemble” with a principled 2-tier RNN approach in the time-domain. In the next sections, we present our proposed architecture and discuss how deep feature learning can be used to successfully resolve the above issues.

4. 2-Tier Deep Classifier Architecture

MSC-RNN is a multi-scale, cascaded architecture that uses EMI-FastGRNN as the lower tier clutter discriminator and FastGRNN as the upper tier source classifier. While EMI-FastGRNN efficiently localizes the source signature in a clutter prone time-series ensuring smaller sequential inputs along with early classification, FastGRNN reduces the per-step computational overhead over much heavier alternatives such as LSTM. We begin with the relevant background for each of these components.

4.1. Candidate Classifiers

FastGRNN. FastRNN (Kusupati et al., 2018)

provably stabilizes RNN training by helping to avoid EVGP by using only two additional scalars over the traditional RNN. FastGRNN is built over FastRNN and it extends the scalars of FastRNNs to vector gates while maximizing the computation reuse. FastGRNN also ensures its parameter matrices are low-rank, sparse and byte quantized to ensure very small models and very fast computation. FastGRNN is shown to match the accuracies of state-of-the-art RNNs (LSTM and GRU) across various tasks like keyword spotting, activity recognition, sentiment analysis, and language modeling while being up to 45x faster. Let

be the input time-series, where . The traditional RNN’s hidden vector captures long-term dependencies of the input sequence: . Typically, learning and difficult due to the gradient instability. FastGRNN (Figure 3(a)) uses a scalar controlled peephole connection for every coordinate of :

Here, are trainable parameters, and represents the vector Hadamard product.

(a) FastGRNN (cascaded)
(b) EMI-FastGRNN (always on)
Figure 3. FastGRNN & EMI-FastGRNN (images from (Kusupati et al., 2018; Dennis et al., 2018))

EMI-RNN. Time-series signals when annotated are rarely precise and often coarsely labeled due to various factors like human errors and smaller time frames of activities themselves. EMI-RNN (Dennis et al., 2018) tackles the problem of signal localization using MIL, by splitting the th data window into instances of a fixed width (Figure 3(b)). The algorithm alternates between training the classifier and re-labeling the data based on the learned classier until convergence. A simple thresholding scheme is applied to refine the instances: in each iteration, consecutive instances are found with maximum prediction sum for the class label. Only these instances are included in the training set for the next iteration. Here,

is a hyperparameter that intuitively represents the number of instances expected to cover the source signature. In the end, EMI-RNN produces precise signal signatures which are much smaller than the raw input, thus reducing the computation and memory overhead over the traditional sequential techniques. EMI-RNN also ensures early detection of noise or keywords thereby removing the need of going through the entire signal before making a decision. When combined, EMI-FastGRNN provides very small models along with very fast inference for time-series classification tasks. Codes for FastGRNN

(Kusupati et al., 2018) & EMI-RNN (Dennis et al., 2018) are part of Microsoft Research India’s EdgeML repository (Dennis et al., ).

4.2. MSC-RNN Design

While EMI-RNN is by itself equipped to handle multi-class classification efficiently, we find its accuracy and non-dominant source recall to be sub-optimal for the radar time-series, especially at smaller hidden dimensions and shorter window lengths. FastGRNN, on the other hand, is a relatively heavier solution to be used as a continuously running 3-class discriminator. To redress this trade-off, we make the following observations:

  1. [label=()]

  2. clutter, which yields self-canceling phase, can be rejected at a relatively shorter time-scale,

  3. disambiguating source types from their complex returns is a harder problem requiring a potentially longer window of observation, and

  4. the common case in a realistic deployment constitutes clutter; legitimate displacements are relatively few.

Figure 4. MSC-RNN architecture – the lower EMI-FastGRNN runs continuously, while the higher FastGRNN is invoked only for legitimate displacements

MSC-RNN, therefore, handles the two sub-problems at different time-scales of featurization (see Figure 4): the lower tier, an EMI-FastGRNN, discriminates sources from clutter at the level of short instances, while the upper one, a windowed FastGRNN, discriminates source types at the level of longer windows. Further, the upper tier is invoked only when a source is discriminated by the lower tier and operates on the instance-level embeddings generated by the latter.

4.2.1. Joint Training and Inference

The training of the lower tier inherits from that of EMI-training. We recap its training algorithm (Dennis et al., 2018), which occurs in two phases, the MI phase and the EMI phase. In the MI phase, where the source boundaries are refined in a clutter-prone window, the following objective function is optimized:

Here, represents the loss function of FastGRNN, and the classifier is based on the final time-step in an instance. In the EMI phase, which incorporates the early stopping, the loss is obtained by replacing the previous loss function with the sum of the classifier loss at every step: , where is the fully connected layer and the output at step . The overall training proceeds in several rounds, where the switch to the EMI loss function is typically made halfway in.

(a) Public park
(b) Indoor amphitheater
(c) Parking garage bldg.
(d) Building foyer
Figure 5. Some locations where source and clutter data was collected for experiments

For training the upper tier, in keeping with the divide-and-conquer paradigm of MSC-RNN, the upper tier FastGRNN cell should only learn to separate the source types, while ignoring instances of training data that are clutter. Therefore, we devise a conditional training strategy that captures the cascading behavior. To achieve this, the standard cross-entropy loss function of the upper tier is modified as:

where represents the upper classifier, and represents the instance-level embedding vector from EMI-RNN with a hidden dimension of (here, represents the feature dimension for the radar time-series). Intuitively, this means that the upper loss is unaffected by clutter points, and thus the tiers can be kept separate.

0:  Multi-instance training data , the number of rounds ,
  Freeze FastGRNN, unfreeze EMI-FastGRNN
     Train EMI-FastGRNN()
  until convergence
  Freeze EMI-FastGRNN, unfreeze FastGRNN
     Train FastGRNN(), minimizing loss
  until convergence
  Unfreeze both EMI-FastGRNN and FastGRNN
  for  do
     if  then
     end if
        Train MSC-RNN() minimizing loss
     until convergence
  end for
Algorithm 1 MSC-RNN training algorithm

The training algorithm for MSC-RNN is outlined in Algorithm 1

. The two tiers are first separately initialized using their respective loss functions, and in the final phase, both are jointly trained to minimize the sum of their losses. Inference is simple: the instance-level EMI-RNN stops early with a decision of “Source” when a probability threshold

is crossed; consecutive positives constitute a discrimination for which the cascade is activated.

5. Comparative & Tier-wise Evaluation

5.1. Datasets

Table 3(a) lists the radar source and clutter datasets collected in various indoor and outdoor environments, which are used in this work. Some of these locations are documented in Figure 5; small or crammed indoor spaces such as office cubicles have been avoided to prevent the radar returns from being adversely affected by multi-path effects and because they are not central to the smart city scenarios. A partial distribution of displacement durations is provided in Figure 6(a). Each data collect has associated with it the corresponding ground truth, recorded with motion-activated trail cameras or cellphone video cameras, with which the radar data was correlated offline to “cut” and label the source displacement snippets appropriately111The radar dataset, which we have open-sourced, does not include individually identifiable information of living individuals and is thus not considered research with human subjects per 45 CFR 46.102(e)(1)(ii).. The datasets have been balanced in the number of human and non-human displacement points where possible, and windowed into snippets of 1, 1.5, and 2 seconds which correspond to 256, 384, and 512 I-Q sample pairs respectively. We note that due to the duration of collections and differences in average displacement lengths, etc., humans are underrepresented in these datasets compared to the other labels. Table 3(b) shows the number of training, validation, and test points for each of these window lengths on a roughly 3:1:1 split. Currently, only the cattle set has multiple concurrent targets; efforts to expand our datasets with target as well as radar type variations are ongoing.

Env. Data Type
Type Count
Building foyer Human, Gym ball 52, 51
Indoor amphitheater Human, Gym ball 49, 41
Parking garage bldg. Human 268
Parking lot Human, Car 50, 41
Indoor soccer field Human, Gym ball 90, 82
Large classroom Human, Gym ball 48, 50
Cornfield Human, Dog 117, 85
Cattle shed Cow 319
Playground Clutter 45 mins
Parking garage bldg. Clutter 45 mins
Public park Clutter 45 mins
Garden Clutter 45 mins
Lawn Clutter 20 mins
(a) Source displacement counts and clutter durations
Window Len. (s) #Windows
Training Validation Testing
1 17055 5685 5685
1.5 11217 3739 3739
2 8318 2773 2773
(b) Windowed data from (a) showing number of training, validation, and test points
Table 3. Radar evaluation datasets

5.2. Evaluation Methodology

Our proposed architecture is compared with existing shallow radar solutions that use feature handcrafting in the amplitude, phase and spectral domains, as well as with other MIL RNNs. In all cases involving RNNs, the radar data is represented purely in the time-domain. The models chosen for this evaluation are:

  1. [label=()]

  2. 2-tier SVM with phase unwrapped displacement detection. Phase unwrapping (Goldstein et al., 1988) is a widely used technique in radar displacement detection due to its computational efficiency. The idea is to construct the relative trajectory of a source by accumulating differences in successive phase measurements, whereby clutter can be filtered out. We contrast MSC-RNN with a two-tier solution proposed in (Roy et al., 2017), which uses a robust variant of phase unwrapping with adaptive filtering of clutter samples.

  3. 3-class SVM. A clutter vs human vs non-human SVM solution that uses feature handcrafting.

  4. EMI-FastGRNN. An EMI version of FastGRNN (Section 4).

  5. EMI-LSTM. An EMI version of the LSTM. Note that this is a much heavier architecture than the former, and should not be regarded as suitable for a microcontroller device.

Since shallow featurization incurs high incremental overhead, real-time micro-power radar solutions typically avoid techniques such as PCA (Candès et al., 2011)

, logistic regression

(James et al., 2013) or low-dimensional projection (Kumar et al., 2017). Instead, the 15 best features are selected using the Max. Relevance, Min. Redundancy (mRMR) (Peng et al., 2005) algorithm. For the MIL experiments, the windowed data from Table 3(b) is further reshaped into instances of length 48

2 samples with a fixed stride of 16

2, where 2 refers to the number of features (I and Q components of radar data). For example, for 1 second windows, the shape of the training data for MIL experiments is (17055, 14, 48, 2), and the shape of the corresponding instance-level one-hot labels is (17055, 14, 3). In the interest of fairness and also to avoid a combinatorial exploration of architectural parameters, we present results at fixed hidden sizes of 16, 32, and 64. For MSC-RNN, the lower tier’s output (embedding) dimension and upper tier’s hidden dimension are kept equal; however, in practice, it is easy to parameterize them differently since the former only affects the latter’s input dimension.

(a) Disp. Duration CDF (partial)
(b) Impact of on EMI Recalls
Figure 6. Source detected duration CDF for the data in Table 3(a) and how the hyperparameter in 2-class EMI affects their detection (1 second windows)

5.2.1. Hyperparameters

Table 4 lists the hyperparameter combinations used in our experiments. For the upper-tier source discrimination comparison in Section 5.3.3, FastGRNN is also allowed to select its optimum input length from 16, 32, and 64 samples. The selection of the EMI hyperparameter merits some discussion, in that it controls the extent of “strictness” we assign to the definition of displacement. A higher makes it more difficult for a current window to be classified as a source unless the feature of interest is genuinely compelling. Expectedly, this gives a trade-off between clutter and source recall as is illustrated in Figure 6(b). As explained in Section 1, controlling for false positives is extremely important in radar sensing contexts such as intrusion detection. Hence, we empirically set to 10, the smallest value that gives a clutter recall of 0.999 or higher in our windowed datasets.

Model Hyperparameter Values
EMI/FastGRNN Batch Size 64, 128
Hidden Size 16, 32, 64
Gate Nonlinearity sigmoid, tanh
Update Nonlinearity sigmoid, tanh
Keep prob. (EMI-LSTM) 0.5, 0.75, 1.0
Optimizer Adam
Table 4. Training hyperparameters used

5.3. Results

5.3.1. Comparative Classifier Performance

We compare the inference accuracy and recalls of MSC-RNN, with the RNN and shallow solutions outlined in Section 5.2. Recall that we have purposefully devised a purely time-domain solution for source discrimination for efficiency reasons, since one of the main components of featurization overhead is that of FFT computations. Figure 7 compares MSC-RNN with engineered features in the amplitude, time, and spectral domains that are optimized for micro-power radar classification. For the two-tier SVM, the source recalls for increasing window sizes are inferred from Figure 8 (discussed in Section 5.3.3). We find that MSC-RNN significantly outperforms the 2-tier SVM solution in terms of human and non-human recalls, even with features learned from the raw time-series. Similarly, for the 3-class case, our solution provides much more stable noise robustness and is generally superior even to the much heavier SVM solution.

(a) Human Recall
(b) Non-human Recall
Len. (s)
Accuracy Clutter Recall
1 0.851 0.944 0.758 0.999
1.5 0.934 0.954 0.996 0.999
2 0.959 0.972 0.999 1.000
(c) Accuracy and Clutter Recall (3-class SVM and MSC-RNN)
Figure 7. Classification comparison of purely time-domain FastGRNN with two SVM solutions: (a) a 2-tier system using a phase unwrapped clutter rejector as the lower tier, and (b) a 3-class SVM. Both use 15 high information features handcrafted in the amplitude, time, and spectral domains

Figure 9 contrasts our model with 3-class EMI-FastGRNN and EMI-LSTM, for fixed hidden sizes of 16, 32, and 64 respectively. It can be seen that MSC-RNN outperforms the monolithic EMI algorithms on all three metrics of accuracy, non-human and human recalls (with one exception for EMI-LSTM). Notably, cascading significantly enhances the non-dominant class recall over the other methods, especially for larger hidden sizes, and therefore offers better resilience to the source type imbalance in radar datasets.

5.3.2. Runtime Efficiency Comparison - MSC-RNN vs. Feature Handcrafting.

Table 5 lists the runtime duty cycle estimates of MSC-RNN versus shallow SVM alternatives in two deployment contexts with realistic clutter conditions, supported by usage statistics of a popular biking trail in Columbus, OH (35). While the 2-tier SVM understandably has the lowest duty cycle due to a cheap lower tier, it is not a competitive solution as established in Section 5.3.1. The 3-class SVM, on the other hand, is dominated by the feature computation overhead. While the 482 MSC-RNN formulation is about 1.75 as efficient as using handcrafted features, it is possible to reduce instance-level computations even further by using longer input vectors and reducing the number of iterations. As an example, MSC-RNN with a 16-dimensional input vector is 3 more efficient than feature engineering.

Architecture Est. Duty Cycle (Cortex-M3)
97% Clutter 98% Clutter
MSC-RNN (Inp. dim.=2) 21.00% 20.00%
MSC-RNN (Inp. dim.=16) 10.87% 10.7%
2-Tier SVM 2.05% 1.7%
3-Class SVM 35.00% 35.00%
Table 5. Estimated featurization duty cycle comparison on ARM Cortex-M3

5.3.3. Tier-wise Evaluation

We next compare the lower-tier and upper-tier classifiers individually to their shallow counterparts in the 2-tier SVM solution. Tier 1 Classifier. Figure 8 compares the probabilities of missed detects versus displacement durations for the 3-outof-4 displacement detector and the EMI component of our solution with 2-second windows (for a principled approach to choosing parameters for the former, refer to Appendix A) at hidden sizes of 16, 32, and 64. It can be seen that, for the shortest cut length of 1.5 s in the dataset, the detection probability is improved by up to 1.5 (1.6) over the 3-outof-4 detector with false alarm rates of 1/week and 1/month respectively even when the false alarm rate (test clutter recall) of EMI is 0, which translates to a false alarm rate of 1 per year. Further, the EMI detector converges to 0 false detects with displacements 2.5 s, and is therefore able to reliably detect walks 2.6 shorter than the previous solution. Therefore, it is possible to restrict false positives much below 1/month while significantly improving detectability over the M-outof-N solution. Since the clutter and source datasets span various backgrounds (Figure 5), MSC-RNN offers superior cross-environmental robustness.

Figure 8. Comparison of miss probabilities versus displacement durations of Tier 1 classifier vs.  3-outof-4 phase unwrapped displacement detector (window length: 2 seconds)

Tier 2 Classifier. We now show that the gains of MSC-RNN over the 2-tier SVM solution are not, in fact, contingent on the quality of the underlying displacement detector for the latter. For this experiment, we train a 2-class FastGRNN on embeddings derived from the lower-layer EMI-FastGRNN. Table 6 compares its performance with the upper-tier SVM from the latter when trained with the best 15 cross-domain features obtained from the raw radar samples. It can be seen that the purely time-domain FastGRNN still generally outperforms the 2-class SVM on all three metrics of accuracy, human recall, and non-human recall. Thus, it is possible to replace feature engineering with deep feature learning and enjoy the dual benefits of improved sensing and runtime efficiency for this class of radar applications.

1 0.93 0.93 0.90 0.90 0.93 0.94
1.5 0.93 0.93 0.90 0.93 0.95 0.95
2 0.93 0.96 0.86 0.96 0.96 0.97
Table 6. Independent of the Tier 1 classifier, the Tier 2 source-type classifier outperforms the SVM
(a) Accuracy (H=16)
(b) Human Recall (H=16)
(c) Non-human Recall (H=16)
(d) Accuracy (H=32)
(e) Human Recall (H=32)
(f) Non-human Recall (H=32)
(g) Accuracy (H=64)
(h) Human Recall (H=64)
(i) Non-human Recall (H=64)
Figure 9. Sensing performance comparison of MSC-RNN with EMI-FastGRNN and EMI-LSTM

6. Low-power Implementation

The radar sensor described in Figure 1(a) uses an ARM Cortex-M3 microcontroller with 96 KB of RAM and 4 MB of flash storage. It runs eMote (The Samraksh Company, ), a low-jitter near real-time operating system with a small footprint. We emphasize that energy efficient compute, not working memory or storage, is the bigger concern for efficient real-time operation. Hence, we take several measures to efficiently implement the multi-scale RNN to run at a low duty cycle on the device. These include low-rank representation of hidden states, Q15 quantization, and piecewise-linear approximations of non-linear functions. The latter in particular ensures that all the computations can be performed with integer arithmetic when the weights and inputs are quantized. For example, can be approximated as: , and can be approximated as: . The underlying linear algebraic operations are implemented using the CMSIS-DSP library (9). While advanced ARM processors such as Cortex-M4 offer floating point support, it should be noted that, for efficiency reasons, using sparse, low rank matrices and quantization techniques are beneficial in general.

7. Conclusion and Future Work

In this work, we introduce multi-scale, cascaded RNNs for radar sensing, and show how leveraging the ontological decomposition of a canonical classification problem into clutter vs. source classification, followed by source type discrimination on an on-demand basis can improve both sensing quality as well as runtime efficiency over alternative systems. Learning discriminators at the time-scales relevant to their respective tasks, and jointly training the discriminators while being cognizant of the cascading behavior between them yields the desired improvement. The extension of MSC-RNNs to more complicated sensing contexts is a topic of future work. Of interest are regression-based radar “counting” problems such as occupancy estimation or active transportation monitoring, where the competitiveness of MSC-RNN to architectures such as TCNs (Bai et al., 2018) could be insightful. We also believe that MSC-RNN could also apply to alternative sensing for smart cities and built environments where the sources have intrinsic ontological hierarchies, such as in urban sound classification (Bello et al., 2019).


We thank our shepherd, Zheng Yang, and the anonymous reviewers for their comments. We are indebted to Don Dennis, Prateek Jain, and Harsha Vardhan Simhadri at Microsoft Research India for their suggestions and feedback. The computation for this work was supported by the Ohio Supercomputer Center (Center, 1987) project PAS1090, the IIT Delhi HPC facility, and Azure services provided by Microsoft Research Summer Workshop 2018: Machine Learning on Constrained Devices 222


  • J. Adkins et al. (2018) The signpost platform for city-scale sensing. In IPSN, pp. 188–199. Cited by: 0(b), Figure 1.
  • S. Bai, J. Z. Kolter, et al. (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §7.
  • J. P. Bello, C. Silva, et al. (2019) SONYC: a system for monitoring, analyzing, and mitigating urban noise pollution. CACM 62 (2), pp. 68–77. External Links: ISSN 0001-0782 Cited by: §1, §7.
  • E. J. Candès, X. Li, et al. (2011) Robust principal component analysis?. JACM 58 (3), pp. 11. Cited by: §5.2.
  • C. E. Catlett, P. H. Beckman, et al. (2017) Array of things: a scientific research instrument in the public way: platform design and early lessons learned. In SCOPE, pp. 26–33. Cited by: §1.
  • O. S. Center (1987) Ohio supercomputer center. External Links: Link Cited by: Acknowledgements.
  • K. Cho, B. Van Merriënboer, et al. (2014)

    On the properties of neural machine translation: encoder-decoder approaches

    arXiv preprint arXiv:1409.1259. Cited by: §2.
  • J. Chung, S. Ahn, et al. (2016) Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704. Cited by: §2.
  • [9] CMSIS-dsp software library. Note: Cited by: §6.
  • [10] EdgeML: Machine learning for resource-constrained edge devices External Links: Link Cited by: §4.1.
  • D. Dennis, C. Pabbaraju, et al. (2018) Multiple instance learning for efficient sequential data classification on resource-constrained devices. In NIPS, pp. 10975–10986. Cited by: §1, §1, §2, Figure 3, §4.1, §4.2.1.
  • R. M. Goldstein, H. A. Zebker, et al. (1988) Satellite radar interferometry: two-dimensional phase unwrapping. Radio science 23 (4), pp. 713–720. Cited by: item 1.
  • S. Han, H. Mao, et al. (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR. Cited by: §2.
  • J. He and A. Arora (2014) A regression-based radar-mote system for people counting. In PerCom, pp. 95–102. Cited by: §2, §3.2.1.
  • J. He, D. Roy, et al. (2014) Mote-scale human-animal classification via micropower radar. In SenSys, pp. 328–329. Cited by: §2.
  • G. E. Hinton (1990) Mapping part-whole hierarchies into connectionist networks. Artificial Intelligence 46 (1-2), pp. 47–75. Cited by: §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §2.
  • G. James, D. Witten, et al. (2013) An introduction to statistical learning. Vol. 112, Springer. Cited by: §5.2.
  • R. Jesus Javier and Y. Kim (2014) Application of linear predictive coding for human activity classification based on micro-doppler signatures. GRSL, IEEE 11, pp. 1831–1834. Cited by: §2.
  • B. Jokanovic, M. Amin, et al. (2016) Radar fall motion detection using deep learning. In RADAR, pp. 1–6. External Links: ISSN 2375-5318 Cited by: §2.
  • C. Jose, M. Cisse, et al. (2017) Kronecker recurrent units. arXiv preprint arXiv:1705.10142. Cited by: §2.
  • Y. Kim and H. Ling (2008) Human activity classification based on micro-doppler signatures using an artificial neural network. In AP-S, pp. 1–4. Cited by: §2.
  • Y. Kim and T. Moon (2015) Human detection and activity classification based on micro-doppler signatures using deep convolutional neural networks. GRSL, IEEE 13, pp. 1–5. Cited by: §2.
  • Y. Kim and B. Toomajian (2016) Hand gesture recognition using micro-doppler signatures with convolutional neural network. IEEE Access 4, pp. 1–1. Cited by: §2.
  • D. Kotzias, M. Denil, et al. (2014)

    Deep multi-instance transfer learning

    arXiv preprint arXiv:1411.3128. Cited by: §2.
  • A. Krizhevsky, I. Sutskever, et al. (2012) Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Cited by: §1.
  • A. Kumar, S. Goyal, et al. (2017) Resource-efficient machine learning in 2 kb ram for the internet of things. In ICML, pp. 1935–1944. Cited by: §3.2.3, §5.2.
  • S. Kumari, D. Roy, et al. (2019) EdgeL3: compressing l3-net for mote scale urban noise monitoring. In IPDPSW, pp. 877–884. Cited by: §2.
  • A. Kusupati, M. Singh, et al. (2018) FastGRNN: a fast, accurate, stable and tiny kilobyte sized gated recurrent neural network. In NIPS, pp. 9030–9041. Cited by: §1, §1, §2, Figure 3, §4.1, §4.1.
  • S. Lam Phung, F. H. C. Tivive, et al. (2015) Classification of micro-doppler signatures of human motions using log-gabor filters. IET Radar, Sonar & Navigation 9, pp. . Cited by: §2.
  • L. Liu, M. Popescu, et al. (2011) Automatic fall detection based on doppler radar motion signature. In PervasiveHealth, Vol. 222, pp. 222–225. Cited by: §2.
  • G. Mendis, T. Randeny, et al. (2016) Deep learning based doppler radar for micro uas detection and classification. In MILCOM, pp. 924–929. Cited by: §2.
  • P. Molchanov, J. Astola, et al. (2011)

    Ground moving target classification by using dct coefficients extracted from micro-doppler radar signatures and artificial neuron network

    In MRRS, pp. 173–176. Cited by: §2.
  • M. C. Mozer (1992) Induction of multiscale temporal structure. In NIPS, pp. 275–282. Cited by: §2.
  • [35] Olentangy trail usage, columbus, oh. Note: Cited by: §5.3.2.
  • D. P. Fairchild and R. Narayanan (2014) Classification of human motions using empirical mode decomposition of human micro-doppler signatures. Radar, Sonar & Navigation, IET 8, pp. 425–434. Cited by: §2.
  • R. Pascanu, T. Mikolov, et al. (2013) On the difficulty of training recurrent neural networks. In ICML, pp. 1310–1318. Cited by: §2.
  • H. Peng, F. Long, et al. (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence 27 (8), pp. 1226–1238. Cited by: §5.2.
  • [39] MSC-RNN: Multi-Scale, Cascaded RNNs for Radar Classification External Links: Link Cited by: §1.
  • D. Roy, C. Morse, et al. (2017) Cross-environmentally robust intruder discrimination in radar motes. In MASS, pp. 426–434. Cited by: Appendix A, §2, §3.2.1, §3.2.3, item 1.
  • J. Schmidhuber (1992) Learning complex, extended sequences using the principle of history compression. Neural Computation 4 (2), pp. 234–242. Cited by: §2.
  • [42] The Samraksh Company .NOW with eMote. Note: Cited by: §6.
  • [43] UrbanCCD Array of things. Note: Cited by: §1.
  • Z. Wang, J. Lin, et al. (2017) Accelerating recurrent neural networks: a memory-efficient approach. IEEE Transactions on VLSI Systems 25 (10), pp. 2763–2775. Cited by: §2.
  • J. Wu, Y. Yu, et al. (2015) Deep multiple instance learning for image classification and auto-annotation. In CVPR, pp. 3460–3469. Cited by: §2.
  • J. Ye, L. Wang, et al. (2018)

    Learning compact recurrent neural networks with block-term tensor decomposition

    In CVPR, pp. 9378–9387. Cited by: §2.
  • J. Zhang, Q. Lei, et al. (2018) Stabilizing gradients for deep neural networks via efficient svd parameterization. arXiv preprint arXiv:1803.09327. Cited by: §2.

Appendix A Parameter Selection for M-outof-N Displacement Detector

(a) Clutter threshold selection for 1 FA/week and 1 FA/month
(b) Relaxation of per-window threshold through aggregation
Figure 10. Shallow displacement detector parameter selection using the datasets from Table 3(a): here, M=3 and N=4

We discuss the parameter selection process for the unwrapped-phase displacement detector (Roy et al., 2017) referenced in Figures 7 and 8 in a principled manner. Figure 10(a) shows the cumulative distribution of unwrapped phase changes of environmental clutter, translated into real distance units, in various environments for second integration windows from the clutter datasets in Table 3(a). The data is extrapolated using linear fitting on a logarithmic scale to estimate the required phase thresholds to satisfy false alarm rates of per week and per month respectively (derived using Bernoulli probabilities). We see that the unwrapped thresholds for 1 false alarm per week and month correspond to 0.3 and 0.32 m respectively. In this analysis, we fix the IQ rejection parameter at 0.9, which gives us the most lenient thresholds. Figure 10(b) illustrates the CCDFs of phase displacements for all target types (humans, gym balls, dogs, cattle, and slow-moving vehicles) in our dataset combined, calculated over second windows. Setting thresholds based on the previous analysis, the probability of false negatives per window is still significant. In practice, the algorithm improves detection by basing its decision on -outof- sliding windows, where the miss probability improves since the threshold per window is now the original threshold. For false alarm per week (month), the displacement threshold for the -outof- detector reduces to 0.22 m (0.24 m) per window, with an improved miss probability of 0.59 (0.62).