1. Introduction
Motivated by the needs of IoT applications, this paper presents a principled way of designing deep neural networks that learn (from IoT sensing signals) features inspired by the fundamental properties of the underlying domain of measurements; namely, properties of physical signals. We refer by IoT applications to those where sensors measure some physical quantities, generating (possibly complex and multidimentional) timeseries data, typically reflecting some underlying physical process. The human brain (whose wiring inspires the structure of conventional neural networks) extracts features wellsuited for external perceptual tasks, which explains the great success of such networks at those tasks. In contrast, the internal physical processes underlying sensor measurements in IoT systems have properties (such as physical intertia, characteristics of wireless signal propagation, and signal resonance) that depend more on signal frequency
, motivating feature extraction in the
frequency domain. It is no coincidence that much of classical signal processing literature works by transforming timeseries data to the frequency domain first. To help capture signatures of internal physical processes the way a brain captures their externally perceived properties, this paper develops a new neural network block designed specifically for learning in the frequency domain.The design of neural network structures greatly influences efficiency of signal modelling and ease of extraction of hidden patterns. Convolutional neural networks (CNNs) for image recognition, for example, align perfectly with biological studies of the visual cortex
(Hubel and Wiesel, 1968) and with domain knowledge in digital image processing (Gonzalez et al., 2002). We thus ask a fundamental question: what structures are wellsuited for the domain of physical sensor measurements, which we henceforth call the domain of IoT?Previous research on customizing deep learning models to the needs of IoT applications (Yao et al., 2018e; Lane et al., 2015; Yao et al., 2017a) mainly focused on designing neural network structures that integrate conventional deep learning components, such as convolutional and recurrent layers, to extract spatial and temporal properties of inputs. On the other hand, since the physics of measured phenomena are best expressed in the frequency domain, decades of research on signal processing developed powerful techniques for timefrequency analysis of signals, including motion sensor signals (Stisen et al., 2015; Hemminki et al., 2013), radio frequency signals (Wang et al., 2015; Pu et al., 2013), acoustic signals (Gupta et al., 2012; Chen et al., 2014), and visible light signals (Li et al., 2016). A popular transform that maps timeseries measurements to the frequency domain is the ShortTime Fourier Transform (STFT). We, therefore, propose a new neural network model, namely, ShortTime Fourier Neural Networks (STFNets) that operate directly in the frequency domain.
One potential approach for learning in the frequency domain might simply be to convert sensing signals into the frequency domain first, and then apply conventional neural network components, possibly extending them to support operations on complexnumbers so they can represent frequencydomain quantities (Trabelsi et al., 2017). These approaches miss two key opportunities for improvement, described below, that we take advantage of in this work. As a result, our work leads to more accurate results, as shown in the evaluation section. The two reasons that account for our improvements are as follows.
First, different from traditional neural networks, where the internal representations constitute features with no physical meaning, the internal representations in STFNet leverage frequency domain semantics that encode time and frequency information. All operations and learnable parameters we propose are explicitly made compatible with the basic properties of spectral data, and align corresponding frequency and time components. In our design, we categorize spectral manipulations into three main types: filtering, convolution, and pooling. Filtering refers to the general spectral filtering and global template matching operation; convolution refers to the local motif detection including shift detection and local template detection; and pooling refers to dimension reduction over the frequency domain. We then design the spectralcompatible parameters and operating rules for these three manipulation categories respectively, which have shown superior performance in evaluations compared to the application of conventional neural networks in the domain of complexnumbers.
Second, transforming signals to the frequency domain is governed by the uncertainty principle (Smith, 2007). The transformed representation cannot achieve both a high frequency resolution and a high time resolution at the same time. In STFT, the timefrequency resolution is controlled by the length of the sliding window (the length of the part of the timeseries being converted at a time). With a longer window, we can obtain a finergrained frequency representation. However, we then cannot achieve a time resolution smaller than the window size. The uncertainty principle causes a dilemma in traditional timefrequency analysis. One often needs to guess the best timefrequency resolution using trial and error. In STFNet, we circumvent this dilemma by simultaneously computing multiple STFTs with different timefrequency resolutions. The representations with different timefrequency resolutions are then mutually enhanced in a datadriven manner. The network then automatically learns the best resolution or resolutions, where the most useful features are present. STFNet defines a formal way to extract features from multiple timefrequency transformations with the same set of spectralcompatible operations and parameters, which greatly reduces model complexity while improving accuracy.
We demonstrate the effectiveness of STFNet through extensive experiments with various sensing modalities. During the evaluation, we focus on devicebased and devicefree human activity recognition with a broad range of sensing modalities, including motion sensors (accelerometer and gyroscopes), WiFi, ultrasound, and visible light. The experimental results validate the design settings of STFNets and illustrate their superior accuracy compared to the stateoftheart deep learning frameworks for IoT applications.
Broadly speaking, the main contributions of this paper to the general research landscape of deep learning and IoT are twofold:

STFNet presents a principled way of designing neural networks that reveal the key properties of physical processes underlying the sensing signals from the timefrequency perspective.

STFNet unveils the benefit of incorporating domainspecific analytic modelling and transformation techniques into the neural network design.
The rest of paper is organized as follows. Section 2 introduces related work on deep learning in the context of mobile sensing as well as deep learning for spectral representations. We introduce the detailed technical design of STFNet in Section 3. The evaluation is presented in Section 4. Finally, we discuss the results in Section 5 and conclude in Section 6.
2. Related Work
The impressive achievements in image classification using deep neural networks at the turn of the decade (Krizhevsky et al., 2012)
precipitated a reemergence of interest in deep learning. Deep neural networks have achieved significant accuracy improvements in a broad spectrum of areas, including computer vision
(Simonyan and Zisserman, 2014; He et al., 2016)(Collobert et al., 2011; Bahdanau et al., 2014), and network analysis (Perozzi et al., 2014; Kipf and Welling, 2016).Recent efforts applied deep learning in the context of IoT. In order to improve the predictive accuracy of IoT applications, researchers employed deep learning to model complicated sensing tasks (Lane et al., 2015; Yao et al., 2017a). In order to improve system efficiency at executing neural networks on lowend IoT devices, efforts have been made to compress model parameters and/or structures in a manner that does not entail (almost any) accuracy loss (Yao et al., 2017b; Yao et al., 2018a; Bhattacharya and Lane, 2016; Han et al., 2015)
. Recent work in the context of IoT also addressed mathematical foundations for quantifying confidence in deep learning predictions to support missioncritical applications. The work produced deep neural networks that offer wellcalibrated uncertainty estimates in results
(Yao et al., 2018c, d; Gal and Ghahramani, 2015, 2016). Finally, the challenge of insufficient labeling of IoT data was addressed by introducing semisupervised approaches for deep learning that allow neural network training using mostly unlabeled data (Yao et al., 2018b). However, none of the aforementioned IoTinspired efforts addressed the customization of learning machinery to a different signal space inspired by the physics of measured processes; namely, the frequency domain.To fill the above gap, recent work in machine learning focused on extending deep neural networks to complex numbers and spectral representations. Trabelsi et al. propose deep complex networks, investigating the complexvalue neural network structure (Trabelsi et al., 2017)
. However, they mainly concentrate on the problems of initialization, normalization, and activation functions when extending realvalued operations directly into the complexvalue domain. Their designs focus more on complexvalue representations than spectral representations, and do not take the properties of spectral data into consideration. Rippel et al. study spectral representations for convolutional neural networks
(Rippel et al., 2015). However, their study focuses on spectral parametrizing of standard CNNs, instead of designing operations customized for spectral data. In addition, their work treats input data fully from the frequency perspective instead of the timefrequency perspective. Yao et al. propose a neural network that takes shorttime Fourier transformed data as inputs (Yao et al., 2017a). Yet their design uses traditional CNNs and RNNs, combining the real and imagery parts of complexvalue inputs as additional features.To the best of our knowledge, STFNet is the first work that integrates neural networks with traditional timefrequency analysis, and designs fundamental spectralcompatible operations for Fouriertransformed representations. Our study shows that the approach leads to improved accuracy compared to the state of the art. It implies that integrating neural networks with domaininspired transformation techniques (in our case, the Fourier Transform of physical timeseries signals) projects input signals into a space that significantly facilitates the learning process.
3. ShortTime Fourier Neural Networks
We introduce the technical details of STFNets in this section. We separate the technical descriptions into six parts. In the first two subsections, we provide some background followed by a highlevel overview of STFNet components, including (i) hologram interleaving, (ii) STFNetfiltering, (iii) STFNetconvolution, and (iv) STFNetpooling. In the remaining four subsections, we describe the technical details of each of these components, respectively.
3.1. Background and STFNet Overview
IoT devices sample the physical environment generating timeseries data. Discrete Fourier Transform (DFT) is a mathematical tool that converts samples over time (with a sampling rate of ) into a components in frequency (with a frequency step of ). The more samples are selected, the finer the component resolution is in frequency.
We can always transform the whole sequence of data with DFT, achieving a high frequency resolution. However, we then lose information on signal evolution over time, or the time resolution. In order to solve this problem, ShortTime Fourier Transform (STFT) divides a longer time signal into shorter segments of equal length and computes DTF separately on each shorter segment. By losing a certain degree of frequency resolution, STFT helps us regain the time resolution to some extent. In choosing , there arises a fundamental tradeoff between the attainable time and frequency resolution, which is called the uncertainty principle (Smith, 2007). For the purposes of learning to pedict a given output, the optimal tradeoff point depends on the time and frequency granularity of the features that best determine the outputs we want to reproduce. The goal of STFNets is thus to learn frequency domain features that predict the output, while at the same time learn the best resolution tradeoff point in which the relevant features exist.
The building component of an STFNet is an STFNet block, shown in Figure 1. An STFNet block is the layerequivalent in our neural network. The larger network would normally be composed by stacking such layers. Within each block, STFNet circumvents the uncertainty principle by computing multiple STFT representations with different timefrequency resolutions. Collectively, these representations constitute what we call the timefrequency hologram. And we call an individual timefrequency signal representation, a hologram representation. They are then used to mutually enhance each other by fillingin missing frequency components in each.
Candidate frequencydomain features are then extracted from these enhanced representations via general spectral manipulations that come in two flavors; filtering and convolution. They represent global and local feature extraction operations, respectively. The filtering and convolution kernels are learnable, making each STFNet layer a building block for spectral manipulation and learnable frequency domain feature extraction. In addition, we also design a new mechanism, called pooling, for frequency domain dimensionality reduction in STFNets. Combinations of features extracted using the above manipulations then pass through activation functions and an inverse STFT transform to produce (filtered) outputs in the time domain. Stacking STFNet blocks has the effect of producing progressively sharper (i.e., higher order) filters to shape the frequency domain signal representation into more relevant and more finetuned features.
Figure 2 gives an example of an SFTNet block that accepts as input a twodimensional timeseries signal (e.g., 2D accelerometers data). Each dimension is then transformed to the frequency domain at four different resolutions using STFT, generating four different internal nodes, each representing the signal in the frequency domain at a different timefrequency resolution. Collectively, the four representations constitute the hologram. In the next step, mutual enhancements are done improving all representations. Each representation then undergoes a variety of alternative spectral manipulations (called “filters” in the figure). Two filters are shown in the figure for each dimension. The parameters of these filters are the weights multiplied by the frequency components of the filter input; a different weight per component. These parameters are what the network learns. Note that, a filter does not change the timefrequency resolution of the corresponding input. Filter outputs of the same timefrequency resolution are then combined additively across all dimensions and passed through a nonlinear activation function (as in a conventional convolutional neural network). An inverse STFT brings each such combined output back to the time domain, where it becomes an input to the next STFNet block. (Alternatively, the inverse STFT can be applied after dimension combination and before the activation function.) Hence, each output timeseries is produced by applying spectral manipulation and fusion to one particular timefrequency resolution of all input timeseries. Once converted to the time domain, however, the output timeseries can be resampled in the next block at different timefrequency resolutions again. The goal of STFNet is to learn the weighting of different frequency components within each filter in each block such that features are produced that best predict final network outputs.
3.2. STFNet Block Fundamentals
In this subsection, we introduce the formulation of our design elements wihin each STFNet block. In the rest of this paper, all vectors are denoted by bold lowercase letters (e.g.,
and), while matrices and tensors are represented by bold uppercase letters (e.g.,
and ). For a vector , the element is denoted by . For a tensor , the matrix along the third axis is denoted by , and other slicing denotations are defined similarly. We use calligraphic letters to denote sets (e.g., and ). For set , denotes the cardinality.We denote the input to the STFNet block as , where we divide the input dimension timeseries into windows of size samples. We call the signal length and the signal dimension. Since we concentrate on sensing signals, we assume that all the raw and internalmanipulated sensing signals are realvalued in time domain.
As shown in Figure 1, the input signal first goes through a multiresolution shorttime Fourier transform (Multi_STFT), which is a compound traditional shorttime Fourier transform (STFT), to provide a timefrequency hologram of the signal. STFT breaks the original signal up into chunks with a sliding window, where sliding window with width only has nonzero values for . Then each chunk is DiscreteFourier transformed,
(1) 
where denotes the shorttime Fourier transform with width and sliding step . denotes the number of time chunks. denotes the number of frequency components. Since input signal is realvalued, its discrete Fourier transform is conjugate symmetric. Therefore, we only need the frequency components to represent the signal, i.e., . In this paper, we focus on sliding chunks with rectangular window and no overlaps to simplify the formulation, i.e., and . We therefore denote of shorttime Fourier transform as .
The Multi_STFT operation is composed of multiple shorttime Fourier transform with different window widths . The window width, , determines the timefrequency resolution of STFT. Larger provides better frequency resolution, while smaller provides better time resolution. In this paper, we set the window widths to be powers of , i.e., , to simplify the design later. We can thus formulate Multi_STFT as:
(2) 
Next, according to Figure 1, the multiresolution representations go into the hologram interleaving component, which enables the representations to compensate and balance their timefrequency resolutions with each other. The technical details of the hologram interleaving component are introduced in Section 3.3.
The STFNet block then manipulates multiple hologram representations with the same set of spectralcompatible operation(s), including STFNetfiltering, STFNetconvolution, and STFNetpooling. We will formulate these operations in Section 3.4, 3.5, and 3.6, respectively.
Finally, the STFNet block converts the manipulated frequency representations back into the time domain with the inverse shorttime Fourier transform. The resulting representations from different views of the hologram are weighted and merged as the input “signal” for the next block. Since we merge the output representations from different views of the hologram, we reduce the output feature dimension of STFNetfiltering and convolution operations by the factor of to prevent the dimension explosion.
3.3. STFNet Hologram Interleaving
In this subsection, we introduce the formulation of hologram interleaving. Due to the Fourier uncertainty principle, the representations in timefrequency hologram either have high time resolution or high frequency resolution. The hologram interleaving tries to use representations with high time resolution to instruct the representations with low time resolution to highlight the important components over time. This is done by two steps:

Revealing the mathematical relationship of aligned timefrequency components among different representations in the timefrequency hologram.

Updating the original relationship in a datadriven manner through neuralnetwork attention components.
We start from the definition of timefrequency hologram, generated by Multi_STFT defined in (2). Note that, the window width set is defined as , . Without loss of generality, an illustration of multiresolution shorttime Fourier transformed representations with input signal having length and signal dimension as well as are illustrated in Figure 3.
In order to find out the relationship of aligned timefrequency components, we start with the frequencycomponent dimension. Since different representations only change the window width of STFT but not the sampling frequency of input signal, these frequency components represent frequencies from to (Nyquist frequency) with step . Then we can first obtain the relationship of frequency ranging steps among different representations,
(3) 
Therefore, a low frequencyresolution representation (with window width ) can find their frequencyequivalent counterparts for every frequency components in a high frequencyresolution representation (with window width ). The upper part of Figure 3 provides a simple illustration of such relationship. In the following analysis, we will use the original index and corresponding frequency interchangeably to recall the frequency component from the timefrequency hologram .
Next, we analyze the relationship over the timechunk dimension, when two representations have frequencyequivalent components. Note that time chunks in are generated by sliding rectangular window without overlap. Based on (1), for representations having window widths and (),
(4) 
Therefore, given the equivalent frequency component, a time component in low timeresolution representation (with window width ) is the sum of aligned time components of the high timeresolution representation (with window width ). As a toy example in Figure 3, the first row of the middle tensor is equal to the sum of first two rows of the left tensor for frequencies , , and . The row of the right tensor is equal to the sum of four rows of the left tensor for frequencies , , and . The row of the right tensor is equal to the sum of two rows of the middle tensor for frequencies and , etc.
According to the analysis above, the high frequencyresolution representations lose their finegrained time resolutions at certain frequencies by summing the corresponding frequency components up over a range of time. However, the high timeresolution representations preserve these information.
The idea of hologram interleaving is to replace the sum operation in high frequencyresolution representation with a weighted merge operation to highlight the important information over time. For a certain frequency component, the weight of merging is learnt through the most finegrained information preserved in the timefrequency hologram. In this paper, we implement the weighted merge operation as a simple attention module. For a merging input , where is the number of elements to be merged, the merge operation is formulated as:
(5) 
where is the piecewise magnitude operation for complexnumber vector; and is the learnable weight matrix. Notice that the final merged result is rescaled by the factor to imitate the “sum” property of Fourier transform.
3.4. STFNetFiltering Operation
Starting from this subsection, we will introduce our three spectralcompatible operations in STFNet. In each subsection, the introduction includes two main parts: 1) the basic formulation of proposed spectralcompatible operation, and 2) extending a single operation to multiresolution data.
Spectral filtering is a widelyused operation in timefrequency analysis. The STFNetfiltering operation replaces the traditional manually designed spectral filter with a learnable weight that can update during the training process. Although the spectral filtering is equivalent to the timedomain convolution according to convolution theorem ^{1}^{1}1https://en.wikipedia.org/wiki/Convolution_theorem, the filtering operation helps to handle the multiresolution timefrequency analysis, and facilitates the parameterization and modelling. We denote the input tensor as , where is the number of time chunk, frequency component number, and input feature dimension. The STFNetfiltering operation is formulated as:
(6) 
where is the learnable weight matrix, the output feature dimension, and the output representation.
The function of STFNetfiltering operation is providing a set of learnable global frequency template matchings over the time. However, it is not straightforward to extend the matching operation to the representations with different timefrequency resolutions. Although we can create multiple with different frequency resolutions , it can introduce unnecessary complexity and redundancy.
STFNetfiltering solves this problem by interpolating the frequency components in weight matrix. As we mentioned in Section
3.3, data in hologram with different frequency resolutions have the same frequency range (from to ) but different frequency steps (). Therefore, STFNetfiltering operation only has one weight matrix with frequency components. When the operation input has frequency components with , we can subsample the frequency components in . When , we interpolate the frequency components of . STFNet provides two kind of interpolation methods: 1) linear interpolation and 2) spectral interpolation.The linear interpolation generates the missing frequency components in extended weight matrix from the two neighbouring frequency components in :
(7) 
The spectral interpolation utilizes the relationship between discretetime Fourier transform (DTFT) and discrete Fourier transform (DFT). For a timelimited signal (with length ), DTFT regards it as a infinitelength data with zeros outside the timelimited range, while DFT regards it as a periodic data. As a result, DTFT generates a continuous function over the frequency domain, while DFT generates a discrete function. Therefore, DFT can be regarded as a sampling of DTFT with step . In order to increase the frequency resolution of , we can increase the sampling step from to
, which is called spectral interpolation. Spectral interpolation can be done through zero padding in the time domain
(Smith, 2007),(8) 
where denotes padding zeros at the end of sequence, and denotes the inverse discrete Fourier transform. Please note that, if we pad infinite zeros to the IDFT result, then DFT turns into DTFT. An simple illustration of STFNetfiltering operation is shown in Figure 4.
3.5. STFNetConvolution Operation
In this subsection, we introduce our design of STFNetconvolution operation. Other than filtering operation that handles global pattern matching, we still need the convolution operation to deal with local motifs in the frequency domain. We denote the input tensor as
, where is the number of time chunk, number of frequency component, and input feature dimension. The convolution operation involves two steps: 1) padding the input data, and 2) convolving with kernel weight matrix , where is the kernel size along the frequency axis and is still the output feature dimension.Without the padding step, the output of convolution operation will shrink the number of frequency components, which may break the underlying structure and information in the frequency domain. Therefore, we need to pad extra “frequency component” to keep the shape of output tensor unchanged compared to that of the input data. In the deep learning research, padding zeros is a common practice. Zero padding is reasonable for inputs such as images and signal in the time domain, meaning no additional information in the padding range. However, padding zerovalued frequency component introduces additional information in the frequency domain.
Therefore, STFNetconvolution operation proposes the spectral padding for timefrequency analysis. According to the definition of DFT, transformed data is periodic within the frequency domain. In addition, if the original signal is realvalued, then the transformed data is conjugate symmetric within each period. Previously, we cut the number of frequency components of a length signal to for reducing the redundancy. In the spectral padding, we add these frequency components back according to the rule
(9) 
where denotes complex conjugation. In addition, the number of padding before and after the input tensor is same as the previous padding techniques.
Then we can define the basic convolution operation in STFNet
(10) 
where denotes our spectral padding operation, and denotes the convolution operation.
Next, we discuss the way to share the kernel weight matrix with multiresolution data. Other than interpolating the kernel weight matrix as shown in (7) and (8), we propose another solution for the STFNetconvolution operation. The convolution operation concerns more about the pattern of relative positions on the frequency domain. Therefore, instead of providing additional kernel details on finegrained frequency resolution, we can just ensure that the convolution kernel is applied with the same frequency spacing on representations with different frequency resolutions. Such idea can be implemented with the dilated convolution (Yu and Koltun, 2015). If is applied to a input tensor with frequency components, for a input tensor with frequency components (), the dilated rate is set to . A simple illustration of STFNetconvolution with dilated configuration is shown in Figure 5.
3.6. STFNetPooling Operation
In order to provide a dimension reduction method for sensing series within STNet, we introduce the STFNetpooling operation. STFNetpooling truncates the spectral information over time with a predefined frequency pattern. As a widelyused processing technique, filtering zeroes unwanted frequency components in the signal. Various filtering techniques have been designed, including lowpass filtering, highpass filtering, and bandpass filtering, which serve as templates for our STFNetpooling. Instead of zeroing unwanted frequency components, STFNetpooling removes unwanted components and then concatenates the left pieces. For applications with domain knowledge about signaltonoise ratio over the frequency domain, specific pooling strategy can be designed. In this paper, we focus on lowpass STFNetpooling as an illustrative example.
To extend the STFNetpooling operation to multiple resolutions and preserving spectral information, we make sure that all representations have the same cutoff frequency according to their own frequency resolutions. A simple example of lowpass STFNetpooling operation is shown in Figure 6. We can see that our three tensors are truncated according to the same cutoff frequency, .
4. Evaluation
In this section, we evaluate the STFNet with diverse sensing modularities. We focus on the devicebased and devicefree human activity recognitions with motion sensors (accelerometer and gyroscope), WiFi, ultrasound, and visible light. We first introduce the experimental setting, including data collection and baseline algorithms. Next, we show the performance metrics of leaveoneuserout evaluation of human activity recognition with different modularities. Finally, we analyze the effectiveness of STFNet through several ablation studies.
4.1. Experimental Settings
In this subsection, we first introduce detailed information of the dataset we used or collected for each evaluation task. Then we specify the way to test the performance of evaluation task.
Motion Sensor: In this experiment, we recognize human activity with motion sensors on smart devices. We use the dataset collected by Allan et al. (Stisen et al., 2015). This dataset contains readings from two motion sensors (accelerometer and gyroscope). Readings were recorded when users executed activities scripted in no specific order, while carrying smartwatches and smartphones. The dataset contains 9 volunteers, 6 activities (biking, sitting, standing, walking, climbStairup, and climbStairdown). We align two sensor readings, linear interpolate two readings by 100Hz, and segment them into nonoverlapping data samples with time interval 5.12s. Therefore, each data sample is a matrix, where both accelerometer and gyroscope have readings on , , and axis.
WiFi: In this experiment, we make use of Channel State Information (CSI) to analyze human activities. CSI refers to the known channel properties of a communication link, which can be affected by the presence of humans and their activities. We employ 11 volunteers (including both men and women) as the subjects and collect CSI data from 6 different rooms in two different buildings. In particular, we build a WiFi infrastructure, which includes a transmitter (a wireless router) and two receivers. We use the tool to report CSI values of 30 OFDM subcarriers (Halperin et al., 2011). The experiment contains 6 activities (wiping the whiteboard, walking, moving a suitcase, rotating the chair. sitting, as well as standing up and sitting down). We linearly interpolate the CSI data with a uniform sampling period, and downsample the measurements into 100Hz. Then we segment the downsampled CSI data into nonoverlapping data samples with time interval 5.12s. Therefore, each data sample is a matrix, where each CSI measurement has readings from 30 subcarriers.
Ultrasound: In this experiment, we conduct human activity recognition based on ultrasound. We employ 12 volunteers as the subjects to conduct the 6 different activities. The activity data are collected from 6 different rooms in two different buildings. The transmitter is an iPad on which an ultrasound generator app is installed, and it can emit an ultrasound signal of approximately 19 KHz. The receiver is a smartphone and we use the installed recorder app to collect the sound waves. We demodulate the received signal with carrier frequency 19KHz, and downsample the measurement into 100Hz. Then we segment the downsampled ultrasound data into nonoverlapping data samples with time interval 5.12s. Therefore, each sample is a matrix.
Visible light: In this experiment, we capture the human activity in the visible light system. We build an optical system using photoresistors to capture the inair body gesture, which can detect the illuminance change (lux) caused by the body interaction. In the experiment, there are three light conditions (natural mode, warm mode, and cool mode) and 4 hand gestures (drawing an anticlockwise circle, drawing a clockwise circle, drawing a cross, and shaking hand side to side). We employ 6 volunteers as the subjects and each of them performs 20 trials of every gesture under a given lighting condition. We linearly interpolate and downsample the measurements into 25Hz. Then we segment the data into nonoverlapping data samples with time interval 5.12s. Therefore, each sample is a matrix, where each measurement contains readings from 6 CdS cells.
STFNetFilter/Conv  DeepSense/ComplexNet  

Sensor Data 1  Sensor Data 2  Chunked  Chunked 
Sensor Data 1  Sensor Data 2  
STFNet11  STFNet12  Conv Layer11  Conv Layer12 
STFNet21  STFNet22  Conv Layer21  Conv Layer22 
STFNet31  STFNet32  Conv Layer31  Conv Layer32 
STFNetpooling  Max pooling  
STFNet4  Conv Layer4  
STFNet5  Conv Layer5  
STFNet6  Conv Layer6  
Averaging  GRU  
Softmax  Softmax 
Testing: In the whole evaluation, to illustrate the generalization ability of STFNet and other baseline models, we perform leaveoneuserout cross validation for every task. For each time, we choose the data from one user as testing data with the left as training data. We then compare the performance of models according to their accuracy and F1 score with confidence interval.
4.2. Models in Comparison
In order to evaluate, when compared to conventional deep learning components (i.e., convolutional and recurrent layers), whether our proposed STFNet component is better at decoding information and extracting features from sensing inputs, we substitute components in the stateoftheart neural network structure for IoT applications with STFNet. In the whole evaluation, we choose DeepSense as the stateoftheart structure, which has shown signifiant improvements on various sensing tasks (Yao et al., 2017a). The illustration of structures of five comparing models with two sensor inputs are shown in Table 1. Detailed information of our comparing models are listed as follows,

STFNetFilter: This model integrates the proposed STFNet component and the DeepSense structure. Within the STFNet component, we use the STFNetfiltering operation designed in Section 3.4
. The intuition of DeepSense structure is to first perform local processing within each sensor and then perform global sensor fusion over multiple sensors. In this model, we replace all convolutional layers used in local/global sensor data processing with our timefrequency analyzing component, STFNet. Since our model has already incorporated timedomain analysis within the STFNet component through multiresolution processing, we replace the Gated Recurrent Units (GRU) with simple feature averaging time at last.

STFNetConv: This model is almost the same as the STFNetFilter, except that we use the STFNetconvolution operation designed in Section 3.5.

DeepSenseFreq: This model is the original DeepSense (Yao et al., 2017a). It divides the input sensing data into chunks, and processes each chunk with DFT. It treats the real and imagery parts of discrete Fourier transformed time chunks as the additional feature dimensions. This is the stateoftheart deep learning model for sensing data modelling and IoT applications.

DeepSenseTime: This model is almost the same as the DeepSenseFreq, except that it directly takes the chunked raw sensing data without DFT as input.

ComplexNet: This model is a complexvalue neural network (Trabelsi et al., 2017) that can operate on complexvalue inputs. Instead of using simple CNN and RNN structure as originally proposed (Trabelsi et al., 2017), we cheat in their favor by using the DeepSense structure, which improves the performance in all tasks. The network inputs are chunked sensing data with DFT.
4.3. Effectiveness
In this section, we discuss about the effectiveness of our proposed STFNet based on extensive experiments and diverse sensing modalities, compared with other stateoftheart deep learning models.
As we mentioned in Section 4.1, all models are evaluated through leaveoneuserout cross validation with accuracy and F1 score accompanied by the confidence interval. STFNetbased models (STFNetFilter and STFNetConv) take a sliding window set for multiresolution shorttime Fourier transform. We choose the set to be for activity recognition with motion sensors, WiFi, and ultrasound; and choose set to be for activity recognition with visible light. DeepSensebased models (DeepSenseFreq and DeepSenseTime) need a sliding window for chunking input signals. In the evaluation, we cheat in their favor by choosing the bestperforming window size from according to the accuracy metric. In addition, we consistently configure STFNetfiltering operation with linear interpolation, and STFNetconvolution operation with spectral padding. We will show further evaluations on multiresolution operations and the effects of diverse operation settings in Section 4.4.
4.3.1. Motion Sensors
For devicebased activity recognition with motion sensors, there are 9 users. The accuracy and F1 score with the confidence interval for leaveoneuserout cross validation are illustrated in Figure 7.
STFNet based models, i.e., STFNetFilter and STFNetConv, outperform all other baseline models with a large margin. The confidence interval lower bound of STNetFilter and STFNetConv is even better than the confidence interval upper bound of DeepSenseFreq and DeepSenseTime. STFNetFilter performs better than STFNetConv in this experiment, indicating that different activities have distinct global profiling patterns with motion sensor readings in the frequency domain, even among different users. STFNetFilter is able to learn the accurate global frequency profiling, which makes it the topperformance model in this task. In addition, compared to ComplexNet, STFNet based models show clear improvements. Therefore, using just complexvalue neural network for sensing signal is far from enough. The multiresolution processing and operations that are spectralcompatible are all crucial designs.
4.3.2. WiFi
For devicefree activity recognition with WiFi signal, there are 11 users. The accuracy and F1 score with the confidence interval for leaveoneuserout cross validation are illustrated in Figure 8. STFNet based models still outperform all others with a clear margin, illustrating the effectiveness of principled design of STFNet from timefrequency perspective. DeepSenseFreq outperforms DeepSenseTime in this experiment, which means that even having timefrequency transformation as preprocessing can help. The complexvalue network, ComplexNet, performs worse than its realvalue counterpart, DeepSenseFreq. This indicates that blindly processing timefrequency representations without preserving their physical meanings can even hurt the final performance. STNetConv performs better than STNetFilter in the WiFi experiment, indicating that local shiftings in the frequency domain are more representative for diverse activities profiled with WiFi CSI.
4.3.3. Ultrasound
There are 12 users in devicefree activity recognition with ultrasound experiment. The accuracy and F1 score with the confidence interval for leaveoneuserout cross validation are illustrated in Figure 9. STFNet based models still significantly outperforms all other baselines. An interesting observation is that ComplexNet performs even worse than both DeepSenseFreq and DeepSeneTime, which again validates the importance of designing neural networks for sensing signal with multiresolution processing as well as preserving the time and frequency information.
4.3.4. Visible Light
There are 6 users in the experiment of devicefree activity recognition with visible light. The accuracy and F1 score with the confidence interval are illustrated in Figure 10. Except for the DeepSenseTime, all other models can can achieve an accuracy of approximately 90% or higher. STFNet based models still do the best. There is no significant difference between STFNetFilter and STFNetConv, which indicates that measured visible light readings have quite clean representations in the frequency domain.
4.4. Ablation Studies
In the previous section, we illustrate the performance of STFNet compared to other stateoftheart baselines. In this section, we focus mainly on the STFNet design. We conduct several ablation studies by deleting one designing feature from STFNet at a time.
4.4.1. MultiResolution v.s. SingleResolution
First, we validate the effectiveness of our design of multiresolution processing in STFNet block. As shown in Figure 1, this includes multiresolution STFT, hologram interleaving, and weights sharing techniques in STFNetFiltering and STFNetConvolution operations. In this experiment, we add two more baseline models, STFNetSingleFilter and STFNetSingleConv, generated by deleting the multiresolution processing in STFNetFilter and STFNetConv respectively. These two models pick the bestperforming window size from according to the accuracy metric. The results for all four tasks are illustrated in Figure 11, where DeepSenseFreq severs as a decent performance lowbound. The design of multiresolution processing significantly impacts the performance of STFNet. STFNetSingleFilter and STFNetSingleConv show clear performance degradation compared to their multiresolution counterparts. In addition, STFNetSingleFilter and STFNetSingleConv still consistently outperform DeepSenseFreq with a clear margin. This is because our other designed operations, including STFNetFiltering, STFNetConvolution, STFNetPooling still facilitate the learning in timefrequency domain.
4.4.2. Spectral Padding v.s. Zero Padding
Next, we validate our design of spectral padding in the STFNetConvolution operation as shown in Figure 5. In this experiment, we add a new baseline algorithm, STFNetConvzPad, by replacing spectral padding with traditional zero padding in the STFNetConv. The accuracy and F1 score of all four tasks are shown in Figure 12. Here, DeepSenseFreq is still treated as a performance lowbound. By comparing STFNetConvzPad and STFNetConv, we can see that spectral padding consistently helps improving the model performance. In most cases, the improvement is limited. However, in the case of visible light, spectral padding significantly improves both accuracy and F1 score. Therefore, designing neural network by preserving the timefrequency semantics of sensing signal is an important rule to follow.
4.4.3. Linear Interpolation v.s. Spectral Interpolation
Then, we compare our two designs of weight interpolation method in the STFNetFiltering operation, linear interpolation and spectral interpolation, as shown in Figure 4. The STFNetFilter defined in Section 4.2 uses linear interpolation, so we rename it as STFNetFilterLinearInpt in this experiment. We add a new baseline model called STFNetFilterSpectralInpt by using spectral interpolation instead of linear interpolation in STFNetFilter. The results of all four tasks are illustrated in Figure 13. In general, the performance of two design choices are almost the same. At most of time, linear interpolation performs slightly better. In addition, we recommend using linear interpolation, since its implementation is easier,
4.4.4. STFNet Pooling v.s. Mean/Max Pooling
Finally, we validate our design of STFNetPooling (lowpass deisgn) as shown in Figure 6. In this experiment, we add two new baseline algorithms, STFNetFiltermPad and STFNetConvmPad, by replacing STFNetPooling in STFNetFilter and STFNetConv with traditional max/mean pooling in the time domain (through choosing the one has better accuracy). The results are illustrated in Figure 14. In all settings, STFNetPooling shows better performance. In some cases, the improvement are significant. We believe that STFNetPooling can achieve even better performance if given the detailed signaltonoise ratio over the frequency domain for each specific sensor. Then we can employ other pooling strategies instead of the lowpass design.
5. Discussion
This paper provides a principled way of designing neural networks for sensing signals inspired by the fundamental nature of the underlying physical processes. STFNet, operates directly in the frequency domain, in which the measured physical phenomena are best exposed. We propose three types of learnable frequency manipulations that are able to operate on multiresolution representations, while preserving the underlying timefrequency information. Although extensive experiments have illustrated the superior performance of STFNet, further research is needed to better understand design choices for neural networks from the timefrequency perspective.
One challenge is to explore the possibility of integrating neural networks with other timefrequency transformations. In this paper, STFNet focuses on the shorttime Fourier Transform. However, STFT is the most basic one. There are plenty of other transformation basis functions in traditional timefrequency analysis. How to naturally integrate them with neural network while keeping the underlying physical meaning within transformed representations? How to choose or design the most suitable transformation basis functions that meet the corresponding mathematical requirements? Answers to these questions can greatly impact the way researchers design neural networks for sensing signal processing.
Another challenge is to empower the frequency manipulations to have heterogeneous behaviours over the time. In STFNet, all designed operations are learnable frequency manipulations, which perform identically over time. In order to fully exploit the potential of timefrequency analysis, further research is needed on designing timevarying timefrequency manipulations, that adapt to current temporal patterns.
Furthermore, a better experimental and theoretical understanding is needed of the basic settings of neural networks to support computation in timefrequency domain. For traditional realvalue neural networks, researchers have good intuitions about the basic configurations of initialization, activation functions, dropout and normalization techniques, and optimization methods. However, for neural network in the timefrequency domain, our understanding is limited. Although the reseach community started to study the basic settings of neural networks with complex values (Trabelsi et al., 2017), the current understanding remains preliminary. Timefrequency analysis can have operations in both the real and complex domains. At the same time, the underlying timefrequency information within the internal representations can make the related studies even more complicated. We believe that this understanding will greatly facilitate future design of deep learning systems for IoT.
In addition, outside the IoT context, there exists a large number of transformations and dimension reduction techniques, such as SVD and PCA, that have made great impact in revealing useful features of complex phenomena. Our study of deep learning with STFT suggests that integrating deep neural networks with other common transformations may facilitate learning in domains where such transformations reveal essential features of the input signal domain. Future work is needed to explore this conjecture.
6. Conclusion
In this paper, we introduced STFNet, a principled way of designing neural networks from the timefrequency perspective. STFNet endows timefrequency analysis with additional flexibility and capability. In addition to just parameterizing the frequency manipulations with deep neural networks, we bring two key insights into the design of STFNet. On one hand, STFNet leverages and preserves the frequency domain semantics that encode time and frequency information. On the other hand, STFNet circumvents the uncertainty principle through multiresolution transform and processing. Evaluations show that STFNet consistently outperforms the stateoftheart deep learning models with a clear margin under diverse sensing modalities, and our two designing insights significantly contribute to the improvement. The designs and evaluations of STFNet unveil the benefits of incorporating domainspecific modeling and transformation techniques into neural network design.
Acknowledgements.
Research reported in this paper was sponsored in part by NSF under grants CNS 1618627 and CNS 1320209 and in part by the Army Research Laboratory under Coop erative Agreements W911NF0920053 and W911NF1720196. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory, NSF, or the U.S. Govern ment. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.References
 (1)
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
 Bhattacharya and Lane (2016) Sourav Bhattacharya and Nicholas D Lane. 2016. Sparsification and separation of deep learning layers for constrained resource inference on wearables. In Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems CDROM. ACM, 176–189.
 Chen et al. (2014) KeYu Chen, Daniel Ashbrook, Mayank Goel, SungHyuck Lee, and Shwetak Patel. 2014. AirLink: sharing files between multiple devices using inair gestures. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 565–569.
 Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, Aug (2011), 2493–2537.
 Gal and Ghahramani (2015) Yarin Gal and Zoubin Ghahramani. 2015. Dropout as a Bayesian approximation. arXiv preprint arXiv:1506.02157 (2015).

Gal and
Ghahramani (2016)
Yarin Gal and Zoubin
Ghahramani. 2016.
A theoretically grounded application of dropout in recurrent neural networks. In
Advances in neural information processing systems. 1019–1027.  Gonzalez et al. (2002) Rafael C Gonzalez, Richard E Woods, et al. 2002. Digital image processing.
 Gupta et al. (2012) Sidhant Gupta, Daniel Morris, Shwetak Patel, and Desney Tan. 2012. Soundwave: using the doppler effect to sense gestures. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1911–1914.
 Halperin et al. (2011) Daniel Halperin, Wenjun Hu, Anmol Sheth, and David Wetherall. 2011. Tool release: Gathering 802.11 n traces with channel state information. ACM SIGCOMM Computer Communication Review 41, 1 (2011), 53–53.
 Han et al. (2015) Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).

He
et al. (2016)
Kaiming He, Xiangyu
Zhang, Shaoqing Ren, and Jian Sun.
2016.
Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition
. 770–778.  Hemminki et al. (2013) Samuli Hemminki, Petteri Nurmi, and Sasu Tarkoma. 2013. Accelerometerbased transportation mode detection on smartphones. In Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems. ACM, 13.
 Hubel and Wiesel (1968) David H Hubel and Torsten N Wiesel. 1968. Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology 195, 1 (1968), 215–243.
 Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
 Lane et al. (2015) Nicholas D Lane, Petko Georgiev, and Lorena Qendro. 2015. DeepEar: robust smartphone audio sensing in unconstrained acoustic environments using deep learning. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. ACM, 283–294.
 Li et al. (2016) Tianxing Li, Qiang Liu, and Xia Zhou. 2016. Practical human sensing in the light. In Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 71–84.
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 701–710.
 Pu et al. (2013) Qifan Pu, Sidhant Gupta, Shyamnath Gollakota, and Shwetak Patel. 2013. Wholehome gesture recognition using wireless signals. In Proceedings of the 19th annual international conference on Mobile computing & networking. ACM, 27–38.
 Rippel et al. (2015) Oren Rippel, Jasper Snoek, and Ryan P Adams. 2015. Spectral representations for convolutional neural networks. In Advances in neural information processing systems. 2449–2457.
 Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2014).
 Smith (2007) Julius Orion Smith. 2007. Mathematics of the discrete Fourier transform (DFT): with audio applications. Julius Smith.
 Stisen et al. (2015) Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen. 2015. Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems. ACM, 127–140.
 Trabelsi et al. (2017) Chiheb Trabelsi, Olexa Bilaniuk, Ying Zhang, Dmitriy Serdyuk, Sandeep Subramanian, João Felipe Santos, Soroush Mehri, Negar Rostamzadeh, Yoshua Bengio, and Christopher J Pal. 2017. Deep complex networks. arXiv preprint arXiv:1705.09792 (2017).
 Wang et al. (2015) Wei Wang, Alex X Liu, Muhammad Shahzad, Kang Ling, and Sanglu Lu. 2015. Understanding and modeling of wifi signal based human activity recognition. In Proceedings of the 21st annual international conference on mobile computing and networking. ACM, 65–76.
 Yao et al. (2017a) Shuochao Yao, Shaohan Hu, Yiran Zhao, Aston Zhang, and Tarek Abdelzaher. 2017a. Deepsense: A unified deep learning framework for timeseries mobile sensing data processing. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 351–360.
 Yao et al. (2018a) Shuochao Yao, Yiran Zhao, Huajie Shao, ShengZhong Liu, Dongxin Liu, Lu Su, and Tarek Abdelzaher. 2018a. Fastdeepiot: Towards understanding and optimizing neural network execution time on mobile and embedded devices. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems. ACM, 278–291.
 Yao et al. (2018c) Shuochao Yao, Yiran Zhao, Huajie Shao, Aston Zhang, Chao Zhang, Shen Li, and Tarek Abdelzaher. 2018c. Rdeepsense: Reliable deep mobile computing models with uncertainty estimations. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, 4 (2018), 173.
 Yao et al. (2018b) Shuochao Yao, Yiran Zhao, Huajie Shao, Chao Zhang, Aston Zhang, Shaohan Hu, Dongxin Liu, Shengzhong Liu, Lu Su, and Tarek Abdelzaher. 2018b. Sensegan: Enabling deep learning for internet of things with a semisupervised framework. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 3 (2018), 144.
 Yao et al. (2018d) Shuochao Yao, Yiran Zhao, Huajie Shao, Chao Zhang, Aston Zhang, Dongxin Liu, Shengzhong Liu, Lu Su, and Tarek Abdelzaher. 2018d. Apdeepsense: Deep learning uncertainty estimation without the pain for iot applications. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 334–343.
 Yao et al. (2018e) Shuochao Yao, Yiran Zhao, Aston Zhang, Shaohan Hu, Huajie Shao, Chao Zhang, Lu Su, and Tarek Abdelzaher. 2018e. Deep Learning for the Internet of Things. Computer 51, 5 (2018), 32–41.
 Yao et al. (2017b) Shuochao Yao, Yiran Zhao, Aston Zhang, Lu Su, and Tarek Abdelzaher. 2017b. Deepiot: Compressing deep neural network structures for sensing systems with a compressorcritic framework. In Proceedings of the 15th ACM Conference on Embedded Network Sensor Systems. ACM, 4.
 Yu and Koltun (2015) Fisher Yu and Vladlen Koltun. 2015. Multiscale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).
Comments
There are no comments yet.