# Deep Learning for Sensor-based Activity Recognition: A Survey

Sensor-based activity recognition seeks the profound high-level knowledge about human activities from multitudes of low-level sensor readings. Conventional pattern recognition approaches have made tremendous progress in the past years. However, those methods often heavily rely on heuristic hand-crafted feature extraction, which could hinder their generalization performance. Additionally, existing methods are undermined for unsupervised and incremental learning tasks. Recently, the recent advancement of deep learning makes it possible to perform automatic high-level feature extraction thus achieves promising performance in many areas. Since then, deep learning based methods have been widely adopted for the sensor-based activity recognition tasks. This paper surveys the recent advance of deep learning based sensor-based activity recognition. We summarize existing literature from three aspects: sensor modality, deep model, and application. We also present detailed insights on existing work and propose grand challenges for future research.

## Authors

• 19 publications
• 16 publications
• 3 publications
• 2 publications
• 2 publications
• ### Deep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities

The vast proliferation of sensor devices and Internet of Things enables ...
01/21/2020 ∙ by Kaixuan Chen, et al. ∙ 0

• ### Few-Shot Learning-Based Human Activity Recognition

Few-shot learning is a technique to learn a model with a very small amou...
03/25/2019 ∙ by Siwei Feng, et al. ∙ 0

• ### Attention-Based Deep Learning Framework for Human Activity Recognition with User Adaptation

Sensor-based human activity recognition (HAR) requires to predict the ac...
06/06/2020 ∙ by Davide Buffelli, et al. ∙ 0

• ### Effective Human Activity Recognition Based on Small Datasets

Most recent work on vision-based human activity recognition (HAR) focuse...
04/29/2020 ∙ by Bruce X. B. Yu, et al. ∙ 13

• ### B-HAR: an open-source baseline framework for in depth study of human activity recognition datasets and workflows

Human Activity Recognition (HAR), based on machine and deep learning alg...
01/23/2021 ∙ by Florenc Demrozi, et al. ∙ 10

• ### Deep, Convolutional, and Recurrent Models for Human Activity Recognition using Wearables

Human activity recognition (HAR) in ubiquitous computing is beginning to...
04/29/2016 ∙ by Nils Y. Hammerla, et al. ∙ 0

• ### AudioAR: Audio-Based Activity Recognition with Large-Scale Acoustic Embeddings from YouTube Videos

Activity sensing and recognition have been demonstrated to be critical i...
10/19/2018 ∙ by Dawei Liang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Research Highlights (Required)

To create your highlights, please type the highlights against each \item command.

It should be short collection of bullet points that convey the core findings of the article. It should include 3 to 5 bullet points (maximum 85 characters, including spaces, per bullet point.) We survey deep learning based HAR in sensor modality, deep model, and application. We comprehensively discuss the insights of deep learning models for HAR tasks. We extensively investigate why deep learning can improve the performance of HAR. We also summarize the public HAR datasets frequently used for research purpose. We present some grand challenges and feasible solutions for deep learning based HAR.

## 1 Introduction

Human activity recognition (HAR) plays an important role in people’s daily life for its competence in learning profound high-level knowledge about human activity from raw sensor inputs. Successful HAR applications include home behavior analysis (Vepakomma et al., 2015), video surveillance (Qin et al., 2016), gait analysis (Hammerla et al., 2016), and gesture recognition (Kim and Toomajian, 2016). There are mainly two types of HAR: video-based HAR and sensor-based HAR (Cook et al., 2013). Video-based HAR analyzes videos or images containing human motions from the camera, while sensor-based HAR focuses on the motion data from smart sensors such as an accelerometer, gyroscope, Bluetooth, sound sensors and so on. Due to the thriving development of sensor technology and pervasive computing, sensor-based HAR is becoming more popular and widely used with privacy well protected. Therefore, in this paper, our main focus is on sensor-based HAR.

HAR can be treated as a typical pattern recognition (PR) problem. Conventional PR approaches have made tremendous progress on HAR by adopting machine learning algorithms such as decision tree, support vector machine, naive Bayes, and hidden Markov models

(Lara and Labrador, 2013). It is no wonder that in some controlled environments where there are only a few labeled data or certain domain knowledge is required (e.g. some disease issues), conventional PR methods are fully capable of achieving satisfying results. However, in most daily HAR tasks, those methods may heavily rely on heuristic hand-crafted feature extraction, which is usually limited by human domain knowledge (Bengio, 2013). Furthermore, only shallow features can be learned by those approaches (Yang et al., 2015), leading to undermined performance for unsupervised and incremental tasks. Due to those limitations, the performances of conventional PR methods are restricted regarding classification accuracy and model generalization.

Recent years have witnessed the fast development and advancement of deep learning, which achieves unparalleled performance in many areas such as visual object recognition, natural language processing, and logic reasoning

(LeCun et al., 2015)

. Different from traditional PR methods, deep learning can largely relieve the effort on designing features and can learn much more high-level and meaningful features by training an end-to-end neural network. In addition, the deep network structure is more feasible to perform unsupervised and incremental learning. Therefore, deep learning is an ideal approach for HAR and has been widely explored in existing work

(Lane et al., 2015; Alsheikh et al., 2016; Plötz et al., 2011).

Although some surveys have been conducted in deep learning (LeCun et al., 2015; Schmidhuber, 2015; Bengio, 2013) and HAR (Lara and Labrador, 2013; Bulling et al., 2014), respectively, there has been no specific survey focusing on the intersections of these two areas. To our best knowledge, this is the first article to present the recent advance on deep learning based HAR. We hope this survey can provide a helpful summary of existing work, and present potential future research directions.

The rest of this paper is organized as follows. In Section 2, we briefly introduce sensor-based activity recognition and explain why deep learning can improve its performance. In Section 3, 4 and 5, we review recent advance of deep learning based HAR from three aspects: sensor modality, deep model, and application, respectively. We also introduce several benchmark datasets. Section 6 presents summary and insights on existing work. In Section 7, we discuss some grand challenges and feasible solutions. Finally, this paper is concluded in Section 8.

## 2 Background

### 2.1 Sensor-based Activity Recognition

HAR aims to understand human behaviors which enable the computing systems to proactively assist users based on their requirement (Bulling et al., 2014). Formally speaking, suppose a user is performing some kinds of activities belonging to a predefined activity set :

 A={Ai}mi=1 (1)

where denotes the number of activity types. There is a sequence of sensor reading that captures the activity information

 (2)

where denotes the sensor reading at time .

We need to build a model to predict the activity sequence based on sensor reading s

 ^A={^Aj}nj=1=F(s),^Aj∈A (3)

while the true activity sequence (ground truth) is denoted as

 A∗={A∗j}nj=1,A∗j∈A (4)

where denotes the length of sequence and .

The goal of HAR is to learn the model by minimizing the discrepancy between predicted activity and the ground truth activity

. Typically, a positive loss function

is constructed to reflect their discrepancy. usually does not directly take s as input, and it usually assumes that there is a projection function that projects the sensor reading data to a -dimensional feature vector . To that end, the goal turns into minimizing the loss function .

Fig. 1 presents a typical flowchart of HAR using conventional PR approaches. First, raw signal inputs are obtained from several types of sensors (smartphones, watches, Wi-Fi, Bluetooth, sound etc.). Second, features are manually extracted from those readings based on human knowledge (Bao and Intille, 2004), such as the mean, variance, DC, and amplitude in traditional machine learning approaches (Hu et al., 2016). Finally, those features serve as inputs to train a PR model to make activity inference in real HAR tasks.

### 2.2 Why Deep Learning?

Conventional PR approaches have made tremendous progress in HAR (Bulling et al., 2014). However, there are several drawbacks to conventional PR methods.

Firstly, the features are always extracted via a heuristic and hand-crafted way, which heavily relies on human experience or domain knowledge. This human knowledge may help in certain task-specific settings, but for more general environments and tasks, this will result in a lower chance and longer time to build a successful activity recognition system.

Secondly, only shallow features can be learned according to human expertise (Yang et al., 2015). Those shallow features often refer to some statistical information including mean, variance, frequency and amplitude etc. They can only be used to recognize low-level activities like walking or running, and hard to infer high-level or context-aware activities (Yang, 2009). For instance, having coffee is more complex and nearly impossible to be recognized by using only shallow features.

Thirdly, conventional PR approaches often require a large amount of well-labeled data to train the model. However, most of the activity data are remaining unlabeled in real applications. Thus, these models’ performance is undermined in unsupervised learning tasks

(Bengio, 2013). In contrast, existing deep generative networks (Hinton et al., 2006) are able to exploit the unlabeled samples for model training.

Moreover, most existing PR models mainly focus on learning from static data; while activity data in real life are coming in stream, requiring robust online and incremental learning.

Deep learning tends to overcome those limitations. Fig. 2 shows how deep learning works for HAR with different types of networks. Compared to Fig. 1, the feature extraction and model building procedures are often performed simultaneously in the deep learning models. The features can be learned automatically through the network instead of being manually designed. Besides, the deep neural network can also extract high-level representation in deep layer, which makes it more suitable for complex activity recognition tasks. When faced with a large amount of unlabeled data, deep generative models (Hinton et al., 2006) are able to exploit the unlabeled data for model training. What’s more, deep learning models trained on a large-scale labeled dataset can usually be transferred to new tasks where there are few or none labels.

In the following sections, we mainly summarize the existing work based on the pipeline of HAR: (a) sensor modality, (b) deep model, and (c) application.

## 3 Sensor Modality

Although some HAR approaches can be generalized to all sensor modalities, most of them are only specific to certain types. According to (Chavarriaga et al., 2013)

, we mainly classify those modalities into three aspects:

body-worn sensors, object sensors, and ambient sensors. Table 1 briefly outlines all the modalities.

### 3.1 Body-worn Sensor

Body-worn sensors are one of the most common modalities in HAR. Those sensors are often worn by the users, such as an accelerometer, magnetometer, and gyroscope. The acceleration and angular velocity are changed according to human body movements; thus they can infer human activities. Those sensors can often be found on smart phones, watches, bands, glasses, and helmets.

Body-worn sensors were widely used in deep learning based HAR (Chen and Xue, 2015; Plötz et al., 2011; Zeng et al., 2014; Jiang and Yin, 2015; Yang et al., 2015). Among those work, the accelerometer is mostly adopted. Gyroscope and magnetometer are also frequently used together with the accelerometer. Those sensors are often exploited to recognize activities of daily living (ADL) and sports. Instead of extracting statistical and frequency features from the movement data, the original signal is directly used as inputs for the network.

### 3.2 Object Sensor

Object sensors are usually placed on objects to detect the movement of a specific object (Chavarriaga et al., 2013). Different from body-worn sensors which capture human movements, object sensors are mainly used to detect the movement of certain objects in order to infer human activities. For instance, the accelerometer attached to a cup can be used to detect the drinking water activity. Radio frequency identifier (RFID) tags are typically used as object sensors and deployed in smart home environment (Vepakomma et al., 2015; Yang et al., 2015; Fang and Hu, 2014) and medical activities (Li et al., 2016b; Wang et al., 2016a). The RFID can provide more fine-grained information for more complex activity recognition.

It should be noted that object sensors are less used than body-worn sensors due to the difficulty in its deployment. Besides, the combination of object sensors with other types is emerging in order to recognize more high-level activities (Yang, 2009).

### 3.3 Ambient Sensor

Ambient sensors are used to capture the interaction between humans and the environment. They are usually embedded in users’ smart environment. There are many kinds of ambient sensors such as radar, sound sensors, pressure sensors, and temperature sensors. Different from object sensors which measure the object movements, ambient sensors are used to capture the change of the environment.

Several literature used ambient sensors to recognize daily activities and hand gesture (Lane et al., 2015; Wang et al., 2016a; Kim and Toomajian, 2016). Most of the work was tested in the smart home environment. Same as object sensors, the deployment of ambient sensors is also difficult. In addition, ambient sensors are easily affected by the environment, and only certain types of activities can be robustly inferred.

### 3.4 Hybrid Sensor

Some work combined different types of sensors for HAR. As shown in (Hayashi et al., 2015), combining acceleration with acoustic information could improve the accuracy of HAR. Ambient sensors are also used together with object sensors; hence they can record both the object movements and environment state. (Vepakomma et al., 2015) designed a smart home environment called A-Wristocracy, where a large number of fine-grained and complex activities of multiple occupants can be recognized through body-worn, object, and ambient sensors. It is obvious that the combination of sensors is capable of capturing rich information of human activities, which is also possible for a real smart home system in the future.

## 4 Deep Model

In this section, we investigate the deep learning models used in HAR tasks. Table 2 lists all the models.

### 4.1 Deep Neural Network

Deep neural network (DNN) is developed from artificial neural network (ANN). Traditional ANN often contains very few hidden layers (shallow) while DNN contains more (deep). With more layers, DNN is more capable of learning from large data. DNN usually serves as the dense layer of other deep models. For example, in a convolution neural network, several dense layers are often added after the convolution layers. In this part, we mainly focus on DNN as a single model, while in other sections we will discuss the dense layer.

(Vepakomma et al., 2015) first extracted hand-engineered features from the sensors, then those features are fed into a DNN model. Similarly, (Walse et al., 2016) performed PCA before using DNN. In those work, DNN only served as a classification model after hand-crafted feature extraction, hence they may not generalize well. And the network was rather shallow. (Hammerla et al., 2016) used a 5-hidden-layer DNN to perform automatic feature learning and classification with improved performance. Those work indicated that, when the HAR data is multi-dimensional and activities are more complex, more hidden layers can help the model train well since their representation capability is stronger (Bengio, 2013). However, more details should be considered in certain situations to help the model fine-tune better.

### 4.2 Convolutional Neural Network

Convolutional Neural Network (ConvNets, or CNN) leverages three important ideas: sparse interactions, parameter sharing, and equivariant representations (LeCun et al., 2015). After convolution, there are usually pooling and fully-connected layers, which perform classification or regression tasks.

CNN is competent to extract features from signals and it has achieved promising results in image classification, speech recognition, and text analysis. When applied to time series classification like HAR, CNN has two advantages over other models: local dependency and scale invariance. Local dependency means the nearby signals in HAR are likely to be correlated, while scale invariance refers to the scale-invariant for different paces or frequencies. Due to the effectiveness of CNN, most of the surveyed work focused on this area.

When applying CNN to HAR, there are several aspects to be considered: input adaptation, pooling, and weight-sharing.

1) Input adaptation. Unlike images, most HAR sensors produce time series readings such as acceleration signal, which is temporal multi-dimensional 1D readings. Input adaptation is necessary before applying CNN to those inputs. The main idea is to adapt the inputs in order to form a virtual image. There are mainly two types of adaptation: model-driven and data-driven.

• Data-driven approach treats each dimension as a channel, then performs 1D convolution on them. After convolution and pooling, the outputs of each channel are flattened to unified DNN layers. A very early work is (Zeng et al., 2014), where each dimension of the accelerometer was treated as one channel like RGB of an image, then the convolution and pooling were performed separately. (Yang et al., 2015) further proposed to unify and share weights in multi-sensor CNN by using 1D convolution in the same temporal window. Along with this line, (Chen and Xue, 2015) resized the convolution kernel to obtain the best kernel for HAR data. Other similar work include (Hammerla et al., 2016; Sathyanarayana et al., 2016; Pourbabaee et al., 2017). This data-driven approach treats the 1D sensor reading as a 1D image, which is simple and easy to implement. The disadvantage of this approach is the ignorance of dependencies between dimension and sensors, which may influence the performance.

• Model-driven approach resizes the inputs to a virtual 2D image so as to adopt a 2D convolution. This approach usually pertains to non-trivial input tuning techniques. (Ha et al., 2015) combined all dimensions to form an image, while (Jiang and Yin, 2015) designed a more complex algorithm to transform the time series into an image. In (Singh et al., 2017), pressure sensor data was transformed to the image via modality transformation. Other similar work include (Ravi et al., 2016; Li et al., 2016b). This model-driven approach can make use of the temporal correlation of sensor. But the map of time series to image is non-trivial task and needs domain knowledge.

2) Pooling. The convolution-pooling combination is common in CNN, and most approaches performed max or average pooling after convolution (Ha et al., 2015; Kim and Toomajian, 2016; Pourbabaee et al., 2017). Apart from avoiding overfitting, pooling can also speed up the training process on large data (Bengio, 2013).

3) Weight-sharing. Weight sharing (Zebin et al., 2016; Sathyanarayana et al., 2016) is an efficient method to speed up the training process on a new task. (Zeng et al., 2014) utilized a relaxed partial weight sharing technique since the signal appeared in different units may behave differently. (Ha and Choi, 2016) adopted a CNN-pf and CNN-pff structure to investigate the performance of different weight-sharing techniques. It is shown in those literature that partial weight-sharing could improve the performance of CNN.

### 4.3 Autoencoder

Autoencoder learns a latent representation of the input values through the hidden layers, which can be considered as an encoding-decoding procedure. The purpose of autoencoder is to learn more advanced feature representation via an unsupervised learning schema. Stacked autoencoder (SAE) is the stack of some autoencoders. SAE treats every layer as the basic model of autoencoder. After several rounds of training, the learned features are stacked with labels to form a classifier.

(Almaslukh et al., 2017; Wang et al., 2016a) used SAE for HAR, where they first adopted the greedy layer-wise pre-training (Hinton et al., 2006), then performed fine-tuning. Compared to those works, (Li et al., 2014)

investigated the sparse autoencoder by adding KL divergence and noise to the cost function, which indicates that adding sparse constraints could improve the performance of HAR. The advantage of SAE is that it can perform unsupervised feature learning for HAR, which could be a powerful tool for feature extraction. But SAE depends too much on its layers and activation functions which may be hard to search the optimal solutions.

### 4.4 Restricted Boltzmann Machine

Restricted Boltzmann machine (RBM) is a bipartite, fully-connected, undirected graph consisting of a visible layer and a hidden layer (Hinton et al., 2006). The stacked RBM is called deep belief network (DBN) by treating every two consecutive layers as an RBM. DBN/RBM is often followed by fully-connected layers.

In pre-training, most work applied Gaussian RBM in the first layer while binary RBM for the rest layers (Plötz et al., 2011; Hammerla et al., 2015; Lane et al., 2015). For multi-modal sensors, (Radu et al., 2016) designed a multi-modal RBM where an RBM is constructed for each sensor modality, then the output of all the modalities are unified. (Li et al., 2016a) added pooling after the fully-connected layers to extract the important features. (Fang and Hu, 2014) used a contrastive gradient (CG) method to update the weight in fine-tuning, which helps the network to search and convergence quickly in all directions. (Zhang et al., 2015b) further implemented RBM on a mobile phone for offline training, indicating RBM can be very light-weight. Similar to autoencoder, RBM/DBN can also perform unsupervised feature learning for HAR.

### 4.5 Recurrent Neural Network

Recurrent neural network (RNN) is widely used in speech recognition and natural language processing by utilizing the temporal correlations between neurons. LSTM (long-short term memory) cells are often combined with RNN where LSTM is serving as the

Few work used RNN for the HAR tasks (Hammerla et al., 2016; Inoue et al., 2016; Edel and Köppe, 2016; Guan and Ploetz, 2017), where the learning speed and resource consumption are the main concerns for HAR. (Inoue et al., 2016) investigated several model parameters first and then proposed a relatively good model which can perform HAR with high throughput. (Edel and Köppe, 2016)

proposed a binarized-BLSTM-RNN model, in which the weight parameters, input, and output of all hidden layers are all binary values. The main line of RNN based HAR models is dealing with resource-constrained environments while still achieve good performance.

### 4.6 Hybrid Model

Hybrid model is the combination of some deep models.

One emerging hybrid model is the combination of CNN and RNN. (Ordóñez and Roggen, 2016; Yao et al., 2017) provided good examples for how to combine CNN and RNN. It is shown in (Ordóñez and Roggen, 2016) that the performance of ‘CNN + recurrent dense layers’ is better than ‘CNN + dense layers’. Similar results are also shown in (Singh et al., 2017). The reason is that CNN is able to capture the spatial relationship, while RNN can make use of the temporal relationship. Combining CNN and RNN could enhance the ability to recognize different activities that have varied time span and signal distributions. Other work combined CNN with models such as SAE (Zheng et al., 2016) and RBM (Liu et al., 2016). In those work, CNN performs feature extraction, and the generative models can help in speeding up the training process. In the future, we expect there will be more research in this area.

## 5 Applications

HAR is always not the final goal of an application, but it serves as an important step in many applications such as skill assessment and smart home assistant. In this section, we survey deep learning based HAR from the application perspective.

### 5.1 Featured Applications

Most of the surveyed work focused on recognizing activities of daily living (ADL) and sports (Zeng et al., 2014; Chen and Xue, 2015; Ronao and Cho, 2016; Ravì et al., 2017). Those activities of simple movements are easily captured by body-worn sensors. Some research studied people’s lifestyle such as sleep (Sathyanarayana et al., 2016) and respiration (Khan et al., 2017; Hannink et al., 2017). The detection of such activities often requires some object and ambient sensors such as WiFi and sound, which are rather different from ADL.

It is a developing trend to apply HAR to health and disease issues. Some pioneering work has been done with regard to Parkinson’s disease (Hammerla et al., 2015), trauma resuscitation (Li et al., 2016a, b) and paroxysmal atrial fibrillation (PAF) (Pourbabaee et al., 2017). Disease issues are always related to the change of certain body movements or functions, so they can be detected using corresponding sensors.

Under those circumstances, the association between disease and activity should be given more consideration. It is important to use the appropriate sensors. For instance, Parkinson’s disease is often related to the frozen of gait, which can be reflected by some inertial sensors attached to shoes (Hammerla et al., 2015).

Other than health and disease, the recognition of high-level activities is helpful to learn more resourceful information for HAR. The movement, behavior, environment, emotion, and thought are critical parts in recognizing high-level activities. However, most work only focused on body movements in smart homes (Vepakomma et al., 2015; Fang and Hu, 2014), which is not enough to recognize high-level activities. For instance, (Vepakomma et al., 2015) combined activity and environment signal to recognize activities in a smart home, but the activities are constrained to body movements without more information on user emotion and state, which are also important. In the future, we expect there will be more research in this area.

### 5.2 Benchmark Datasets

We extensively explore the benchmark datasets for deep learning based HAR. Basically, there are two types of data acquisition schemes: self data collection and public datasets.

• Self data collection: Some work performed their own data collection (e.g. (Chen and Xue, 2015; Zhang et al., 2015b; Bhattacharya and Lane, 2016; Zhang et al., 2015a)). Very detailed efforts are required for self data collection, and it is rather tedious to process the collected data.

• Public datasets: There are already many public HAR datasets that are adopted by most researchers (e.g. (Plötz et al., 2011; Ravi et al., 2016; Hammerla et al., 2016)). By summarizing existing literature, we present several widely used public datasets in Table 3.

## 6 Summary and Discussion

Table 4 presents all the surveyed work in this article. We can make several observations based on the table.

1) Sensor deployment and preprocessing. Choosing the suitable sensors is critical for successful HAR. In surveyed literature, body-worn sensors serve as the most common modalities and accelerometer is mostly used. The reasons are two folds. Firstly, a lot of wearable devices such as smartphones or watches are equipped with an accelerometer, which is easy to access. Secondly, the accelerometer is competent to recognize many types of daily activities since most of them are simple body movements. Compared to body-worn sensors, object and ambient sensors are better at recognizing activities related to context and environment such as having coffee

. Therefore, it is suggested to use body-worn sensors (mostly accelerometer+gyroscope) for ADL and sports activities. If the activities are pertaining to some semantic meaning but more than simple body movements, it is better to combine the object and ambient sensors. In addition, there are few public datasets for object and ambient sensors probably because of privacy issues and deployment difficulty of the data collecting system. We expect there will be more open datasets regarding those sensors.

Sensor placement is also important. Most body-worn sensors are placed on the dominant wrist, waist, and the dominant hip pocket. This placement strategy can help to recognize most common daily activities. However, when it comes to object and ambient sensors, it is critical to deploy them in a non-invasive way. Those sensors are not usually interacting with users directly, so it is critical to collect the data naturally and non-invasively.

Before using deep models, the raw sensor data need to be preprocessed accordingly. There are two important aspects. The first aspect is sliding window. The inputs should be cut into individual inputs according to the sampling rate. This procedure is similar to conventional PR approaches. The second one is channels. Different sensor modalities can be treated as separate channels, and each axis of a sensor can also be a channel. Using multi-channel could enhance the representation capability of the deep model since it can reflect the hidden knowledge of the sensor inputs.

2) Model selection. There are several deep models surveyed in this article. Then, a natural question arises: which model is the best for HAR? (Hammerla et al., 2016) did an early work by investigating the performance of DNN, CNN and RNN through 4,000 experiments on some public HAR datasets. We combine their work and our explorations to draw some conclusions: RNN and LSTM are recommended to recognize short activities that have natural order while CNN is better at inferring long-term repetitive activities (Hammerla et al., 2016)

. The reason is that RNN could make use of the time-order relationship between sensor readings, and CNN is more capable of learning deep features contained in recursive patterns. For multi-modal signals, it is better to use CNN since the features can be integrated through multi-channel convolutions

(Zeng et al., 2014; Zheng et al., 2014; Ha et al., 2015). While adapting CNN, data-driven approaches are better than model-driven approaches as the inner properties of the activity signal can be exploited better when the input data are transformed into the virtual image. Multiple convolutions and poolings also help CNN perform better. RBM and autoencoders are usually pre-trained before being fine-tuned. Multi-layer RBM or SAE is preferred for more accurate recognition.

Technically there is no model which outperforms all the others in all situations, so it is recommended to choose models based on the scenarios. To better illustrate the performance of some deep models, Table 5 offers some results comparison of existing work on public datasets in Table 3 111OPP 1, OPP 2, Skoda, and UCI smartphone follow the protocols in (Hammerla et al., 2016), (Plötz et al., 2011), (Zeng et al., 2014), and (Ronao and Cho, 2016), respectively. OPP 1 used weighted f1-score; OPP 2, Skoda, and UCI smartphone used accuracy.

. In Skoda and UCI Smartphone protocols, CNN achieves the best performance. In two OPPORTUNITY protocols, DBN and RNN outperform the others. This confirms that no models can achieve the best in all tasks. Moreover, the hybrid models tend to perform better than single models (DeepConvLSTM in OPPORTUNITY 1 and Skoda). For a single model, CNN with shifted inputs (Fourier transform) generates better results compared to shifted kernels.

## 7 Grand Challenges

Despite the progress in previous work, there are still challenges for deep learning based HAR. In this section, we present those challenges and propose some feasible solutions.

A. Online and mobile deep activity recognition. Two critical issues are related to deep HAR: online deployment and mobile application. Although some existing work adopted deep HAR on smartphone (Lane et al., 2015) and watch (Bhattacharya and Lane, 2016), they are still far from online and mobile deployment. Because the model is often trained offline on some remote server and the mobile device only utilizes a trained model. This approach is neither real-time nor friendly to incremental learning. There are two approaches to tackle this problem: reducing the communication cost between mobile and server, and enhancing computing ability of the mobile devices.

B. More accurate unsupervised activity recognition. The performance of deep learning still relies heavily on labeled samples. Acquiring sufficient activity labels is expensive and time-consuming. Thus, unsupervised activity recognition is urgent.

• Take advantage of the crowd. The latest research indicates that exploiting the knowledge from the crowd will facilitate the task (Prelec et al., 2017). Crowd-sourcing takes advantage of the crowd to annotate the unlabeled activities. Other than acquiring labels passively, researchers could also develop more elaborate, privacy-concerned way to collect useful labels.

• Deep transfer learning.Transfer learning performs data annotation by leveraging labeled data from other auxiliary domains (Pan and Yang, 2010; Cook et al., 2013; Wang et al., 2017). There are many factors related to human activity, which can be exploited as auxiliary information using deep transfer learning. Problems such as sharing weights between networks, exploiting knowledge between activity related domains, and how to find more relevant domains are to be resolved.

C. Flexible models to recognize high-level activities. More complex high-level activities need to be recognized other than only simple daily activities. It is difficult to determine the hierarchical structure of high-level activities because they contain more semantic and context information. Existing methods often ignore the correlation between signals, thus they cannot obtain good results.

• Hybrid sensor. Elaborate information provided by the hybrid sensor is useful for recognizing fine-grained activities (Vepakomma et al., 2015). Special attention should be paid to the recognition of fine-grained activities by exploiting the collaboration of hybrid sensors.

• Exploit context information. Context is any information that can be used to characterize the situation of an entity (Abowd et al., 1999). Context information such as Wi-Fi, Bluetooth, and GPS can be used to infer more environmental knowledge about the activity. The exploitation of resourceful context information will greatly help to recognize user state as well as more specific activities.

D. Light-weight deep models. Deep models often require lots of computing resources, which is not available for wearable devices. In addition, the models are often trained off-line which cannot be executed in real-time. However, less complex models such as shallow NN and conventional PR methods could not achieve good performance. Therefore, it is necessary to develop light-weight deep models to perform HAR.

• Combination of human-crafted and deep features. Recent work indicated that human-crafted and deep features together could achieve better performance (Plötz et al., 2011). Some pre-knowledge about the activity will greatly contribute to more robust feature learning in deep models (Stewart and Ermon, 2017). Researchers should consider the possibility of applying two kinds of features to HAR with human experience and machine intelligence.

• Collaboration of deep and shallow models. Deep models have powerful learning abilities, while shallow models are more efficient. The collaboration of those two models has the potential to perform both accurate and light-weight HAR. Several issues such as how to share the parameters between deep and shallow models are to be addressed.

E. Non-invasive activity sensing. Traditional activity collection strategies need to be updated with more non-invasive approaches. Non-invasive approaches tend to collect information and infer activity without disturbing the subjects and requires more flexible computing resources.

• Opportunistic activity sensing with deep learning. Opportunistic sensing could dynamically harness the non-continuous activity signal to accomplish activity inference (Chen et al., 2016a). In this scenario, back propagation of deep models should be well-designed.

F. Beyond activity recognition: assessment and assistant. Recognizing activities is often the initial step in many applications. For instance, some professional skill assessment is required in fitness exercises and smart home assistant plays an important role in healthcare services. There is some early work on climbing assessment (Khan et al., 2015). With the advancement of deep learning, more applications should be developed to be beyond just recognition.

## 8 Conclusion

Human activity recognition is an important research topic in pattern recognition and pervasive computing. In this paper, we survey the recent advance in deep learning approaches for sensor-based activity recognition. Compared to traditional pattern recognition methods, deep learning reduces the dependency on human-crafted feature extraction and achieves better performance by automatically learning high-level representations of the sensor data. We highlight the recent progress in three important categories: sensor modality, deep model, and application. Subsequently, we summarize and discuss the surveyed research in detail. Finally, several grand challenges and feasible solutions are presented for future research.

## Acknowledgments

This work is supported in part by National Key R & D Program of China (No.2017YFB1002801), NSFC (No.61572471), and Science and Technology Planning Project of Guangdong Province (No.2015B010105001). Authors thank the reviewers for their valuable comments.

## References

• Abowd et al. (1999) Abowd, G.D., Dey, A.K., Brown, P.J., Davies, N., Smith, M., Steggles, P., 1999. Towards a better understanding of context and context-awareness, in: International Symposium on Handheld and Ubiquitous Computing, Springer. pp. 304–307.
• Almaslukh et al. (2017) Almaslukh, B., AlMuhtadi, J., Artoli, A., 2017. An effective deep autoencoder approach for online smartphone-based human activity recognition. International Journal of Computer Science and Network Security 17, 160.
• Alsheikh et al. (2016) Alsheikh, M.A., Selim, A., Niyato, D., Doyle, L., Lin, S., Tan, H.P., 2016. Deep activity recognition models with triaxial accelerometers. AAAI workshop .
• Bao and Intille (2004) Bao, L., Intille, S.S., 2004. Activity recognition from user-annotated acceleration data, in: International Conference on Pervasive Computing, Springer. pp. 1–17.
• Bengio (2013) Bengio, Y., 2013. Deep learning of representations: Looking forward, in: International Conference on Statistical Language and Speech Processing, Springer. pp. 1–37.
• Bhattacharya and Lane (2016) Bhattacharya, S., Lane, N.D., 2016. From smart to deep: Robust activity recognition on smartwatches using deep learning, in: 2016 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops), IEEE. pp. 1–6.
• Bulling et al. (2014) Bulling, A., Blanke, U., Schiele, B., 2014. A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys (CSUR) 46, 33.
• Chavarriaga et al. (2013) Chavarriaga, R., Sagha, H., Calatroni, A., Digumarti, S.T., Tröster, G., Millán, J.d.R., Roggen, D., 2013. The opportunity challenge: A benchmark database for on-body sensor-based activity recognition. Pattern Recognition Letters 34, 2033–2042.
• Chen et al. (2016a) Chen, Y., Gu, Y., Jiang, X., Wang, J., 2016a. Ocean: a new opportunistic computing model for wearable activity recognition, in: UbiComp, ACM. pp. 33–36.
• Chen and Xue (2015) Chen, Y., Xue, Y., 2015. A deep learning approach to human activity recognition based on single accelerometer, in: Systems, Man, and Cybernetics (SMC), 2015 IEEE International Conference on, IEEE. pp. 1488–1492.
• Chen et al. (2016b) Chen, Y., Zhong, K., Zhang, J., Sun, Q., Zhao, X., 2016b. Lstm networks for mobile human activity recognition .
• Cheng and Scotland (2017) Cheng, W.Y., Scotland, A.e.a., 2017. Human activity recognition from sensor-based large-scale continuous monitoring of parkinson’s disease patients, in: Connected Health: Applications, Systems and Engineering Technologies (CHASE), 2017 IEEE/ACM International Conference on, pp. 249–250.
• Cook et al. (2013) Cook, D., Feuz, K.D., Krishnan, N.C., 2013. Transfer learning for activity recognition: A survey. Knowledge and information systems 36, 537–556.
• Edel and Köppe (2016) Edel, M., Köppe, E., 2016. Binarized-blstm-rnn based human activity recognition, in: Indoor Positioning and Indoor Navigation (IPIN), 2016 International Conference on, IEEE. pp. 1–7.
• Fang and Hu (2014) Fang, H., Hu, C., 2014. Recognizing human activity in smart home using deep learning algorithm, in: Chinese Control Conference (CCC), pp. 4716–4720.
• Gjoreski et al. (2016) Gjoreski, H., Bizjak, J., Gjoreski, M., Gams, M., 2016.

Comparing deep and classical machine learning methods for human activity recognition using wrist accelerometer, in: IJCAI-16 workshop on Deep Learning for Artificial Intelligence (DLAI).

• Guan and Ploetz (2017) Guan, Y., Ploetz, T., 2017. Ensembles of deep lstm learners for activity recognition using wearables. arXiv preprint arXiv:1703.09370 .
• Ha and Choi (2016) Ha, S., Choi, S., 2016. Convolutional neural networks for human activity recognition using multiple accelerometer and gyroscope sensors, in: Neural Networks (IJCNN), 2016 International Joint Conference on, IEEE. pp. 381–388.
• Ha et al. (2015) Ha, S., Yun, J.M., Choi, S., 2015. Multi-modal convolutional neural networks for activity recognition, in: Systems, Man, and Cybernetics (SMC), 2015 IEEE International Conference on, IEEE. pp. 3017–3022.
• Hammerla et al. (2015) Hammerla, N.Y., Fisher, J., Andras, P., Rochester, L., Walker, R., Plötz, T., 2015. Pd disease state assessment in naturalistic environments using deep learning, in: Twenty-Ninth AAAI Conference on Artificial Intelligence.
• Hammerla et al. (2016) Hammerla, N.Y., Halloran, S., Ploetz, T., 2016. Deep, convolutional, and recurrent models for human activity recognition using wearables, in: IJCAI.
• Hannink et al. (2017) Hannink, J., Kautz, T., Pasluosta, C.F., Gaßmann, K.G., et al., 2017. Sensor-based gait parameter extraction with deep convolutional neural networks. IEEE journal of biomedical and health informatics 21, 85–93.
• Hayashi et al. (2015) Hayashi, T., Nishida, M., Kitaoka, N., Takeda, K., 2015. Daily activity recognition based on dnn using environmental sound and acceleration signals, in: Signal Processing Conference (EUSIPCO), 23rd European, pp. 2306–2310.
• Hinton et al. (2006) Hinton, G.E., Osindero, S., Teh, Y.W., 2006. A fast learning algorithm for deep belief nets. Neural computation 18, 1527–1554.
• Hu et al. (2016) Hu, L., Chen, Y., Wang, S., Wang, J., Shen, J., Jiang, X., Shen, Z., 2016. Less annotation on personalized activity recognition using context data, in: UIC, pp. 327–332.
• Inoue et al. (2016) Inoue, M., Inoue, S., Nishida, T., 2016. Deep recurrent neural network for mobile human activity recognition with high throughput. arXiv preprint arXiv:1611.03607 .
• Jiang and Yin (2015) Jiang, W., Yin, Z., 2015. Human activity recognition using wearable sensors by deep convolutional neural networks, in: MM, ACM. pp. 1307–1310.
• Khan et al. (2015) Khan, A., Mellor, S., Berlin, E., Thompson, R., McNaney, R., Olivier, P., Plötz, T., 2015. Beyond activity recognition: skill assessment from accelerometer data, in: UbiComp, ACM. pp. 1155–1166.
• Khan et al. (2017) Khan, U.M., Kabir, Z., Hassan, S.A., Ahmed, S.H., 2017. A deep learning framework using passive wifi sensing for respiration monitoring. arXiv preprint arXiv:1704.05708 .
• Kim and Li (2017) Kim, Y., Li, Y., 2017. Human activity classification with transmission and reflection coefficients of on-body antennas through deep convolutional neural networks. IEEE Transactions on Antennas and Propagation 65, 2764–2768.
• Kim and Toomajian (2016) Kim, Y., Toomajian, B., 2016. Hand gesture recognition using micro-doppler signatures with convolutional neural network. IEEE Access .
• Lane and Georgiev (2015) Lane, N.D., Georgiev, P., 2015. Can deep learning revolutionize mobile sensing?, in: Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications, ACM. pp. 117–122.
• Lane et al. (2015) Lane, N.D., Georgiev, P., Qendro, L., 2015. Deepear: robust smartphone audio sensing in unconstrained acoustic environments using deep learning, in: UbiComp, ACM. pp. 283–294.
• Lara and Labrador (2013) Lara, O.D., Labrador, M.A., 2013. A survey on human activity recognition using wearable sensors. IEEE Communications Surveys & Tutorials 15, 1192–1209.
• LeCun et al. (2015) LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444.
• Lee et al. (2017) Lee, S.M., Yoon, S.M., Cho, H., 2017. Human activity recognition from accelerometer data using convolutional neural network, in: Big Data and Smart Computing (BigComp), IEEE International Conference on, pp. 131–134.
• Li et al. (2016a) Li, X., Zhang, Y., Li, M., Marsic, I., Yang, J., Burd, R.S., 2016a. Deep neural network for rfid based activity recognition, in: Wireless of the Students, by the Students, and for the Students (S3) Workshop with MobiCom.
• Li et al. (2016b) Li, X., Zhang, Y., Marsic, I., Sarcevic, A., Burd, R.S., 2016b. Deep learning for rfid-based activity recognition, in: Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems CD-ROM, ACM. pp. 164–175.
• Li et al. (2014) Li, Y., Shi, D., Ding, B., Liu, D., 2014. Unsupervised feature learning for human activity recognition using smartphone sensors, in: Mining Intelligence and Knowledge Exploration. Springer, pp. 99–107.
• Liu et al. (2016) Liu, C., Zhang, L., Liu, Z., Liu, K., Li, X., Liu, Y., 2016. Lasagna: towards deep hierarchical understanding and searching over mobile sensing data, in: Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking, ACM. pp. 334–347.
• Mohammed and Tashev (2017) Mohammed, S., Tashev, I., 2017. Unsupervised deep representation learning to remove motion artifacts in free-mode body sensor networks, in: Wearable and Implantable Body Sensor Networks (BSN), 2017 IEEE 14th International Conference on, IEEE. pp. 183–188.
• Morales and Roggen (2016) Morales, F.J.O., Roggen, D., 2016. Deep convolutional feature transfer across mobile activity recognition domains, sensor modalities and locations, in: Proceedings of the 2016 ACM International Symposium on Wearable Computers, ACM. pp. 92–99.
• Murad and Pyun (2017) Murad, A., Pyun, J.Y., 2017. Deep recurrent neural networks for human activity recognition. Sensors 17, 2556.
• Ordóñez and Roggen (2016) Ordóñez, F.J., Roggen, D., 2016. Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16, 115.
• Pan and Yang (2010) Pan, S.J., Yang, Q., 2010. A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on 22, 1345–1359.
• Panwar et al. (2017) Panwar, M., Dyuthi, S.R., Prakash, K.C., Biswas, D., Acharyya, A., Maharatna, K., Gautam, A., Naik, G.R., 2017. Cnn based approach for activity recognition using a wrist-worn accelerometer, in: Engineering in Medicine and Biology Society (EMBC), 2017 39th Annual International Conference of the IEEE, IEEE. pp. 2438–2441.
• Plötz et al. (2011) Plötz, T., Hammerla, N.Y., Olivier, P., 2011. Feature learning for activity recognition in ubiquitous computing, in: IJCAI, p. 1729.
• Pourbabaee et al. (2017) Pourbabaee, B., Roshtkhari, M.J., Khorasani, K., 2017. Deep convolution neural networks and learning ecg features for screening paroxysmal atrial fibrillatio patients. IEEE Trans. on Systems, Man, and Cybernetics .
• Prelec et al. (2017) Prelec, D., Seung, H.S., McCoy, J., 2017. A solution to the single-question crowd wisdom problem. Nature 541, 532–535.
• Qin et al. (2016) Qin, J., Liu, L., Zhang, Z., Wang, Y., Shao, L., 2016. Compressive sequential learning for action similarity labeling. IEEE Transactions on Image Processing 25, 756–769.
• Radu et al. (2016) Radu, V., Lane, N.D., Bhattacharya, S., Mascolo, C., Marina, M.K., Kawsar, F., 2016. Towards multimodal deep learning for activity recognition on mobile devices, in: UbiComp, ACM. pp. 185–188.
• Ravi et al. (2016) Ravi, D., Wong, C., Lo, B., Yang, G.Z., 2016. Deep learning for human activity recognition: A resource efficient implementation on low-power devices, in: Wearable and Implantable Body Sensor Networks (BSN), 2016 IEEE 13th International Conference on, IEEE. pp. 71–76.
• Ravì et al. (2017) Ravì, D., Wong, C., Lo, B., Yang, G.Z., 2017. A deep learning approach to on-node sensor data analytics for mobile or wearable devices. IEEE journal of biomedical and health informatics 21, 56–64.
• Ronao and Cho (2015a) Ronao, C.A., Cho, S.B., 2015a. Deep convolutional neural networks for human activity recognition with smartphone sensors, in: International Conference on Neural Information Processing, Springer. pp. 46–53.
• Ronao and Cho (2015b) Ronao, C.A., Cho, S.B., 2015b. Evaluation of deep convolutional neural network architectures for human activity recognition with smartphone sensors, in: Proc. of the KIISE Korea Computer Congress, pp. 858–860.
• Ronao and Cho (2016) Ronao, C.A., Cho, S.B., 2016. Human activity recognition with smartphone sensors using deep learning neural networks. Expert Systems with Applications 59, 235–244.
• Sathyanarayana et al. (2016) Sathyanarayana, A., Joty, S., Fernandez-Luque, L., Ofli, F., Srivastava, J., Elmagarmid, A., Taheri, S., Arora, T., 2016. Impact of physical activity on sleep: A deep learning based exploration. arXiv preprint:1607.07034 .
• Schmidhuber (2015) Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural Networks 61, 85–117.
• Singh et al. (2017) Singh, M.S., Pondenkandath, V., Zhou, B., Lukowicz, P., Liwicki, M., 2017. Transforming sensor data to the image domain for deep learning-an application to footstep detection. arXiv preprint arXiv:1701.01077 .
• Stewart and Ermon (2017) Stewart, R., Ermon, S., 2017. Label-free supervision of neural networks with physics and domain knowledge., in: AAAI, pp. 2576–2582.
• Vepakomma et al. (2015) Vepakomma, P., De, D., Das, S.K., Bhansali, S., 2015. A-wristocracy: Deep learning on wrist-worn sensing for recognition of user complex activities, in: 2015 IEEE 12th International Conference on Wearable and Implantable Body Sensor Networks (BSN), IEEE. pp. 1–6.
• Walse et al. (2016) Walse, K.H., Dharaskar, R.V., Thakare, V.M., 2016. Pca based optimal ann classifiers for human activity recognition using mobile sensors data, in: Proceedings of First International Conference on Information and Communication Technology for Intelligent Systems: Volume 1, Springer. pp. 429–436.
• Wang et al. (2016a) Wang, A., Chen, G., Shang, C., Zhang, M., Liu, L., 2016a.

Human activity recognition in a smart home environment with stacked denoising autoencoders, in: International Conference on Web-Age Information Management, Springer. pp. 29–40.

• Wang et al. (2017) Wang, J., Chen, Y., Hao, S., Feng, W., Shen, Z., 2017. Balanced distribution adaptation for transfer learning, in: The IEEE International conference on data mining (ICDM), pp. 1129–1134.
• Wang et al. (2016b) Wang, J., Zhang, X., Gao, Q., Yue, H., Wang, H., 2016b. Device-free wireless localization and activity recognition: A deep learning approach. IEEE Transactions on Vehicular Technology .
• Yang et al. (2015) Yang, J.B., Nguyen, M.N., San, P.P., Li, X.L., Krishnaswamy, S., 2015. Deep convolutional neural networks on multichannel time series for human activity recognition, in: IJCAI, Buenos Aires, Argentina, pp. 25–31.
• Yang (2009) Yang, Q., 2009. Activity recognition: Linking low-level sensors to high-level intelligence., in: IJCAI, pp. 20–25.
• Yao et al. (2017) Yao, S., Hu, S., Zhao, Y., Zhang, A., Abdelzaher, T., 2017. Deepsense: A unified deep learning framework for time-series mobile sensing data processing, in: WWW, pp. 351–360.
• Zebin et al. (2016) Zebin, T., Scully, P.J., Ozanyan, K.B., 2016. Human activity recognition with inertial sensors using a deep learning approach, in: SENSORS, pp. 1–3.
• Zeng et al. (2014) Zeng, M., Nguyen, L.T., Yu, B., Mengshoel, O.J., Zhu, J., Wu, P., Zhang, J., 2014. Convolutional neural networks for human activity recognition using mobile sensors, in: Mobile Computing, Applications and Services (MobiCASE), 2014 6th International Conference on, IEEE. pp. 197–205.
• Zhang et al. (2015a) Zhang, L., Wu, X., Luo, D., 2015a. Human activity recognition with hmm-dnn model, in: Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on, IEEE. pp. 192–197.
• Zhang et al. (2015b) Zhang, L., Wu, X., Luo, D., 2015b. Real-time activity recognition on smartphones using deep neural networks, in: UIC, IEEE. pp. 1236–1242.
• Zhang et al. (2015c) Zhang, L., Wu, X., Luo, D., 2015c. Recognizing human activities from raw accelerometer data using deep neural networks, in: IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pp. 865–870.
• Zhang et al. (2017a) Zhang, S., Ng, W.W., Zhang, J., Nugent, C.D., 2017a.

Human activity recognition using radial basis function neural network trained via a minimization of localized generalization error, in: International Conference on Ubiquitous Computing and Ambient Intelligence, Springer. pp. 498–507.

• Zhang et al. (2017b) Zhang, Y., Li, X., Zhang, J., et al., 2017b. Car-a deep learning structure for concurrent activity recognition, in: IPSN, pp. 299–300.
• Zheng et al. (2014) Zheng, Y., Liu, Q., Chen, E., Ge, Y., Zhao, J.L., 2014. Time series classification using multi-channels deep convolutional neural networks, in: International Conference on Web-Age Information Management, Springer. pp. 298–310.
• Zheng et al. (2016) Zheng, Y., Liu, Q., Chen, E., Ge, Y., Zhao, J.L., 2016. Exploiting multi-channels deep convolutional neural networks for multivariate time series classification. Frontiers of Computer Science 10, 96–112.