Asymmetric Residual Neural Network for Accurate Human Activity Recognition

by   Jun Long, et al.
Central South University

Human Activity Recognition (HAR) using deep neural network has become a hot topic in human-computer interaction. Machine can effectively identify human naturalistic activities by learning from a large collection of sensor data. Activity recognition is not only an interesting research problem, but also has many real-world practical applications. Based on the success of residual networks in achieving a high level of aesthetic representation of the automatic learning, we propose a novel Asymmetric Residual Network, named ARN. ARN is implemented using two identical path frameworks consisting of (1) a short time window, which is used to capture spatial features, and (2) a long time window, which is used to capture fine temporal features. The long time window path can be made very lightweight by reducing its channel capacity, yet still being able to learn useful temporal representations for activity recognition. In this paper, we mainly focus on proposing a new model to improve the accuracy of HAR. In order to demonstrate the effectiveness of ARN model, we carried out extensive experiments on benchmark datasets (i.e., OPPORTUNITY, UniMiB-SHAR) and compared with some conventional and state-of-the-art learning-based methods. Then, we discuss the influence of networks parameters on performance to provide insights about its optimization. Results from our experiments show that ARN is effective in recognizing human activities via wearable datasets.


Dual Residual Network for Accurate Human Activity Recognition

Human Activity Recognition (HAR) using deep neural network has become a ...

UMSNet: An Universal Multi-sensor Network for Human Activity Recognition

Human activity recognition (HAR) based on multimodal sensors has become ...

Action2Activity: Recognizing Complex Activities from Sensor Data

As compared to simple actions, activities are much more complex, but sem...

Learning Attribute Representation for Human Activity Recognition

Attribute representations became relevant in image recognition and word ...

Deep HMResNet Model for Human Activity-Aware Robotic Systems

Endowing the robotic systems with cognitive capabilities for recognizing...

DFTerNet: Towards 2-bit Dynamic Fusion Networks for Accurate Human Activity Recognition

Deep Convolutional Neural Networks (DCNNs) are currently popular in huma...

Conditional-UNet: A Condition-aware Deep Model for Coherent Human Activity Recognition From Wearables

Recognizing human activities from multi-channel time series data collect...

I Introduction

Human Activity Recognition (HAR) is very important for human-computer interaction and is an indispensable part of many current real-world applications. To overcome the awareness of human-computer interaction, the potential features in the on-body device must be learned. HAR using wearable devices data is at the core of intelligent assistive technology, due to its proliferative applications in smart homes [1], intelligent traffic control [2], medical/health assistance [3, 4], skill based check [5], even in the security field [6]. Particularly for the elderly people who are in remote and need to continuous monitor, HAR can greatly increase their safety [7].

Figure 1: The basic motivation source of our proposed ARN.

Nowadays, accelerators, gyroscopes and magnetic field sensors are widely utilized in smart phone (e.g., Apple iPhone, Samsung Galaxy, Huawei P/Mate), smart bracelets (Apple Watch, Fitbit). With the increasing number of wearable sensors and Internet of Things (IoT) devices, there is a growing trend in collecting the activity data of users in real time. The key technology in HAR includes a sliding time window of time-series data captured with on-body sensors, manually designed feature extraction procedures, and a wide variety of supervised learning methods.

In the past years, researchers have made lots of progress in wearable activity recognition, using algorithms such as Logistic Regression 


, Decision Tree 


, and Hidden Markov Model 

[9]. Reference [7]

carried out experiment to evaluate the recognition performance of supervised and unsupervised machine learning techniques. For task identification, many of traditional methods (i.e., hand-crafted features and codebook approach) are characterized by manual feature extraction and discover features with expert knowledge. The performance of those methods often depends on the quality of the obtained feature representation. However, manual feature extraction is not always possible in practice, especially when we are unknown about the structure of the input data because of a lack of expert knowledge. Compared with manual feature extraction methods, deep learning techniques can discover adequate features without expert knowledge and systematic exploration of the feature space. In the many fields, deep learning techniques have achieved remarkable results, such as in image recognition 

[10], speech recognition [11]

, natural language processing 

[12] and so on. Exiting deep learning methods for HAR can be further divided into two categories: Deep Neural Networks (DNNs) [13]

, and Convolutional Neural Networks (CNNs) 


. DNNs (e.g., contain an input layer, at least two hidden layers and an output layer) can extract highly abstract features by stacking hidden layers. It allows us to add more possible connections between input and output neurons so that the ways to re-use learned features can be increased. Researchers who have used DNNs methods for HAR include: 

[15] who investigated deep neural networks with wearable sensors data and [16] who explored temporal deep neural networks for active biometric authentication.

Recently, many researchers have adopted CNNS to deploy human activity recognition system, such as [6, 15, 17, 18]. CNNs can model the entire sequence by sharing the weights from local to global, extract abstract features at hierarchical layers through a series of convolutional operations, and process the raw activity signals for capturing potential features. CNNs are based on the discovery of visual cortical cells and retain the spatial information of the data through receptive field. It is known that the power of CNNs stems in large part from their ability to exploit symmetries through a combination of weight sharing and translation equivariance. Also, with their ability to act as feature extractors, a plurality of convolution operators is stacked to create a hierarchy of progressive abstract features. Reference [14] proposed CNNs based approaches to automatically extract discriminative features for HAR. More and more researches are using variants of CNNs to learn sensor-based data representations for human activity recognition and have achieved remarkable performances. The model [19]

consists of two or three temporal-convolution layers with a ReLU activation function followed by a max-pooling layer and a soft-max classifier, which can be applied over all sensors simultaneously. Yang

et al. [17] introduce four temporal-convolutional layers on a single sensor, followed by a fully-connected layer and soft-max classifier and it shows that deeper networks can find correlations between different sensors.

However, all the deep learning methods we mentioned above are all identified by a single-path neural net without considering spatial and temporal features of the data. Inspired by the biological study of retinal ganglion cells in the primate visual system (Figure 1

illustrates the basic motivation source of our proposed ARN), there are Parvocellular (P-cells) that provide good spatial detail and color in the visual system, but its resolution is very low. In addition, there are high-frequency Magnocellular (M-cells), which are very sensitive to time changes, but not sensitive to spatial details and colors. In this paper, inspired by the facts above, we propose a model to handle a set of activity data, synchronized by an asymmetric net, using a short time window to capture spatial features and a long time window to capture fine temporal features, corresponding to the P-cells and the M-cells, respectively. Our network is an end-to-end network, and the input of the network is the original sensor data. The data collected from the wearable device can be directly input into the network. ARN model is applicable for supervised learning approaches and unsupervised learning approaches because it is based on ResNet 

[20] that has been proven to be applicable for supervised and unsupervised learning approaches. In this paper, we use supervised learning method. Because it can make label information bridge the heterogeneous gap.

We propose a novel Asymmetric Residual Network, named ARN. As a new kind of deep learning network, the components of activity recognition in ARN are divided into two parts. (1) a residual net using short time window (i.e., 32 or 64); (2) a residual net using long time window (i.e., 64 or 96). 32, 64, 96 are meaning the length of the time window111(Length) = (Time) x (Sampling Frequency). The last layer representations of two parts will be concatenated, then use the fusion representations for accurate activity recognition. The superior advantages of the ARN over other existing methods were listed in the Table 1. To the best of our knowledge, this is the first work that applies a asymmetric residual net for activity recognition.

The main contributions of the paper are as follows:

  1. We propose a novel symmetric neural network based on ResNet for HAR, termed ARN, which is an asymmetric network and has two paths separately working at short and long slide window, our wide path is designed to capture global features but few spatial details, analogous to M-cells, and our narrow path is lightweight, similar to the small receptive field of P-cells.

  2. We design a network that consists of asymmetric residual net that not only can effectively manage information flow, but will also automatically learn effective activity feature representation, while capturing the fine feature distribution in different activities from wearable sensor data.

  3. We compare the performance of our method with other relevant methods by carrying out extensive experiments on benchmark datasets. The results show our method outperforms other methods.

The remainder of this paper is structured as follows. In Section II, we briefly introduce the related works. In Section III, we highlight the motivation of our method and provide some theoretical analysis for its implementation. In Section IV, we introduce our experimental results and corresponding analysis and finally in Section V concludes the paper.

Method manual f. high-level f. spatial f. temporal f. unsupervised supervised
HC [21]
CBH [22]
CBS [23]
AE [24]
MLP [25]
CNN [14]
LSTM [26]
Hybrid [27]
ResNet [20]
ARN(This Work)
  • f. denotes features.

Table 1: Comparison of different HAR technologies

Ii Related Work

i Methods for Human Activity Recognition

This section introduces the features extraction methods for HAR selected in our comparative study. There are two main directions for HAR methods: conventional recognition methods and learning-based methods. In conventional methods, we extract features manually with expert knowledge. In Learning-based methods, we can discover adequate features without expert knowledge and systematic exploration of the feature space. Conventional methods in our comparative study include Hand-Crafted Features (HC) [21], Codebook approach (CB) [22]

. The learning-based methods include Autoencoders approach (AE) 


, Multi-Layer Perceptron (MLP) 

[25], Convolutional Neural Network (CNN) [14]

, Long-Short Term Memory Networks (LSTM) 

[26], Hybrid Convolutional and Recurrent Networks (Hybrid) [27], Deep Residual Learning (ResNet) [20].

i.1 Hand-Crafted Features

HC comprises simple metrics computed on data and uses simple statistical value (e.g., std, avg, mean, max, min, median, etc.) or frequency domain correlation features based on the signal Fourier transform to analyze the time series of human activity recognition data. Due to its simplicity to setup and low computational cost, it is still being used in some areas, but the accuracy cannot satisfy the requirement of modern AI games. In addition, when faced with the activity recognition of complex high-level behaviors tasks, identifying the relevant features through these traditional approaches is time-consuming.

i.2 Codebook approach

CB consists of two consecutive steps. The first step is codebook construction which is to construct codebook by using cluster algorithm to process a set of subsequences extracted from the original data sequence. Each center of the cluster is considered as a codeword which represents a distinct subsequence. The second step is codeword assignment which aims to built a feature vector that is associated to a data sequence. Subsequences are firstly extracted from the sequence, and then assign each subsequence to the most similar codeword. Finally, a histogram-based feature representing the distribution of all codewords is built by using this information. During the codebook construction, a set of subsequences can be firstly extracted from the original data sequence by using a sliding time window approach with window size

and sliding stride

. Then, a k-means clustering 

[28] algorithm can be applied on subsequences to obtain clusters of similar subsequences. Similarity metric between two subsequences is Euclidean distance. Finally, we can get a codebook that consists of codewords.

During the codebook assignment, we should firstly extract subsequences from a sequence using the same sliding window approach as the codebook construction, and the most similar codeword need to be assigned for each subsequence. Then, a histogram of the frequencies of codewords can be built by using this information. Finally, we can get a probabilistic feature presentation by normalizing the histogram.

The approach of described above can be called codebook with hard assignment (CBH) because each subsequence is assigned to a codeword deterministically. However, this approach may lack flexibility in some uncertain situations where a subsequence is similar to two or more codewords. In order to solve this problem, we use a soft assignment variant (CBS). CBS can exploit kernel density estimation 

[23] to perform smooth assignment of subsequences to multiple codewords. It allows us to obtain a feature which represents a smooth distribution of codewords that considers the similarity between all codewords and subsequences.

Figure 2: Architecture of a AE for HAR. in the input layer denotes the length of time window. denotes the number of sensor channels. denotes the number of hidden layers.

i.3 Autoencoders approach

AE is a specific architecture that consists an encoder and a decoder as depicted in Fig 2

. The encoder can project input data in a feature space of lower dimension. While the decoder can map the encoded features back to the input space. Then, the AE can reproduce the input data on the output according to a loss function like Mean Squared Errors after t raining.

i.4 Multi-Layer Perceptron

MLP is one of the most simplest neural networks. An important feature of the MLP is that it has multiple layers. As show in Fig 3, we call the first layer as input layer, the last layer as output layer, and the middle layers as hidden layers. MLP does not specify the number of hidden layers, so you can choose the appropriate number of hidden layers according to your needs. Each neuron in a fully-connected layer takes the outputs of the previous layer as its inputs. Stacking layers can be seen as extracting features of an increasingly higher level of abstraction and the output features of neurons at -th layer can be calculated by neurons at -th layer.

Figure 3: Architecture of a MLP for HAR. in the input layer denotes the length of time window. denotes the number of sensor channels. denotes the number of hidden layers. Input data are first flattened into a ()-dimensional vector. The vector is used as data for hidden layers. All layers are fully-connected.

i.5 Convolutional Neural Networks

CNNs can automatically extract the features from raw sensor data which without need for very professional expert knowledge [29]. A standard convolutional neural network consists of convolutional layers, max-pooling layers, fully-connection layers (FC) and a Soft-Max layer. Instead of using predefined filters as in traditional feature extracting methods, CNNs can learn locally connected neurons that represent data-specific filters. As CNNs can share weights of neurons, the parameters of CNNs are much fewer than those of the traditional neural networks [30].

Convolutional layers are an important component of CNNs. Using several convolution filters (or kernels), which aim to learn feature representations of the raw input, complex operations can be easily performed by the convolution operation in the convolutional layer. The dimension of filters (or kernels) is determined by the input dimension. Convolution kernel is a function that generalizes a linear model for the underlying local patch. It works well for abstraction, when instances of latent concepts are linearly separable. In each convolutional layer, neurons of current layer are connected to the neurons of previous layer through feature mapping operation. Thus, feature mapping of the upper layer can be obtained from the convolved results of the previous layer by adopting an element-wise nonlinear activation function. So, the value of the feature map in the -th layer, is calculated by:


where maps are the total number of feature maps in -th layer and

is a bias vector.

is the activation function to improve the performance of CNNs. The most notable non-liner activation function is ReLU, which is defined as: . The ReLu activation operation allows networks to compute much faster than sigmoid or tanh

activation functions, induces the sparsity in the hidden neurons, and makes networks to obtain sparse representations more easily. Adopting ReLU may bring zero value to affect the performance of backpropagation, but many research results have show that

ReLU [31] works much better than sigmoid and tanh [32].

Pooling layers have come after the convolutional layer, is another component of CNNs. In the pooling layer, a pooling operation is used to reduce the number of neurons connections between neighboring convolutional layers thus reducing computational complexity.

Fully-connected layers, whcih aims to convert the matrix-feature (2-D) unfolded to a vector-feature (1-D) for anastomosis classification tasks, and contains about 90% of the parameters of the entire CNNs.

Loss function plays an important role in different classification tasks. The most common loss function is soft-max. Given a training set , where is the input patch, is the target label which belongs to the total number of labels (K). The prediction of -th class for -th input is transformed with the Soft-max function:


Soft-max normalizes the predictions to a probability distribution over the total classes. The soft-max is represented loss as follows:


Regularization is required in CNNs. Overfitting is an unavoidable problem in convolutional neural networks, that but it can be effectively reduced by regularization. As a means of regularization, dropout can prevent the dependence of different neurons in a network, and force the network to be more accurate even in the absence of certain information.

Figure 4: Architecture of a LSTM for HAR. in the input layer denotes the length of time window. denotes the number of sensor channels. Each LSTM layer is composed of LSTM cell. denotes the element-wise multiplication. denotes the element-wise addition. The output of the last LSTM layer is passed to a dense and a Soft-Max layers.

i.6 Recurrent Neural Networks and Long-Short Term Memory Networks

Recurrent Neural Networks (RNNs) are a specific architecture that connections between neurons have directed cycles and the output of the neurons dependent on the state of the network at the previous timestamps. RNNs can find patterns with long-term dependencies because its specific behavior that memorizes the information extracted from the past data. But in practice, there is a phenomenon called gradient explosion or vanish will make a great affect on performance of RNNs. The problem of vanishing or exploding gradient refers to the derivate of the error function with respect to the network weight becomes very large or close to zero [33]. This problem will result in the adverse impact on the weight update by the back-propagation algorithm. Therefore, LSTM is designed to solve the problem of vanishing or exploding gradient in RNNs. LSTM extends RNNs with memory cells and remembers information over time by storing it in an internal memory. The internal state can be updated and erased depending on their input and the state at the previous time step. As show in Fig 4, this mechanism is achieved by introducing internal processor called cell. A cell contains three gates, called input gate (), output gate () and forget gate (). is the cell state. Gates are used to regulate the information update to the cell state. Their equations are mentioned below:


where represents the element-wise multiplication of two vectors. refers to the input vector to the LSTM cell at time t and is the hidden state vector. designates the activation function. , and are the matrices of weights and biases.

i.7 Hybrid Convolutional and Recurrent Networks

Hybrid comprises convolutional, LSTM and softmax layers as depicted in Fig 

5. Convolutional layers have the ability to extract the features from input data and create a hierarchy of progressively more abstract features by stacking several convolutional operators. LSTM includes a memory to model temporal dependencies in time series problems. Therefore, the combination of CNN and LSTM can capture time dependencies on features extracted by convolutional operations.

Figure 5: Architecture of a hybrid for human activity recognition. in the input layer denotes the length of time window. denotes the number of sensor channels. In convolutional layers, denotes the kernels in layer and is the length of kernels. The convolutional layers can extract features from input data and provide abstract presentations of data in feature maps. LSTM layers learn temporal dynamics from data.

i.8 Deep Residual Learning

ResNet is designed to address degradation problem that the accuracy of training set gets saturated or even decreases with the network depth increasing. Different from the ordinary convolutional neural network, ResNet has many stacked Residual blocks, in which identity mappings are added to connect input and output. Residual block with identity mapping can be expressed in a general form:


where and are output and input of the -th unit, is a residual function and are parameters of the unit. is an identity mapping and is a ReLU activation function. The key idea of ResNet is to learn the additive residual function with respect to . ResNet can make the element-wise addition on input and output by attaching a shortcut connection. This simple addition can increase the training speed of the model and improve the training effect, and will not add additional parameters to the network. With the network depth increasing, this simple structure is a good solution to the degradation problem.

Iii Asymmetric Residual Network

In this section, our proposed model has a narrow path (see Sec ii) and a wide path(see Sec iii), which are concatenated and sent to the fully-connected layer. Loss function is introduced in Sec v.

Figure 6: The proposed ARN architecture for human activity recognition. and in the raw data layer denote the length of time window corresponding to the narrow path and wide path, respectively (). denotes the number of sensor channels. In the conv. layers, denotes the kernels in layer, and the length of kernels is . In the res. layers, the dimension of each block double increase feature maps to the input signal, which is processed by number of preset building blocks for residual learning.

i Network Architecture

As shown in Fig 6, convolutional layers and residual layers in our architecture are used to model the recognition task. In convolutional layers, the general activity features are extracted from raw sensor data. In residual layers, the special features can be extracted from general features and the special features are used for human activity recognition.

The convolutional layers (i.e., conv. layers) of our architecture consists of 1 layer, including 64 sliding windows (filters) whose size is

, a batch normalization layer, and a ReLU layer with use of a pooling layer 

[29]. The residual layers contain four “blocks”. The details of residual block are shown in Table 2 and the value of are set to , respectively.

ii Narrow Path

The narrow path can be any convolutional model (e.g., Reference [34] introduced a new Two-Stream inflated 3D Convolutional Networks: filters and pooling kernels of very deep image classification Convolutional Networks are expanded into 3D, Reference [35] introduced spatiotemporal ResNets as a combination of Two-stream Convolutional Networks and ResNets, Reference [36] introduced non-local operations as a generic family of building blocks for capturing long-range dependencies.) that works on a sequence data as a spatiotemporal volume. The key concept in our narrow path is a short slide time window to scan the sequence activity data. We can know form [37] that feature learning methods can get good performance when . Therefore, a typical value of we study is 32 [37]. Denoting the number of sensor channels as , the raw clip length is . The function of this path is to throw compact information into the net, the purpose is to capture spatial features.

iii Wide Path

In parallel to the narrow path, the wide path is another convolutional model with a long slide time window. The operations of two path net work on the same raw activity data sequences, so the wide path uses slide time window, times longer than the narrow path. A typical value is  [38] in our experiments. The presence of is in the key of the NarrowWide concept. It explicitly indicates that the two paths work on different time window. Our wide path enters a long sequence of activity data into the net in order to pursue global functionality throughout the net hierarchy. Our wide path is distinguished from existing methods in that it can use significantly lower channel capacity to achieve good accuracy for the ARN model. The low channel capacity can also be interpreted as a weaker ability of representing spatial semantics. Our wide path not only has a long slide time window, but also pursues high-dimension features throughout the network hierarchy, maintaining temporal fidelity as much as possible.

iv Lateral Concatenation

Our lateral concatenation fuses from the narrow path to the wide path. We denote the representation shape of the narrow path as , the representation shape of the wide path is . The output of the lateral concatenation is fused into the narrow path by concatenation. Therefore, the shape of the concatenation layer is .

v Loss Function

In order to train classification models, classification objectives (such as logistic loss and softmax loss) have been widely explored. For accurate human activity recognition, using labels that are different from the ground-truth for prediction, cannot contribute to the update of the network parameters. For depth estimates, predictions that are close to the ground-truth labels also help to update network parameters. In this work, we employ softmax loss for training the human activity recognition model. For each training sequence , the probability of each label in our model is computed via softmax:



are the logits or unnormalized log probabilities. Here, the

are computed by adding a fully-connected layer on top of the sequence data embedding, i.e., , where and are weights and bias for target label, respectively. Let denote the ground-true distribution over classes for this training example such that . The cross-entropy loss for the example is computed as:

Stage narrow path wide path
Max-pooling /2 /2
concate global average pool global average pool
fc. 512 512
Table 2: The layer-parameters of the ARN mdoel.

Iv Experiments

In order to demonstrate the performance of our proposed ARN method, we carried out our extensive experiments on two widely used benchmark datasets, i.e., OPPORTUNITY and UniMiB-SHAR, to verify the effectiveness of our method.

i Dataset

Human activity features are usually unique and cyclical, and natural human activities include walking, running, jumping and so on. Therefore, a set of active data that includes a variety of types of natural human activities should be considered in dataset construction.

We use benchmark datasets to validate the model performance, and use different action sequences to verify whether they belong to the same person. There are many benchmark activity datasets, such as OPPORTUNITY [39], WISDM [40], UniMiB-SHAR [41], MHEALTH [42], PAMAP2 [43] datasets. In this paper, we evaluate our method by using the following two datasets.

OPPORTUNITY dataset has been widely used in many researches. It contains four subjects performing 17 different (morning) Activities of Daily Living (ADLs) in a sensor-rich environment, as listed in Table 3 4. They were acquired at a sampling frequency of 30Hz equipping 7 wireless body-worn inertial measurement units (IMUs). Each IMU consists of a 3D accelerometer, 3D gyroscope and a 3D magnetic sensor, as well as 12 additional 3D accelerometers placed on the back, arms, ankles and hips, accounting for a total of 145 different sensor channels. During the data collection process, each subject performed a session 5 times with ADL and 1 drill session. During each ADL session, subjects were asked to perform the activities naturally-named “ADL1” to “ADL5”. During the drill sessions, subjects performed 20 repetitions of each of the 17 ADLs of the dataset. The dataset contains about 6 hours of information in total, and the data are labeled on a timestamp level. The dataset can be used in an open activity recognition recognition challenge where participants competed to achieve the highest performance on the recognition. In our experiment, the training and testing sets have 63-Dimensions (36-D on hand, 9-D on back and 18-D on ankle, respectively).

UniMiB-SHAR dataset was collected data from 30 healthy subjects (6 male and 24 female) acquired using the 3D-accelerometer of a Samsung Galaxy Nexus I9250 with Android OS version 5.1.1. It contains 11771 samples of both human activities and falls performed by 30 subjects of ages ranging from 18 to 60. The data are sampled at a constant sampling rate of 50 Hz, and split into 17 different activity classes, 9 safety activities and 8 dangerous activities (e.g., a falling action) as shown in Table 3 5. Unlike the OPPORTUNITY dataset, the dataset does not have any NULL class and remains relatively balanced. It allows researchers to work to more robust features and classification schemes. In our experiments, the training and testing sets have 3-Dimensions.

Types of sensors
Custom bluetooth wireless accelerometers,
Sun SPOTs and InertiaCube3,
Ubisense localisation system,
A custom-made magnetic field sensor
A Bosh BMA220 acceleration sensor
Numbers of sensors 72 3
Numbers of samples 473K 11771
Acquisition periods 10-20(min) 0.6(min)
Table 3: The details of experimental datasets
Class Proportion Class Proportion
Open Door 1/2 1.87%/1.26% Open Fridge 1.60%
Close Door 1/2 6.15%/1.54% Close Fridge 0.79%
Open Dishwasher 1.85% Close Dishwasher 1.32%
Open Drawer 1/2/3 1.09%/1.64%/0.94% Clean Table 1.23%
Close Drawer 1/2/3 0.87%1.69%/2.04% Drink from Cup 1.07%
Toggle Switch 0.78% NULL 72.28%
Table 4: Classes and proportions of the OPPORTUNITY dataset
Class Proportion Class Proportion
StandingUpfromSitting 1.30% Walking 14.77%
StandingUpfromLaying 1.83% Running 16.86%
LyingDownfromStanding 2.51% Going Up 7.82%
Jumping 6.34% Going Down 11.25%
F(alling) Forward 4.49% F and Hitting Obstacle 5.62%
F Backward 4.47% Syncope 4.36%
F Right 4.34% F with ProStrategies 4.11%
F Backward SittingChair 3.69% F Left 4.54%
Sitting Down 1.70%
Table 5: Classes and proportions of the UniMiB-SHAR dataset

The OPPORTUNITY dataset and UniMiB-SHAR dataset are collected from real environment. The two datasets have their own characteristic and contain different sensors, the UniMiB-SHAR dataset only contains the accelerometer data, it has low power cost. The OPPORTUNITY dataset combines accelerometers, gyroscopes and magnetic sensors data, and it can provide accurate limb orientation.

ii Baseline

We compared our proposed ARN method against some classic or state-of-the-art activity recognition methods. We roughly divided these methods into categories: conventional recognition methods include HC [21], CBH [22], CBS [23]. The learning-based methods include AE [24], MLP [25], CNN [14], LSTM [26], Hybrid [27], ResNet [20]. As in conventional methods, we use hand-crafted features, readers can find more details in [37]. For learning-based methods, we use raw activity data as input. Follow by [37], the hyper-parameters of these learning-based baseline models except ResNet222The hyper-parameters used by ResNet are the hyper-parameters used in one of the path in the proposed ARN model. for the OPPORTUNITY and UniMiB-SHAR datasets are provided in Table 6.

Model parameters
AE 5000
MLP 2000
CNN ((11,1),(1,1),50,(2,1))
LSTM (64,600)
Hybrid ((11,1),(1,1),50,(2,1))
  • dense units.

  • ((kernel size), (siding stride), num of kernels, (Pool)).

  • (number of cells, output-dim).

Table 6: Hyper-parameters of the learning-based methods on the OPPORTUNITY dataset and UniMiB-SHAR dataset.

iii Implementation and Setting

Our ARN model is implemented in TensorFlow 


, a system that transfers complex data structures to artificial intelligence neural networks for analysis and processing. The computing platform is equipped with an Intel 2

Intel E5-2600 CPU, 128G RAM, and a NVIDIA TITAN Xp 12G GPU. The model is trained using the ADADELTA gradient decent algorithm with default parameters (i.e., initial learning rate of 1), for 50 epoches. The batch size is set to 128. The hyper-parameters of the proposed model are provides in Table 


Sliding Time Window Size: The length of the sliding window is an important hyper-parameter of the proposed model. As in baseline methods, we carried out two more comparative studies using , and . For the proposed model, we use or as the hyper-parameter of the narrow path and or as the hyper-parameter of the wide path, respectively.

iv Performance Measure

ADL datasets are often highly unbalanced. The OPPORTUNITY dataset is extremely imbalanced, as the NULL class represents more than 75% of the recorded data. For this dataset, the overall classification accuracy is not an appropriate measure of performance, because the activity recognition rate of the majority classes might skew the performance statistics to the detriment of the least represented classes. As a result, many previous researches such as 


show the use of an evaluation metric independent of the class repartition—

-score. The -score combines two measures: the precision and the recall : is the number of correct positive examples divided by the number of all positive examples returned by the classifier, and is the number of correct positive results divided by the number of all positive samples. The -score is the harmonic average of and , where the best value is at 1 and worst at 0. In this paper, we use an additional evaluation metric to make the comparison with them easier: the weighted -Score (Sum of class -scores, weighted by the class proportion):


where and is the number of samples in class , and is the total number of samples.

Method (time window) OPPORTUNITY UniMiB-SHAR
HC [21] 32 84.95 22.83
64 85.56 22.19
96 85.69 21.96
CBH [22] 32 84.37 64.51
64 85.21 65.03
96 84.66 64.36
CBS [23] 32 85.53 67.54
64 86.01 67.97
96 85.39 67.36
AE [24] 32 82.87 68.37
64 84.54 68.24
96 83.39 68.39
MLP [25] 32 87.32 73.33
64 87.34 75.36
96 86.65 74.82
CNN [14] 32 87.51 74.01
64 88.03 73.04
96 87.62 73.36
LSTM [26] 32 85.33 69.24
64 86.89 69.49
96 86.21 68.81
Hybrid [27] 32 87.91 73.19
64 88.17 73.22
96 87.67 72.26
ResNet [20] 32 88.91 76.19
64 89.17 76.22
96 87.67 75.26
ARN 32-96 (n)-(w) 90.29 76.39
Table 7: Weighted -score performances of different methods on the OPPORTUNITY and UniMiB-SHAR datasets. (n) and (w) denote the narrow and wide path, respectively.

v Results and discussions

In this section, we present and discuss the results. To get insight into how these methods are applied to the domain, we show the performance of these methods and evaluate some key parameters.

Method (n)-(w) (time window) OPPORTUNITY UniMiB-SHAR
ARN_1 32-64 90.21 77.23
ARN_2 32-96 90.29 76.39
ARN_3 64-96 90.19 76.04
Table 8: Weighted -score performances comparison of ARN with the combinations of different lengths of the slide window on the OPPORTUNITY and UniMiB-SHAR datasets. (n) and (w) denote the narrow and wide path, respectively.

The weighted -score of all models on OPPORTUNITY and UniMiB-SHAR are listed in Table 7. Results on these datasets show that the proposed ARN method substantially outperforms all other methods against which it was compared. Compared to conventional recognition methods, such as CBS, the best conventional method achieves an absolute boost of 4.98%, and 14.65% corresponding to the OPPORTUNITY dataset and the UniMiB-SHAR dataset, respectively. In addition, most of the learning-based recognition methods outperform the conventional recognition methods. In particular, for OPPORTUNITY dataset, the Hybrid method achieves the best performance among all the learning-based methods. Compared to Hybrid method, our ARN method achieves boosts of 2.4%. For UniMib-SHAR dataset, the MLP method achieves the best performance among all the learning-based methods. Compared to MLP method, our ARN method achieves boosts of 2.48%. We also compared a single-path residual network i.e. ResNet, our ARN achieves an absolute boost of 1.26%, and 1.32% corresponding to the OPPORTUNITY dataset and the UniMiB-SHAR dataset, respectively.

From the Table 7, we can observe that the gap between the learning-based methods and conventional methods is larger on the UniMiB-SHAR dataset than OPPORTUNITY dataset. The reasons are that the sensor channels in OPPORTUNITY dataset are more than those in UniMiB-SHAR dataset. By carefully comparing the performance of the results, we found that our proposed method showed a higher degree of performance improvement when tested on UniMiB-SHAR dataset compared to OPPORTUNITY dataset. This means that our method is effective.

We also observe from the Table 7 that different lengths of slide time window have an impact on the performance of the activity recognition. The short time window contains too little information. With the growth of the time window, the window contains more and more information, and the accuracy is improved accordingly. But using longer slide time window does not yield better recognition performances [DBLP:journals/sensors/LiSNKG47]. Most methods perform best in human activity recognition tasks when . The reasons are that longer frames potentially contain data related to a higher number of activities, making their majority-labeling more inaccurate.

vi Hyperparameters Evaluation

Impact of the length of the narrow and wide path selection: The model we proposed has two paths, one is narrow path (i.e., the length of slide time window is short), another is wide path (i.e., the length of the slide window is wide). In order to verify the impact of different lengths of slide time windows combinations on the results. We leverage a combination of slide time window of different lengths for comparison experiments (i.e., 32-64, 32-96, 64-96). ARN_1, ARN_2 and ARN_3 mean the combination of slide time window are 32-64, 32-96 and 64-96, respectively. We carried out experiments on the two datasets. The weighted -score results are shown in Table 8. From Table 8, we can observe that on the OPPORTUNITY dataset, the ARN_2 outperforms the ARN_1 and ARN_3, the minimum image size is * and the accuracy is 90.29%. For UniMiB-SHAR dataset, the ARN_1 outperforms the ARN_2 and ARN_3, the minimum image size is * and the accuracy is 76.39%. The performance gap between the three experiments was very small. This indicates that ARN is a stable model that is not sensitive to the lengths of slide time window.

V Conclusions

In this paper, we propose a novel asymmetric residual network for activity recognition using wearable device data, named ARN. To improve the accuracy of activity recognition, our method consists of two paths. The first path uses a short time window to capture spatial features, and the second path uses a long time window to capture fine temporal features. Unlike other learning-based methods, ARN considers the spatial and temporal features of the data at the same time. It can effectively manage information flow and automatically learn activity feature representation. ARN is an end-to-end network, and the data collected by the wearable device can be directly input into the network. Comprehensive experiments on the two benchmark human activity recognition datasets demonstrate that the ARN outperforms the compared baselines. This method has a good application prospect. However, biometric identification, fingerprint recognition, iris recognition and other technologies have achieved more than 98% accuracy and are widely used in people’s life. Compared to the practical applications of recognition field in society, the accuracy of ARN cannot meet. Therefore, there is still a lot of room for progress.

For future work we may research how to extract more fine-gained features by using attention strategy and focus on the research of data dynamic fusion algorithm for maximizing the retention of the data features and obtaining higher recognition accuracy. In addition, we will recognize human dangerous activities, but these activities recognition involves many fine-gained feature extractions. Therefore, we may take advantage of the memory mechanism to design a memory-augmented neural network [45] that can learn to find supporting pre-stored clews (i.e., representations of historical activities or others).