Attention-Based Deep Learning Framework for Human Activity Recognition with User Adaptation

by   Davide Buffelli, et al.
Università di Padova

Sensor-based human activity recognition (HAR) requires to predict the action of a person based on sensor-generated time series data. HAR has attracted major interest in the past few years, thanks to the large number of applications enabled by modern ubiquitous computing devices. While several techniques based on hand-crafted feature engineering have been proposed, the current state-of-the-art is represented by deep learning architectures that automatically obtain high level representations and that use recurrent neural networks (RNNs) to extract temporal dependencies in the input. RNNs have several limitations, in particular in dealing with long-term dependencies. We propose a novel deep learning framework, , based on a purely attention-based mechanism, that overcomes the limitations of the state-of-the-art. We show that our proposed attention-based architecture is considerably more powerful than previous approaches, with an average increment, of more than 7% on the F1 score over the previous best performing model. Furthermore, we consider the problem of personalizing HAR deep learning models, which is of great importance in several applications. We propose a simple and effective transfer-learning based strategy to adapt a model to a specific user, providing an average increment of 6% on the F1 score on the predictions for that user. Our extensive experimental evaluation proves the significantly superior capabilities of our proposed framework over the current state-of-the-art and the effectiveness of our user adaptation technique.


Understanding and Improving Recurrent Networks for Human Activity Recognition by Continuous Attention

Deep neural networks, including recurrent networks, have been successful...

Deep Learning for Sensor-based Activity Recognition: A Survey

Sensor-based activity recognition seeks the profound high-level knowledg...

Hierarchical Self Attention Based Autoencoder for Open-Set Human Activity Recognition

Wearable sensor based human activity recognition is a challenging proble...

Sensor Data Augmentation with Resampling for Contrastive Learning in Human Activity Recognition

Human activity recognition plays an increasingly important role not only...

DeepSense: A Unified Deep Learning Framework for Time-Series Mobile Sensing Data Processing

Mobile sensing applications usually require time-series inputs from sens...

Deep, Convolutional, and Recurrent Models for Human Activity Recognition using Wearables

Human activity recognition (HAR) in ubiquitous computing is beginning to...

Deep CHORES: Estimating Hallmark Measures of Physical Activity Using Deep Learning

Wrist accelerometers for assessing hallmark measures of physical activit...

1 Introduction

Sensor-based human activity recognition (HAR) is a time series classification task that involves predicting the movement or action of a person (e.g. walking, running, etc.) based on sensor data. HAR has many practical applications, such as fitness tracking, video surveillance, and gesture recognition. Despite being a well studied and mature problem, HAR has been a very active research area in recent years, due to the rise of ubiquitous computing enabled by smartphones, wearables, and Internet-of-Things devices (Nweke et al., 2018; Yao et al., 2018; Bianchi et al., 2019).

Several previously proposed approaches tackled the problem by hand-crafting features (Figo et al., 2010; Stisen et al., 2015)

. These kind of approaches, based on trial-and-error, require a lot of human effort, and therefore time, and are not guaranteed to generalize well to unseen subjects. Deep learning enables automatic feature extraction and can hierarchically compose features to obtain high level representations, which have more discriminative power than handcrafted features based on human expertise. These properties allow deep learning models to be more robust and with higher generalization properties, and make deep learning the state-of-the-art technique for HAR

(Nweke et al., 2018; Wang et al., 2019). In particular, the state-of-the-art is given by the DeepSense framework  (Yao et al., 2017)

, with an architecture based on convolutional neural networks (CNNs) and recurrent neural networks (RNNs).

While RNNs have been used in several domains to capture sequential relationships, they still have some shortcomings, in particular in terms of learning from long input sequences (Hochreiter et al., 2001; Cho et al., 2014). An attractive strategy to enhance, or replace, RNNs is provided by attention models

, first introduced in encoder-decoder neural networks in the context of natural language processing (NLP)

(Bahdanau et al., 2015)

. The main idea behind attention mechanisms is to act as a memory-access mechanism that allows the decoder to selectively access the most important parts of the input sequence based on the current context. Attention models alleviate RNNs difficulties in learning from long input sequences, and successive developments have led to NLP models based solely on attention mechanisms

(Vaswani et al., 2017). To the best of our knowledge, the use of pure attention models in deep learning architectures to extract temporal dependencies in multimodal data, such as multi-sensor HAR data, has not been explored.

The human activity recognition task is highly “personal”, in the sense that a single smartphone or smartwatch is usually used by just one person, and the style of walking, running or climbing stairs is peculiar to each individual. It is then natural to aim at the development of deep learning techniques that can be adapted to a specific user. However, the exploration of personalized deep learning models for HAR has been hitherto ignored.

1.1 Our Contribution

We expand the deep learning approaches for HAR with a new purely attention-based framework, TrASenD, that builds upon the state-of-the-art while significantly outperforming it on three different HAR datasets. TrASenD builds on the observation that RNNs do not provide the best way to capture the temporal relationships in the data, and significantly outperforms the state-of-the-art deep learning method for HAR. We also consider other variants of DeepSense, designed by replacing RNNs with more powerful attention enhanced RNNs mechanisms to capture temporal dependencies, and we show that while they do perform better than DeepSense, they are still less performing than our purely attention based TrASenD. In addition, we propose a personalization framework to adapt the model to a specific user over time, increasing the accuracy of the predictions for the user. To achieve this result we use a lightweight transfer learning approach that continues the training of only a small portion of the model with data acquired from the user. We empirically show that this approach significantly improves the performance of the model on data from a specific user.

Our contributions can be summarized as follows:

  • We make use of a purely attention-based mechanism to develop a novel deep learning framework, TrASenD, for multimodal temporal data. TrASenD builds on the state-of-the-art frameworks (Yao et al., 2017, 2019) and significantly enhances them by replacing RNNs with a pure attention architecture inspired by the Transformer (Vaswani et al., 2017).

  • We extensively evaluate TrASenD against the current state-of-the-art and some of its variants that we design. We show that TrASenD significantly outperforms other methods on 3 different HAR datasets, with an average increment of more than on the F1 score over the previous best performing model. We also test the impact of data augmentation on the performances of the models, showing that it does play an important role on the generalization capabilities of the models.

  • We propose a new transfer learning technique to adapt a model to specific user, in order to exploit the “personal” nature of the HAR task. We obtain an average increment of on the F1 score on the predictions for that user.

  • We empirically prove the effectiveness of our personalization technique. In fact, we show that it is capable of significantly improving the performance of every model we analyze, on each dataset. Furthermore, we empirically demonstrate the learning capabilities of the proposed transfer learning approach.

2 Related Work

We now present the previous work related to our contributions. We first present the most relevant work on deep learning approaches for HAR (Section 2.1), then on attention mechanisms (Section 2.2), and finally on transfer learning and personalization in HAR (Section 2.3).

2.1 Deep Learning for HAR

Following the taxonomy defined in recent surveys (Nweke et al., 2018; Wang et al., 2019), deep learning techniques for sensor-based HAR fall into three main categories. The first category includes architectures composed of RNNs only (e.g. Guan and Plötz, 2017; Inoue et al., 2017). The second category includes architectures based on CNNs only, and can be further divided in two subcategories of models: Data Driven and Model Driven (Wang et al., 2019). Data Driven models (e.g. Hammerla et al., 2016; Sathyanarayana et al., 2016; Pourbabaee et al., 2018) use CNNs directly on the raw data coming from the sensors (each dimension of the data is seen as a channel). On the other hand, Model Driven approaches (e.g. Li et al., 2016b; Ravi et al., 2016; Singh et al., 2017) first preprocess the data to get a grid-like structure, and then use CNNs. The third category is represented by those models that use both CNNs and RNNs (Ordóñez and Roggen, 2016; Singh et al., 2017; Yao et al., 2017, 2019; Ma et al., 2019)

. Finally, other deep learning techniques used for HAR are autoencoders

(Wang et al., 2016; Almaslukh et al., 2017)

, and Restricted Boltzmann Machines

(Hammerla et al., 2015; Li et al., 2016a; Radu et al., 2016).

DeepSense (Yao et al., 2017) is a deep learning framework for HAR that belongs to the third category, and constitutes the state-of-the-art for HAR. DeepSense starts with a CNN to extract features from intervals of data obtained from different sensors. The features extracted from different sensors in a time interval are then concatenated and other convolutional layers are used to extract higher level multi-sensor features. Successively a RNN layer (in particular, a stacked GRU layer) is used to learn temporal dependencies between the features extracted at different time intervals. A final layer is then easily customizable to adapt the framework for classification, regression or segmentation tasks.

The authors of DeepSense recently proposed a new version of the framework, SADeepSense (Yao et al., 2019), where they introduce a self-attention mechanism that automatically balances the contributions of multiple sensor inputs. SADeepSense maintains the same architecture of the original DeepSense framework, and adds an attention module to balance the contribution of different sensors based on their sensing quality. Additionally, in the RNN layer, another attention module is used to selectively attend to the most meaningful timesteps. This approach differs significantly from ours as the self-attention module of SADeepSense is used to address the issue of heterogeneity in the sensing quality from multiple sensors, and to select the most relevant timesteps for the final prediction, while TrASenD employs a purely attention-based mechanism directly as a mean to extract temporal dependencies in the data. Furthermore, SADeepSense retains the stacked GRU layer of the original DeepSense framework, while our approach replaces the GRU layer entirely. Another recently proposed architecture based on the DeepSense framework, which adopts a similar attention strategy to SADeepSense is AttnSense (Ma et al., 2019).

2.2 Attention Models

Attention models were first introduced in encoder-decoder neural networks in the context of natural language processing (NLP) (Bahdanau et al., 2015)

. The main idea behind attention mechanisms is to allow the decoder to selectively access the most important parts of the input sequence based on the current context. This technique serves as a memory-access mechanism, and overcomes RNNs difficulties in learning from long input sequences. Attention has then been used for image captioning in an architecture that made use of both CNNs and RNNs

(Xu et al., 2015). Since then, attention models have become very popular in the deep learning community as an effective and powerful tool to enhance the capabilities of RNNs (e.g. Luong et al., 2015; Chaudhari et al., 2019; Toshevska and Kalajdziski, 2019). Furthermore, Vaswani et al. (2017) introduced the Transformer architecture, which is the current state-of-the-art for NLP, and completely removes RNNs with an attention-only mechanism to model temporal relationships.

In HAR, attention models have only been used in addition to a RNN (as described in Section 2.1), and not as a mean to directly capture temporal dependencies, which is the approach we propose in TrASenD.

2.3 Transfer Learning and Personalization in HAR

Transfer learning is not new to HAR. In particular transfer learning has been leveraged to compensate for the amount of labeled data when training a model for activity recognition in different environments/circumstances (Lopes et al., 2011; Cook et al., 2013).

A previous (non-deep learning) transfer learning approach for personalized HAR, was proposed by Saeedi et al. (2018), and used the Locally Linear Embedding (LLE) algorithm to construct activity manifolds, which are used to assign labels to unlabeled data that can be used to develop a personalized model for the target user. Other different approaches to personalized HAR have been made with incremental learning (Siirtola et al., 2018)

on some classifiers that however were not based on deep learning, and with

Hidden Unit Contributions (Matsui et al., 2017), a small layer inserted in between CNNs and learned from user data. In our approach we use transfer learning to train a small portion of the neural network architecture on data provided by a specific user. We show empirically that this simple and easy to implement technique is in fact capable of adapting the framework to the user. Some preliminary work in this direction can be found in Rokni et al. (2018). We greatly expand on it by: providing quantitative results on the improvements given by this personalization process; comparing with state-of-the-art techniques; and applying the personalization procedure to multiple, different, deep learning architectures. We also present an empirical evaluation of the learning capabilities of the proposed transfer learning technique.

3 Data Preprocessing

In this section we present the preprocessing of the sensor measurements that is performed for TrASenD111DeepSense (Yao et al., 2017)

applies a similar procedure, however, we also report some details, like the interpolation of the measurements, and the exact values of the parameters, that were not specified in

(Yao et al., 2017).. For each sensor , , let matrix

describe its measurements, and vector

define the timestamp of each measurement. has size , where is the number of dimensions for each measurement from sensor (e.g., for both accelerometer and gyroscope as they measure data along the , , and axes) and is the number of measurements. has size . For each sensor , , the preprocessing procedure is defined as follows:

  • Split the input measurements and along time to generate a series of non-overlapping intervals with width . These intervals define the set , where and .

  • For each pair belonging to

    apply the Fourier transform and stack the inputs into a

    tensor , where

    is the dimension of the frequency domain containing

    magnitude and phase pairs.

Finally, we group all the tensors in the set , which is then the input to our TrASenD framework.

In practice, we first divide the measurements into samples with a length of 5 seconds (with no overlap), and then apply the procedure with seconds and . From now on, with the term timestep we refer to a given -length interval. In order to deal with uneven sampling intervals that might appear in the data we first interpolate the measurements in each -length interval, sample evenly separated points, and then apply the Fourier transform to those points. The interpolation is done with a linear interpolation along each measurement axis. The measurements in a 5 seconds sample of each sensor are passed to the architecture as a matrix of size features dimension, where and features dimension (each training and evaluation example is fed to the network with one matrix per sensor). Notice that applying a convolution operation with filters having a receptive field that spans a single row is like extracting features from each -length interval separately.

Data Augmentation.

Similarly to Yao et al. (2017

), for each training example we added other 9 artificial examples obtained by adding noise (with a normal distribution with zero mean and variance of

for the accelerometer and of for the gyroscope). The idea behind this procedure is that the data generated by the sensors are already noisy, so having more samples with slightly different noise should make the network more robust to it. We analyze the impact of data augmentation in our experimental section.

4 Architecture

In this section we present our framework TrASenD. We start with a description of the architectural template defined by the DeepSense framework (Yao et al., 2017)222In (Yao et al., 2017)

the authors do not specify several architectural parameters (filter dimensions, strides, presence of padding, dropout probability, training optimizer, learning rate, etc.). We refer to the parameters that can be found on the author’s implementation available at We then present the unique characteristics of TrASenD and its redesigned temporal extraction strategy that is based purely on attention. Finally, we present two additional variants of TrASenD with the goal of studying different temporal extraction strategies not based purely on attention, but still more advanced than the stacked GRU layer of DeepSense.

4.1 DeepSense

DeepSense’s architecture (Figure 1) can be divided in three parts: convolutional layers, recurrent layers, and output layer. The convolutional layers can be further divided into two subnetworks: an individual convolutional subnetwork for each sensor and a unique merge convolutional subnetwork. Each individual convolutional subnetwork (one per sensor) takes as input a matrix with dimension features dimension (see Section 3) and is composed of three convolutional layers with 64 filters each. The first layer has filters with dimension with a stride of 333Intuitively, the filters have a receptive field that covers three measurement points, and have a stride of one measurement point (after the Fourier transform each point is represented by two numbers: magnitude and phase).. The second and the third individual convolutional layers have filters with dimension

. The convolutions in all three layers are applied without padding and are followed by batch normalization

(Ioffe and Szegedy, 2015)

, and a ReLu activation. Furthermore dropout

(Srivastava et al., 2014) is applied in between the layers, with probability . The output of the individual layers are then concatenated, obtaining a tensor with dimension (where features depends of the dimension of filters at the previous layers and channels is equal to the number of filters of the last individual convolutional layers), and passed to the merge convolutional subnetwork. This subnetwork is composed of three convolutional layers with 64 filters each. For each layer the dimensions of the filters are respectively , , , this time with padding. Again, after each layer, batch normalization and a ReLu activation are performed, with dropout in between layers (with probability ).

Figure 1: Scheme of the DeepSense framework (Yao et al., 2017). Individual convolutional subnetworks and the merge convolutional subnetwork share weights across timesteps.

The recurrent layers are composed of two stacked GRU (Chung et al., 2014) layers with 120 cells each. Dropout (with probability ) and recurrent batch normalization (Cooijmans et al., 2017) are performed between the two layers. Then the mean of the outputs at each time step is taken, and passed to the output layer.

Finally, the output layer is a simple dense layer with a number of units equal to the number of activities to predict. The softmax

activation is used to get a probability distribution between the activities, and the cross-entropy is used as loss function:

where is the number of training examples, is the number of different classes, is the

-th element of the one-hot encoded ground truth for the

-th training example, and is the -th element of the output of the architecture (after softmax) for the -th training example.

4.2 TrASenD

Recurrent Neural Networks (RNNs) present several problems, from the difficulty to learn long-term dependencies (Hochreiter et al., 2001; Cho et al., 2014), to their low computational efficiency. We propose a new framework, building on the architectural template defined in Section 4.1, that replaces the stacked-GRU recurrent layer with an attention-based technique that better exploits temporal dependencies in the data.

We first introduce the attention operator, which is at the core of our attention-based technique for the extraction of temporal dependencies, and then present in more detail the architecture of our proposed framework. Figure 2 (a) shows a scheme of the architecture of our temporal dependencies extractor.

Figure 2: (a) Scheme of TrASenD’s temporal information extraction block. (b) Scheme of the attention mechanism for TrASenD-CA. At a given timestep the high level features extracted from the merge convolutional subnetwork are first flattened and concatenated. The attention mechanism, considering the current state of the GRU layer, generates an attention weight for each feature which is then used to scale them. The sum of the scaled features represents the context vector which is concatenated to the original features and passed as input to the GRU.

4.2.1 Attention Operator

An attention operator takes as input three matrices: a Query matrix , a Key matrix , and a Value matrix , where each row of the matrices indicates the query, key, or value vector of a specific item (where item usually refers to a feature vector). The attention operator attends every query to every key and obtains a similarity score (also called attention score) which is used to obtain weights for all the value vectors (rows of the Value matrix). Following (Vaswani et al., 2017), we obtain the similarity score using the scaled dot-product, and then the attention weights by applying softmax. Finally, the values are scaled with their respective attention weight. The whole process can be written as:

where is the dimension of query and key vectors. The weights are such that, for every query, the values related to the keys with the highest similarity score are given a higher weight (i.e., more importance). In other words, the weights are used to give more attention to the values that are more pertinent to the given query. We talk about self-attention when Query, Key, and Value matrices are all referring to items of the same sequence. A multi-headed mechanism is such that, for each item, different multiple Query, Key, and Value matrices are created and the attention operator is applied to all of them. The outputs of all the heads are then combined together.

4.2.2 Architecture

TrASenD follows the feature extraction procedure and the feed-forward output layer of DeepSense, but completely replaces the recurrent layers. In fact, we only use attention to extract temporal dependencies in the data, with a temporal information extractor layer inspired by the Transformer (Vaswani et al., 2017). In more detail, we create a temporal information extractor using a 8-headed self-attention mechanism. To pass the data to the temporal layer, we reshape the output of the merge convolutional subnetwork to have dimension (where features depends from the size and the number of filters in the merge convolutional subnetwork). The features at different timesteps will be the input of the self-attention mechanism. Every sublayer of the temporal block has output with size

to allow residual connections.

We start by applying the positional embedding described by Vaswani et al. (2017) to introduce a notion of relative order between the features extracted at different timesteps. Then, for each head, we first multiply the input with 3 different learnable matrices to obtain the query, key, value matrices (each row of these matrices represents query, key, and value vectors for each timestep). We then obtain the attention score using the scaled dot-product, where we used and set the dimension of the values to be the same. The attention weights obtained from each head are then concatenated and multiplied by a learnable matrix to return to a matrix with dimension . This matrix is then summed with the original inputs (creating a residual connection), and Layer Normalization (Ba et al., 2016) is applied. The data in each timestep is passed through a position-wise dense layer444The same feedforward network is used for each timestep. It is equivalent to a one-dimensional convolutional layer over timesteps with kernel size 1. with ReLu activation. Finally another residual connection with Layer Normalization is applied to obtain the output of the temporal information extraction block which is then passed to the feedforward output layer. A scheme of the temporal information extraction block can be found in Figure 2 (a).

4.3 Other Architectural Variants

We now present two variants of TrASenD where we replace the purely attention based temporal information extraction block, with other (simpler, but more advanced than regular RNNs) techniques to capture temporal dependencies in the input.


The first variant substitutes the pure attention temporal block with a bidirectional-RNN (BRNN) (Schuster and Paliwal, 1997). A BRNN generalizes the concept of RNNs by connecting two hidden layers of opposite directions to the same output (we continue using GRUs as forward and backward hidden layers). This allows the network to get information from past and future inputs simultaneously. At each timestep we now get the state of both forward and backward cells, so we concatenate them, and finally take the average of the concatenated outputs at each timestep and pass them to the output layer.


Inspired by the work by Xu et al. (2015), we use a GRU layer (we keep it with 120 cells) with an attention mechanism over the output features of the merge convolutional subnetwork. We first average the features extracted from the first -length interval (first timestep) and pass it through a dense layer to obtain the initial state for the GRU layer. We then use the following attention mechanism: at each timestep, we pass the features extracted by the CNN layers and the current state of the GRU through two different dense layers

without applying any activation function

. We then sum the two outputs and apply before passing it to softmax to obtain the attention weights. Finally, the features are scaled with their attention weights. The sum of the scaled feature vectors forms the context vector which is then concatenated to the original features for the current timestep and passed as input to the GRU. A scheme of this attention mechanism can be found in Figure 2 (b). The rest of the architecture remains unchanged.

4.4 Transfer Learning Personalization

To make the system capable of adapting to a specific user over time, we propose a simple transfer learning strategy. Transfer learning is a method where a model developed for a task is reused as the starting point to learn a model on a second task. The typical scenario in a transfer learning setting is to have a trained base network, which is repurposed by training on a target dataset. The idea is that the pre-trained weights in the base network can ease the training on the target dataset. We slightly depart from this scenario by extracting the output layer from a trained TrASenD model (and other proposed variants); that is, we are using transfer learning only on the output layer. More in detail, the data coming from the sensors will be passed to the TrASenD architecture, up to the end of the temporal layer. The output layer becomes a separate network that receives the output of the temporal layer as input, and will be trained with the data generated by the user. This can be implemented in a practical scenario by first using a model trained on one of the datasets, and after each prediction, asking the user to manually insert the activity he was performing. We then use these new data samples to retrain only the output layer, which is a single layer dense network that can easily be trained on-device. This procedure allows the architecture to take advantage of the complex general feature extracting mechanism that reduces multimodal time series to a fixed size vector, and to successively learn user-specific feature characteristics.

5 Experimental Evaluation

We present here the datasets and the procedure used to evaluate the performance of TrASenD, performing comparisons with multiple other methods for HAR, and to evaluate the proposed personalization process. Furthermore, we present an empirical study of the benefits of the data augmentation procedure.

5.1 Datasets

We present below the three human activity recognition datasets used in our tests. Our choices were based on the statistics shown in Table 3 of the survey by Wang et al. (2019): we considered the datasets that had data from at least 9 subjects (to better test generalization properties), with at least 2 different sensing modalities (to test the various methods on multimodal data), and then took the datasets with the largest number of samples. A summary of the chosen datasets can be found in Table 1.


(Stisen et al., 2015). The Heterogeneity Activity Recognition Data Set contains data from accelerometer and gyroscope of 12 different devices (8 smartphones and 4 smartwatches) used by 9 different subjects while performing 6 activities. We only considered data coming from smartphones.


(Reiss and Stricker, 2012, 2012). The Physical Activity Monitoring dataset contains data of 12 different physical activities, performed by 9 subjects wearing 3 inertial measurement units and a heart rate monitor. We only considered data coming from the inertial measurement units (IMU), which were positioned in three different body areas (hand, chest, ankle) during the measurements. From each IMU we considered data measured by the first accelerometer, the gyroscope and the magnetometer. This provides a scenario with data coming from 9 input sensors.


(Zhang and Sawchuk, 2012). The University of Southern California Human Activity Dataset uses high precision specialised hardware, and has a focus on the diversity of subjects, balancing the participants based on gender, age, height and weight. The dataset contains measurements from accelerometer and gyroscope obtained from 14 different subjects while performing 12 activities.

Dataset Subjects Activities Input Sensors
HHAR 9 6 2
PAMAP2 9 12 9
USC-HAD 14 12 2
Table 1: Summary of the multi-modal HAR datasets used for our tests.

5.2 Baselines

We choose an extensive collection of deep learning, and non-deep learning methods to compare to TrASenD

 and its variants. For all considered models, we use the implementation provided by the authors when it is available, and we implement it from scratch following the description from the papers when it is not. Unless otherwise specified we use the model hyperparameters defined by the authors.

Deep Learning Baselines.

We test our algorithm against all the DeepSense-based architectures, and additional deep learning techniques. In particular for the DeepSense-based architectures we test against the original DeepSense (Yao et al., 2017), and the two latest attention enhanced versions: SADeepSense (Yao et al., 2019), and AttnSense (Ma et al., 2019). We then consider DeepConvLSTM (Ordóñez and Roggen, 2016) which is a CNN+LSTM approach, and its new attentive version proposed in (Murahari and Plötz, 2018) that we call DeepConvLSTM-Att. All the attention models considered thus far add an attention module to a RNN layer, while we remember that our algorithm TrASenD completely removes RNNs in favour of a purely attention-based temporal information extraction technique. We also provide some results for a basic LSTM based architecture (we implement it with 2 LSTM layers, each with 256 cells, followed by a fully connected layer that outputs the predicted class). Finally, to take into consideration also other deep learning techniques we consider MultiRBM (Radu et al., 2016)

, where a Restricted Boltzman Machine (RBM) is used for each sensor, and a single final RBM is used to then merge all the outputs for the sensors and obtain the predicted class.

Non-Deep Learning Baselines.

As non-deep learning baselines we considered a Random Forest (RF) classifier (one of the most used and most effective shallow classifiers for HAR

(Stisen et al., 2015)) on the same raw frequency domain features fed to the deep learning approaches (denoted with RF-FF), and then on the most used handcrafted frequency domain features (DC Component, Spectral Energy, and Information Entropy; denoted with RF-HC).

5.3 Experimental Setup

For all tests we performed leave-one-user-out cross validation: we train on data from all subjects except one, and we use the data from the excluded subject as test set. We perform this procedure for each subject and then average the results.

To evaluate the personalization process we divide the data of each activity of the excluded user into two equal time-contiguous parts. One part is used to personalize the output layer after the model has been learned on all other users, and the other is used as test set. We also make sure to feed the data, both for training and validation, in time-contiguous samples (simulating the real-world personalization procedure described in Section 4.4).

Due to the imbalance in the number of samples per-class we use the F1 score as the measure to quantify the performance of the models. All TrASenD models were implemented555Code is available at:

using TensorFlow

(Abadi et al., 2016).

To ensure a fair comparison and to avoid “hyperparameter hacking” we kept all the values for the architecture hyperparameters (filters size, dropout probability, number of filters, number of GRU units, etc.: see Section 4.) equal for each DeepSense-based model. Furthermore, for all models, the only optimized hyperparameter was the learning rate. To do so we took out 1 user and tried the training and evaluation procedure on the HHAR dataset, with learning rate . We then considered the setting that gave the highest F1 score on the user’s data and used it for all

datasets (no optimization for each different dataset). In the training procedure we trained for 30 epochs for each user and took the model of the epoch with the highest performance. All

TrASenD based models were trained using the Adam Optimizer (Kingma and Ba, 2014). The other methods were trained with the optimization technique suggested by the authors. For the personalization process, we retrain the output layer for 1 epoch (per each new data point separately) with TensorFlow’s default Adam optimizer parameters: , , , and .

5.4 Results

Model Dataset
RF-FF 0.569 0.512 0.417
RF-HC (Stisen et al., 2015) 0.575 0.501 0.474
MultiRBM (Radu et al., 2016) 0.647 0.589 0.598
LSTM 0.663 0.583 0.612
DeepConvLSTM (Ordóñez and Roggen, 2016) 0.701 0.633 0.658
DeepConvLSTM-Att (Murahari and Plötz, 2018) 0.735 0.647 0.682
DeepSense (Yao et al., 2017) 0.720 0.647 0.670
SADeepSense (Yao et al., 2019) 0.753 0.661 0.688
AttnSense (Ma et al., 2019) 0.762 0.657 0.685
TrASenD-BD 0.798 0.650 0.681
TrASenD-CA 0.797 0.659 0.687
TrASenD 0.848 0.723 0.702
Table 2: F1 score results of the considered models on three HAR datasets.

Table 2 summarizes the F1 score results for TrASenD and the other methods we considered, on the three datasets We can observe that TrASenD and its variants present higher F1 score than DeepSense on all the three datasets. Furthermore we notice that TrASenD always achieves the highest performance with a big margin. In fact, TrASenD shows an F1 score that is, on average, higher then the previous best performing model. These results confirm that our attention-based technique (without RNNs) is highly capable of extracting temporal dependencies. Most notably we can see that TrASenD significantly outperforms the newer SADeepSense and AttnSense, whose performance are comparable to the ones of TrASenD-BD and TrASenD-CA, which are far from TrASenD’s.

Model Dataset
DeepSense (Yao et al., 2017) 0.720 0.775 0.647 0.693 0.670 0.712
SADeepSense (Yao et al., 2019) 0.753 0.790 0.661 0.699 0.688 0.749
AttnSense (Ma et al., 2019) 0.762 0.801 0.657 0.689 0.685 0.746
TrASenD-BD 0.798 0.821 0.650 0.699 0.681 0.748
TrASenD-CA 0.797 0.819 0.659 0.701 0.687 0.726
TrASenD 0.848 0.889 0.723 0.749 0.702 0.759
Table 3: F1 score results of the deep learning models with (P) and without (NP) personalization.

Table 3 presents the F1 score of the DeepSense-based models when evaluated on the datasets with and without applying personalization. The results confirm the effectiveness of our transfer learning personalization process giving an average increase on the F1 score independently of dataset and base architecture.

We can therefore conclude that pure attention based models are effective also outside of the Natural Language Processing scenario they were proposed for, and they are a very powerful technique that shows great results on sensor-based data.

5.4.1 Validating the Personalization Process

To prove that the training of the output layer alone can significantly impact on the performance of the network we first train the full model of Section 4.1 on the HHAR dataset with randomly permuted labels, and then we perform the personalization process on correctly labeled data. The resulting F1 scores (on the test set) are 0.166 and 0.523, respectively. We can notice that the model trained on data with randomly permuted labels has the performance of a uniform random classifier, as one would expect, and the personalization process is capable of significantly boosting the performance of the model. This result shows that in fact the re-training of the output layer alone can largely affect the outcome of the model.

Figure 3: Performance of the Deep Learning models on HHAR when trained with different number of augmented samples.

5.4.2 Impact of Data Augmentation

To asses the benefits of the data augmentation procedure, we evaluate all the deep learning models based on the DeepSense framework on HHAR with and without augmented data. The results, shown in Table 4, confirm that data augmentation is important to train a model that is more robust to noise, and in fact we can see a significant increase in the F1 score. Figure 3 shows how the performance of the analyzed DeepSense variants change when trained with different number of augmented samples. It’s interesting to see that using 4 augmented samples for each real sample, already provides an important performance gain. We also notice that TrASenD is always superior to the other architectures, and performs significantly better than the others even when trained without augmented samples. Furthermore, we see that SADeepSense and TrASenD are the two architectures showing the smallest gap between highest and lowest F1 score result, confirming their superior generalization properties, with TrASenD achieving a higher overall F1 score.

Model F1 Score on Test Set
DeepSense (Yao et al., 2017) 0.720 0.621
SADeepSense (Yao et al., 2019) 0.753 0.682
AttnSense (Ma et al., 2019) 0.762 0.687
TrASenD-BD 0.798 0.646
TrASenD-CA 0.797 0.638
TrASenD 0.848 0.761
Table 4: Performance on HHAR with (A) and without (NA) data augmentation.

6 Conclusions

In this paper we presented TrASenD, a new deep learning frameworks for multimodal time series, and also proposed a transfer learning procedure to personalize the model to a specific user for the human activity recognition tasks. TrASenD is designed to improve the extraction of temporal dependencies in the data by replacing RNNs with a purely attention based temporal information extraction block. Our extensive experimental evaluation shows that TrASenD significantly outperform the state-of-the-art and that, in general, replacing RNNs with attention-based strategies leads to significant improvements. In particular, we obtain an average increment of more than on the F1 score over the previous best performing model. We also show the effectiveness of our simple personalization process, which is capable of an average increment on the F1 score on data from a specific user, and the impact of data augmentation.

The personalization procedure we propose may impact the user experience while using an application that implements our technique. In fact, asking too many times for feedback about the model’s predictions may not be feasible. Future research directions include the optimization of the personalization process to minimize the feedback required from the user, for example by using data augmentation or curriculum training techniques (Bengio et al., 2009).

Work partially supported by MIUR, the Italian Ministry of Education, University and Research, under PRIN Project n. 20174LF3T8 AHeAD (Efficient Algorithms for HArnessing Networked Data) and under the initiative “Departments of Excellence" (Law 232/2016), and by the grant STARS2017 from the University of Padova.


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    Tensorflow: a system for large-scale machine learning

    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §5.3.
  • B. Almaslukh, J. AlMuhtadi, and A. Artoli (2017) An effective deep autoencoder approach for online smartphone-based human activity recognition. Int. J. Comput. Sci. Netw. Secur 17 (4), pp. 160–165. Cited by: §2.1.
  • J. Ba, R. Kiros, and G. E. Hinton (2016) Layer normalization. CoRR abs/1607.06450. Cited by: §4.2.2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. Cited by: §1, §2.2.
  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. Proceedings of the 26th Annual International Conference on Machine Learning - ICML ’09. External Links: Document, ISBN 9781605585161, Link Cited by: §6.
  • V. Bianchi, M. Bassoli, G. Lombardo, P. Fornacciari, M. Mordonini, and I. De Munari (2019) IoT wearable sensor and deep learning: an integrated approach for personalized human activity recognition in a smart home environment. IEEE Internet of Things Journal 6 (5), pp. 8553–8562. External Links: Document, ISSN 2372-2541, Link Cited by: §1.
  • S. Chaudhari, G. Polatkan, R. Ramanath, and V. Mithal (2019) An attentive survey of attention models. ArXiv abs/1904.02874. Cited by: §2.2.
  • K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio (2014) On the properties of neural machine translation: encoder–decoder approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. External Links: Document, Link Cited by: §1, §4.2.
  • J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555. Cited by: §4.1.
  • T. Cooijmans, N. Ballas, C. Laurent, and A. C. Courville (2017) Recurrent batch normalization. CoRR abs/1603.09025. Cited by: §4.1.
  • D. Cook, K. D. Feuz, and N. C. Krishnan (2013) Transfer learning for activity recognition: a survey. Knowledge and Information Systems 36 (3), pp. 537–556. External Links: Document, ISSN 0219-3116, Link Cited by: §2.3.
  • D. Figo, P. C. Diniz, D. R. Ferreira, and J. M. P. Cardoso (2010) Preprocessing techniques for context recognition from accelerometer data. Personal and Ubiquitous Computing 14 (7), pp. 645–662. External Links: Document, ISSN 1617-4917, Link Cited by: §1.
  • Y. Guan and T. Plötz (2017) Ensembles of deep lstm learners for activity recognition using wearables. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1 (2), pp. 1–28. External Links: Document, ISSN 2474-9567, Link Cited by: §2.1.
  • N. Y. Hammerla, J. Fisher, P. Andras, L. Rochester, R. Walker, and T. Plötz (2015) PD disease state assessment in naturalistic environments using deep learning. In AAAI, Cited by: §2.1.
  • N. Y. Hammerla, S. Halloran, and T. Plötz (2016) Deep, convolutional, and recurrent models for human activity recognition using wearables. In IJCAI, Cited by: §2.1.
  • S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In A Field Guide to Dynamical Recurrent Neural Networks, S. C. Kremer and J. F. Kolen (Eds.), Cited by: §1, §4.2.
  • M. Inoue, S. Inoue, and T. Nishida (2017) Deep recurrent neural network for mobile human activity recognition with high throughput. Artificial Life and Robotics 23 (2), pp. 173–185. External Links: Document, ISSN 1614-7456, Link Cited by: §2.1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. External Links: Link Cited by: §4.1.
  • D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. International Conference on Learning Representations. Cited by: §5.3.
  • X. Li, Y. Zhang, M. Li, I. Marsic, J. Yang, and R. S. Burd (2016a) Deep neural network for rfid-based activity recognition. Proceedings of the Eighth Wireless of the Students, by the Students, and for the Students Workshop on - S3. External Links: Document, ISBN 9781450342551, Link Cited by: §2.1.
  • X. Li, Y. Zhang, I. Marsic, A. Sarcevic, and R. S. Burd (2016b) Deep learning for rfid-based activity recognition. Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems CD-ROM - SenSys ’16. External Links: Document, ISBN 9781450342636, Link Cited by: §2.1.
  • A. P. Lopes, E. Santos, E. Valle, J. Almeida, and A. Araujo (2011) Transfer learning for human action recognition. 2011 24th SIBGRAPI Conference on Graphics, Patterns and Images. External Links: Document, ISBN 9781457716744, Link Cited by: §2.3.
  • T. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. External Links: Document, Link Cited by: §2.2.
  • H. Ma, W. Li, X. Zhang, S. Gao, and S. Lu (2019) AttnSense: multi-level attention mechanism for multimodal human activity recognition.

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence

    External Links: Document, ISBN 9780999241141, Link Cited by: §2.1, §2.1, §5.2, Table 2, Table 3, Table 4.
  • S. Matsui, N. Inoue, Y. Akagi, G. Nagino, and K. Shinoda (2017) User adaptation of convolutional neural network for human activity recognition. 2017 25th European Signal Processing Conference (EUSIPCO). External Links: Document, ISBN 9780992862671, Link Cited by: §2.3.
  • V. S. Murahari and T. Plötz (2018) On attention models for human activity recognition. Proceedings of the 2018 ACM International Symposium on Wearable Computers - ISWC ’18. External Links: Document, ISBN 9781450359672, Link Cited by: §5.2, Table 2.
  • H. F. Nweke, Y. W. Teh, M. A. Al-garadi, and U. R. Alo (2018) Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: state of the art and research challenges. Expert Systems with Applications 105, pp. 233 – 261. External Links: Document, ISSN 0957-4174, Link Cited by: §1, §1, §2.1.
  • F. Ordóñez and D. Roggen (2016) Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16 (1), pp. 115. External Links: Document, ISSN 1424-8220, Link Cited by: §2.1, §5.2, Table 2.
  • B. Pourbabaee, M. J. Roshtkhari, and K. Khorasani (2018) Deep convolutional neural networks and learning ecg features for screening paroxysmal atrial fibrillation patients. IEEE Transactions on Systems, Man, and Cybernetics: Systems 48 (12), pp. 2095–2104. External Links: Document, ISSN 2168-2232, Link Cited by: §2.1.
  • V. Radu, N. D. Lane, S. Bhattacharya, C. Mascolo, M. K. Marina, and F. Kawsar (2016) Towards multimodal deep learning for activity recognition on mobile devices. Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing Adjunct - UbiComp ’16. External Links: Document, ISBN 9781450344623, Link Cited by: §2.1, §5.2, Table 2.
  • D. Ravi, C. Wong, B. Lo, and G. Yang (2016) Deep learning for human activity recognition: a resource efficient implementation on low-power devices. 2016 IEEE 13th International Conference on Wearable and Implantable Body Sensor Networks (BSN). External Links: Document, ISBN 9781509030873, Link Cited by: §2.1.
  • A. Reiss and D. Stricker (2012) Introducing a new benchmarked dataset for activity monitoring. 2012 16th International Symposium on Wearable Computers. External Links: Document, ISBN 9781467315838, Link Cited by: §5.1.
  • A. Reiss and D. Stricker (2012) Creating and benchmarking a new dataset for physical activity monitoring. Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments - PETRA ’12. External Links: Document, ISBN 9781450313001, Link Cited by: §5.1.
  • S. A. Rokni, M. Nourollahi, and H. Ghasemzadeh (2018) Personalized human activity recognition using convolutional neural networks. ArXiv abs/1801.08252. Cited by: §2.3.
  • R. Saeedi, K. Sasani, S. Norgaard, and A. H. Gebremedhin (2018) Personalized human activity recognition using wearables: a manifold learning-based knowledge transfer. 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). External Links: Document, ISBN 9781538636466, Link Cited by: §2.3.
  • A. Sathyanarayana, S. R. Joty, L. Fernández-Luque, F. Ofli, J. Srivastava, A. K. Elmagarmid, S. Taheri, and T. Arora (2016) Impact of physical activity on sleep: a deep learning based exploration. CoRR abs/1607.07034. Cited by: §2.1.
  • M. Schuster and K.K. Paliwal (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. External Links: Document, ISSN 1053-587X, Link Cited by: §4.3.
  • P. Siirtola, H. Koskimäki, and J. Röning (2018) Personalizing human activity recognition models using incremental learning. Cited by: §2.3.
  • M. S. Singh, V. Pondenkandath, B. Zhou, P. Lukowicz, and M. Liwickit (2017) Transforming sensor data to the image domain for deep learning — an application to footstep detection. 2017 International Joint Conference on Neural Networks (IJCNN). External Links: Document, ISBN 9781509061822, Link Cited by: §2.1.
  • N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. Cited by: §4.1.
  • A. Stisen, H. Blunck, S. Bhattacharya, T. S. Prentow, M. B. Kjærgaard, A. Dey, T. Sonne, and M. M. Jensen (2015) Smart devices are different: assessing and mitigating mobile sensing heterogeneities for activity recognition. Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems - SenSys ’15. External Links: Document, ISBN 9781450336314, Link Cited by: §1, §5.1, §5.2, Table 2.
  • M. Toshevska and S. Kalajdziski (2019)

    Exploring the attention mechanism in deep models: a case study on sentiment analysis

    ICT Innovations 2019. Big Data Processing and Mining, pp. 202–211. External Links: Document, ISBN 9783030331108, ISSN 1865-0937, Link Cited by: §2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: 1st item, §1, §2.2, §4.2.1, §4.2.2, §4.2.2.
  • A. Wang, G. Chen, C. Shang, M. Zhang, and L. Liu (2016)

    Human activity recognition in a smart home environment with stacked denoising autoencoders

    Lecture Notes in Computer Science, pp. 29–40. External Links: Document, ISBN 9783319471211, ISSN 1611-3349, Link Cited by: §2.1.
  • J. Wang, Y. Chen, S. Hao, X. Peng, and L. Hu (2019) Deep learning for sensor-based activity recognition: a survey. Pattern Recognition Letters 119, pp. 3–11. External Links: Document, ISSN 0167-8655, Link Cited by: §1, §2.1, §5.1.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 2048–2057. External Links: Link Cited by: §2.2, §4.3.
  • S. Yao, S. Hu, Y. Zhao, A. Zhang, and T. Abdelzaher (2017) DeepSense: a unified deep learning framework for time-series mobile sensing data processing. Proceedings of the 26th International Conference on World Wide Web - WWW ’17. External Links: Document, ISBN 9781450349130, Link Cited by: 1st item, §1, §2.1, §2.1, §3, Figure 1, §4, §5.2, Table 2, Table 3, Table 4, footnote 1, footnote 2.
  • S. Yao, Y. Zhao, H. Shao, D. Liu, S. Liu, Y. Hao, A. Piao, S. Hu, S. Lu, and T. F. Abdelzaher (2019) SADeepSense: self-attention deep learning framework for heterogeneous on-device sensors in internet of things applications. IEEE INFOCOM 2019 - IEEE Conference on Computer Communications. External Links: Document, ISBN 9781728105154, Link Cited by: 1st item, §2.1, §2.1, §5.2, Table 2, Table 3, Table 4.
  • S. Yao, Y. Zhao, A. Zhang, S. Hu, H. Shao, C. Zhang, L. Su, and T. Abdelzaher (2018) Deep learning for the internet of things. Vol. 51, IEEE Computer Society (English (US)). External Links: Document, ISSN 0018-9162 Cited by: §1.
  • M. Zhang and A. A. Sawchuk (2012) USC-had: a daily activity dataset for ubiquitous activity recognition using wearable sensors. Proceedings of the 2012 ACM Conference on Ubiquitous Computing - UbiComp ’12. External Links: Document, ISBN 9781450312240, Link Cited by: §5.1.