DeepAI
Log In Sign Up

TASKED: Transformer-based Adversarial learning for human activity recognition using wearable sensors via Self-KnowledgE Distillation

09/14/2022
by   Sungho Suh, et al.
DFKI GmbH
0

Wearable sensor-based human activity recognition (HAR) has emerged as a principal research area and is utilized in a variety of applications. Recently, deep learning-based methods have achieved significant improvement in the HAR field with the development of human-computer interaction applications. However, they are limited to operating in a local neighborhood in the process of a standard convolution neural network, and correlations between different sensors on body positions are ignored. In addition, they still face significant challenging problems with performance degradation due to large gaps in the distribution of training and test data, and behavioral differences between subjects. In this work, we propose a novel Transformer-based Adversarial learning framework for human activity recognition using wearable sensors via Self-KnowledgE Distillation (TASKED), that accounts for individual sensor orientations and spatial and temporal features. The proposed method is capable of learning cross-domain embedding feature representations from multiple subjects datasets using adversarial learning and the maximum mean discrepancy (MMD) regularization to align the data distribution over multiple domains. In the proposed method, we adopt the teacher-free self-knowledge distillation to improve the stability of the training procedure and the performance of human activity recognition. Experimental results show that TASKED not only outperforms state-of-the-art methods on the four real-world public HAR datasets (alone or combined) but also improves the subject generalization effectively.

READ FULL TEXT VIEW PDF

page 7

page 13

10/23/2021

Adversarial Deep Feature Extraction Network for User Independent Human Activity Recognition

User dependence remains one of the most difficult general problems in Hu...
04/15/2022

Ensemble diverse hypotheses and knowledge distillation for unsupervised cross-subject adaptation

Recognizing human locomotion intent and activities is important for cont...
09/20/2020

MARS: Mixed Virtual and Real Wearable Sensors for Human Activity Recognition with Multi-Domain Deep Learning Model

Human activity recognition (HAR) using wearable Inertial Measurement Uni...
07/14/2020

Spectrum-Guided Adversarial Disparity Learning

It has been a significant challenge to portray intraclass disparity prec...
03/04/2020

A Deep Learning Method for Complex Human Activity Recognition Using Virtual Wearable Sensors

Sensor-based human activity recognition (HAR) is now a research hotspot ...
03/24/2018

Encoding Accelerometer Signals as Images for Activity Recognition Using Residual Neural Network

Human activity recognition using a single 3-axis accelerometer plays an ...
12/14/2020

Invariant Feature Learning for Sensor-based Human Activity Recognition

Wearable sensor-based human activity recognition (HAR) has been a resear...

1 Introduction

Wearable sensor-based human activity recognition (HAR) has become an important field with various applications, such as health management Bachlin et al. (2009); Plötz et al. (2012), home behavior analysis Wen et al. (2016), user authentication Derawi et al. (2010), and exercise Direkoglu and O’Connor (2012); Sundholm et al. (2014)

. The main tasks of sensor-based HAR are to recognize which activity is being performed, such as sitting, walking, and hammering, given the data provided by the on-body sensors like IMU on the smartphone and smartwatch. A large number of studies have been conducted for sensor-based HAR, which can be categorized into traditional machine learning models and deep learning-based methods. Traditional machine learning models represent feature vectors computed over sub-intervals of sensor data using hand-crafted features in statistical and frequency domains and learn a mapping from feature vectors to activity labels using traditional machine learning classifiers such as random forest and support vector machine (SVM)

Bao and Intille (2004); Chavarriaga et al. (2013); Kwon et al. (2018)

. In the case of deep learning approaches, the raw (or lightly pre-processed) sensor data is fed into a deep neural network, which alone performs the feature extraction and classification. Recently, deep learning methods have achieved state-of-the-art performance in HAR by using convolutional neural networks (CNN) as in

Yang et al. (2015)

or by combining them with long short-term memory networks (LSTM) as in

Ordóñez and Roggen (2016). Even more recently, self-attention approaches have reached state-of-the-art results Mahmud et al. (2020); Khaertdinov et al. (2021).

Although the existing sensor-based HAR methods have achieved impressive results, they have a limitation when handling sensor data from diverse subjects (i.e., users) or applying the well-trained models to unseen users. Many existing studies Cutting and Kozlowski (1977); Singh et al. (2017) have shown that different people perform the same activities in different ways due to their varied personal characteristics and behaviours, which makes user recognition possible, but significant performance degradation of activity recognition can occur. This is observed in practice by the gap in performance when evaluating by leaving out subjects instead of leaving out sessions.

The approaches to this problem can be roughly classified into two categories: classic approaches and deep learning-based methods. Classic approaches include directly selecting user-invariant classical features Saputri et al. (2014) or directly building one model per user Hong et al. (2015). These methods can achieve good performance, but they require labeled data to build models for all users, which is not realistic for many HAR applications as it may increase costs and deployment time, and selecting hand-crafted user-invariant features may not be feasible as developing said features requires expertise in the target domain. In addition, this approach may weaken the overall performance depending on the quality of the features available.

Recently, deep learning has been applied to this problem by exploring multi-task or even adversarial learning. In Chen et al. (2020), authors exploit user labels by combining HAR with subject identification in a multi-task learning framework that allows the model to focus on the relevant features for each user. This shows that taking user information into account can improve classification results, but it is not clear if models can learn to generalize beyond the available training subjects as they were not evaluated when leaving users out. Still, their work shows that deep learning models can clearly benefit from taking user information into account. Other works such as Sheng and Huber (2020) have shown that differences in the environment can, to some extent, be mitigated by employing similarity-based multi-task learning, but they have also not evaluated their model leaving users out and their representation tends to create one cluster per subject with sub-clusters per activity, which may not favor generalization, being more similar to user-models.

In the other direction, Bai et al. (2020)

adopted adversarial learning to generate a feature representation more robust regarding user variations. Instead of allowing the model to exploit user-specific information for classification, it should avoid leaking it as there is an adversary (a discriminator) whose objective is to separate subjects in the feature space. They achieved this by employing a Wasserstein Generative Adversarial Network (WGAN)

Arjovsky et al. (2017) and Siamese networks and showed that their method could generalize to new subjects without sacrificing performance. This may have other advantages besides performance, as neural networks are known to leak subject information Iwasawa et al. (2017) and applying adversarial learning can mitigate those concerns.

While the deep learning-based methods such as Bai et al. (2020) have achieved significant improvements, most of these methods still have limitations. Although adversarial learning improved the performance of the activity recognition by generalization, it cannot measure the degree of generalization of the latent features in the training procedure. In the adversarial learning procedure, feature representation modules as generators are trained to fool a discriminator, while the discriminator conducts binary classification to distinguish between two different features from representation modules with randomly chosen two subjects. Thus, it cannot measure the degree of generalization for all subjects and cannot generalize the feature representation over all subjects. Furthermore, their multi-view data representation module comprises three different networks to merge the sub-representations of different views. It requires large amounts of annotated data for human activity recognition to train this complex multi-view data representation module. However, it is challenging to collect a sufficiently large amount of datasets in personalized human activity recognition applications.

To overcome these problems, we propose a novel cross-subject adversarial learning for sensor-based human activity recognition. The proposed model is capable of learning a subject-independent embedding feature representation from multiple subjects and generalizing it to unseen target subjects by using an adversarial learning procedure. The adversarial learning between a feature extractor and a discriminator, which distinguishes the extracted features from multiple subjects, learns the distributions of multiple subject data and extracts subject-invariant generalized features for activity recognition. To measure the degree of the feature generalization and align the distributions among the multiple subjects, we use the Maximum Mean Discrepancy (MMD) Gretton et al. (2006); Li et al. (2015); Long et al. (2015) regularization. The MMD regularization helps enhance the generalization ability of the proposed adversarial learning method by quantifying the feature generalization. However, the performance of human activity recognition and feature generalization can be different depending on the architecture of the feature extractor network. The previous work Suh et al. (2022)

proposed an encoder-decoder structure based on the CNN structure to preserve the characteristic of the original signals, which has been utilized in weakly supervised learning and feature representation

Zeng et al. (2017); Varamin et al. (2018).

In this paper, we extend the previous work by using transformer network architecture

Dosovitskiy et al. (2020); Plizzari et al. (2021), which accounts for individual sensor orientations and spatial and temporal features, for cross-subject human activity recognition. In addition, we adopt the teacher-free self-knowledge distillation Yuan et al. (2020) to improve the stability of the training procedure and balance the optimization between feature generalization and activity recognition. Through a series of experiments on four public real-world human activity recognition datasets, we demonstrate the effectiveness of the proposed method. In addition, we evaluate the proposed method a step further that multiple datasets can be combined together. We discuss the overall potential of the proposed model to generalize various domains.

The contributions of this paper can be summarized as follows.

  • A novel Transformer-based Adversarial learning framework for human activity recognition using wearable sensors via Self-KnowledgE Distillation (TASKED) is proposed.

  • We formulate the adversarial learning between feature extraction and subject discriminator and improve the generalization of the extracted features and the performance of the activity recognition.

  • We use the MMD regularization to enhance the generalization of the feature representation and measure the degree of the generalization.

  • We adopt the self-knowledge distillation method to improve the stability of the training procedure and balance training optimization between feature generalization and activity recognition.

  • To validate the proposed method, we conducted experiments with four public benchmark datasets: Opportunity Chavarriaga et al. (2013), PAMAP2 Reiss and Stricker (2012), MHEALTH Banos et al. (2014), and RealDISP Baños et al. (2012). By the experiments on each and across multiple datasets, we can verify the advantages and effectiveness of the proposed method for feature generalization and activity recognition.

The rest of the paper is organized as follows. Section 2 introduces the related works. Section 3 provides the details of the proposed method. Section 4 presents quantitative experimental results on the four datasets and their combinations. Finally, Section 5 concludes the paper.

2 Related works

2.1 Sensor-based Human Activity Recognition

Many researchers have studied sensor-based HAR Dang et al. (2020). The task of sensor-based HAR can be considered as a time-series classification where sensor data is obtained by different types of sensor devices such as inertial measurement units (IMU), electrocardiography (ECG), electromyography (EMG), and heart rate measurement. Bulling et al. Bulling et al. (2014)

introduced the general-purpose process framework for activity recognition, which treats individual frames of sensor data as statistically independent. Previous studies can be classified into classical HAR methods and deep learning-based methods. The classical HAR methods have been investigated to extract hand-crafted features to capture the data distributions of activities. The most frequently used hand-crafted features are time-domain features, such as mean, variance, and skewness, and frequency domain features, such as power spectral density

Janidarmian et al. (2017). Anguita et al. Anguita et al. (2012) proposed a multi-class SVM model on a smartphone to recognize six-class locomotion activities. Hammerla et al. Hammerla et al. (2013) introduced an empirical cumulative density function (ECDF) feature to preserve the spatial information of the signal frames. Kwon et al. Kwon et al. (2018) extended it by adding temporal structures to the ECDF and showed the improvement of activity recognition.

Recently, deep learning-based HAR methods have been widely explored in existing work Plötz et al. (2011); Lane et al. (2015); Alsheikh et al. (2016), with the fast development and advancement of deep neural networks. Yang et al. Yang et al. (2015) proposed a human activity recognition method using CNNs, in which the multiple convolutions and pooling filters were designed along the temporal dimensions to process the sensor data. Bhattcharya and Lane Bhattacharya and Lane (2016) have introduced a sophisticated model optimization method for constrained resource inference on wearable devices such as smartphones and smartwatches. Contrary to the activity recognition methods using CNNs, recurrent deep learning methods Nakano and Chakraborty (2017) have been researched in the field of HAR. Most prominently used models, so-called LSTMs units Hochreiter and Schmidhuber (1997)

, like other recurrent neural networks, are recurrent neural networks with principally infinite memory for every computing node and have been used very successfully for HAR. Ordóñez and Roggen

Ordóñez and Roggen (2016) proposed a DeepConvLSTM which combines the LSTM model with a number of CNN layers to learn sensor representation by capturing the short-term and long-term temporal correlations for activity recognition. Alsheikh et al. Alsheikh et al. (2016)

proposed a hybrid approach using a deep belief network as an emission matrix of a hidden Markov model to extract features from the sequence of human activities. In

Hammerla et al. (2016), appropriate training procedures have been analyzed the basic deep learning approaches including basic temporal CNN and deep LSTM networks through large scale experimentation. Guan and Plötz Guan and Plötz (2017) proposed Ensembles of deep LSTM networks which combine multiple LSTM networks and achieved better performance than the previous works. More recently, the self-attention mechanism Vaswani et al. (2017) has been applied for activity recognition. Zeng et al. Zeng et al. (2018) applied the self-attention mechanism on the LSTM networks to highlight the important part of time-series signals. Similarly, Mahmud et al. Mahmud et al. (2020) proposed an activity recognition method by employing self-attention, which has reached state-of-the-art results. However, such deep models often require high computation and memory resources. Furthermore, all the works assume that the data among subjects in the training and test datasets follow the same data distributions. In real-world activity recognition applications, different people perform the same activities in different ways, so the data distributions among subjects have a huge gap.

2.2 Subject-independent Human Activity Recognition

Generally, there are two ways to capture the interpersonal variability: One is to increase the amount of training data from different subjects, and the another is to extract subject-independent features. The former is too expensive and impossible to collect and annotate data from different people. Recently, transfer learning methods have been investigated to solve cross-domain HAR problems, including cross-sensor-modalities

Morales and Roggen (2016), cross-locations Chiang et al. (2017), and cross-subjects Handiru and Prasad (2016); Zhao et al. (2020b). Domain adaptation is the particular branch of transfer learning, that measures data distribution heterogeneity and aligns among data distributions Cook et al. (2013). Deng et al. Deng et al. (2014) proposed a cross-person activity recognition method using a reduced kernel extreme learning machine on the source domain, which classifies the target sample and the high confident samples and applies them to the training dataset. Zhao et al. Zhao et al. (2011)

introduced a transfer learning embedded decision tree algorithm that integrates a decision tree and the k-means clustering algorithm to recognize mobile phone-based different personalized activities by model adaptation. Wang et al.

Wang et al. (2018a) proposed a stratified transfer learning method that adopted the pseudo-leveling concept on the unlabeled target data by measuring MMD between the feature spaces for the source domain data and the pseudo-labeled target data. Khan et al. Khan et al. (2018) proposed a heterogeneous deep convolutional neural network (HDCNN) that used a feature matching approach to adapt the pre-trained network, trained with the supervised source domain dataset, to an unlabeled target dataset collected from the smartwatch through the minimization of the discrepancy between two datasets after every convolutional and fully connected layer. They used Kullback-Leibler (KL) divergence as a distance measure between the pre-trained network and the target domain feature extractor network. Faridee et al. Faridee et al. (2019)

proposed an AugToAct framework which directly alignes data distributions between source and target domains by combining augmentation transformations with deep semi-supervised learning to infer complex activities with the minimal labels in both source and target domains. AugToAct similarly performed domain adaptation as HDCNN but employed Jensen–Shannon (JS) divergence to minimize the discrepancy among two different domains instead of KL divergence. Akbari and Jafari

Akbari and Jafari (2019) extracted stochastic features by training variational auto-encoder instead of deterministic feature extraction and employed the same network architecture of HDCNN to apply the feature matching approach to adapt the model to a target environment. Zhao et al. Zhao et al. (2020a) a local domain adaptation method for cross-domain HAR, which aligns the distributions of source and target domains by using the MMD regularization. They first classified the activities into abstract clusters and mapped the original features into a low-dimensional subspace where the MMD between two clusters with the same label from different domains were minimized. The above domain adaptation approaches consider only a single source domain for the domain adaptation tasks.

Several studies have researched multiple source domain adaptation for sensor-based activity recognition. Some approaches focus on explicitly identifying the most relevant source domain among the multiple source domains with the target domain based on similarity measurements such as cosine similarity. Wang et al.

Wang et al. (2018b) proposed a transfer neural network to perform knowledge transfer for activity recognition (TNNAR), which captures both the time and spatial relationship between activities. They considered explicitly selecting the most relevant domain from the multiple source domains based on the cosine similarity and applied the selected domain for domain adaptation to the target domain. Another group of approaches combines all the available source domain data and projects into lower-dimensional space, which is further processed by the classifier. Jeyakumar et al. Jeyakumar et al. (2019) proposed a SenseHAR, which is a sensor fusion model for each device that mapped the raw sensor values to a shared low dimensional latent space. They mitigated the heterogeneous data distribution and assigned labels to the unlabeled data. However, the above-mentioned approaches did not consider multiple source domain data simultaneously, and the approaches cannot capture the uncertainty within the classification tasks.

Recently, Multi-task and generative adversarial learning (GAN) Goodfellow et al. (2014)-based methods have been introduced to solve the different data distribution problems. Chen et al. Chen et al. (2020) proposed a deep multi-task learning-based activity and user recognition (METIER) model, which combines activity recognition and user recognition with a multi-task model. The model shares parameters between the activity recognition module and the user recognition module, and the activity recognition performance can be improved by the user recognition module employing a mutual attention mechanism. Sheng et al. Sheng and Huber (2020) proposed a weakly supervised multi-task representation learning, which used Siamese networks to exploit a temporal convolutional network as a backbone model. Bai et al. Bai et al. (2020) introduced a discriminative adversarial multi-view network, which extracts multi-view features from temporal, spatial, and Spatio-temporal views using CNN, and generalizes the multi-view features by employing WGAN and Siamese network architecture to decrease the variants between the extracted features from different subjects.

Another general advance in deep neural networks, relevant to this work, is the use of attention and transformers Vaswani et al. (2017)

. Those models have achieved state of the art results in many areas including natural language processing

Devlin et al. (2018), vision Dosovitskiy et al. (2020), and skeleton-based HAR Plizzari et al. (2021). Other works, such as Mahmud et al. (2020) also use self-attention for wearable sensor-based HAR obtaining state-of-the-art results. Still, their model combines data from the different present sensors in the first layers using 1x1 convolutions, forgoing richer explicit attention across sensors further down the network.

3 Proposed method

3.1 Problem Formulation

In this section, we introduce a transformer-based adversarial learning framework for cross-subject sensor-based human activity recognition via self-knowledge distillation. The sensor-based activity recognition task is to use data collected by sensors at different positions on the human body to predict one of activity labels. Let be the time-series sensor data obtained by applying a sliding window of size of the available sensor channels with consisting of the sensor data for that window and the corresponding activity label set of , and the corresponding subject label set of . We assume our training data shares the same activities and sensor types with the test data with unseen subjects in the training data used for validation. The goal of the proposed method is to generalize extracted features from the source subjects to the target subjects and improve the performance of the overall activity classification.

3.2 Network Architecture with transformer

Figure 1: The overall network architecture of the proposed framework.

Our goal is to extract cross-subject generalized features to improve the performance of the activity recognition network. The proposed neural network architecture comprises three independent networks: a feature extractor, an activity classifier, and a subject discriminator. The feature extractor represents the proposed transformer network, and the activity classifier maps the features extracted from the feature extractor into the activity labels. The subject discriminator is trained to distinguish the subject label from which the embedding features originated. In contrast, the feature extractor is trained to fool the subject discriminator by providing features that cannot be discriminated between subjects by using an adversarial learning framework. To improve the stability of the training procedure and optimize the balance between the subject discriminator and the activity classifier, we train the proposed overall networks with the self-knowledge distillation method using the pre-trained feature extractor and activity classifier. The overall structure of the proposed framework with the three independent networks is shown in Figure 1.

The feature extractor aims to map the input space to a common embedding feature space and we denote the feature extractor .

Our feature extractor works as follows: First, each sensor’s data is processed by its independent convolutional layer. This is done to bring all sensors to the same number of channels but has other advantages. Separate convolutional layers are beneficial for HAR Plötz et al. (2011)

and make the architecture more adaptable, as one can add different ones for different sensor modalities or applications. Then, we concatenate all sensor representations, generating a tensor

, where is the size of our sliding window, is the number of sensors, and is the selected number of filters for each sensor’s convolutional layer. The number of sensors and window size varies per dataset, but we used 32 for across all of them.

As transformers have obtained state-of-the-art results in many fields, including skeleton-based HAR Plizzari et al. (2021), we combine sensor representations using our spatial attention block, which is based on Plizzari et al. (2021). Our block (Figure 2) computes attention scores between sensors (spatial attention) and includes temporal convolutions that also reduce the temporal dimension as the number of channels increases. Notice that our Spatial Attention block includes only convolutions with a kernel size of one (network-in-network) for the attention part, while the temporal convolutions have bigger kernels. This is done since spatial attention computes attention weights in the form , that is, across sensors regardless of time. In fact, our preliminary experiments demonstrated that network performance does not improve with bigger kernels for computing query, key, and value. Moreover, in order to avoid overfitting, our blocks use three types of regularization: batch norm, dropout, and drop connect. Unlike dropout, which randomly zeros network outputs, drop connect randomly zeros network weights, in our case weights related to the attention matrix. This is done after the softmax, so it is followed by re-normalization so that weights still sum up to one. We have selected a transformer architecture for our feature extractor given its performance and the activities we want to recognise. In our experiments we are not interested in information across windows (long-term dependencies). To handle those, one could add an LSTM layer at the end of the feature extractor or opt for a transformer variant that keeps track of such dependencies such as Wang et al. (2019).

The activity classifier predicts the activity labels using the embedding features , where denotes the activity label space. In this paper, the activity classifier is designed with an average pooling layer over a time channel and a 1D convolution layer. The loss of the activity classifier influences the feature extractor, as they are both trained to minimize the supervised classification.

The subject discriminator distinguishes the subject label from which the embedding feature. The goal of the proposed method is to extract subject-invariant features from multiple domains. In other words, the feature extractor generates a common embedding feature space for multiple subjects. Though the training procedure with the supervised activity classification encourages learning the data distribution of the activities, the extracted embedding features still contain subject-specific information. By adopting an adversarial manner between the subject discriminator and the feature extractor, the subject discriminator is trained to distinguish the subject label from the embedding features and the feature extractor is trained to fool the subject discriminator. A strong subject discriminator can train the feature extractor to generate embedding features that generalize across data from different subjects. The subject discriminator is composed of three discriminator blocks, each including a convolution layer, batch normalization, and a dropout respectively, and two fully connected layers. To align the distributions among the

source and target subjects and further generalize the embedding feature representation, we use the MMD Gretton et al. (2006); Li et al. (2015); Long et al. (2015) regularization.

In addition, the feature extractor and activity classifier are regularized by using self-knowledge distillation from the pre-trained model to improve the stability of the training procedure and balance the optimization between the feature generalization and activity recognition. Following Yuan et al. (2020)

, knowledge distillation from the perspective of label smoothing regularization regularizes model training by replacing the one-hot labels with smoothed ones. We deploy the label smoothing regularization following the conceptual ideas of the teacher-free knowledge distillation technique in the training procedure of the proposed framework. Through this regularization, the feature extractor and activity recognition are trained by themselves and are prevented to be biased to the feature generalization. The detailed descriptions of loss functions are addressed in the following subsection. The detailed structure of the proposed adversarial cross-subject networks is shown in

Table 1.

Network Name Layers Act. Func. Input Tensor Output Dimension
Feature Extractor Input for each Sensor - - -
Conv block (1 per ) per sensor [Conv , st=1] ReLU Input for each Sensor
Concatenation - - The Conv blocks
Positional Encoding positional encoding as proposed in Wang and Liu (2021). - Concatenation
Spatial Attention 1 [Spatial Attention Block ,] - Positional Encoding
Spatial Attention 2 [Spatial Attention Block ,] - Spatial Attention 1
Spatial Attention 3 [Spatial Attention Block ,] - Spatial Attention 2
Average Pooling over - Spatial Attention 3
Activity Classifier Average Pooling over - - Average Pooling over 256
FC 1 Fully Connected - FC 1
Subject Discriminator Discriminator Block 1 Conv , st=2 Leaky ReLU Average Pooling over 32
Discriminator Block 2 Conv , st=2 Leaky ReLU Discriminator Block 1 64
Discriminator Block 3 Conv , st=2 Leaky ReLU Discriminator Block 2 128
FC 1 Fully Connected ReLU Discriminator Block 3 10
FC 2 Fully Connected - FC 1
Table 1:

Detailed structure of the proposed TASKED framework. The fourth column indicates the activation function used in the layer while

, and denote the kernel size, the window size and the number of channels of input sensor signals, respectively.
Figure 2: Structure of the attention block used in the transformer feature extractor.

3.3 Loss functions

As shown in Figure 1, we define three loss functions to train the three independent networks: 1) a classification loss , 2) a domain loss , and 3) an MMD loss . Let , , and denote the feature extractor, the activity classifier, and the subject discriminator, respectively.

In the proposed network architecture, the feature extractor and activity classifier are basically trained by supervised learning with given activity labels. Normally, the data distribution of classes in the human activity recognition datasets is often imbalanced. Training a network under imbalanced data conditions can produce a detrimental effect on human activity recognition performance and a biased classification result. To improve the performance of the performance under imbalanced data conditions, we adopt a weighted cross-entropy loss and a dice loss function for activity labels as the initial classification loss. The initial classification loss for the feature extractor and activity classifier is expressed as follows:

(1)

where the first term is the weighted cross-entropy classification loss, the second term is the dice classification loss, is a weight for each class label, is an one-hot vector of the ground truth of the activity label, is an output one-hot vector from the activity classifier, and denotes an element-wise multiplication.

The knowledge distillation technique transfers knowledge from a teacher model to a student model so that the performance of the student model is improved. The student model learns from more informative sources -the predictive probabilities of the teacher model- instead of just one-hot labels. Following

Yuan et al. (2020), we first train the student model containing the feature extractor and activity classifier in the normal way to obtain a pre-trained model by minimizing Equation 1. Hinton et al. Hinton et al. (2015) proposed to use temperature scaling to soften the predictive probabilities for better distillation. Given the pre-trained models and , the output prediction of the pre-trained network is expressed as follows.

(2)

where

denotes the output logit vector of the pre-trained model and

denotes a temperature parameter. The idea of self-knowledge distillation is to regularize the pre-trained model and the student model by using the Kullback-Leibler (KL) divergence. Namely, a knowledge distillation (KD) regularization loss is defined as follows.

(3)

where is the KL divergence, and , are the output probability of the pre-trained models and the student models , respectively. Similar to the original knowledge distillation method, we try to minimize the initial classification loss and the KD regularization loss between the predictions of pre-trained and student model.

(4)

where

is a hyperparameter.

Moreover, the goal of the subject discriminator is to distinguish which subject generated the embedding feature representations. The subject discriminator is thus trained to minimize the domain loss:

(5)

where is the ground truth of the subject label and the equation is the cross-entropy classification loss.

Here, in addition to adversarial learning, we address the MMD regularization to align the distributions among different subjects and to further improve the generalization of embedding features extracted by the feature extractor. The MMD is one of the most commonly used non-parametric methods to measure the distance of the distribution between two different domain datasets. The feature extractor represents the embedding features from the input signals , and we let and represent the embedding representations of two different subject domains. A mapping operation projects the representations of two different domains onto the reproducing kernel Hilbert space (RKHS) Gretton et al. (2006), and calculates the mean distance between the two domains in RKHS. The MMD between two subject domains can be calculated by using the function as follows:

(6)

The key to calculate the MMD is to find the appropriate as a mapping function to map the two domains to RKHS . Thus, the mean difference between the two data distributions after the mapping is calculated as their difference. represents a characteristic kernel function as , and the MMD can be rewritten as follows:

(7)

Generally, the Gaussian kernel function is used as the kernel function in the MMD algorithm, which maps data to infinite-dimensional space. This MMD method is based on a single kernel transformation. In this work, we adopt the multi-kernel MMD (MK-MMD) Long et al. (2015), which is an extension of the MMD and the optimal kernel can be obtained by linear combination of multiple kernels. The kernel function is defined as the convex combination of positive semi-definite (PSD) kernels . The total kernel is defined as follows.

(8)

where is weighted by different kernel , is the coefficient, which is the weight of , to ensure that the generated multi-kernel is characteristic.

Unlike the general domain adaptation, the goal of the proposed method is to generalize the features from multiple subject domains from either only the source subjects or both source and target subjects. Thus, the overall MMD regularization loss is described as follows.

(9)

The goal of the feature extractor is to extract the generalized feature among different subject domains, preserve the characteristics of the original data, and learn class-discriminative feature representations. In other words, the feature extractor is trained to jointly minimize the losses of classification and MMD, and maximize the domain loss for the adversarial learning, simultaneously, whereas the activity classifier and the subject discriminator are trained to minimize the the classification loss and domain loss, respectively. Finally, The objective functions of the proposed method are defined as follows:

(10)

where controls the relative importance of activity classification, and and are hyperparameters that control the effect of domain generalization.

In summary, the classification loss is used to improve the performance of the activity recognition with the regularization technique of the self-knowledge distillation, and the MMD regularization term helps to measure and align the distribution distance among different subjects, and the domain loss hinders extracting subject domain-specific information from the feature extractor by the adversarial learning between the feature extractor and subject discriminator.

3.4 Training Procedure

Figure 3: Training procedure, representing the two steps described in Section 3.4. Solid lines indicate that the network is being trained and dashed lines indicate that the parameters of the network are fixed.
0:  Batch size , Adam hyperparameter , hyperparameters , , .
1:  Input: , , , , ,
2:  for number of training iteration for step 1 do
3:     Sample a batch from the training dataset , corresponding activity label , and domain label .
4:      Equation 1
5:      Equation 5
6:      Equations 5 and 1
7:  end for
8:  for number of training iteration for step 2 do
9:     Sample a batch from the training dataset , corresponding activity label , and domain label .
10:      Equation 5
11:      Equation 4
12:      Equation 10
13:     Sample a batch from the target dataset , and corresponding domain label .
14:     
15:      Equation 5
16:      Equations 9 and 5
17:  end for
Algorithm 1 Training procedure for TASKED framework. We use default values of , , , ,

The training procedure for the proposed TASKED framework is based on two steps in order to train the three independent neural networks stably. Figure 3 presents the training procedure with the detailed steps. The first step is multi-task learning as a pre-training step for activity and subject recognitions. In this step, we jointly minimize the losses of activity classification and subject discrimination to train the feature extractor to capture the data distribution of each activity and the data distributions among subjects. At the same time, the subject discriminator is trained to minimize the discriminator loss Equation 5 and the activity classifier is trained to minimize the classification loss Equation 1. Finally, we train the adversarial learning framework via MMD regularization and self-knowledge distillation. In this step, we compute the final classification loss Equation 4 using the KD regularization between the pre-trained and the student model. While the feature extractor is trained to minimize the subject discriminator loss in the first step, the feature extractor is trained to maximize the discriminator loss in the second step for the adversarial learning scheme. The training details for the proposed adversarial feature extraction method are summarized in Algorithm 1.

4 Experimental results

4.1 Datasets and Evaluation Metrics

To evaluate the effectiveness of the proposed method for human activity recognition, four types of popular HAR public datasets were used, Opportunity Chavarriaga et al. (2013), PAMAP2 Reiss and Stricker (2012), MHEALTH Banos et al. (2014) and RealDISP Baños et al. (2012), that contain continuous sensor data of various sensors and different human activities by different participants.

Dataset Subjects Activities Channels Frequency Window size Sliding step
Opportunity - locomotion 4 5 113 64 16
Opportunity - gestures 4 18 113 64 16
PAMAP2 8 12 36 200 50
MHEALTH 10 12 23 200 50
RealDISP 17 33 81 120 60
Table 2: Evaluation dataset information for our experiments
  • Opportunity: The Opportunity dataset is for human activity recognition from wearable, object, and ambient sensors. It contains annotated recordings from four users with 7 inertial measurement units (IMU) and 12 on-body accelerometers. The dataset was recorded with 6 runs per user, that 5 runs are the activity of daily living runs (ADL) by a natural execution of daily activities and the last run is called a drill run by a scripted sequence of activities. The dataset contains a total of 6 hours of recordings. The provided sampling frequency of all IMU sensors is . The dataset comprises three types of sensors: body-worn sensors with 145 channels, object sensors with 60 channels, and ambient sensors with 37 channels. In this paper, we selected a dimension of 113 channels taking into account only the body-worn sensors including the IMUs and accelerometers, following the setup of Ordóñez and Roggen (2016)

    . We preprocessed all channels of sensor data to fill in missing values using linear interpolation and to normalize the data values per channel to interval

    with manually set minimum and maximum values per channel as in Ordóñez and Roggen (2016). We used a sliding window size of 64 with a sliding step of 16, which is close to two seconds of the sliding window and a 0.5-second step size. We use two types of annotations from the dataset. One is modes of locomotion and postures, such as Stand, Walk, Sit, and Lie is annotated with five classes. Another is 18 mid-level gestures such as Open Door, Close Door, and Clean Table.

  • PAMAP2: The PAMAP2 dataset was recorded from nine participants that were instructed to perform 18 activities of daily living. The dataset contains a total of more than 10 hours of recordings. One channel of heart rate-monitor and three IMUs were placed on the subject’s chest, dominant wrist, and dominant ankle. The dataset has 52 channels, containing a channel of heart rate, 17 channels per IMU. The full IMU sensory data is composed of 6 channels of acceleration data, 3 channels of gyroscope data, 3 channels of magnetometer data, and 3 channels of orientation. In this work, we selected a total dimension of 36 channels by removing a channel of heart rate, a channel of temperature per IMU, 4 channels of orientation per IMU, since the orientation of IMUs is mentioned as invalid in the data collection. Additionally, we remove six activities classified ”Optional” in the dataset and the ninth subject since the ”Optional” activities were collected by only one subject and the ninth subject executed only one activity. Thus, a total of 12 activities named ”protocol” in the dataset from 8 subjects are used in this work. We preprocessed all channels of selected sensor data to fill in NaN values using linear interpolation. All channels were normalized to zero mean and unit variance per user. The IMU data was collected under the sampling frequency of and we used a sliding window length of 200 (2 seconds) with a sliding step of 50 (0.5 second).

  • MHEALTH: The MHEALTH dataset contains basic body movements and vital signs data recorded from 10 volunteers of diverse profiles. The volunteers carried out 12 physical activities such as climbing stairs, walking, waist bends forward, and cycling. Each activity was recorded for a minute or repeated 20 times. Three IMUs were placed on the subject’s chest, right wrist, and left ankle to measure acceleration, rate of turn, and magnetic field orientation. Additionally, the sensor was positioned on the chest to provide 2-lead ECG measurements, which can be used for basic heart monitoring or checking the effects of exercise on the ECG. The provided sampling rate of all sensing modalities is . To evaluate the proposed method with the iterative leave-one-subject-out cross-validation procedure, we augmented the dataset with a sliding window length of 200 (4 seconds) and a step size of 50 (1 second), unlike other methods Nguyen et al. (2015); Sheng and Huber (2020) used a sliding window length of 5 seconds and a step size of 2.5 seconds.

  • RealDISP: The RealDISP dataset consists of a total of 9 hours of daily activity data from 17 volunteers. The volunteers were instructed to perform 33 activities involving the whole-body movements and body part-specific activities such as walking, jogging, jumping, cycling, knee bending, waist bending, and rowing. 9 IMUs were distributed on the arms, wrists, calves, thighs, and back of the subject. Each sensor provides 3 channels of acceleration, 3 channels of gyroscope, 3 channels of magnetic field measurements, and four channels of quaternions for orientation. The sensor data were collected under the sampling rate of and we used a sliding window length of 120 (2.4 seconds) with a sliding step of 60 (1.2 seconds).

The detailed information of the four datasets is summarized in Table 2. To evaluate the performance of the proposed model and how much performance varies depending on the subject, we conduct two different experiments: 1) leave-one-subject-out cross-validation procedure on a single dataset, 2) leave-one-subject-out cross-validation procedure across different datasets. Unlike the experimental settings in other methods Ordóñez and Roggen (2016); Chen et al. (2020), the leave-one-subject-out cross-validation procedure is that all data from a subject are used as a test set and all data from another subject are used as a validation set, while all data from other subjects are used as a training set. We picked up all data from the first two subjects (that are not the test one) as validation set. Thus, there are two iterations for a test set, but with two different validation sets. For instance, if there are 10 subjects in the given dataset, the training and validation procedure would be conducted 20 times. The second experiment is the leave-one-subject-out across different datasets. The leave-one-subject-out scheme across different datasets is that one subject data from each dataset is used as a test set and another subject data from each dataset is used as a validation dataset while the rest data from each dataset are used as a training set. We selected common activity classes, sensor positions, and sensor data types among different datasets. The evaluation was repeated two times on each test set.

To evaluate and compare the performance of the proposed method with others, we adopted three evaluation metrics, which are used in various human activity recognition studies

Ordóñez and Roggen (2016); Bai et al. (2020): accuracy , weighted F1-score , and macro F1-score .

(11)
(12)
(13)

where , , and , , , and denote the true positive, false positive, true negative ,and false negative values, denotes the number of classes and is the proportion of samples of the -th class with being the number of samples of the -th class and being the total number of samples.

4.2 Implementation Details

The experiments were all implemented using Python scripts in the PyTorch framework. Training procedures were conducted in the Linux system with four NVIDIA Quadro RTX 8000 GPUs. The hyperparameters for

Equation 9 were , , and . In addition, the hyperparameters for Equation 2 and Equation 4 were and . Through various testing, it is observed that the parameters mentioned in the paper yield the best performance. We chose the Adam optimizer Kingma and Ba (2014) with a learning rate of , , , , and

. The batch size was 128 and the training epochs are 200. Early stopping was used with the patience of 20 epochs to avoid overfitting. For the MK-MMD, the Gaussian kernel is applied to the MK-MMD, and its number is set to 5.

4.3 Comparison Results on a Single Dataset

The proposed TASKED framework was evaluated on the Opportunity, PAMAP2, MHEALTH, and RealDISP datasets. The three evaluation metrics were used to evaluate and compare the proposed method to deep-learning-based state-of-the-art methods including multi-channel time-series convolutional neural networks (MC-CNN) Yang et al. (2015), DeepConvLSTM Ordóñez and Roggen (2016), Self-attention activity recognition method Mahmud et al. (2020), METIER model Chen et al. (2020), and the previous method ”Adversarial CNN” Suh et al. (2022). MC-CNN is a CNN-based model consisting of three convolutional layers, two pooling layers, and two fully connected layers. DeepConvLSTM is a combined model of CNN and LSTM for activity recognition, that comprises four convolutional layers and two LSTM layers to learn both spatial and temporal correlations. Self-attention activity recognition method introduced a self-attention mechanism to improve the performance of the human activity recognition based on wearable sensors. The METIER model is to solve activity recognition and user recognition tasks jointly and transfer knowledge across them by sharing parameters between activity recognition and user recognition networks softly and employing mutual attention mechanism to exploit their knowledge to highlight important features for each other. Lastly, the previous method, Adversarial CNN, proposed a adversarial feature extraction for use independent human activity recognition based on CNN architecture. In addition, to show the effectiveness of the proposed adversarial learning with MMD regularization and self-knowledge distillation, we evaluate the proposed transformer network alone. This serves as an ablation study, highlighting how much of the improvement comes from the new transformer architecture as opposed to our proposed adversarial framework. For a fair comparison study, we set up the same experimental conditions, such as splitting the training and test datasets, data preprocessing, the window size, and step size, as we addressed in the previous section.

The Opportunity dataset is normally evaluated using two of its label types: the modes of locomotion recognition task and the gesture recognition task. We conducted the comparison experiment and introduce the experimental results on the two tasks separately.

Dataset Task Method
Opportunity Locomotion MC-CNN 65.79 8.73 63.59 9.54 62.32 11.00
DeepConvLSTM 66.02 5.23 65.39 5.08 62.88 10.47
Self-Attention 61.71 21.66 60.17 22.21 55.27 20.32
METIER 73.97 3.84 73.62 3.84 74.24 3.98
Adversarial CNN 73.11 2.98 73.06 3.04 72.78 7.73
Transformer 64.19 10.97 62.80 13.19 61.95 8.24
TASKED 75.83 2.54 75.83 2.57 77.09 3.41
Gestures MC-CNN 69.47 5.96 68.20 4.79 27.86 7.39
DeepConvLSTM 71.37 4.23 71.23 3.37 35.59 7.36
Self-Attention 74.09 5.30 69.64 5.29 28.89 12.79
METIER 71.49 4.94 73.31 3.45 44.44 6.27
Adversarial CNN 76.25 2.42 75.54 2.13 44.65 4.54
Transformer 73.41 5.32 72.95 4.62 41.50 8.82
TASKED 80.20 3.55 79.75 3.40 54.40 6.66
PAMAP2 MC-CNN 76.50 15.77 75.12 17.48 67.84 16.28
DeepConvLSTM 66.43 18.74 64.69 20.19 57.06 17.03
Self-Attention 76.92 15.45 76.01 16.58 68.68 15.24
METIER 80.90 7.51 80.74 8.51 72.66 9.53
Adversarial CNN 80.91 15.39 80.79 16.33 73.73 16.05
Transformer 81.38 12.93 80.81 13.77 73.44 12.72
TASKED 83.04 11.35 82.93 12.35 75.21 11.74
MHEALTH MC-CNN 85.31 8.81 82.46 10.50 82.33 10.64
DeepConvLSTM 85.46 7.80 83.22 8.91 83.40 8.68
Self-Attention 86.42 10.41 83.85 12.16 84.33 11.58
METIER 87.80 6.29 86.02 7.59 86.70 7.18
Adversarial CNN 91.82 7.54 90.53 8.67 91.03 8.07
Transformer 91.92 6.77 90.77 8.29 90.97 7.55
TASKED 95.00 3.72 94.66 4.14 94.42 4.09
RealDISP MC-CNN 70.19 15.19 70.41 14.14 70.13 15.01
DeepConvLSTM 68.86 17.03 69.74 15.30 68.33 15.75
Self-Attention 85.21 16.04 85.93 14.15 85.41 15.16
METIER 82.39 15.22 83.16 12.52 82.67 14.05
Adversarial CNN 78.32 18.03 78.66 16.72 78.65 16.95
Transformer 83.93 19.02 84.51 17.69 84.96 18.03
TASKED 85.97 18.46 86.78 16.54 86.83 17.04
Table 3: Comparison results with the state-of-the-arts on Opportunity, PAMAP2, MHEALTH, and RealDISP datasets. The numbers are expressed in percent and represented as .
Figure 4: Box-and-whisker plots of comparison results with the state-of-the-art in terms of , , and on Opportunity dataset (a) for the mode of locomotion recognition task, (b) for the gesture recognition, (c) PAMAP2 dataset, (d) MHEALTH dataset, and (e) RealDISP dataset.

Table 3 shows the quantitative evaluation results on the Opportunity, PAMAP2, MHEALTH, and RealDISP datasets. Because the Opportunity dataset for the gesture recognition has 18 types of activities and is severely imbalanced, the results in terms of

are relatively lower and have a larger standard deviation than the results in terms of

and . The results show that the proposed TASKED framework achieves significantly higher performance on the Opportunity dataset for both locomotion and gesture recognition tasks in terms of all three of the measurements. Furthermore, TASKED provides high-performance results with small inter-subject variation, whereas the state-of-the-arts give high standard deviation results relatively. By comparing the proposed method with the previous method ”Adversarial CNN”, the proposed transformer architecture and self-knowledge distillation improve the performance. For example, TASKED achieves 4.31, 9.75, 1.48, 3.39, 8.18 percent points improvements over the adversarial CNN in terms of on the Opportunity dataset (for locomotion and gesture tasks), PAMAP2, MHEALTH, and RealDISP datasets, respectively. In addition, it shows that the proposed adversarial learning and MMD regularization improve the performance of the activity recognition by comparing the proposed method with Transformer which is the proposed neural network architecture without the subject discriminator and adversarial learning.

In Figure 4 (a) and (b), we analyze the results of TASKED and the state-of-the-art methods in the box-and-whisker plots on the Opportunity dataset in terms of , , and

. The bars signify the performance distribution between 25 % quantile and 75 % quantile. The proposed method not only provides significantly better performance but also gives small performance variances on both tasks than the state-of-the-art methods.

Additionally, the performance variances on the PAMAP2 and RealDISP datasets are much larger than on the Opportunity and MHEALTH datasets, because of the diversity between different subjects and the data quality. The proposed TASKED outperforms all the state-of-the-art methods on the MHEALTH datasets in terms of all three metrics. TASKED achieves 3.08, 3.89, 3.39 percent points improvements over the best state-of-the-art method in terms of , , and , respectively. Additionally, the standard deviation of the performance by TASKED is much smaller than that of the state-of-the-art methods. The comparison results demonstrate the superiority of TASKED. Figure 4 (c)-(e) shows the box-and-whisker plots on the PAMAP2, MHEALTH, and REALDISP datasets in terms of , . and . The box-and-whisker plots also show that the proposed method achieves higher minimum performance and the ranges of the performance than other methods.

4.4 Comparison Results across Different Datasets

Dataset Activities Method
O+P+M 4 MC-CNN 79.55 3.04 79.15 3.19 78.78 5.21
DeepConvLSTM 80.78 2.00 80.59 2.11 79.29 4.36
METIER 83.20 2.05 83.22 2.04 83.92 2.53
Adversarial CNN 83.01 1.45 83.02 1.48 83.50 3.05
Transformer 82.03 2.44 82.06 2.48 82.94 2.77
TASKED 83.74 2.17 83.76 2.18 85.36 2.16
O+P+M+R 13 MC-CNN 76.82 3.28 76.49 3.56 70.71 6.02
DeepConvLSTM 76.74 3.77 76.92 3.70 70.63 7.01
METIER 82.56 2.27 82.55 2.34 81.40 4.48
Adversarial CNN 79.24 3.91 78.72 4.39 77.30 7.16
Transformer 79.84 2.88 79.87 2.93 78.42 5.77
TASKED 82.80 2.09 82.79 2.16 82.82 4.97
Table 4: Comparison results with the state-of-the-art methods across multiple datasets among Opportunity, PAMAP2, MHEALTH, and RealDISP datasets with leave-one-subject-out cross-validation scheme. O, P, M, R denote Opportunity, PAMAP2, MHEALTH, and RealDISP, respectively. The numbers are expressed in percent and represented as .
Figure 5: Box-and-whisker plots of comparison results with the state-of-the-art methods in terms of , , and across (a) Opportunity, PAMAP2, and MHEALTH datasets with 4 common activities, (b) Opportunity, PAMAP2, MHEALTH, and RealDISP datasets with 13 common activities.

To show the effectiveness of TASKED further, we performed experiments across multiple datasets with the leave-one-subject-out cross-validation scheme. We have conducted two experiments with two different combinations among the four datasets: Opportunity, PAMAP2, MHEALTH, and RealDISP. First of all, we extracted common sensor positions and types among four datasets. A total of common sensor data have 18 channels with three sensor positions and three different types. The sensory data is composed of 3 channels of acceleration data acquired from the sensor placed on the subject’s back or chest, 9 channels of acceleration, gyroscope, and magnetometer data from the sensor placed on the subject’s right hand, and 6 channels of acceleration and gyroscope data from the sensor placed on the subject’s left ankle. Since the four datasets have different sampling rates, we resampled all data into . We used a sliding window size of 100 with a sliding step of 16, which is 2 seconds of the sliding window and 0.5-second step size. All samples were normalized to zero mean and unit variance. The first combination is Opportunity, PAMAP2, and MHEALTH. A total of 4 common activities, such as lying, sitting, standing, and walking, among the three datasets are used in this work. The second combination uses all four datasets. There are 13 activities that exist in common in at least two or more datasets: lying, sitting, standing, walking, running, cycling, jogging, climbing stairs, knee bending, jumping front and back, waist bends forward, the frontal elevation of arms, and rope jumping. We followed a leave-one-subject-out evaluation scheme. In each fold, one subject data from each dataset was used as the test data, another subject data from each dataset was used as the validation data, and the remaining subjects were used as the training data.

In Table 4, the comparison results are demonstrated with the state-of-the-art methods with two different dataset combination experiments. The results show that TASKED gives better performance with the two different combinations in terms of , , and than other state-of-the-art methods. In addition, TASKED provides smaller inter-subject variation values than other methods. By comparing the proposed method with the Transformer without the subject discriminator and adversarial learning, the proposed adversarial learning and self-knowledge distillation improved the performance in terms of all three of the measurements and reduced the standard deviation values. The proposed method achieved 1.71, 2.96 percent points improvements over the Transformer in terms of and 2.42, 4.40 percent points improvements over the Transformer in terms of with both combinations, respectively. Even though the Transformer was trained without adversarial learning and multi-task learning, the performance of the Transformer provided significantly higher performance than MC-CNN and DeepConvLSTM and is close to the previous method ”Adversarial CNN” and the multi-task learning method METIER. It shows that the proposed transformer architecture successfully extracted the spatiotemporal representations from the time-series sensor data.

Figure 5 shows the box-and-whisker plots of comparison results with the state-of-the-art in terms of , , and on the two different dataset combinations. The proposed TASKED framework provides the best performance with the smallest variance in terms of all three metrics.

5 Conclusion

In this paper, we have presented a novel TASKED framework for sensor-based HAR. The main objective of our model was to learn cross-domain feature representations by joint optimizing adversarial learning between the transformer-based feature extractor and subject discriminator. The transformer network architecture was designed to learn the spatial and temporal representations from the time-series sensor data, and the adversarial learning scheme with an MK-MMD-based regularization method generalized the feature distributions between different subject domains. In addition, the teacher-free self-knowledge distillation method was adopted to improve the stability of the training procedure and prevent a bias to feature generalization regularized by the MMD loss and adversarial learning. The experiments demonstrated that the proposed method can extract the spatial and temporal representations for HAR from various types of sensor data and generalize the representations among different subject domains. Especially, the average performance in terms of by the proposed method was significantly higher than the results by the state-of-the-art methods on the Opportunity, PAMAP2, MHEALTH, and RealDISP datasets. In summary, the proposed method improved 3.08, 3.89, 3.39 percent points of , , and over the best state-of-the-art method, respectively. Furthermore, the experimental results across different datasets showed that TASKED learns the spatial and temporal representations from different datasets and achieves the best performance among the state-of-the-art methods.

In the future, we plan to deploy the proposed transformer architecture for cross-modal activity recognition, such as wearable sensors and vision. The proposed transformer architecture can learn the spatial and temporal representations from different modalities and transfer knowledge between the modalities. In addition, we can apply the proposed adversarial learning scheme to sensor position-invariant HAR. By using the adversarial scheme, it could recognize the importance of each sensor, generalize the features from different sensors, and optimize their number and placement on the body. The idea is to enable learning across datasets, domains, and modalities.

Acknowledgments

The research reported in this paper was supported by the BMBF (German Federal Ministry of Education and Research) in the project VidGenSense (01IW21003).

References

  • A. Akbari and R. Jafari (2019) Transferring activity recognition models for new wearable sensors with deep generative domain adaptation. In Proceedings of the 18th International Conference on Information Processing in Sensor Networks, pp. 85–96. Cited by: §2.2.
  • M. A. Alsheikh, A. Selim, D. Niyato, L. Doyle, S. Lin, and H. P. Tan (2016) Deep activity recognition models with triaxial accelerometers. In 30th AAAI Conference on Artificial Intelligence, AAAI 2016, pp. 8–13. Cited by: §2.1.
  • D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz (2012) Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine. In International workshop on ambient assisted living, pp. 216–223. Cited by: §2.1.
  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214–223. Cited by: §1.
  • M. Bachlin, M. Plotnik, D. Roggen, I. Maidan, J. M. Hausdorff, N. Giladi, and G. Troster (2009) Wearable assistant for parkinson’s disease patients with the freezing of gait symptom. IEEE Transactions on Information Technology in Biomedicine 14 (2), pp. 436–446. Cited by: §1.
  • L. Bai, L. Yao, X. Wang, S. S. Kanhere, B. Guo, and Z. Yu (2020) Adversarial multi-view networks for activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4 (2), pp. 1–22. Cited by: §1, §1, §2.2, §4.1.
  • O. Baños, M. Damas, H. Pomares, I. Rojas, M. A. Tóth, and O. Amft (2012) A benchmark dataset to evaluate sensor displacement in activity recognition. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing, pp. 1026–1035. Cited by: item 5, §4.1.
  • O. Banos, R. Garcia, J. A. Holgado-Terriza, M. Damas, H. Pomares, I. Rojas, A. Saez, and C. Villalonga (2014) MHealthDroid: a novel framework for agile development of mobile health applications. In International workshop on ambient assisted living, pp. 91–98. Cited by: item 5, §4.1.
  • L. Bao and S. S. Intille (2004) Activity recognition from user-annotated acceleration data. In International conference on pervasive computing, pp. 1–17. Cited by: §1.
  • S. Bhattacharya and N. D. Lane (2016) Sparsification and separation of deep learning layers for constrained resource inference on wearables. In Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems CD-ROM, pp. 176–189. Cited by: §2.1.
  • A. Bulling, U. Blanke, and B. Schiele (2014) A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys (CSUR) 46 (3), pp. 1–33. Cited by: §2.1.
  • R. Chavarriaga, H. Sagha, A. Calatroni, S. T. Digumarti, G. Tröster, J. d. R. Millán, and D. Roggen (2013) The opportunity challenge: a benchmark database for on-body sensor-based activity recognition. Pattern Recognition Letters 34 (15), pp. 2033–2042. Cited by: item 5, §1, §4.1.
  • L. Chen, Y. Zhang, and L. Peng (2020) Metier: a deep multi-task learning based activity and user recognition model using wearable sensors. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4 (1), pp. 1–18. Cited by: §1, §2.2, §4.1, §4.3.
  • Y. Chiang, C. Lu, and J. Y. Hsu (2017) A feature-based knowledge transfer framework for cross-environment activity recognition toward smart home applications. IEEE Transactions on Human-Machine Systems 47 (3), pp. 310–322. Cited by: §2.2.
  • D. Cook, K. D. Feuz, and N. C. Krishnan (2013) Transfer learning for activity recognition: a survey. Knowledge and information systems 36 (3), pp. 537–556. Cited by: §2.2.
  • J. E. Cutting and L. T. Kozlowski (1977) Recognizing friends by their walk: gait perception without familiarity cues. Bulletin of the psychonomic society 9 (5), pp. 353–356. Cited by: §1.
  • L. M. Dang, K. Min, H. Wang, M. J. Piran, C. H. Lee, and H. Moon (2020) Sensor-based and vision-based human activity recognition: a comprehensive survey. Pattern Recognition 108, pp. 107561. Cited by: §2.1.
  • W. Deng, Q. Zheng, and Z. Wang (2014) Cross-person activity recognition using reduced kernel extreme learning machine. Neural Networks 53, pp. 1–7. Cited by: §2.2.
  • M. O. Derawi, C. Nickel, P. Bours, and C. Busch (2010) Unobtrusive user-authentication on mobile phones using biometric gait recognition. In 2010 Sixth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 306–311. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.2.
  • C. Direkoglu and N. E. O’Connor (2012) Team activity recognition in sports. In

    European Conference on Computer Vision

    ,
    pp. 69–83. Cited by: §1.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1, §2.2.
  • A. Z. M. Faridee, M. A. A. H. Khan, N. Pathak, and N. Roy (2019) AugToAct: scaling complex human activity recognition with few labels. In Proceedings of the 16th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, pp. 162–171. Cited by: §2.2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.2.
  • A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola (2006) A kernel method for the two-sample-problem. Advances in neural information processing systems 19, pp. 513–520. Cited by: §1, §3.2, §3.3.
  • Y. Guan and T. Plötz (2017) Ensembles of deep lstm learners for activity recognition using wearables. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1 (2), pp. 1–28. Cited by: §2.1.
  • N. Y. Hammerla, S. Halloran, and T. Plötz (2016) Deep, convolutional, and recurrent models for human activity recognition using wearables. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 1533–1540. Cited by: §2.1.
  • N. Y. Hammerla, R. Kirkham, P. Andras, and T. Ploetz (2013) On preserving statistical characteristics of accelerometry data using their empirical cumulative distribution. In Proceedings of the 2013 international symposium on wearable computers, pp. 65–68. Cited by: §2.1.
  • V. S. Handiru and V. A. Prasad (2016) Optimized bi-objective eeg channel selection and cross-subject generalization with brain–computer interfaces. IEEE Transactions on Human-Machine Systems 46 (6), pp. 777–786. Cited by: §2.2.
  • G. Hinton, O. Vinyals, J. Dean, et al. (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 2 (7). Cited by: §3.3.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1.
  • J. Hong, J. Ramos, and A. K. Dey (2015) Toward personalized activity recognition systems with a semipopulation approach. IEEE Transactions on Human-Machine Systems 46 (1), pp. 101–112. Cited by: §1.
  • Y. Iwasawa, K. Nakayama, I. Yairi, and Y. Matsuo (2017) Privacy issues regarding the application of dnns to activity-recognition using wearables and its countermeasures by use of adversarial training.. In IJCAI, pp. 1930–1936. Cited by: §1.
  • M. Janidarmian, A. Roshan Fekr, K. Radecka, and Z. Zilic (2017) A comprehensive analysis on wearable acceleration sensors in human activity recognition. Sensors 17 (3), pp. 529. Cited by: §2.1.
  • J. V. Jeyakumar, L. Lai, N. Suda, and M. Srivastava (2019) SenseHAR: a robust virtual activity sensor for smartphones and wearables. In Proceedings of the 17th Conference on Embedded Networked Sensor Systems, pp. 15–28. Cited by: §2.2.
  • B. Khaertdinov, E. Ghaleb, and S. Asteriadis (2021) Contrastive self-supervised learning for sensor-based human activity recognition. In 2021 IEEE International Joint Conference on Biometrics (IJCB), pp. 1–8. Cited by: §1.
  • M. A. A. H. Khan, N. Roy, and A. Misra (2018) Scaling human activity recognition via deep learning-based domain adaptation. In 2018 IEEE international conference on pervasive computing and communications (PerCom), pp. 1–9. Cited by: §2.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • H. Kwon, G. D. Abowd, and T. Plötz (2018) Adding structural characteristics to distribution-based accelerometer representations for activity recognition using wearables. In Proceedings of the 2018 ACM international symposium on wearable computers, pp. 72–75. Cited by: §1, §2.1.
  • N. D. Lane, P. Georgiev, and L. Qendro (2015) Deepear: robust smartphone audio sensing in unconstrained acoustic environments using deep learning. In Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing, pp. 283–294. Cited by: §2.1.
  • Y. Li, K. Swersky, and R. Zemel (2015)

    Generative moment matching networks

    .
    In International Conference on Machine Learning, pp. 1718–1727. Cited by: §1, §3.2.
  • M. Long, Y. Cao, J. Wang, and M. Jordan (2015) Learning transferable features with deep adaptation networks. In International conference on machine learning, pp. 97–105. Cited by: §1, §3.2, §3.3.
  • S. Mahmud, M. Tonmoy, K. K. Bhaumik, A. Rahman, M. A. Amin, M. Shoyaib, M. A. H. Khan, and A. A. Ali (2020) Human activity recognition from wearable sensor data using self-attention. arXiv preprint arXiv:2003.09018. Cited by: §1, §2.1, §2.2, §4.3.
  • F. J. O. Morales and D. Roggen (2016) Deep convolutional feature transfer across mobile activity recognition domains, sensor modalities and locations. In Proceedings of the 2016 ACM International Symposium on Wearable Computers, pp. 92–99. Cited by: §2.2.
  • K. Nakano and B. Chakraborty (2017) Effect of dynamic feature for human activity recognition using smartphone sensors. In 2017 IEEE 8th International Conference on Awareness Science and Technology (iCAST), pp. 539–543. Cited by: §2.1.
  • L. T. Nguyen, M. Zeng, P. Tague, and J. Zhang (2015) Recognizing new activities with limited training data. In Proceedings of the 2015 ACM International Symposium on Wearable Computers, pp. 67–74. Cited by: item 3.
  • F. J. Ordóñez and D. Roggen (2016) Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16 (1), pp. 115. Cited by: §1, §2.1, item 1, §4.1, §4.1, §4.3.
  • C. Plizzari, M. Cannici, and M. Matteucci (2021) Skeleton-based action recognition via spatial and temporal transformer networks. Computer Vision and Image Understanding 208, pp. 103219. Cited by: §1, §2.2, §3.2.
  • T. Plötz, N. Y. Hammerla, and P. L. Olivier (2011) Feature learning for activity recognition in ubiquitous computing. In Twenty-second international joint conference on artificial intelligence, pp. 1729–1734. Cited by: §2.1, §3.2.
  • T. Plötz, N. Y. Hammerla, A. Rozga, A. Reavis, N. Call, and G. D. Abowd (2012) Automatic assessment of problem behavior in individuals with developmental disabilities. In Proceedings of the 2012 ACM conference on ubiquitous computing, pp. 391–400. Cited by: §1.
  • A. Reiss and D. Stricker (2012) Introducing a new benchmarked dataset for activity monitoring. In 2012 16th international symposium on wearable computers, pp. 108–109. Cited by: item 5, §4.1.
  • T. R. D. Saputri, A. M. Khan, and S. Lee (2014)

    User-independent activity recognition via three-stage ga-based feature selection

    .
    International Journal of Distributed Sensor Networks 10 (3), pp. 706287. Cited by: §1.
  • T. Sheng and M. Huber (2020) Weakly supervised multi-task representation learning for human activity analysis using wearables. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4 (2), pp. 1–18. Cited by: §1, §2.2, item 3.
  • M. S. Singh, V. Pondenkandath, B. Zhou, P. Lukowicz, and M. Liwickit (2017) Transforming sensor data to the image domain for deep learning—an application to footstep detection. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2665–2672. Cited by: §1.
  • S. Suh, V. F. Rey, and P. Lukowicz (2022)

    Adversarial deep feature extraction network for user independent human activity recognition

    .
    In 2022 IEEE International Conference on Pervasive Computing and Communications (PerCom), pp. 217–226. Cited by: §1, §4.3.
  • M. Sundholm, J. Cheng, B. Zhou, A. Sethi, and P. Lukowicz (2014) Smart-mat: recognizing and counting gym exercises with low-cost resistive pressure sensing matrix. In Proceedings of the 2014 ACM international joint conference on pervasive and ubiquitous computing, pp. 373–382. Cited by: §1.
  • A. A. Varamin, E. Abbasnejad, Q. Shi, D. C. Ranasinghe, and H. Rezatofighi (2018) Deep auto-set: a deep auto-encoder-set network for activity recognition using wearables. In Proceedings of the 15th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, pp. 246–253. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §2.1, §2.2.
  • J. Wang, Y. Chen, L. Hu, X. Peng, and S. Y. Philip (2018a) Stratified transfer learning for cross-domain activity recognition. In 2018 IEEE international conference on pervasive computing and communications (PerCom), pp. 1–10. Cited by: §2.2.
  • J. Wang, V. W. Zheng, Y. Chen, and M. Huang (2018b) Deep transfer learning for cross-domain activity recognition. In proceedings of the 3rd International Conference on Crowd Science and Engineering, pp. 1–8. Cited by: §2.2.
  • Z. Wang and J. Liu (2021) Translating math formula images to latex sequences using deep neural networks with sequence-level training. International Journal on Document Analysis and Recognition (IJDAR) 24 (1), pp. 63–75. Cited by: Table 1.
  • Z. Wang, Y. Ma, Z. Liu, and J. Tang (2019) R-transformer: recurrent neural network enhanced transformer. arXiv preprint arXiv:1907.05572. Cited by: §3.2.
  • J. Wen, J. Indulska, and M. Zhong (2016)

    Adaptive activity learning with dynamically available context

    .
    In 2016 IEEE International Conference on Pervasive Computing and Communications (PerCom), pp. 1–11. Cited by: §1.
  • J. Yang, M. N. Nguyen, P. P. San, X. L. Li, and S. Krishnaswamy (2015) Deep convolutional neural networks on multichannel time series for human activity recognition. In Twenty-fourth international joint conference on artificial intelligence, pp. 3995–4001. Cited by: §1, §2.1, §4.3.
  • L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng (2020) Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3903–3911. Cited by: §1, §3.2, §3.3.
  • M. Zeng, H. Gao, T. Yu, O. J. Mengshoel, H. Langseth, I. Lane, and X. Liu (2018) Understanding and improving recurrent networks for human activity recognition by continuous attention. In Proceedings of the 2018 ACM international symposium on wearable computers, pp. 56–63. Cited by: §2.1.
  • M. Zeng, T. Yu, X. Wang, L. T. Nguyen, O. J. Mengshoel, and I. Lane (2017) Semi-supervised convolutional neural networks for human activity recognition. In 2017 IEEE International Conference on Big Data (Big Data), pp. 522–529. Cited by: §1.
  • J. Zhao, F. Deng, H. He, and J. Chen (2020a) Local domain adaptation for cross-domain activity recognition. IEEE Transactions on Human-Machine Systems 51 (1), pp. 12–21. Cited by: §2.2.
  • J. Zhao, L. Li, F. Deng, H. He, and J. Chen (2020b) Discriminant geometrical and statistical alignment with density peaks for domain adaptation. IEEE Transactions on Cybernetics. Cited by: §2.2.
  • Z. Zhao, Y. Chen, J. Liu, Z. Shen, and M. Liu (2011) Cross-people mobile-phone based activity recognition. In Twenty-second international joint conference on artificial intelligence, pp. 2545–2550. Cited by: §2.2.