Multi-task Self-Supervised Learning for Human Activity Detection

07/27/2019 ∙ by Aaqib Saeed, et al. ∙ TU Eindhoven 6

Deep learning methods are successfully used in applications pertaining to ubiquitous computing, health, and well-being. Specifically, the area of human activity recognition (HAR) is primarily transformed by the convolutional and recurrent neural networks, thanks to their ability to learn semantic representations from raw input. However, to extract generalizable features, massive amounts of well-curated data are required, which is a notoriously challenging task; hindered by privacy issues, and annotation costs. Therefore, unsupervised representation learning is of prime importance to leverage the vast amount of unlabeled data produced by smart devices. In this work, we propose a novel self-supervised technique for feature learning from sensory data that does not require access to any form of semantic labels. We learn a multi-task temporal convolutional network to recognize transformations applied on an input signal. By exploiting these transformations, we demonstrate that simple auxiliary tasks of the binary classification result in a strong supervisory signal for extracting useful features for the downstream task. We extensively evaluate the proposed approach on several publicly available datasets for smartphone-based HAR in unsupervised, semi-supervised, and transfer learning settings. Our method achieves performance levels superior to or comparable with fully-supervised networks, and it performs significantly better than autoencoders. Notably, for the semi-supervised case, the self-supervised features substantially boost the detection rate by attaining a kappa score between 0.7-0.8 with only 10 labeled examples per class. We get similar impressive performance even if the features are transferred from a different data source. While this paper focuses on HAR as the application domain, the proposed technique is general and could be applied to a wide variety of problems in other areas.



There are no comments yet.


page 12

page 15

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Over the last years, deep neural networks have been widely adopted for time-series and sensory data processing; achieving impressive performance in several application areas pertaining to pervasive sensing, ubiquitous computing, industries, health and well-being (Radu et al., 2018; Georgiev et al., 2017; Saeed and Trajanovski, 2017; Hannun et al., 2019; Liu et al., 2016; Yao et al., 2018). In particular, for smartphone-based human activity recognition (HAR), D convolutional and recurrent neural networks trained on raw labeled signals significantly improve the detection rate over traditional methods (Wang et al., 2018a; Hammerla et al., 2016; Morales and Roggen, 2016; Yang et al., 2015; Yao et al., 2018). Despite the recent advances in the field of HAR, learning representations from a massive amount of unlabeled data still presents a significant challenge. Obtaining large, well-curated activity recognition datasets is problematic due to a number of issues. First, smartphone data are privacy sensitive, which makes it hard to collect sufficient amounts of user-activity instances in a real-life setting. Second, the annotation cost and the time it takes to generate a large volume of labeled instances are prohibitive. Finally, the diversity of devices, types of embedded sensors, variations in phone-usage, and different environments are further roadblocks in producing massive human-labeled data. To sum up, such expensive and hard to scale process of gathering labeled data generated by smart devices makes it very difficult to apply supervised learning in this domain directly.

In light of these challenges, we pose the question whether it is possible to learn semantic representations in an unsupervised way to circumvent the manual annotation of the sensor data with strong labels, e.g., activity classes. In particular, the goal is to extract features that are on par with those learned with fully-supervised 111The fully-supervised network is the standard deep model that is trained in an end-to-end fashion directly with activity labels without any pre-training. methods. There is an emerging paradigm for feature learning called self-supervised learning that defines auxiliary (also known as pretext or surrogate) tasks to solve, where labels are readily extractable from the data without any human intervention, i.e., self-supervised. The availability of strong supervisory signals from the surrogate tasks enables us to leverage objective functions as utilized in a standard supervised learning setting (Doersch and Zisserman, 2017). For instance, the vision community proposed a considerable number of self-supervised tasks for advancing representation learning222also known as feature learning from static images, videos, and audio (see Section 5

). Most prominent among them are: colorization of grayscale images 

(Larsson et al., 2017; Zhang et al., 2017), predicting image rotations (Gidaris et al., 2018), solving jigsaw puzzles (Noroozi and Favaro, 2016), predicting the direction of video playback (Wei et al., 2018), temporal order verification (Misra et al., 2016)

, odd sequence detection 

(Fernando et al., 2017), audio-visual correspondence (Owens et al., 2016; Arandjelović and Zisserman, 2017), and curiosity-driven agents (Pathak et al., 2017). The presented methodology for sensor representation learning takes inspiration from these methods and takes leverage of signal transformations to extract highly generalizable features for the down-stream333or an end-task task, i.e., HAR.

Figure 1. Illustration of the proposed multi-task self-supervised approach for feature learning. We train a temporal convolutional network for transformation recognition as a pretext task as shown in Step 1. The learned features are utilized by (or transferred to) the activity recognition model (Step 2) for improved detection rate with a small labeled dataset.

Our work is motivated by the success of jointly learning to solve multiple self-supervised tasks (Doersch and Zisserman, 2017; Caruana, 1997) and we propose to learn accelerometer representations (i.e., features) by training a temporal convolutional neural network (CNN) to recognize the transformations applied to the raw input signal. Particularly, we utilize a set of signal transformations (Um et al., 2017; Batista et al., 2011) that are applied on each input signal in the datasets, which are then fed into the convolutional network along with the original data for learning to differentiate among them. In this simple formulation, a group of binary classification tasks (i.e., to recognize whether a transformation such as permutation, scaling, and channel shuffling was applied on the original signal or not) act as surrogate tasks to provide a rich supervisory signal to the model. In order to extract highly generalizable features for the end-task of interest, it is essential to utilize transformations that exploit versatile invariances of the temporal data (further details are provided in Section 3). To this end, we utilize eight transformations to train a multi-task network for simultaneously recognizing each of them. The visual illustration of the proposed approach is given in Figure 1. In the pre-training phase, the network consisting of a common trunk with a separate head for each task is trained on self-supervised data, and in the second step, the features learned by the shared layers are utilized by the HAR model. Importantly, we want to emphasize that in order for the convolutional network to recognize the transformations, it must learn to understand the core signal characteristics through acquiring knowledge of underlying differences in the accelerometer signals for various activity categories. We support this claim through an extensive evaluation of our method on six publicly available datasets in unsupervised, semi-supervised and transfer learning settings, where it achieves noticeable improvements in all the cases while not requiring manually labeled data for feature learning.

The main contributions of this paper are:

  • We propose to utilize self-supervision from large unlabeled data for human activity recognition.

  • We design a signal transformation recognition problem as a surrogate task for annotation-free supervision, which provides a strong training signal to the temporal convolutional network for learning generalizable features.

  • We demonstrate through extensive evaluation that the self-supervised features perform significantly better in the semi-supervised and transfer learning settings on several publicly available datasets. Moreover, we show that these features achieve performance that is superior to or comparable with the features learned via the fully-supervised approach (i.e., trained directly with activity labels).

  • We illustrate with SVCCA (Raghu et al., 2017), saliency mapping (Simonyan et al., 2013), and t-SNE (Maaten and Hinton, 2008)

    visualizations that the features extracted via self-supervision are very similar to those learned by the fully-supervised network.

  • Our method substantially reduces the labeled data requirement, effectively narrowing the gap between unsupervised and supervised representation learning.

The paper is organized as follows. Section 2 provides an overview of related paradigms and methodologies as background information. Section 3 introduces the proposed self-supervised representation learning framework for HAR. Section 4 presents an evaluation of our framework on publicly available datasets. Section 5 gives an overview of the related work. Finally, Section 6 concludes the paper and lists future directions for research.

2. Preliminaries

In this section, we provide a brief overview of multiple learning paradigms, including multi-task, transfer, semi-supervised and importantly, representation learning. These either benefit or serve as fundamental building blocks of our self-supervised framework for representation extraction and robust HAR under various settings.

2.1. Representation Learning

Representation (feature) learning is concerned with automatically extracting useful information from the data that can be effectively used for an impending machine learning problem such as classification. In the past, most of the efforts were spent on developing (and manually engineering) feature extraction methods based on domain expertise to incorporate prior knowledge in the learning process. However, these methods are relatively limited as they rely on human creativity to come up with novel features and lack the power to capture underlying explanatory factors in the milieu of low-level sensory input. To overcome these limitations and to automate the discovery of disentangled features, neural networks based approaches have been widely utilized, such as autoencoders and their variants (Baldi, 2012)

. Deep neural networks are composed of multiple (parameterized) non-linear transformations that are trained through a supervised or unsupervised objective function with the aim of yielding useful representations. These techniques have achieved indisputable empirical success across a broad spectrum of problems 

(Krizhevsky et al., 2012; Taigman et al., 2014; Sutskever et al., 2014; Mohamed et al., 2012; Hannun et al., 2019; Aytar et al., 2016; Li et al., 2016; Radu et al., 2018) thanks to the increasing dataset sizes and computing power availability. Nevertheless, representation learning still stands as a fundamental problem in machine intelligence and is an active area of research (see (Bengio et al., 2013) for a detailed survey).

2.2. Multi-task Learning

The goal of multi-task learning (MTL) is to enhance the learning efficiency and accuracy through simultaneously optimizing multiple objectives based on shared representations and exploiting relations among the tasks (Caruana, 1997)

. It is widely utilized in several application domains within machine learning such as natural language processing 

(Hashimoto et al., 2017)

, computer vision 

(Kendall et al., 2018), audio sensing (Georgiev et al., 2017), and well-being (Saeed and Trajanovski, 2017). In this learning setting, supervised tasks, each with a dataset and a separate cost function are made available. The multi-objective loss is then generally created through a weighted linear sum of the individual tasks’ losses as:


where is the task weight and

is a task-specific loss function. It is important to note that, MTL itself does not impose any restriction on the loss type of an individual task. Therefore, unsupervised and supervised tasks or tasks having different cost functions can be conveniently combined for learning representations.

2.3. Transfer Learning

Transfer learning aims to develop methods for preserving and leveraging previously acquired knowledge to accelerate the learning of novel tasks. In recent years, it has shown remarkable improvement in performance on several very challenging problems, especially in areas, where little-labeled data are available such as in natural language understanding, object recognition, and activity recognition (Sharif Razavian et al., 2014; Morales and Roggen, 2016; Howard and Ruder, 2018). In this paradigm, the goal is to transfer (or reuse) the learned knowledge from a source domain to a target domain . More precisely, consider domains and with learning tasks and , respectively. The goal is to help improve the learning of a predictive function in using the knowledge extracted from and , where , and/or , meaning that domains or tasks may be different. This learning formulation enables to develop a high-quality model under different knowledge transfer settings (such as features, instances, weights) from existing labeled data of some related task or domain. For a detailed review of transfer learning, we refer an interested reader to (Pan et al., 2010).

2.4. Semi-supervised Learning

Semi-supervised learning provides a compelling framework for leveraging unlabeled data in cases when labeled data collection is expensive. It has been repeatedly shown that given enough computational power and supervised data; deep neural networks can achieve human-level performance on a wide variety of problems (LeCun et al., 2015). However, the curation of large-scale datasets is very costly and time-consuming as it either requires crowdsourcing or domain expertise such as in the case of medical imaging. Likewise, for several practical problems, it is simply not possible to create a large enough labeled dataset (e.g., due to privacy issues) to learn a model of reasonable accuracy. Semi-supervised learning algorithms offer a compelling alternative to fully-supervised methods for jointly learning from few labeled and a large number of unlabeled instances. More specifically, given a labeled training set of input-output pairs and unlabeled instance set, , the broad aim is to produce a predictive function making use of not only but also the underlying structure in , where represents the learnable parameters of the model. For a concise review and realistic evaluation of various deep learning based semi-supervised techniques, see (Oliver et al., 2018).

2.5. Towards Self-supervision

Deep learning has been increasingly used for end-to-end HAR with far superior performance that can be achieved through traditional machine learning methods (Yang et al., 2015; Hammerla et al., 2016; Radu et al., 2018; Morales and Roggen, 2016; Saeed et al., 2018). However, learning from very few labeled data, i.e. few-shot and semi-supervised learning is still an issue as large labeled datasets are required to train a model of sufficient quality. Similarly, the utilization of previously learned knowledge from related data (or task) to rapidly solve a comparable problem is not addressed very well by the existing methods (see Section 5 for more details). In this paper, we explore self-supervised feature learning for HAR that effectively utilizes unlabeled data. The exciting field of self-supervision is concerned with extracting supervisory signals from data without requiring any human intervention. The evolution of feature extraction methods from hand-crafted features towards self-supervised representations is illustrated in Figure 2. The input to each of the illustrated approaches is raw data, which is not shown for the sake of brevity.

Figure 2. Evolution of feature learning approaches from hand-crafted methods towards task discovery for self-supervision.

3. Approach

In this section, we present our self-supervised representation learning framework for HAR. First, we provide an overview of the methodology. Next, we discuss various learning tasks (i.e. transformation classification) and their benefits for generic features extraction from unlabeled data. Finally, we provide a detailed description of the network architecture, its implementation, and the optimization process.

3.1. Overview

The objective of our work is to learn general-purpose sensor representations based on a temporal convolutional network in an unsupervised manner. To achieve this goal, we introduce a self-supervised deep network named Transformation Prediction Network (TPN), which simultaneously learns to solve multiple (signal) transformation recognition tasks as shown in Figure 1. Specifically, the proposed multi-task TPN

is trained to produce estimates of the transformations applied to the raw input signal. We define a set of

distinct transformations (or tasks) , where is a function that applies a particular signal alteration technique to the temporal sequence to yield a transformed version of the signal . The network

that has a common trunk and individual head for each task, it takes an input sequence and produces a probability of the signal being a transformed version of the original, i.e.

. Note, that given a set of unlabeled signals (e.g. of accelerometer), we can automatically construct a self-supervised labeled dataset . Hence, given this set of training instances, the multi-task self-supervised training objective that a model must learn to solve is:


where is the predicted probability of being a transformed version and are the learnable parameters of the network. represents the number of instances for a task (which can vary but are equal in our case) and is the loss-weight of task .

We emphasize that, although the network has a separate layer to differentiate between original and each of the transformations it can be extended in a straight-forward manner to recognize multiple transformations applied to the same input signal or for multi-label classification. In the following subsection, we explain the types of signal transformations that are used in this work.

3.2. Self-supervised Tasks: Signal Transformations

The aforementioned formulation requires the signal transformations to define a multi-task classification that enables the convolutional model to learn disentangled semantic representations useful for down-stream tasks, e.g. activity detection. We aimed for conceptually simple, yet diverse tasks to possibly cover several invariances that commonly arise in temporal data (Batista et al., 2011). Intuitively, a diverse set of tasks should lead to a broad spectrum of features, which are more likely to span the feature-space domain needed for a general understanding of the signal’s characteristics. In this work, we propose to utilize eight straight-forward signal transformations (i.e. (Batista et al., 2011; Um et al., 2017) for the self-supervision of a network. More specifically, when transformations are applied on an input signal , they result in eight variants of . As mentioned earlier, the temporal convolutional model is then trained jointly on all the tasks’ data to solve a problem of transformation recognition, which allows the model to extract high-level abstractions from the raw input sequence. The transformations utilized in this work are summarized below:

  • Noised: Given sensor readings of a fixed length, a possible transformation is the addition of random noise (or jitter) in the original signal. Heterogeneity of device sensors, software, and other hardware can cause variations (noisy samples) in the produced data. A model that is robust against noise will generalize better as it learns features that are invariant to minor corruption in the signal.

  • Scaled: A transformation that changes the magnitude of the samples within a window through multiplying with a randomly selected scalar. A model capable of handling scaled signals produces better representations as it becomes invariant to amplitude and offset invariances.

  • Rotated: Robustness against arbitrary rotations applied on the input signal can achieve sensor-placement (orientation) invariance. This transformation inverts the sample signs (without changing the associated class-label) as frequently happens if the sensor (or device) is, for example, held upside down.

  • Negated: This simple transformation is an instance of both scaled (scaling by ) and rotated transformations. It negates samples within a time window, resulting in a vertical flip or a mirror image of the input signal.

  • Horizontally Flipped: This transformation reverses the samples along the time-dimension, resulting in a complete mirror image of an original signal as if it were evolved in the opposite time direction.

  • Permuted: This transformation randomly perturbs the events within a temporal window through slicing and swapping different segments of the time-series to generate a new one, hence, facilitating the model to develop permutation invariance properties.

  • Time-Warped: This transformation locally stretches or warps a time-series through a smooth distortion of time intervals between the values (also known as local scaling).

  • Channel-Shuffled: For a multi-component signal such as a triaxial accelerometer, this transformation randomly shuffles the axial dimensions.

There are several benefits of utilizing transformations recognition as auxiliary tasks for feature extraction from unlabeled data.

Enabling the learning of generic representations: The primary motivation is that the above-defined pretext tasks enable the network to capture the core signal characteristics. More specifically, for the TPN to successfully recognize if the signal is transformed or not, it must learn to detect high-level semantics, sensor behavior under different device placements, time-shift of the events, varying amplitudes, and robustness against sensor noise, thus, contributing to solving the ultimate task of HAR.

Task diversification and elimination of low-level input artifacts: A clear advantage of using multiple self-supervised tasks as opposed to a single one is that it will lead to a more diverse set of features that are invariant to low-level artifacts of the signals. Had we chosen to utilize signal reconstruction, e.g. with autoencoders, this would learn to compress the input, but due to a weak supervisory signal (as compared to self-supervision), it may discover trivial features with no practical value for the activity recognition or any other task of interest. We compare our approach against other methods in section 4.3.

Transferring knowledge: Furthermore, with our approach, the unlabeled sensor data that are produced in huge quantity can be effectively utilized with no human intervention to pre-train a network that is suitable for semi-supervised and transfer learning settings. It is particularly of high value for training networks in a real-world setting, where very little or no supervision is available to learn a model of sufficient quality from scratch.

Other benefits: Our self-supervised method has numerous other benefits. It has an equivalent computational cost to supervised learning but with better convergence accuracy, making it a suitable candidate for continuous unsupervised representation learning in-the-wild. Moreover, our technique neither requires a sophisticated pre-processing (apart from z-normalization) nor needs a specialized architecture (which also requires labeled data) to exploit invariances. We will show in Section 4.3 through extensive evaluation that the self-supervised models learn useful representations and dramatically improve performance over other learning strategies. Despite the simplicity of the proposed scheme, it allows utilizing data collected through a wide variety of devices from a diverse set of users.

3.3. Network Architecture and Implementation

We implement the TPN as a multi-branch temporal convolutional neural network with a common trunk (shared layers) and a distinct head (private layers) for each task with a separate loss function. Hard parameter sharing is employed between all the task-specific layers to encourage strong weight utilization from the trunk. Figure 3 illustrates the TPN containing three D convolutional layers consisting of , , and feature maps with kernel sizes of , and

respectively, and having a stride of

. Dropout is used after each of the layers with a rate of , and L regularization is applied with a rate of

. Global max pooling is used after the last convolution layer to aggregate high-level discriminative features. Moreover, each task-specific layer is comprised of a fully-connected layer of

hidden units followed by a sigmoidal output layer for binary classification. We use as non-linearity in all the layers (except the output) and train a network with Adam optimizer (Kingma and Ba, 2014)

for a maximum of 30 epochs with a learning rate of

, unless stated otherwise. Furthermore, the activity recognition model has a similar architecture to the TPN except for a fully-connected layer that consists of hidden units followed by a softmax output layer with units depending on the activity detection task under consideration. Additionally, during training of this model, we apply early-stopping, if the network fully converges on the training set to avoid overfitting.

Figure 3. Detailed architectural specification of transformation prediction and activity recognition networks. We propose a framework for self-supervised representation learning from unlabeled sensor data (such as an accelerometer). Various signal transformations are utilized to establish supervisory tasks, and the network is trained to differentiate between an original and transformed version of the input. The three blocks of

Conv + ReLU

and Dropout layers, which is followed by a Global Max Pooling are similar across both networks. However, the multi-task model has a separate head for each task. Likewise, the activity recognizer has an additional densely connected layer. The TPN is pre-trained on self-supervised data, and the learned weights are transferred (depicted by a dashed arrow) and kept frozen to the lower model, which is then trained to detect various activities.

The motivation for keeping the TPN architecture simple arises from the fact that we want to show the performance gain does not come from the number of parameters (or layers) or due to the utilization of other sophisticated techniques such as batch normalization but the improvement is due to self-supervised pre-training. Likewise, the choice of multi-task learning setting, where each task has an additional private layer manifests in letting the model push pretext task-specific features to the last layers and let the initial layers extract generic representations that are important for a wide variety of end-tasks. Moreover, our architectural specification allows for a straightforward extension to add other related tasks, if needed, such as input reconstruction. Although, we do not explore applying multiple transformations to the same sequence or train models for their recognition the network design is intrinsically capable of performing this multi-label classification task.

Input: Unlabeled instance set , labeled dataset , task-specific weights , numbers of epochs and
Output: Self-supervised network , activity classification model with classes
initialize to hold instance-label pairs for multiple tasks in
initialize with parameters and with parameters
// Labeled data generation for self-supervision
for each instance in  do
       for each transformation  do
             Insert , to and , to
       end for
end for
for each epoch from to  do
       Randomly sample a mini-batch of samples for all tasks
       Update by descending along its gradient
end for
Assign learned parameters from to
Keep the transferred weights of network frozen
for each epoch from to  do
       Randomly sample a mini-batch of labeled activity recognition samples
       Update by descending along its gradient
end for
Gradient-based updates can use any standard gradient-based learning technique. We used Adam (Kingma and Ba, 2014) in all our experiments.
Algorithm 1 Multi-task Self-Supervised Learning

Our training process is summarized in Algorithm 1. For every instance, we first generate transformed versions of a signal for the self-supervised pre-training of the network. At each training iteration of the TPN model, we feed the data from all tasks simultaneously, and the overall loss is calculated as a weighted sum of the losses of different tasks. Once pre-training converges, we transfer the weights of convolutional layers from model to an activity recognition network for learning the final supervised task. Here, either all the transferred layers are kept frozen, or the last convolutional layer is fine-tuned depending on the learning paradigm. Figure 3 depicts this process graphically, where shaded convolutional layers represent frozen weights, while others are either trained from scratch or optimized further on the end-task. To avoid ambiguity, in the experiment section, we explicitly mention when the results are from a fully-supervised or self-supervised (including fine-tuned) network.

4. Evaluation

In this section, we conduct an extensive evaluation of our approach on several publicly available datasets for human activity recognition (HAR) in order to determine the quality of learned representations, transferability of the features, and benefits of this in the low-data regime. The self-supervised tasks (i.e., transformation predictions) are utilized for learning rich sensor representations that are suitable for an end-task. We emphasize that achieving high performance on these surrogate tasks is not our focus.

Dataset No. of users No. of activity classes
HHAR 9 6
UniMiB 30 9
UCI HAR 30 6
MobiAct 67 11
WISDM 36 6
MotionSense 24 6
Table 1. Summary of datasets used in our evaluation. These datasets are selected based on the diversity of participants, device types and activity classes. Further details on the pre-processing of each data source and the number of users utilized are discussed in Section 4.1.

4.1. Datasets

We consider six publicly available datasets to cover a wide variety of device types, data collection protocols, and activity recognition tasks performed with smartphones in different environments. Some important aspects of the data are summarized in Table 1. Below, we give brief descriptions of every dataset summarizing its key points.

4.1.1. Hhar

The Heterogeneity Human Activity Recognition (HHAR) dataset (Stisen et al., 2015) contains signals from two sensors (accelerometer and gyroscope) of smartphones and smartwatches for different activities, i.e. biking, sitting, standing, walking, stairs-up and stairs-down. The participants executed a scripted set of activities for minutes to get equal class distribution. The subjects had smartphones in a tight pouch carried around their waist and smartwatches, worn on each arm. In total, they used different smart devices of models from manufacturers to cover a broad range of devices for sampling rate heterogeneity analysis. The sampling rate of signals varied significantly across phones with values between -Hz.

4.1.2. UniMiB

This dataset (Micucci et al., 2017) contains triaxial accelerometer signals collected from a Samsung Galaxy Nexus smartphone at Hz. Thirty subjects participated in the data collection process forming a diverse sample of the population with different height, weight, age, and gender. The subject placed the device in her trouser’s front left pocket for a partial duration and in the right pocket for the remainder of the experiment. We utilized the data of activities of daily living (i.e., standing up from sitting, standing up from lying, walking, running, upstairs, jumping, downstairs, lying down from sitting, sitting) in this paper.

4.1.3. Uci Har

The UCI HAR dataset (Anguita et al., 2013) is obtained from a group of volunteers with a waist-mounted Samsung Galaxy S smartphone. The accelerometer and gyroscope signals are collected at Hz when subjects performed the following six activities: standing, sitting, laying down, walking, downstairs and upstairs.

4.1.4. MobiAct

The MobiAct444second release dataset (Chatzaki et al., 2016) contains signals from a smartphone’s inertial sensors (accelerometer, gyroscope, and orientation) for different activities of daily living and types of falls. It is collected with a Samsung Galaxy S smartphone from participants of different gender, age group, and weight through more than trials. The device is placed in a trouser’s pocket freely selected by the subject in any random orientation to capture everyday usage of the phone. We used the data from participants who have data samples for any of the following activities: sitting, walking, jogging, jumping, stairs up, stairs down, stand to sit, sitting on a chair, sit to stand, car step-in, and car step-out.

4.1.5. Wisdm

The dataset from the Wireless Sensor and Data Mining (WISDM) project (Kwapisz et al., 2011) was collected in a controlled study from volunteers, who carried the cell phone in their pockets. The data were recorded for different activities (i.e., sit, stand, walk, jog, ascend stairs, descend stairs) via an app developed for an Android phone. The accelerometer signal was acquired every ms (sampling rate of Hz). We use the data of all the users available in the raw data file with user ids ranging from to .

4.1.6. MotionSense

The MotionSense dataset (Malekzadeh et al., 2018) comprises an accelerometer, gyroscope, and altitude data from participants of varying age, gender, weight, and height. It was collected using an iPhones, which is kept in the user’s front pocket. The subjects performed different activities (i.e., walking, jogging, downstairs, upstairs, sitting, and standing.) in trials under similar environments and conditions. The study aimed to infer physical and demographics attributes from time-series data in addition to the detection of activities.

4.2. Data Preparation and Assessment Strategy

We applied minimal pre-processing on the accelerometer signals as deep neural networks are very good at learning abstract representations directly from raw data (LeCun et al., 2015). We segmented the signals into fixed size windows that have samples with overlap, for all the datasets under consideration. The appropriate window size is a task-specific parameter and could be tuned or chosen based on prior knowledge for improved performance. Here, we utilize the same window size based on earlier exploration across datasets and to keep experimental evaluation impartial towards the effect of this hyper-parameter. Next, we divide each dataset into training and test sets through randomly selecting of the users for testing and the rest for training and validation; depending on the dataset size. We used the ceiling function to select number of users, e.g. from HHAR dataset users are used for evaluation out of . The training set users’ data are further divided into for training the network and for validation and hyper-parameter tuning. Importantly, we also evaluate our models through user-split based -folds cross-validation, wherever it is appropriate. Finally, we normalize the data by applying z-normalization with summary statistics calculated from the training set. We generate self-supervised data from an unlabeled training set that is produced as a result of the processing as mentioned earlier. We utilize the data generation procedure as explained earlier in Section 3.3.

Furthermore, due to the large size of the HHAR dataset and in order to reduce computational load, we randomly sample instances from each users’ data to produce transformed signals. Likewise, in the case of UniMiB because of its relatively small size, we generate

times more transformed instances. We evaluate the performance with Cohen’s kappa, a weighted version of precision, recall and f-score metrics to be robust against inherent imbalanced nature of the datasets. It is important to highlight that,

we use a network architecture with the same configuration across the datasets to evaluate models’ performance in order to highlight improvement is indeed due to self-supervision and not due to architectural modifications.

4.3. Results

4.3.1. Quantifying the Quality of Learned Feature Hierarchies

We first evaluate our approach to determine the quality of learned representations versus the model depth (i.e., the layer number from which the features come). This analysis helps in understanding whether the features coming from different layers vary in quality concerning their performance on an end-task and if so, which layer should be utilized for this purpose

. To this end, we first pre-train our TPN in a self-supervised manner and learn classifiers on top of

, , and layers independently, for several activity recognition datasets. These classifiers (see Figure 3) are trained in a supervised way while keeping the learned features fixed during the optimization process. Figure 4 provides kappa values on test sets averaged across 10-independent runs to be robust against differences in weight initializations of the classifiers. We observe that for a majority of the datasets the model performance improves with increasing depth apart from HHAR, where features from layer results in improved detection rate with a kappa of compared to of . It may be because the representation of the last layer starts to become too specific on the transformation prediction task or it may also be because we did not utilize the entire dataset for the self-supervision. To be consistent, in the subsequent experiments we used features from the last convolutional layer for all the considered datasets. For a new task or recognition problem, we recommend performing a similar analysis to identify layer/block of the network that gives optimal results on the particular dataset.

Figure 4. Evaluation of activity classification performance using the features learned based on self-supervision (per layer). We train an activity classifier on-top of each of the temporal convolution blocks (, , and ) that are pre-trained with self-supervision. The reported results are averaged over independent runs (i.e., training an activity classifier from scratch). , , and have , , and feature maps, respectively.

4.3.2. Comparison against Fully-Supervised and Unsupervised Approaches

In this subsection, we assess our self-supervised representations learned with TPN against other unsupervised and fully-supervised techniques for feature learning. Table 2

summarizes the results with respect to four evaluation metrics (namely, precision, recall, f-score, and kappa) for 10-independent runs on the six datasets described earlier. For the

Random Init. entries, we keep the convolutional network layers frozen during optimization and train only a classifier in a supervised manner. Likewise, for an Autoencoder, we keep the network architecture the same and pre-train it in an unsupervised way. Afterward, the weights of the encoder are kept frozen, and a classifier is trained on top as usual. The Self-Supervised entries show the result of the convolutional network pre-trained with our proposed method, where a classifier is trained on top of the frozen network in a supervised fashion. Furthermore, Self-Supervised (FT) entries highlight the performance of the network trained with self-supervision but the last convolution layer, i.e. is fine-tuned along with a classifier during training on the activity recognition task. Training an activity classification model on top of randomly initialized convolutional layers poorly performs as expected, which is evidence that the performance improvement is not only because of the activity classifier. These results are followed by a widely used unsupervised learning method, i.e. an autoencoder. The self-supervised technique outperforms existing methods and achieves results that are on par with the fully-supervised model. It is important to note that, for our proposed technique, only the classifier layers are randomly initialized and trained with activity specific labels (the rest is transferred from the self-supervised network). We also observe that fine-tuning the last convolutional layer further improves the classification performance of the down-stream tasks on several datasets such as UniMiB, HHAR, MobiAct, and UCI HAR. The results show that TPN can learn highly generalizable representations, thus reducing the performance gap of feature learning with the (end-to-end) supervised case. For a more rigorous evaluation, we also performed -folds (user split based) cross-validation for every method on all the datasets. The results are provided in Table 4 of the appendix, which also shows that the self-supervised method reduces the performance gap with the supervised setting.

Random Init. 0.38820.0557 0.31010.0409 0.21410.0404 0.17420.0488
Supervised 0.76240.0312 0.73530.0308 0.72760.0297 0.68160.0371
Autoencoder 0.73170.0451 0.66570.0663 0.65850.0724 0.59940.0784
Self-Supervised 0.79850.0155 0.7770.0199 0.76660.0234 0.7310.0243
Self-Supervised (FT) 0.82180.0256 0.7970.0211 0.78620.0187 0.75550.025
(a) HHAR
Random Init. 0.42560.0468 0.35460.037 0.27750.0491 0.22430.0474
Supervised 0.82760.0148 0.80960.0266 0.80970.0248 0.78150.0299
Autoencoder 0.59220.0191 0.55570.0232 0.53760.0339 0.48240.0275
Self-Supervised 0.81330.0077 0.79540.014 0.79290.016 0.76420.0162
Self-Supervised (FT) 0.85060.007 0.84320.0049 0.84250.0054 0.81970.005
(b) UniMiB
Random Init. 0.61890.0648 0.43920.0692 0.37130.0952 0.31330.0866
Supervised 0.90590.0133 0.89980.0139 0.89810.0148 0.87890.0168
Autoencoder 0.83140.0590 0.78770.1112 0.77720.1306 0.74250.1359
Self-Supervised 0.91000.0081 0.90110.0139 0.89870.0155 0.88030.0169
Self-Supervised (FT) 0.90570.0121 0.8970.0185 0.89460.019 0.87540.0222
Random Init. 0.47490.1528 0.34520.1128 0.28130.0982 0.19150.1017
Supervised 0.9080.0066 0.8950.0167 0.89750.0133 0.86650.0202
Autoencoder 0.74930.0328 0.75810.0354 0.72930.0452 0.67720.0517
Self-Supervised 0.90950.0035 0.90590.0059 0.9060.0053 0.87950.0073
Self-Supervised (FT) 0.91940.0057 0.91020.0114 0.91170.0093 0.88550.014
(d) MobiAct
Random Init. 0.59420.0599 0.35430.077 0.3580.0837 0.22240.0656
Supervised 0.90240.0076 0.86570.0206 0.87640.0168 0.82110.0258
Autoencoder 0.65610.2775 0.66310.1623 0.63580.2355 0.51060.288
Self-Supervised 0.88940.0096 0.84840.0269 0.85930.0225 0.79860.0334
Self-Supervised (FT) 0.89990.0111 0.85680.0375 0.86860.0314 0.81060.0466
Random Init. 0.59990.0956 0.50290.0931 0.46810.1105 0.3760.1176
Supervised 0.91640.0053 0.89930.0091 0.90270.0085 0.87630.011
Autoencoder 0.82550.0132 0.81160.0195 0.81090.0169 0.76590.0226
Self-Supervised 0.89790.0073 0.88560.0087 0.88640.0083 0.85890.0106
Self-Supervised (FT) 0.91530.0088 0.89790.0092 0.90050.0094 0.87440.0112
(f) MotionSense
Table 2. Task Generalization: Evaluating self-supervised representations for activity recognition. We compare the proposed self-supervised method for representation learning with fully-supervised and unsupervised approaches. We use the same architecture across all the experiments. The self-supervised TPN is trained to recognize transformations applied on the input signal while the activity classifier is trained on top of these learned features where Self-Supervised (FT) entry provides results when the last convolution layer is fine-tuned. The Random Init. entries present results when the convolution layers are randomly initialized and kept frozen during the training of the classifier. The results reported are averaged over independent runs to be robust against variations in the weight initialization and the optimization process.

4.3.3. Assessment of Individual Self-Supervised Tasks in Contrast with Multiple Tasks

In Figure 5, we show comparative performance analysis of single self-supervised tasks with each other and importantly with a multi-task setting. This assessment helps us in understanding whether self-supervised features extracted via jointly learning to solve multiple tasks are any better (for activity classification) than independently solving individual tasks and whether multi-task learning helps in learning more useful sensor semantics. To achieve this, we pre-train a TPN on each of the self-supervised tasks and transfer the weights for learning an activity recognition classifier. We observe in all the cases that learning representations via solving multiple tasks lead to far better performance on the end-task. This further highlights that the features learned through various self-supervised tasks have different strengths and weaknesses. Therefore, merging multiple tasks results in an improvement in learning a diverse set of features. However, we notice that some tasks (such as Channel Shuffled, Permuted, and Rotated) consistently performed better compared to others across datasets; achieving a kappa score above as evaluated on different activity recognition problems. It highlights an important point that there may exist a group of tasks, which are reasonably sufficient to achieve a model of good quality. Furthermore, in Figure 11 of the appendix, we plot the kappa score achieved by a multi-task TPN on transformation recognition tasks as a function of the number of training epochs. This analysis highlights that task complexity varies greatly from one dataset to another and may help with the identification of trivial auxiliary tasks that may lead to non-generalizable features.

In addition to activity classification, for any learning task involving time-series sensor data (e.g., as encountered in a various Internet of Things applications), we recommend extracting features through first solving individual tasks and later focusing on the multi-task scenario; discarding low performing tasks or assigning low-weights to the loss functions of the respective tasks. Another approach could be to auto-tune the task-loss weight by taking homoscedastic uncertainty of each task into account (Kendall et al., 2018).

Figure 5. Comparison of individual self-supervised tasks with the multi-task setting. The TPN is pre-trained for solving a particular task and the activity classifier is trained on-top of the learned features. We report the averaged results of evaluation metrics for

independent runs, where F, K, P, and R refer to F-score, Kappa, Precision and Recall, respectively. We observe that multi-task learning improves performance in all the cases with tasks such as

Channel Shuffled, Permuted, and Rotated consistently performed better compared to other tasks across datasets.

4.3.4. Effectiveness under Semi-Supervised Setting

Our proposed self-supervised feature learning method attains very high performance on different activity recognition datasets. This brings up the question, whether the self-supervised representations can boost performance in the semi-supervised learning setting as well or not. In particular, can we use this to perform activity detection with very little labeled data? Intrigued by this, we also evaluate the effectiveness of our approach to semi-supervised learning. Specifically, we initially train a TPN on an entire training set for transformation prediction. Subsequently, we learn a classifier on top of the last layer’s feature maps with only a subset of the available accelerometer samples and their corresponding activity labels. For training an activity classifier, we use for each category (class) , , , , , and examples. Note that, - samples per class represent a real-world scenario of acquiring a (small) labeled dataset from human users with minimal interruption to their daily routines, hence, making self-supervision from unlabeled data of great value. Likewise, we believe, our analysis of learning with very few labeled instances across datasets is the first attempt in quantifying the amount of labeled data required to learn an activity recognizer of decent quality. For self-supervised models, as earlier, we either kept the weights frozen or only fine-tune the last layer.

Figure 6. Generalization of the self-supervised learned features under semi-supervised setting. The TPN is pre-trained on an entire set of unlabeled data in a self-supervised manner and the activity classifier is trained from scratch on , , , , , and labeled instances per class. The blue curve (baseline) depicts the performance when an entire network is trained in a standard supervised way while the orange curve shows performance when we keep the transferred layers frozen. The green curve illustrates the kappa score when the last layer is fine-tuned along with the training of a classifier on the available set of labeled instances. The reported results are averaged over independent runs for each of the evaluated approaches. The results with weighted f-score are provided in Figure 12 of the Appendix.

In Figure 6, we plot the average kappa of 10-independent runs as a function of the number of available training examples. For each run, we randomly sample desired training instances and train a model from scratch. Note that, we utilize the same instances for evaluating both supervised baseline and our proposed method. The fully-supervised baseline (blue curve) shows network performance when a model is trained only with the labeled data. The proposed self-supervised pre-training technique, in particular, the version with fine-tuning of the last layer, tremendously improved the performance. The difference in the performance between supervised and self-supervised feature learning is significant on MotionSense, UCI HAR, MobiAct, and HHAR datasets in low-data regime (i.e. with - labeled instances per class). More notably, we observe that pre-training helps more in a semi-supervised setting when the data are collected from a wide variety of devices; simulating a real-life setting. Finally, we highlight that a simple convolutional network is used in our experiments to show the feasibility of self-supervision from unlabeled data. We believe a deeper network trained on a bigger unlabeled dataset will further improve the quality of learned representations for the semi-supervised setting.

4.3.5. Evaluating Knowledge Transferability

We have shown that representations learned by the self-supervised TPN consistently achieve the best performance as compared to other unsupervised/supervised techniques and also in a semi-supervised setting. As we have utilized the unlabeled data from the same data source for self-supervised pre-training, a next logical question that arises is can we utilize a different (yet similar) data source for self-supervised representation extraction and gain a performance improvement on a task of interest (also in a low-data regime)? In Table 3, we assess the performance of our unsupervised learned features across datasets and tasks by fine-tuning them on HAHR, UniMiB, UCI HAR, WISDM, and MotionSense datasets. For self-supervised feature learning, we utilized the unlabeled MobiAct dataset as it is collected from a diverse group of users that performed twelve activities; highest among other considered datasets both in terms of the number of users and activities. This makes MobiAct a suitable candidate to perform transfer learning as it encompasses all the activity classes in other datasets. Of course, we do not utilize activity labels in MobiAct for self-supervised representation learning. We begin by pre-training a network on MobiAct dataset and utilize the learned weights for initialization of an activity recognition model. Moreover, the latter model is trained in a fully-supervised manner on an entire training set of a particular dataset (e.g., UniMiB). In comparison with supervised training of the network (from scratch), the weights learned through our technique from a different and completely unlabeled data source improved the performance in all the cases. On WISDM and HHAR our results are percentage points better in terms of kappa score. Similarly, on UniMiB we obtained percentage points improvement over supervised model, i.e. kappa score increase from to .

Supervised (From Scratch) Transfer (Self-Supervised)
Dataset P R F K P R F K
HHAR 0.76240.0312 0.73530.0308 0.72760.0297 0.68160.0371 0.78160.0405 0.76170.0469 0.75490.0452 0.7130.056
UniMiB 0.82760.0148 0.80960.0266 0.80970.0248 0.78150.0299 00.85570.0123 0.84440.0191 0.84450.0185 0.82140.0217
UCI HAR 0.90590.0133 0.89980.0139 0.89810.0148 0.87890.0168 0.90970.0129 0.90730.0145 0.90650.0152 0.88790.0175
WISDM 0.90240.0076 0.86570.0206 0.87640.0168 0.82110.0258 0.90580.0102 0.89070.0113 0.89460.0108 0.85170.0153
MotionSense 0.91640.0053 0.89930.0091 0.90270.0085 0.87630.011 0.92230.0081 0.90590.0132 0.90960.0126 0.88430.016
Table 3. Task and Dataset Generalization: Quantifying the quality of transferred self-supervised network. We pre-train a TPN on MobiAct dataset with the proposed self-supervised approach. The classifier is added on the transferred model and trained in an end-to-end fashion on a particular activity recognition dataset. We chose MobiAct for transfer learning evaluation because of the large number of users and activity classes it covers. The reported results are averaged over independent runs, where , , , and refer to Precision, Recall, F-score, and Kappa, respectively.

Further, we determine the generalization ability in a low-data regime setting, i.e., when very few labeled data are attainable from an end-task of interest. We transfer self-supervised learned representations on the MobiAct dataset as initialization for an activity recognizer. The network is trained in a supervised manner on the available labeled instances of a particular dataset. Figure 7 shows average kappa score of -independent runs of a fully-supervised (learned from scratch) and transferred models for , , , , , and labeled instances. For each training run, the desired instances are randomly sampled, and for both techniques, the same instances are used for learning the activity classifier. In the majority of the cases, transfer learning improves the recognition performance especially when the number of labeled instances per class are very few, i.e. between to . In particular, on HHAR the performance of a model trained with weights transfer is slightly lower in low-data setting but improves significantly as the number of labeled data points increases. We think it may be because of the complex characteristics of the HHAR dataset as it is particularly collected to show heterogeneity of devices (and sensors) having varying sampling rates and its impact on the activity recognition performance.

Figure 7. Assessment of the transferred self-supervised learned features from a different but related dataset (MobiAct) under semi-supervised setting. We evaluate the performance of the self-supervised approach when different unlabeled data are accessible for representation learning but very few labeled instances are available for training a network on the task of interest. The TPN is pre-trained initially on MobiAct data and the activity classifier is added on-top; later an entire network is trained in an end-to-end fashion on few labeled instances. The reported results are averaged over independent runs for each of the evaluated approaches when we randomly sample , , , , , and for learning an activity classifier. The results with weighted f-score are provided in Figure 13 of the Appendix.

4.4. Determining Representational Similarity

The previous experiments establish the effectiveness of self-supervised sensor representations for activity classification that are significantly better than unsupervised and on-par with fully-supervised approaches. The critical question that arises is whether the self-supervised representations are similar to those learned via direct supervision, i.e., with activity labels. The interpretability of the neural networks and deciphering of the learned representations have recently gained significant attention, especially, for images (see (Olah et al., 2018)

for an excellent review). Here, to better understand the similarity of the extracted representation from TPN and the supervised network, we utilize singular vector canonical correlation analysis (SVCCA) 

(Raghu et al., 2017), saliency maps (Simonyan et al., 2013) and t-distributed stochastic neighbor embedding (t-SNE) (Maaten and Hinton, 2008).

Insights on Representational Similarity with Canonical Correlation

The SVCCA allows for a comparison of the learned distributed representations across different networks and layers. It does so through identifying optimal linear relationships between two sets of multidimensional variates (i.e., neuron activation vectors) arising from an underlying process (i.e., a neural network being trained on a specific task) 

(Raghu et al., 2017). Figure 8 provides a mean similarity of top SVCCA correlation coefficients for all pairs of layers for a self-supervised (trained to predict transformations) and a fully-supervised network. We averaged coefficients as SVCCA implicitly assumes that all CCA vectors are equally crucial for the representations at a specific layer. However, there is plenty of evidence that high-performing deep networks do not utilize the entire dimensionality of a layer (Morcos et al., 2018; LeCun et al., 1990; Li et al., 2018). Due to this, averaging over all the coefficients underestimates the degree of representational similarity. To apply SVCCA, we train both the networks as explained earlier and produce activations of each layer. For a layer, where the number of neurons is larger than the layer in comparison, we randomly sample neuron activation vectors to have comparable dimensionality. In Figure 8 each grid entry represents a mean SVCCA similarity between two layers of different networks. We observe a high correlation among temporal convolution layers trained with two different methods across all the evaluated datasets. In particular, a strong grid-like structure emerges between the last layers of the networks, which is because those layers are learned from scratch with activity labeled data and result in identical representations.

Figure 8. CCA similarity between fully-supervised and self-supervised networks. We employ the SVCAA technique (Raghu et al., 2017) to determine the representational similarity between model layers trained with our proposed approach and standard supervised setting. Each pane is a matrix of size layers layers with each entry showing mean similarity (i.e., an average of top- correlation coefficients) between the two layers. Note that there is a strong relation between convolutional layers even though the self-supervised network is pre-trained with unlabeled data; showing that features learned by our approach are very similar to those learned directly via supervised learning, with activity classes. Likewise, a grid-like structure appears between the last layers of the networks depicting high similarity as those layers are always (randomly initialized and) trained with activity labels.

Visualizing Salient Regions

To further understand the predictions produced by both models, we visualize saliency maps (Simonyan et al., 2013) for the highest-scoring class on randomly selected instances from the MotionSense dataset. Saliency maps highlight which time steps largely affect the output through computing gradient of the loss function with respect to each input time step. More formally, let be an accelerometer sample of length and be the class probability produced by a network (.). The saliency score of each input element indicating its influence on the prediction is calculated as:

where is the negative log-likelihood loss of an activity classification network for an input example .

Figure 9 provides a saliency mapping of the same input produced by the two networks for a class with the highest score. To aid interpretability of the saliency score, we calculate a magnitude of each tri-axial accelerometer sample, effectively combining all three channels. The actual input is given in the top-most pane, the magnitudes with varying color intensity are shown in the bottom panes. The dark color illustrates the regions that contribute most to the network’s prediction. We observe that the saliency maps of both self-supervised and fully-supervised networks hint towards similar regions that are crucial for deciding on the class label.

Interestingly, for the Sitting class instance both network mainly focus on a smaller region of the input with slightly more variation in the values. We think it could be because one thing that a network learns is to find periodic variations in the signal (such as peaks and slopes). Hence, it pays more attention even to slightest fluctuation, but it decides on the Sitting label as the signal remains constant (before and after minor changes) which is an entirely different pattern as compared to the instances of other classes. This analysis further validates the point that our self-supervised network learns generalizable features for activity classification.

Figure 9. Saliency maps (Simonyan et al., 2013) of randomly selected instances from MotionSense dataset. The input signal is illustrated in the top pane with the magnitude computed from the sample shown in the bottom panes for better interpretability. The strong colored intensities exhibit the regions that substantially affect the model predictions. The saliency mapping of both networks focus on similar input areas which shows that the self-supervised representations are useful for the end-task.

Visualization of High-Level Feature Space through t-SNE

t-SNE is a non-linear technique for exploring and visualizing multi-dimensional data (Maaten and Hinton, 2008)

. It approximates a low-dimensional manifold of a high-dimensional counterpart through minimizing Kullback-Leibler divergence between them with a gradient-based optimization method. More specifically, it maps multi-dimensional data onto a lower dimensional space and discovers patterns in the input through identifying clusters based on the similarity of the data points. Here, the activations from global max-pooling layers (of both self-supervised and fully-supervised networks) with

hidden units are projected on to a D space. Figure 10 provides the t-SNE embeddings showing high semantic relevance of the learned features for various activity classes. We notice that the self-supervised features largely correspond to those learned with the labeled activity data. Importantly, the clusters of data points across two feature learning strategies are similar, e.g. in UCI HAR, the activity classes like Upstairs, Downstairs and Walking are grouped. Likewise, in HHAR, the data points for Walking, Upstairs, and Downstairs are close-by as opposed to others in the embeddings of both networks. Finally, it is important to note that t-SNE is an unsupervised technique which does not use class labels; the activity labels are just used for final visualization.

Figure 10. t-SNE visualization of the learned representations. We visualize the features from Global Max Pooling layers of fully-supervised and self-supervised networks by projecting them on D space. The clusters show high correspondence among the representations across datasets. For instance, in UniMiB embeddings the samples belonging to the same class are close-by as opposed to those from a different class, such as Running and Walking are alongside each other while data point from SittingDown class are very far. Note that t-SNE embeddings do not use activity labels, they are only used for final visualizations.

5. Related Work

Deep learning methods have been successfully used in several applications of ubiquitous computing, pervasive intelligence, health, and well-being (Radu et al., 2018; Georgiev et al., 2017; Saeed and Trajanovski, 2017; Hannun et al., 2019; Liu et al., 2016; Yao et al., 2018) and eliminate the need of hand-crafted feature engineering. Convolutional and recurrent neural networks have shown dominant performance in solving numerous high-level recognition tasks from temporal data such as activity detection and stress recognition (Wang et al., 2018a; Hammerla et al., 2016; Saeed and Trajanovski, 2017). In particular, CNNs are becoming increasingly popular in sequence (or time-series) modeling due to their ability of weight sharing, translation invariance, scale separation and localization of filters in space and time (Bai et al., 2018; LeCun et al., 2015). In fact, (D) temporal CNNs are now widely used in the area of HAR (see (Wang et al., 2018a) for a detailed review), but the prior works are mostly concerned with supervised learning approaches. The training of deep networks requires a huge (carefully) curated dataset of labeled instances, which in several domains is infeasible due to required manual labeling effort or can only be possible on a small-scale in a controlled lab environment. This inherent limitation of the fully-supervised learning paradigm emphasizes the importance of unsupervised learning to leverage a large amount of unlabeled data for representation learning (Bengio et al., 2013) that can be easily acquired in a real-world setting.

Unsupervised learning has been well-studied in the literature over the past years. Before the era of end-to-end learning, manual feature design strategies (Figo et al., 2010) such as those that employ statistical measures have been used with clustering algorithms to discover a latent group of activities (Wawrzyniak and Niemiro, 2015). Although deep learning techniques have almost entirely replaced hand-crafted feature extraction with directly learning rich features from data, representation learning still stands as a fundamental problem in machine learning (see (Bengio et al., 2013) for an in-depth review). The classical approaches for unsupervised learning include autoencoders (Baldi, 2012)

, restricted Boltzmann machines 

(Nair and Hinton, 2010)

, and convolutional deep belief networks 

(Lee et al., 2009). Another emerging line of research for unsupervised feature learning (also studied in this work), which has shown promising results and does not require manual annotations, is self-supervised learning (Raina et al., 2007; Doersch et al., 2015; Agrawal et al., 2015). These methods exploit the inherent structure of the data to acquire a supervisory signal for solving a pretext task with reliable and widely used supervised learning schemes.

Self-supervision has been actively studied recently in the vision domain, and several surrogate tasks have been proposed for learning representations from static images, videos, sound, and in robotics (Noroozi and Favaro, 2016; Owens et al., 2016; Gomez et al., 2017; Zhang et al., 2017; Larsson et al., 2017; Jenni and Favaro, 2018; Gidaris et al., 2018; Lee et al., 2017; Doersch and Zisserman, 2017; Fernando et al., 2017; Misra et al., 2016; Arandjelović and Zisserman, 2017; Owens and Efros, 2018; Pathak et al., 2017). For example, in images and videos, spatial and temporal contexts, respectively, provide forms of rich supervision to learn features. Similarly, colorization of gray-scale images (Larsson et al., 2017; Zhang et al., 2017), rotation classification (Gidaris et al., 2018), odd sequence detection (Fernando et al., 2017), frame order prediction (Misra et al., 2016), learning the arrow of time (Wei et al., 2018), audio-visual correspondence (Owens et al., 2016; Arandjelović and Zisserman, 2017) and synchronization (Korbar et al., 2018; Owens and Efros, 2018) are some of the recently explored directions of self-supervised techniques. Furthermore, multiple such tasks are utilized together in a multi-task learning setting for solving diverse visual recognition problems (Doersch and Zisserman, 2017)

. These self-supervised learning paradigms have shown to extract high-level representations that are on par with those acquired through fully-supervised pre-training techniques (e.g., with ImageNet labels) and they tremendously help with transfer and semi-supervised learning scenarios. Inspired from this research direction, we explore multi-task self-supervision for learning representations from sensory data through utilizing transformations of the signals.

Some earlier works on time-series analysis have explored transformations to exploit invariances either through architectural modifications (to automatically learn task-relevant variations) or less commonly with augmentation and synthesis. In (Um et al., 2017) task-specific transformations (such as added noise and rotation) are applied to wearable sensor data to augment and improve the performance of Parkinson’s disease monitoring systems. Saeed et al. (Saeed et al., 2018) utilized an adversarial autoencoder for class-conditional (multimodal) synthetic data generation for the behavioral context in a real-life setting. Moreover, Oh et al. (Oh et al., 2018) focused on learning invariances directly from clinical time-series data with specialized neural network architecture. Razavian et al. (Razavian et al., 2016) used convolution layers of varying size filters to capture different resolutions of temporal patterns. Similarly, through additional pre-processing of the original data Cui et al. (Cui et al., 2016) used transformed signals as extra channels to the model for learning multiscale features. To summarize, these works are geared towards learning supervised networks for specific tasks through exploiting invariances, but they do not address the topics of semi-supervised and unsupervised learning.

To the best of our knowledge, the work presented here is the first attempt of self-supervision for sensor representation learning, in particular for HAR. Our work differs from the aforementioned works in several ways as we learn representations with self-supervision from completely unlabeled data and without using any specialized architecture. We show that when training a CNN to predict generally known (time-series) transformations (Batista et al., 2011; Um et al., 2017) as a surrogate task, the model can learn features that are on a par with a fully-supervised network and far better than unsupervised pre-training with an autoencoder. We also demonstrate that the learned representations from a different (but related) unlabeled data source can be successfully transferred to improve the performance of diverse tasks even in the case of semi-supervised learning. In terms of transfer learning, our approach also differs significantly from some earlier attempts (Morales and Roggen, 2016; Wang et al., 2018b) that were concerned with features transferability from a fully-supervised model learned from inertial measurement units data, as our approach utilizes widely available smartphones and does not require labeled data. Finally, the proposed technique is also different from previously studied unsupervised pre-training methods such as autoencoders (Li et al., 2014), restricted Boltzmann machines (Plötz et al., 2011) and sparse coding (Bhattacharya et al., 2014) as we employ an end-to-end (self) supervised learning paradigm on multiple surrogate tasks to extract features.

6. Conclusions and Future Work

We present a novel approach for self-supervised sensor representation learning from unlabeled data with a focus on smartphone-based human activity recognition (HAR). We train a multi-task temporal convolutional network to recognize potential transformations that may have been applied to the raw input signal. Despite the simplicity of the proposed self-supervised technique (and the network architecture), we show that it enables the convolutional model to learn high-level features that are useful for the end-task of HAR. We exhaustively evaluate our approach under unsupervised learning, semi-supervised learning and transfer learning settings on several publicly available datasets. The performance we achieve is consistently superior to or comparable with fully-supervised methods, and it is significantly better than traditional unsupervised learning methods such as an autoencoder. Specifically, our self-supervised framework drastically improved the detection rate under semi-supervised learning setting, i.e., when very few labeled instances are available for learning. Likewise, the transferred features learned from a different but related unlabeled dataset (MobiAct in our case), further improves the performance in comparison with merely training a model from scratch. Notably, these transferred representations even boost the performance of an activity recognizer in semi-supervised learning from a dataset (or task) of interest. Finally, canonical correlation analysis, saliency mapping, and t-SNE visualizations show that the representations of the self-supervised network are very similar to those learned by a fully-supervised model that is trained in an end-to-end fashion with activity labels. We believe that, through utilizing more sophisticated layers and deep architectures, the presented approach can further reduce the gap between unsupervised and supervised feature learning.

In this work, we provided the basis for self-supervision of HAR with smartphones through a few labeled data. In the Internet of Things era, there are many exciting opportunities for future works in related areas, such as in industrial manufacturing, electrical grid, smart wearable technologies, and home automation. In particular, we believe that self-supervision is of immense value for automatically extracting generalizable representations in domains, where labeled data are challenging to acquire, but unlabeled data are available in vast quantities. We hope that the presented perspective of self-supervision inspires the development of additional approaches, specifically for the selection of appropriate auxiliary tasks (based on domain expertise) that enables the network to learn useful features to solve a particular problem. Likewise, combining self-supervision with network architecture search is another crucial area of improvement that will automate the process of optimal model discovery. Another exciting avenue for future research is evaluating self-supervised representations on an imbalanced activity dataset, where, the number of classes are high and collecting a few labeled data points for each activity class is not feasible. Finally, evaluation in a real-world setting (application deployed on real devices) is of prime importance to further understand the aspects that need improvement concerning computational, energy and, labeled data requirements.

This work is funded by SCOTT ( project. It has received funding from the Electronic Component Systems for European Leadership Joint Undertaking under grant agreement No 737422. This Joint Undertaking receives support from the European Union’s Horizon 2020 research and innovation programme and Austria, Spain, Finland, Ireland, Sweden, Germany, Poland, Portugal, Netherlands, Belgium, Norway.
We thank Prof. Peter Baltus for a helpful discussion and anonymous reviewers for their insightful comments and suggestions. Various icons used in the figures are created by Anuar Zhumaev, Korokoro, Gregor Cresnar, Becris, Hea Poh Lin, AdbA Icons, Universal Icons, and Baboon designs from the Noun Project.


  • Agrawal et al. [2015] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision, pages 37–45, 2015.
  • Anguita et al. [2013] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, and Jorge Luis Reyes-Ortiz. A public domain dataset for human activity recognition using smartphones. In ESANN, 2013.
  • Arandjelović and Zisserman [2017] Relja Arandjelović and Andrew Zisserman. Objects that sound. arXiv preprint arXiv:1712.06651, 2017.
  • Aytar et al. [2016] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, pages 892–900, 2016.
  • Bai et al. [2018] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
  • Baldi [2012] Pierre Baldi. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of ICML workshop on unsupervised and transfer learning, pages 37–49, 2012.
  • Batista et al. [2011] Gustavo EAPA Batista, Xiaoyue Wang, and Eamonn J Keogh. A complexity-invariant distance measure for time series. In Proceedings of the 2011 SIAM international conference on data mining, pages 699–710. SIAM, 2011.
  • Bengio et al. [2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • Bhattacharya et al. [2014] Sourav Bhattacharya, Petteri Nurmi, Nils Hammerla, and Thomas Plötz. Using unlabeled data in a sparse-coding framework for human activity recognition. Pervasive and Mobile Computing, 15:242–262, 2014.
  • Caruana [1997] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
  • Chatzaki et al. [2016] Charikleia Chatzaki, Matthew Pediaditis, George Vavoulas, and Manolis Tsiknakis. Human daily activity and fall recognition using a smartphone’s acceleration sensor. In International Conference on Information and Communication Technologies for Ageing Well and e-Health, pages 100–118. Springer, 2016.
  • Cui et al. [2016] Zhicheng Cui, Wenlin Chen, and Yixin Chen. Multi-scale convolutional neural networks for time series classification. arXiv preprint arXiv:1603.06995, 2016.
  • Doersch and Zisserman [2017] Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In The IEEE International Conference on Computer Vision (ICCV), 2017.
  • Doersch et al. [2015] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430, 2015.
  • Fernando et al. [2017] Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video representation learning with odd-one-out networks. In

    Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on

    , pages 5729–5738. IEEE, 2017.
  • Figo et al. [2010] Davide Figo, Pedro C Diniz, Diogo R Ferreira, and João M Cardoso. Preprocessing techniques for context recognition from accelerometer data. Personal and Ubiquitous Computing, 14(7):645–662, 2010.
  • Georgiev et al. [2017] Petko Georgiev, Sourav Bhattacharya, Nicholas D Lane, and Cecilia Mascolo. Low-resource multi-task audio sensing for mobile and embedded devices via shared deep neural network representations. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(3):50, 2017.
  • Gidaris et al. [2018] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
  • Gomez et al. [2017] Lluis Gomez, Yash Patel, Marçal Rusiñol, Dimosthenis Karatzas, and CV Jawahar. Self-supervised learning of visual features through embedding images into text topic spaces. arXiv preprint arXiv:1705.08631, 2017.
  • Hammerla et al. [2016] Nils Y Hammerla, Shane Halloran, and Thomas Ploetz. Deep, convolutional, and recurrent models for human activity recognition using wearables. arXiv preprint arXiv:1604.08880, 2016.
  • Hannun et al. [2019] Awni Y. Hannun, Pranav Rajpurkar, Masoumeh Haghpanahi, Geoffrey H. Tison, Codie Bourn, Mintu P. Turakhia, and Andrew Y. Ng. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature Medicine, 25(1):65–69, 2019. ISSN 1546-170X. doi: 10.1038/s41591-018-0268-3. URL
  • Hashimoto et al. [2017] Kazuma Hashimoto, Yoshimasa Tsuruoka, Richard Socher, et al. A joint many-task model: Growing a neural network for multiple nlp tasks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1923–1933, 2017.
  • Howard and Ruder [2018] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 328–339, 2018.
  • Jenni and Favaro [2018] Simon Jenni and Paolo Favaro. Self-supervised feature learning by learning to spot artifacts. arXiv preprint arXiv:1806.05024, 2018.
  • Kendall et al. [2018] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7482–7491, 2018.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Korbar et al. [2018] Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems, pages 7774–7785, 2018.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • Kwapisz et al. [2011] Jennifer R Kwapisz, Gary M Weiss, and Samuel A Moore. Activity recognition using cell phone accelerometers. ACM SigKDD Explorations Newsletter, 12(2):74–82, 2011.
  • Larsson et al. [2017] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. In CVPR, volume 2, page 7, 2017.
  • LeCun et al. [1990] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
  • LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436 EP –, May 2015. URL
  • Lee et al. [2009] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th annual international conference on machine learning, pages 609–616. ACM, 2009.
  • Lee et al. [2017] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 667–676. IEEE, 2017.
  • Li et al. [2016] Chi Li, M Zeeshan Zia, Quoc-Huy Tran, Xiang Yu, Gregory D Hager, and Manmohan Chandraker. Deep supervision with shape concepts for occlusion-aware 3d object parsing. arXiv preprint arXiv:1612.02699, 2016.
  • Li et al. [2018] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018.
  • Li et al. [2014] Yongmou Li, Dianxi Shi, Bo Ding, and Dongbo Liu. Unsupervised feature learning for human activity recognition using smartphone sensors. In Mining Intelligence and Knowledge Exploration, pages 99–107. Springer, 2014.
  • Liu et al. [2016] Chang Liu, Yu Cao, Yan Luo, Guanling Chen, Vinod Vokkarane, and Yunsheng Ma. Deepfood: Deep learning-based food image recognition for computer-aided dietary assessment. In International Conference on Smart Homes and Health Telematics, pages 37–48. Springer, 2016.
  • Maaten and Hinton [2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • Malekzadeh et al. [2018] Mohammad Malekzadeh, Richard G Clegg, Andrea Cavallaro, and Hamed Haddadi. Protecting sensory data against sensitive inferences. In Proceedings of the 1st Workshop on Privacy by Design in Distributed Systems, page 2. ACM, 2018.
  • Micucci et al. [2017] Daniela Micucci, Marco Mobilio, and Paolo Napoletano. Unimib shar: A dataset for human activity recognition using acceleration data from smartphones. Applied Sciences, 7(10):1101, 2017.
  • Misra et al. [2016] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pages 527–544. Springer, 2016.
  • Mohamed et al. [2012] Abdel-rahman Mohamed, George E Dahl, Geoffrey Hinton, et al. Acoustic modeling using deep belief networks. IEEE Trans. Audio, Speech & Language Processing, 20(1):14–22, 2012.
  • Morales and Roggen [2016] Francisco Javier Ordóñez Morales and Daniel Roggen. Deep convolutional feature transfer across mobile activity recognition domains, sensor modalities and locations. In Proceedings of the 2016 ACM International Symposium on Wearable Computers, pages 92–99. ACM, 2016.
  • Morcos et al. [2018] Ari S Morcos, David GT Barrett, Neil C Rabinowitz, and Matthew Botvinick. On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959, 2018.
  • Nair and Hinton [2010] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
  • Noroozi and Favaro [2016] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
  • Oh et al. [2018] Jeeheh Oh, Jiaxuan Wang, and Jenna Wiens. Learning to exploit invariances in clinical time-series data using sequence transformer networks. arXiv preprint arXiv:1808.06725, 2018.
  • Olah et al. [2018] Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. The building blocks of interpretability. Distill, 2018. doi: undefined.
  • Oliver et al. [2018] Avital Oliver, Augustus Odena, Colin A Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic evaluation of deep semi-supervised learning algorithms. In Advances in Neural Information Processing Systems, pages 3235–3246, 2018.
  • Owens and Efros [2018] Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. arXiv preprint arXiv:1804.03641, 2018.
  • Owens et al. [2016] Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. In European Conference on Computer Vision, pages 801–816. Springer, 2016.
  • Pan et al. [2010] Sinno Jialin Pan, Qiang Yang, et al. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
  • Pathak et al. [2017] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.
  • Plötz et al. [2011] Thomas Plötz, Nils Y Hammerla, and Patrick Olivier. Feature learning for activity recognition in ubiquitous computing. In

    IJCAI Proceedings-International Joint Conference on Artificial Intelligence

    , volume 22, page 1729, 2011.
  • Radu et al. [2018] Valentin Radu, Catherine Tong, Sourav Bhattacharya, Nicholas D Lane, Cecilia Mascolo, Mahesh K Marina, and Fahim Kawsar. Multimodal deep learning for activity and context recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(4):157, 2018.
  • Raghu et al. [2017] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems, pages 6076–6085, 2017.
  • Raina et al. [2007] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pages 759–766. ACM, 2007.
  • Razavian et al. [2016] Narges Razavian, Jake Marcus, and David Sontag. Multi-task prediction of disease onsets from longitudinal laboratory tests. In Machine Learning for Healthcare Conference, pages 73–100, 2016.
  • Saeed and Trajanovski [2017] Aaqib Saeed and Stojan Trajanovski. Personalized driver stress detection with multi-task neural networks using physiological signals. arXiv preprint arXiv:1711.06116, 2017.
  • Saeed et al. [2018] Aaqib Saeed, Tanir Ozcelebi, and Johan Lukkien. Synthesizing and reconstructing missing sensory modalities in behavioral context recognition. Sensors, 18(9):2967, 2018.
  • Sharif Razavian et al. [2014] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 806–813, 2014.
  • Simonyan et al. [2013] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  • Stisen et al. [2015] Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen. Smart devices are different: Assessing and mitigating mobile sensing heterogeneities for activity recognition. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, pages 127–140. ACM, 2015.
  • Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
  • Taigman et al. [2014] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014.
  • Um et al. [2017] Terry T Um, Franz MJ Pfister, Daniel Pichler, Satoshi Endo, Muriel Lang, Sandra Hirche, Urban Fietzek, and Dana Kulić. Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 216–220. ACM, 2017.
  • Wang et al. [2018a] Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. Deep learning for sensor-based activity recognition: A survey. Pattern Recognition Letters, 2018a.
  • Wang et al. [2018b] Jindong Wang, Vincent W Zheng, Yiqiang Chen, and Meiyu Huang. Deep transfer learning for cross-domain activity recognition. In Proceedings of the 3rd International Conference on Crowd Science and Engineering, page 16. ACM, 2018b.
  • Wawrzyniak and Niemiro [2015] S. Wawrzyniak and W. Niemiro. Clustering approach to the problem of human activity recognition using motion data. In 2015 Federated Conference on Computer Science and Information Systems (FedCSIS), pages 411–416, Sep. 2015. doi: 10.15439/2015F424.
  • Wei et al. [2018] Donglai Wei, Joseph Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8052–8060, 2018.
  • Yang et al. [2015] Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiaoli Li, and Shonali Krishnaswamy. Deep convolutional neural networks on multichannel time series for human activity recognition. In Ijcai, volume 15, pages 3995–4001, 2015.
  • Yao et al. [2018] Shuochao Yao, Yiran Zhao, Huajie Shao, Chao Zhang, Aston Zhang, Shaohan Hu, Dongxin Liu, Shengzhong Liu, Lu Su, and Tarek Abdelzaher. Sensegan: Enabling deep learning for internet of things with a semi-supervised framework. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2(3):144, 2018.
  • Zhang et al. [2017] Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR, volume 1, page 5, 2017.


Random Init. 0.34290.1395 0.3020.0465 0.20230.0333 0.16110.0536
Supervised 0.84030.0349 0.8160.0518 0.80760.0612 0.77880.0624
Autoencoder 0.60680.2149 0.5940.1858 0.54740.2182 0.51590.2208
Self-Supervised 0.84540.0378 0.82390.0462 0.81530.053 0.78810.0556
Self-Supervised (FT) 0.84390.0753 0.81010.1004 0.80380.1072 0.77190.1204
(a) HHAR
Random Init. 0.4160.0307 0.36150.0503 0.28750.0722 0.22810.0622
Supervised 0.8050.0095 0.78990.0153 0.78660.0165 0.75760.0181
Autoencoder 0.59890.0313 0.57430.0192 0.54940.0227 0.50090.0242
Self-Supervised 0.770.0211 0.76180.0191 0.75770.0208 0.7240.0218
Self-Supervised (FT) 0.83960.0226 0.83110.0269 0.82850.0283 0.80460.0309
(b) UniMiB
Random Init. 0.61470.1845 0.50190.0999 0.42970.1141 0.38720.1277
Supervised 0.90680.0332 0.90350.0366 0.9030.0377 0.88270.0446
Autoencoder 0.87450.0367 0.84850.0604 0.84610.0642 0.81610.0734
Self-Supervised 0.90540.0273 0.89190.0388 0.8890.0444 0.86880.0472
Self-Supervised (FT) 0.91250.0403 0.9060.0473 0.90460.049 0.88590.0573
Random Init. 0.48140.1405 0.38280.0417 0.30180.0422 0.1910.0435
Supervised 0.91210.0118 0.90290.016 0.90430.0153 0.8760.0198
Autoencoder 0.74880.0402 0.7490.0398 0.73230.0408 0.66920.0542
Self-Supervised 0.89770.0128 0.88380.0133 0.88680.013 0.85210.0165
Self-Supervised (FT) 0.91850.0056 0.91290.0077 0.91270.0067 0.88830.0101
(d) MobiAct
Random Init. 0.60740.0555 0.32310.0958 0.33180.0885 0.18280.0623
Supervised 0.89830.0185 0.87990.0331 0.88460.0297 0.83590.0452
Autoencoder 0.68780.093 0.68680.0715 0.66710.0797 0.55710.1125
Self-Supervised 0.87110.0389 0.83690.0636 0.84620.0556 0.78020.0826
Self-Supervised (FT) 0.88420.0285 0.85180.048 0.85970.0409 0.79950.0645
Random Init. 0.51520.0997 0.42370.1006 0.34510.1069 0.27130.1246
Supervised 0.93320.0144 0.92190.0186 0.92420.0172 0.90340.0231
Autoencoder 0.82970.0658 0.81940.0734 0.8180.075 0.77670.0892
Self-Supervised 0.92610.0189 0.91760.0195 0.91890.019 0.89790.0244
Self-Supervised (FT) 0.94760.0174 0.93730.028 0.93910.0254 0.92250.0345
(f) MotionSense
Table 4. Evaluating self-supervised representation with (user-split based) 5-folds cross-validation for activity recognition. We perform this assessment based on user-split of the data with no overlap between training and test sets i.e. distinct users’ data are used for training and testing of the models. The reported results are averaged over -folds.
Figure 11. Convergence analysis of transformation recognition tasks. We plot the kappa score of self-supervised tasks (i.e. transformation prediction) as a function of the training epochs. In order to produce the kappa curves, the TPN model’s snapshot is saved every second epoch until the defined number of training epochs. For each saved network, we evaluated its performance on the self-supervised data obtained through processing the corresponding test sets. Note that the TPN never sees a test set data in any way during its learning phase.
Figure 12. Weighted F-score: Generalization of the self-supervised learned features under semi-supervised setting. The reported results are averaged over independent runs for each of the evaluated approaches, for more details see Section 4.3.4.
Figure 13. Weighted F-score: Assessment of the transferred self-supervised learned features from a different but related dataset (MobiAct) under semi-supervised setting. The reported results are averaged over independent runs for each of the evaluated approaches, for more details see Section 4.3.5.