Federated Multi-task Hierarchical Attention Model for Sensor Analytics

Sensors are an integral part of modern Internet of Things (IoT) applications. There is a critical need for the analysis of heterogeneous multivariate temporal data obtained from the individual sensors of these systems. In this paper we particularly focus on the problem of the scarce amount of training data available per sensor. We propose a novel federated multi-task hierarchical attention model (FATHOM) that jointly trains classification/regression models from multiple sensors. The attention mechanism of the proposed model seeks to extract feature representations from the input and learn a shared representation focused on time dimensions across multiple sensors. The underlying temporal and non-linear relationships are modeled using a combination of attention mechanism and long-short term memory (LSTM) networks. We find that our proposed method outperforms a wide range of competitive baselines in both classification and regression settings on activity recognition and environment monitoring datasets. We further provide visualization of feature representations learned by our model at the input sensor level and central time level.


Attention-Driven Body Pose Encoding for Human Activity Recognition

This article proposes a novel attention-based body pose encoding for hum...

Learning Shared Encoding Representation for End-to-End Speech Recognition Models

In this work, we learn a shared encoding representation for a multi-task...

Semi-supervised Federated Learning for Activity Recognition

The proliferation of IoT sensors and edge devices makes it possible to u...

Capsule Attention for Multimodal EEG and EOG Spatiotemporal Representation Learning with Application to Driver Vigilance Estimation

Driver vigilance estimation is an important task for transportation safe...

A hybrid virtual sensing approach for approximating non-linear dynamic system behavior using LSTM networks

Modern Internet of Things solutions are used in a variety of different a...

Jointly Trained Sequential Labeling and Classification by Sparse Attention Neural Networks

Sentence-level classification and sequential labeling are two fundamenta...

A Statistical Update of Grid Representations from Range Sensors

In a wide range of robotic applications, being able to create a 3D model...

1 Introduction

Ubiquitous sensors seek to improve the quality of everyday life through pervasively interconnected objects. As an example, consumer-centric healthcare devices provide accurate and personalized feedback on individuals’ health conditions. A wide array of sensors in the form of wearable devices (e.g., clothing and wrist worn devices), smartphones, and infrastructure components (e.g., low-cost sensors, cameras, WiFi and work stations) are the chief enablers. These solutions commonly referred to as the Internet of Things (IoT) allow for fine-grained sensing and inference of users’ context, physiological signals, and even mental health states. These sensing and detection capabilities coupled with advanced data analytics are leading to intelligent intervention and persuasion techniques that provide an appealing end-to-end solution for various domains e.g., environment monitoring, healthcare, education, and workplace management.

Most data generated by these sensors are multivariate temporal sequences. We model the sequential data from sensors using recurrent neural networks (RNNs) 

[Rumelhart et al.1988] adapted for handling long-term dependencies; referred commonly as long short-term memory models (LSTM) [Hochreiter and Schmidhuber1997]. Further individual sensors usually do not have enough training data for learning generalizable models. In this study we aim to address this particular challenge within the framework of federated multi-task learning (MTL). This allows for multiple benefits: (i) reduced communication costs by not sending large volumes of data to a central server, (ii) distributed learning and (iii) privacy guarantees when the data resides only on the client. Similar to prior work on federated multi-task learning [Smith et al.2017], data associated with each task stays local. Individual models are learned for each task first, and a central node controls the communication between local model parameters and global shared parameters. We present a novel federated multi-task approach that uses a hierarchical attention mechanism to learn a common representation among multiple tasks. The inter-feature correlations of each individual task are captured by sensor-specific attention layers applied on input sensors’ time series. Meanwhile, the temporal correlations are captured by attending to all tasks across the time dimension. The scope of this paper involves modeling the input from multiple heterogeneous sensors across multiple users. The proposed algorithm is federated in the sense that no data leaves the user. However, we assume that data from each of the sensors are available at the same time.

The key contributions of this paper can be summarized as follows:

  • We propose a novel federated multi-task hierarchical attention model (FATHOM) that extracts feature representations with multiple attention aspects. This leads to improved classification and regression performance for input temporal data from multiple sensors.

  • Our proposed approach outperforms a wide range of baselines on two multi-modal sensor datasets from different domains with multi-binary labels or multi-continuous labels.

  • We present a framework to extract and visualize key feature representations from sequential data.

2 Related Work

2.1 Multi-task learning

Multi-task learning (MTL) is designed for the simultaneous training of multiple related prediction tasks. Leveraging common information across related tasks has shown to be effective in improving the generalization performance of each task [Caruana1997, Evgeniou et al.2005, Bonilla et al.2007, Pentina and Lampert2017]. MTL is particularly useful when there are a number of related tasks but some tasks have scarce amounts of available training data. Many researchers have investigated MTL from varied perspectives [Chen et al.2018, Zhou et al.2012]. Task relationships are modeled by sharing layers/units of neural networks [Caruana1997], sharing a low dimensional subspace [Argyriou et al.2007], or assuming a clustering among tasks [Zhou et al.2011, Jacob et al.2009]. In this work we do not make any assumptions on task relationships beforehand and learn the common representation with attention mechanisms directly from the data.

Specifically, for sensor analytics, a multi-task multilayer perceptron (MLP) model  

[Vaizman et al.2018b]

was developed to recognize different human activities from mobile sensors. By manipulating the MLP model to fit uncontrolled in-the-wild unbalanced data, this approach outperformed a standard logistic regression (LR) model 

[Vaizman et al.2017]. The MLP and LR approaches do not leverage the underlying temporal dependencies and inter-feature correlations of sensors’ data.

2.2 Federated learning

The target of federated learning is to train a centralized model while training data is distributed on separate client nodes. Prior research on federated learning aims at learning a single model across the network [McMahan et al.2016, Konečnỳ et al.2016, Konečnỳ et al.2015]. Different from these works, Smith et al. [Smith et al.2017] provided an approach to solve statistical challenges in the federated setting of multi-task learning. The key objectives are solving issues of high communication costs, stragglers (nodes with less computation power), and reliability. We adopt federated multi-task learning using a hierarchical attention mechanism. In this work, we focus on improving model performance across all local tasks.

2.3 Attention-based deep network

Fundamentally, neural networks allocate importance to input features through the weights of the learned model. In the context of deep learning, an attention mechanism allows a network to assign different levels of importance to various inputs by adjusting the weights. This leads to a better feature representation. Attention approaches can be roughly divided into Global Attention and Local Attention 

[Luong et al.2015]. The global attention is akin to soft attention [Bahdanau et al.2014, Xu et al.2015, Yao et al.2015]; where the alignment weights are learned and placed “softly” over all patches in the source data. Local attention only selects one patch of the data to access at a time.

Multi-level attentions are studied for improving document classification [Yang et al.2016] and predicting spatio-temporal data [Liang et al.2018].

Our proposed hierarchical attention approach learns individual feature correlations within each local task. It also learns shared feature representations across distributed tasks with a central attention mechanism that focuses on time steps. By passing this central time attention back to each local task, we are able to learn task-specific representations.

3 Methods

3.1 Problem Definition and Notations

Suppose we are given tasks with their input data , generated by distributed nodes. Each task consists of instances collected from multiple sensors and each sensor provides multiple features. Assuming we have features in total from all the sensors, represents the features for the instance,

is the corresponding label vector,

is the number of labels in task . With labeling function , the goal is to learn a model from a hypothesis set using the training data that minimizes the average error across all the tasks:


For multi-label classification problems, we train a network to minimize the cross-entropy of the predicted and true distributions for each task with instances and each instance has labels as:



is the loss function,

denotes the predicted labels, and is the true labels. For multi-output regression tasks, we train the networks to minimize the mean absolute error of the predicted and true distributions for each task with instances and each instance has outputs as:


3.2 Proposed Model

Notation Meaning
K # of tasks
D # of features in each task
M # of labels in each task
task with window size
, attention weight
sensor-specific level context vector
central time context vector
Table 1: Notations. Lower case letters represent scalars, bold upper case letters represent matrices, bold lower case letters represent vectors.

Our proposed model includes two main components: 1) Sensor-specific attention is applied directly on the input data of each task to learn the importance of each feature. This is followed by a LSTM layer to learn the captured feature representations. The last LSTM layer is designed to capture the learned central feature representations followed by two fully connected layers to predict the task labels. 2) A central time attention is applied on time steps of the concatenated tasks to capture importance of time across all tasks. Our method is detailed in Algorithm 1.

In each task, the input series has features, . The sensor-specific attention at each task node obtains the attention weight of each series with time-step window length with a softmax function. It then gets the attention distributions on feature dimension by multiplying the raw features of each time-step window and the attention weight. The attention vector is obtained by passing the attention distributions to one fully connected layer with a activation. The obtained attention vector is fed into a LSTM layer to further capture the learned feature representation. The central time attention aims to extract a shared representation across all input tasks in the time window. By concatenating all hidden states after LSTM layers of each task, we apply a flatten function, an activation by

, and a softmax function to get the shared attention weights. By multiplying raw features at each time-step window of each task and the shared attention weights, we get the attention score of each task. A second LSTM layer is incorporated to learn the second pass after central time attention is applied on each task. Finally a classifier is used to predict the labels of each task. An illustration of our model structure can be found in Figure  


Figure 1: Architecture illustration of proposed federated Multi-task Hierarchical Attention Model (FATHOM). ‘FC’ indicates fully connected layer, indicates concatenation and indicates element-wise multiplication.

In the following sections we will describe each part of our model in detail.

Input: data of tasks

1:  for  in parallel over nodes do
2:     calculate attention vector
3:     pass to a LSTM layer to get
4:  end for
5:  calculate
6:  compute central attention based on
7:  for  in parallel over nodes do
8:     update attention vector
9:  end for
10:  return label matrices
Algorithm 1 Learning for FATHOM

3.2.1 Sensor-specific Attention

Feature extraction at the input sensor level can increase the probability of capturing features that are related to target labels. These features should be given higher weights than other features while computing a sensor-specific representation. It is hard to tell which part of the features has predictive information, so we choose a global attention mechanism [Luong et al.2015] to capture feature representations by attending to all input features from each task.

Assume a given task with input series as , where is the time-step window size, and is the -th series with time-step window size . First we transform each series of using a fully connected layer with

units to obtain a hidden representation



In which, is the trainable weight matrix. We compute the attention weight of the -th input feature by applying a softmax function.


We measure the importance of features by computing the context vector with the element-wise multiplication of and the attention weights . Then a activation is applied to get the final attention vector.


3.2.2 Central Time Attention

The attention distribution captured at sensor-specific levels focus on a specific part of features, which can only reflect the label information at a current timestamp. However, for time series data, there is tight temporal correlation of instances. It is essential to capture the hidden information of time. The central attention component aims at learning a shared representation across all tasks at each time step. Let be the hidden representation of task after the first LSTM layer. First we concatenate the hidden representation across tasks to get the shared hidden representation :


We pass the shared hidden representation to a flatten layer to get a flattened hidden representation .


Different from input-level attention, we apply a nonlinearlity before softmax. By transforming using a fully connected layer with units, is applied to obtain the time-step level context vextor :


Then we compute the attention vector using a softmax function for each time stamp :


The attention score is obtained by normalizing the context score at each time step . We then repeat the attention weight times to get the attention matrix across input dimension. With permutation we obtain the attention vector .

We measure the importance of each time-step by computing the attention matrix with an element-wise multiplication of and the attention vector .


Here we obtain the extracted hidden representation across all tasks at the time-step level. By feeding of task to a LSTM layer and two fully connected layers, we get the predicted labels of each task.

4 Experiments

4.1 Datasets

To evaluate the performance of our approach we use several real-world datasets that have previously been used in multi-task learning frameworks for sensor analytics.

  • ExtraSensory Dataset 111http://extrasensory.ucsd.edu/: Mobile phone sensor data (e.g., high-frequency motion-reactive sensors, location services, audio, watch compass) and watch sensor data (accelerator) collected from 60 users; performing any of 51 activities [Vaizman et al.2018a]. We select 40 users with at least 3000 samples and use the provided 225-length feature vectors of time and frequency domain variables generated for each instance. We model each user as a separate task and predict their activities (e.g.,walking, taking, running).

  • Air Quality Data 222https://biendata.com/competition/kdd_2018/data/: Weather data collected from multiple weather sensors (e.g., thermometer, barometer) from 9 areas in Beijing from Jan 2017 to Jan 2018. We model each area as a separate task; and use the Observed Weather Data to predict the measure of air pollutants (e.g., PM2.5, PM10) from May 1st, 2018 to May 31st, 2018.

4.2 Comparative Methods

We compared the proposed FATHOMapproach to several single-task learning and multi-task learning approaches. In particular, we consider the Logistic Regression (LR) model in [Vaizman et al.2017] as the single-task learning baseline. We also use the multi-layer perceptron MLP (16,16) multi-task model in [Vaizman et al.2018b].

Convolutional Neural Networks is able to extract short-term basic patterns in time dimension and find local dependencies among features. This is referred by Convolutional Recurrent Neural Network (CRNN) [Cirstea et al.2018] where we feed each input task to a separate CNN layer and replace the RNN layer with LSTM layer.

Besides our proposed FATHOM approach, we provide several other models for comparison.

  • We use a single LSTM layer for single-task prediction as a comparison with LR approach. We refer to this by S-LSTM.

  • We replace MLP(16,16) with one layer LSTM which can better capture long-term dependencies in learning. We refer to this model as Multi-task LSTM model (M-LSTM).

  • We also perform an ablation study to assess the strengths of the different attention layers introduced in our proposed FATHOM. FATHOM-sa is without the Sensor-specific Attention and FATHOM-ca is without the Central Time Attention.

We compare the performance of these models with the proposed FATHOM model. For the classification datasets, the labels are highly unbalanced and hence we report the F1 score, precision, recall, and Balance Accuracy (BA) from [Vaizman et al.2018b]. For the regression dataset we evaluate the performance using Symmetric mean absolute percentage error (SMAPE). For each experiment, we split our dataset into training data, validation data and test data, in the proportions of 60%, 20% and 20%, respectively.

4.3 Hyper-parameters

Based on the performance on the validation set we choose the best group of parameters, retrain a model with the identified parameters, and report results on the test set. We set the hidden units of the both LSTM layers to 64, and both the regular dropout and the recurrent dropout to 0.25. We also impose constraints on the weights within LSTM nodes to further reduce overfitting. We use a batch size of 60 for training and an initial learning rate of 0.001. We employ early stopping with patience value of . For classification tasks, we use categorical cross entropy to monitor the loss within the Adam optimization algorithm. For regression tasks, we change the loss to Mean absolute error (MAE).

Methods ExtraSensory Air Quality
Pr Rec F1 BA SMAPE ()
LR[Vaizman et al.2017] 0.57 0.60 0.52 0.72 1.23
MLP(16,16)[Vaizman et al.2018b] 0.55 0.61 0.58 0.76 0.65
S-LSTM 0.79 0.71 0.74 0.84 0.52
M-LSTM 0.45 0.62 0.52 0.77 0.66
CRNN[Cirstea et al.2018] 0.43 0.68 0.54 0.78 0.64
FATHOM(Proposed) 0.89 0.77 0.82 0.88 0.45
Table 2:

Comparative Performance on the ExtraSensory Dataset (Classification) and Air Quality Dataset (Regression). M-LSTM and S-LSTM use one layer LSTM for multi-task learning and single-task learning, respectively. Pr and Rec denote Precision and Recall, respectively.

Methods ExtraSensory Air Quality
Pr Rec F1 BA SMAPE ()
FATHOM-ca 0.50 0.61 0.54 0.77 0.73
FATHOM-sa 0.80 0.69 0.74 0.84 0.51
FATHOM 0.89 0.77 0.82 0.88 0.45
Table 3: Ablation Study Showing Performance of FATHOM Variants with Different Attention Aspects. FATHOM-sa is without the Sensor-specific Attention and FATHOM-ca is without the Central Time Attention. Pr and Rec denote Precision and Recall, respectively.

5 Results

5.1 Comparative Performance

We demonstrate the prediction performance of the proposed FATHOM approach in comparison to different baseline approaches. Tables 2 shows the classification and regression performance for the different models on different benchmarks. We observe that FATHOM significantly outperforms the other models in terms of precision, recall, F1, and Balance Accuracy metrics for classification tasks. Given the highly unbalanced characteristic of the ExtraSensory Dataset, it is promising to observe that FATHOM shows a higher number of true positives and lower false positives. The proposed model outperforms all of the other models in range of 0.08-0.30 on F1 score and 0.04-0.16 on Balanced Accuracy. For the CRNN model, the CNN layer can extract distinctive features of the correlated multiple time series, which acts similar to the attention mechanism. We observe that CRNN performs slightly better than multi-task LSTM (M-LSTM) and LR approaches, but not better than other models. The feature correlations captured by CNN on each task is loose and cannot represent temporal dependency effectively.

From Table 2 results on Air Quality Data we also note that FATHOM outperforms all of the other models by lowering the Symmetric mean absolute percentage error in range 0.07-0.78.

5.1.1 Attention vs. No attention

To further study the function of the two attention mechanisms in our model we perform an ablation study by removing either the local sensor-specific attention layer or the central time attention layer and denote them by FATHOM-sa and FATHOM-ca, respectively. Table 3 shows the prediction performance of these two variants of FATHOM. We observe that FATHOM-sa with the central time attention still achieves very good performance in comparison to other baseline approaches. The FATHOM-sa model captures both the temporal correlation and a common feature representation across all tasks. However, FATHOM-ca does not perform well and is similar to the CRNN approach. The feature representations learned by sensor-specific layer fail to leverage the temporal correlations in the input data. The main FATHOM model with both attention mechanism outperforms FATHOM-ca and FATHOM-sa by 0.28 and 0.08 with respect to the F1-score on the Extrasensory dataset, respectively.

Figure 2: Case study of attention weight spikes in feature dimension captured with Sensor-specific attention. A, B, C represent three different instances at a certain timestamp. , , represent their according labels.

5.1.2 Single-task versus Multi-task learning

We seek to assess the benefits of multi-task learning in comparison to single-task learning models (See Table 2). The single-task LR model has the worst F1 and Balance Accuracy score (and a high error for the regression model). The single task S-LSTM model that captures temporal dependencies outperforms the LR model. However, the performance of jointly trained multiple task learning approaches with LSTM (M-LSTM) is worse in comparison to the S-LSTM model. In general, multi-task learning approaches should improve classification/regression performance but fail when the relationships between multiple tasks are not modeled well. FATHOM, on the other hand, outperforms the single task learning models because it is able to identify specific features that are important (and not noisy) across the different tasks and across the temporal domain.

5.2 Feature Extraction

To better understand the attention mechanisms and their ability to weigh certain features across the different tasks, we present several qualitative studies. Figure 2 shows the burstiness of features (spikes) captured by the sensor-specific attention of three instances from three different users (tasks) at three different time points. We observe a high correlation between the feature spikes and the corresponding labels. For example, for person A who is walking at home with a phone in pocket; the captured related sensors are phone accelerometers, magnetometer, location, microphone, and time. For person B, who is walking but talking with a phone on the table, there is no change of the magnetometer sensor, no acceleration of the phone, and also a lower volume of voice (there is no position change of the phone and the device cannot capture a higher voice on a table). For person C who is driving a car and talking with friends, the sensor-specific attention can capture the correlated sensors to the corresponding group activities.

Figure 3: Sensor-specific attention matrix of two users (tasks) from ExtraSensory Dataset. Each column is the attention vector within 30-minutes time length over the input series.
Figure 4: Sensor-specific attention matrix of one task from Air Quality Data. (a) is prediction of 20-hours time length, (b) is prediction of 240-hours time length. Each column is the attention vector over the input series.

We take two tasks from each dataset to visualize the variation of attention vectors across feature dimensions in Figure 3 and 4 respectively. Figure 3 represents the attention weight matrix of two different users. The user of matrix (a) first lies down, then walks and talks with friends on a phone in pocket. The captured highly related sensors are phone gyroscope, watch accelerator, and microphone. The user of matrix (b) first grooms and gets dressed, then stays in a lab. We find that watch and phone accelerators have a strong correlation with body movement. The microphone is directly correlated with voice in the surroundings.

The continuous value predictions for the Air Quality Data include air pollutants such as PM2.5, PM10, CO, NO2, O3, and SO2. From the attention weight matrix shown in Figure 4 we observe that in a prediction window of 20 hours length, temperature and humidity have the highest weights amongst all the input features. Recall that the attention weights semantically indicate the relative importance of each local contributing feature series. We find that in case of short term prediction, temperature and humidity affect the air pollutants most, wind speed and weather have relatively lighter influence. This is because people in Beijing consume fuel for heating in the winter and the humidity is usually very low during winter time. While in a prediction window length of 240 hours, wind speed has the highest weight. In a dry season with low temperatures, the only effective way to disperse the air pollutants is wind. All the above case studies show that our method is effective at capturing sensor-specific features and leads to interpretable results.

5.3 Central Temporal Attention Evaluation

Figure 5: Time-dimension attention distribution of two different tasks

The prediction performance results show the benefit of the central time attention mechanism (In Table 3, with just central time attention, FATHOM-sa still can outperform the other baseline models).

Figure 5 shows the attention distribution from the central time attention layer for the Extrasensory and Air Quality datasets. By applying the central time attention, the weight distribution does not just focus on the last step, but also spreads to former steps. Our observation is that temporal information is not lost and gets re-introduced leading to a stronger predictive performance.

6 Conclusion

In this paper we present FATHOM, a novel federated multi-task model utilizing hierarchical attention to generate a more efficient task-specific feature representation. Sensor-specific attention captures inner-feature correlations within each local task and central time attention generalizes inter-task feature representations across all tasks. We evaluate our proposed model on both classification and regression tasks. The results show that our approach achieves much better performance compared to a wide range of state-of-the-art methods. We also show multiple qualitative case studies to interpret our model.

In the future, we plan to investigate other federated multi-task settings such as learning multiple tasks asynchronously and mechanisms that protect local data privacy.