Human Activity Recognition (HAR) is a fundamental building block for many emerging services such as health monitoring, smart personal assistance or fitness tracking. These activities are often detected with mobile wearable sensors, like accelerometers in smartphones, and classified by a pre-trained machine learning model[lockhart2012applications]. A challenge is to personalize a general recognition model trained on a set of source users to a new unseen target user. Without any personalization, general models often perform poorly on unseen target users [weiss2012impact], [jordao2018human]. This is because many users have a unique motion pattern that leads to a significant shift between the sources and the target’s distribution. Much previous work has addressed this challenge in a static batch setting, i.e., the full target user data is available as a batch at once. This current work addresses the personalization problem in the more challenging online setting. Specifically, we assume that the samples from a new, unseen target user arrive sequentially, possibly until infinity. Here classification and personalization need to happen continuously in real-time (e.g., every 1-2 seconds) based on few current observations. Also, an algorithm cannot store all previous observations and retrain the model each time with the entire batch. Instead, it must update its state incrementally based on the new observations and possibly disregard them afterwards. Furthermore, the motion pattern of the target user for an activity or the operation setting of the system does not necessarily remain the same over time. Thus, the algorithm must “unlearn” old information and adapt to new one [gepperth2016incremental].
To tackle these problems, some previous work has proposed solutions based on active learning[settles2010active], variants of self-learning [triguero2015self] or a combination of the two. Solutions based on active learning try to identify a few target instances where ground truth information would increase classification accuracy significantly. In a real application, the label would be obtained by asking the user, which could decrease the overall satisfaction and willingness to use the recognition system. Self-learning based solutions, on the other hand, classify the target data and update the model by assuming that very confident predictions are indeed true. While the user does not have to label any data, motion patterns between users can be very different. As self-learning is not a transfer learning algorithm designed to adapt to such significant distribution shifts, it is likely to yield erroneous results under these circumstances. Another shortcoming of previous work is that personalization does not happen simultaneously with classification in real-time. Instead, it is postponed until a larger batch is collected. We expect that using newly available target user data right away may improve classification earlier. Finally, previous work does not address the stability-plasticity dilemma that occurs in the online setting i.e., regulating the trade-off between forgetting old information and adapting to new one [gepperth2016incremental]. Depending on the setting, adaptation must be stronger or weaker.
Our solution addresses these issues. We use a convolutional neural network (CNN) as the initial general model trained on a set of source users to classify the incoming sensor stream of a target user. The CNN contains an incremental online extension of adaptive batch normalization for domain adaptation (DA-BN) [mancini2018kitting, li2016revisiting]. During training the only difference to a standard CNN with batch normalization (BN) layers [ioffe2015batch]
is that each batch only consists of data from the same person. Given this, each batch is normalized by user-specific BN statistics. In the online testing phase, subsequent sliding windows from a new unseen target user arrive and are classified in real-time. Since the target user has never been seen by the model before, the user-specific statistics for the normalization are unknown. Initially, these statistics are estimated from the source users during training. In the online phase, the initial values are updated incrementally from each single sliding window, right before classifying the respective window. Thus, the statistics gradually adjust to the target user with each additional instance. This makes the feature distribution of the target overlap with the sources from training.[mancini2018kitting]
were the first to propose a online version of DA-BN. We in turn use an incremental exponential variance formulation proposed by[finch2009incremental] that updates the global statistics based on a single instance, instead of a batch. Thus, our approach is an online, real-time extension of DA-BN.
We summarize the advantages of our method as follows:
Personalization happens in real-time, i.e., no batch of different activities from the target person has to be collected before or during the online phase. Instead, an incremental personalization step happens each time right before classifying an instance. Further, the user is never asked to label any data before or during the online phase.
Our method processes a (potentially) infinite sequence of measurements with constant memory and performs incremental updates efficiently with one pass over the data.
Our method is adaptive to changes in user motion or the operation setting of the system over time (concept drift). The exponential average and variance formulation provides a parameter to regulate how gradual adaptation to such changes should be.
Our method models the personalization problem as a theoretically grounded online unsupervised transfer learning problem. This allows it to deal with distribution shifts between source and target by design.
Our method is the first deep-learning based approach applied to personalization in online HAR. Deep learning based models have shown to outperform shallow models in HAR and many other machine learning tasks.
In our experiments with 44 subjects on 5 activities, we observe improvements in accuracy over our baseline of up to 14 % for some individuals. Median improvement is 4 %.
2 Related Work
In this section, we first introduce personalization for HAR, in general, before reviewing current online personalization approaches.
[twomey2018comprehensive] is a general introduction to HAR. Usually, in HAR general models are trained from a set of source users for whom labeled measurements have been collected for training. Personalization aims at adjusting the general model to accurately classify the activities of a new unseen target user whose motion patterns may be quite different from the source users.
[weiss2012impact] showed the need for personalization. In their study, models trained only on target user data, outperformed general models that did not train with any target data as well as hybrid models containing some target data during training. However, the hybrid models still clearly outperformed the unpersonalized models. Because collecting a sufficient amount of data for each target user is not practical, much research focused on methods to fine-tune general models given only a small amount of labeled target data [sani2018matching, parkka2010personalization, hong2015toward, cvetkovic2011semi, reiss2013personalized, sani2017knn, garcia2015building, rokni2018personalized]
. Acquiring even a small set of labeled data may be costly and unpractical. Therefore, many researchers also applied unsupervised transfer learning algorithms or semi-supervised learning[ramasamy2018recent, barbosa2018unsupervised, zhao2011cross, lane2011enabling, maekawa2011unsupervised, hachiya2012importance, deng2014cross, wang2018deep, saeedi2018personalized, soleimani2019cross]. The problem is also very related to sensor placement adaptation [chang2020systematic]. As such, it may be beneficial to look at solutions for one problem to adapt and evaluate elements of it in to the other problem.
All this work assumes the availability of the full target user data as one static batch. In real applications, however, this data may often not be available. A new unseen target user would start using the system, and the data would arrive sequentially, presumably until infinity. There are is little work devoted to personalization in this online setting.
For example, [cvetkovic2015adapting] present an approach that relies on a general classifier trained on multiple source users and a personal classifier trained on a small subset of the data from the target user. During the online phase, a meta-model decides for each incoming instance whether to output the prediction from the general or the personal model. Another meta-model decides if the classified instance with its predicted label should be included in the training set of the general model, to be retrained periodically. Their method assumes the availability of a batch of target user data before the online phase, relies on self-learning and saves all the previous data to retrain the model periodically.
Similarly, [siirtola2019incremental] use an ensemble trained statically on source individuals as a general model. In the online phase, data of the target person becomes available sequentially and is gathered into a small batch containing several activities. The general model classifies the batch, and a new ensemble is trained based on these predictions. Depending on the confidence of the predictions of the initial base classifier, either the predictions themselves are used as ground truth, or the user is asked for a label. Finally, the general and the new ensemble are merged into an overall model. Their method relies on a combination of self-learning and active learning. Further, personalization does not happen in real-time but is postponed until a batch with data containing several activities has been collected. In their setting, they do not feature real-time classification either. Instead, they divide the incoming stream into chunks, the first one being for personalization only, the next one for classification and testing, the next one again for personalization and so on.
Moreover, [abdallah2015adaptive] also use an approach that combines active with self-learning and collects a batch of data containing multiple activities. Their method clusters the data and extracts cluster features that are passed to an ensemble as input. In the online phase, incoming measurements form small batches are clustered and classified. Personalization takes place through a combination of active and self-learning.
In [sztyler2017online] a priori data about physical characteristics of the target and source users is used to determine similar source users for a given target. Then the sensory measurements from the selected source users are taken to train an online classifier. Their method uses active learning for personalization.
Quite similarly, [mannini2018classifier] train an incremental SVM on source individuals and update the model incrementally in the online phase, using active learning as well. They could improve classification accuracy by approximately 1% only.
All these approaches rely on self or active learning. None of them personalizes in real-time. None of them provides a way to balance the stability-plasticity tradeoff. Our approach, on the other hand, relies on theoretically founded transfer learning that explicitly deals with distribution shifts between training and testing data. It is fully unsupervised, i.e., no labels of the target user are needed. Personalization happens each time right before classifying an activity in real-time. Using an exponential average provides a parameter to adjust the adaptation rate.
3 Problem Definition
Using a regular smartphone, 3-axis accelorometer measurements are collected in regular intervals at time step . denote the respective measured values in the x, y and z accelorometer dimension for a person . We assume that person is the target, and the other ones are source individuals . For the target person, measurements are subsequently arriving, possibly until infinity . For each source person we assume the availability of a labeled training set of measurements with activity classes . Given the time-dependent nature of activities, it is hardly possible to classify an activity based on a single measurement. One prominent way of dealing with this problem is to collect the measurements into sliding windows of size , ,
being the stride between subsequent sliding windows. For the target person, holds. The training set is transformed into , , being assigned to each sliding window based on the most frequent class within it. So an instance to be classified is not represented by a single measurement but a set of measurements in a sliding window, each measurement representing a feature of the instance. The task is to classify each subsequent sliding window from the target person by a function in real-time, and disregard afterwards. Before classifying there is an incremental learning procedure : that updates based on that sliding window. A supervised machine learning algorithm learns the initial function from the training set . Figure 1 illustrates the setting.
The approach to be presented is based on convolutional neural networks (CNN) with an incremental extension of Domain Alignment Batch Normalization (DA-BN) layers [mancini2018kitting, li2016revisiting]. We first introduce DA-BN layers for the unsupervised batch case, i.e., the test data from the unlabeled target person is available as a single batch at once, and then move on to the online case. Labeled training data from the source individuals is fully available in both cases.
Recent work has demonstrated the effectiveness of DA-BN layers for deep domain adaptation [li2016revisiting, mancini2018kitting, mancini2018boosting, mancini2018robust, cariucci2017autodial, carlucci2017just]. Domain adaptation is a branch of transfer learning that deals with problems under the covariate shift assumption. Given a set of features and labels from the same feature/label space for a source domain and target domain , the conditional distribution between source and target stays the same, i.e., , while the marginal distributions differ, i.e., [pan2009survey]. DA-BN within a CNN is based on the following results. As [shimodaira2000improving] showed, a learner trained on a given source domain will not work optimally on a target domain when the marginal distributions of the domains are different. One solution is to find a feature representation that maximizes the overlap between the domains while also maximizing class separability. Training a classifier that minimizes the error on the source domain using such a feature representation minimizes the classification error on both source and target domain [ben2007analysis, ben2010theory]. As we will see, CNNs with DA-BN follow this theory. Figure 2 is an illustration of this concept. The concept is generalizable to the case where source domains are present: In HAR each individual can be seen as its own domain and the differences in the motion patterns between individuals as the covariate shift.
4.1 Batch Normalization
We now revisit Batch Normalization before explaining some existing extensions as well as our innovations for domain adaptation. Batch normalization [ioffe2015batch] is a well known technique to reduce covariate shift within deep neural network layers and thus stabilize and accelerate training. The idea is to keep the input distribution for a given layer constant by replacing the channel input with its standardized value using its mean and variance . During each training iteration , and are computed over the input batch of the respective layer.
An exponential average of these statistics over subsequent batches is computed to be used as a global estimate for , in the testing phase.
Irrespective of the normalization happening during training using batch estimates or testing using global estimates, is computed by:
and being trainable parameters letting the network shift the imposed distribution. is a (very small) constant for numerical stability.
4.2 Domain Adaptive Batch Normalization
Batch normalization can be applied to unsupervised single-source-single-target domain adaptation by a simple change in the testing phase, as proposed by [li2016revisiting]. Instead of applying Equation 5 with the global estimates , , target-specific estimates are computed using the fully available unlabeled target test data . In case of multiple source domains each batch must only contain instances from the same source during training. Therefore, each domain is normalized using its own domain-specific statistics imposing the same target distribution on the features. The feature distribution of the classification layer’s input should thus overlap for all domains. All other parameters (weight and bias terms) of the neural network remain shared. During training they are optimized to minimize classification loss. As a result, the network learns a feature transformation that maximizes class separability while making the domains overlap.
4.3 Online Domain Adaptive Batch Normalization
The method outlined in the previous section works in the batch setting. Test data from the target domain or at least a sufficiently large subset of it is available to estimate the mean and variance specific to the target domain. [mancini2018kitting] proposed an online extension of DA-BN for visual object recognition under changing visual conditions. Their method features the following adjustment in the online testing phase of DA-BN. Over the incoming stream of test data, small batches of the incoming instances (images in their case) are collected. Equivalently to the training procedure in regular batch normalization, the mean and variance is computed using Equation 1 and Equation 2, and the global estimates , are updated with Equation 3 and Equation 4. These global estimates are then used in Equation 5 to transform . Here, denotes the -th batch to be processed in the online phase. After collecting and processing each batch, an incremental adaptation step takes place.
Under our problem formulation, their solution is not directly applicable. Section 3 has explained that we collect measurements into sequences of measurements (sliding windows) as well. However, we represent one instance of our data with one sliding window. The data units for the subsequent data-processing steps are entire sliding windows, i.e., one window is a data point. The measurements are its features. Put differently, a sliding window is a point in the multidimensional feature space. Since in our case the incremental adaptation step takes place after each incoming sliding window and before real-time classification, we must update the global mean and variance estimates , with a formulation that uses a single instance, i.e., . As all previously proposed DA-BN variants (online and offline) assume a batch of instances () we replace Equation 3 and Equation 4 with an incremental, exponential formulation in our approach. This formulation updates the global statistics directly from a single instance based on [finch2009incremental].
The online adaptation momentum in Equation 6 and Equation 7 is a weighting factor. By choosing an one can regulate how strong the influence of a new instance should be on the running mean and running variance estimate. Therefore, it allows to balance the stability-plasticity tradeoff for the given application setting.
Note that Equation 4 computes an unbiased variance estimate. This means, that there is a correction for the bias introduced by estimating a population statistic from a finite sample. Since the statistic is simply computed from a batch, a correction term is known. On the other hand, Equation 7 does not correct for bias. We are unaware of a bias correction term for the incremental exponentially weighted variance. Yet, we don’t expect this to influence our method significantly. Sometimes, other works also apply a biased variance computed over a sample in their methods. For instance, regular batch normalization does not correct for bias in the training phase, on purpose, to facilitate gradient computation [ioffe2015batch].
5 Description of Approach
For the general model , we train a convolutional neural network (CNN) with online DA-BN layers. As described in Section 4, these layers perform an incremental learning step to subsequently adjust the global model to the target person in the online testing phase. As usual, the CNN consists of 3 parts: (1) a regular convolutional block with several convolutional and maxpooling layers, (2) a fully connected block with one or multiple fully connected, online DA-BN layers and (3) a softmax classification layer.
For each source person , the training data is split into non-overlapping batches of size , each batch containing only data from the same source person :
Note that the batch in training iteration is the -th batch at the
-th training epoch of the neural network. Each batch is first processed by the convolutional block as expected in a regular CNN architecture and is passed to the fully connected block. For each fully connected layer and each channel, the BN statistics over the current batch is computed using Equation1 and Equation 2. Each element in the batch is transformed using Equation 5. As each batch contains only data from the same source person, these statistics are person-specific. The global BN statistics are updated using Equation 3 and Equation 4
. The output is passed to the activation function and the next layer until reaching the classification layer. At the classification layer, a loss is computed to be propagated back through the network. At the end of the training phase, we end up with.
5.2 Online Testing
In the online testing phase, sliding windows from the target persons are processed one by one in the order of arrival. The incremental learning and classification step per sliding window happens within one pass through the model: . The current sliding window is passed to the CNN and is processed by the convolutional block. In the fully connected block, each layer and channel updates its global estimate using Equation 6 and Equation 7 from the input. The global estimates are used in Equation 5 to standardize the respective inputs. The output is passed to the activation function and the next layer until reaching the classification layer. The network outputs an activity and is removed from memory. Then, the next sliding window is processed. Figure 3 illustrates a online pass. The architecture is parametrized as in our experiments.
6.1 Experimental Setup
In our experiments, we use the WISDM dataset publicly available in the UCI Repository [weiss2019smartphone]. The dataset contains accelerometer and gyroscope measurements collected at approx. 20 Hz with a smartphone and a smartwatch from 51 subjects. It contains data on 18 activities of daily living. During collection, all subjects had the smartphone in the same pocket and in the same orientation. For each subject and activity approx. 3 minutes have been recorded. The activities have been recorded separately. This means that each person performs one activity for approx. 3 minutes in a row, followed by the next activity etc. When looking at the timestamps, one can see that the transitions from one activity to the next one are not continuous, but recording has happened in isolation. Also, the data from the smartphone and smartwatch is not synchronized i.e. they have not been collected in parallel.
In line with most other work on personalized HAR we only use the data from the smartphone accelerometer. The accelerometer is the most meaningful sensor for motion based HAR. We also don’t want to influence our results with the effects of sensor fusion. We consider the activities walking, jogging, walking stairs, sitting and standing. Because subjects 09, 16 and 42 did not contain all relevant activities, we disregard their data. Additionally, as [burns2020personalized] reported issues with the collected measurements of 37 to 40 we disregard them as well. As such, we are left with 44 subjects.
Because the recording of measurements happened in isolation we group the data by activity and person id for the following preprocessing steps. We resample the data so that the sampling frequency is at exactly 20 Hz. We also truncate the last measurements to ensure an equal number of measurements per activity and person. This yields a perfectly balanced dataset. We apply a non-centered moving average filter of size 4 for consistency with the online setting. The value of a filtered measurement should not be based on future values but should only use measurements from the past. Therefore, the average value at timestep was computed as the average of the last to values. This filter size was chosen as a combination of results from preliminary experiments and common practice in related work [saeedi2018personalized, bruno2013analysis]. We apply a min-max normalization with a min-max range of [-78, 78] based on the value range of the accelerometer. Finally, sliding windows are of size 40 (2 sec) with 50 % overlap. The size and overlap have been chosen based on an empirical study that has tested HAR models with varying sliding window sizes and overlaps [banos2014window]
. One advantage of using a neural network based model is that feature extraction is part of the overall learning process. As such we do not extract any hand-crafted features from the sliding windows.
All in all, we end up with 3560 measurements (2:58 min) separated into 177 3-dimensional sliding windows of size 40 per activity and person for 5 activities and 44 individuals. Thus we have instances in total.
6.1.3 Evaluation Method
To show the effectiveness of DA-BN layers for personalization in HAR, we conducted several experiments. They evaluate the method in the batch and the online setting.
Unless otherwise stated, we employ the leave-one-person-out-cross-validation (LOPOCV) evaluation model. We create folds and assign the data of each person to exactly one fold [jordao2018human]. So each person is once the target person , while the base model is trained on all the remaining individuals. For each fold the classification accuracy is computed. In the evaluation section, results are often summarized as medians or means over all folds. To make sure that results are comparable, each experiment is based on the same CNN architecture, with the same hyper parameters and initialization weights.
We conduct the following experiments:
Baselines: First, we create a Lower Baseline and an Upper Baseline. The Lower Baseline consists of a regular CNN with regular batch normalization layers in the fully connected block evaluated under the LOPOCV. The Upper Baseline uses a different evaluation model than the other experiments. The data of each person is randomly split into a training and a test set. For each person a regular CNN is trained and evaluated on s data only. Results are often summarized as medians or means over all individuals.
Unsupervised Batch: A CNN with DA-BN layers in the fully connected block is trained. The held out data of the target person is randomly split into a pre-estimation and test set with varying relative sizes from 1:90 to 9:1. The pre-estimation set is passed to the model to estimate the global mean and variance for each channel and layer over the entire batch. These estimates are used in Equation 5 when classifying the test set.
Supervised Batch: This experiment is like the Unsupervised Batch experiment, except that the pre-estimation set contains labels that are used to additionally tune the network weights for 10 epochs. To allow comparisons, we also tune the weights of the Lower Baseline, dubbed Supervised Baseline.
Online Unrandomized: A CNN with Online DA-BN layers in the fully connected block is trained. The order of sliding windows in the held out data of the target person is kept as in the original dataset. So, all instances of one class are processed before all instances of the next class. The online adaptation momentum is varied between [0.0001, 0.005]. Instances are processed one at a time, i.e., not as a batch.
Online Randomized: This experiment is like the Online Unrandomized
experiment, but the order of the instances is randomized. So activities are uniformly distributed in time. This should simulate a slightly more realistic scenario than keeping the order in blocks of activities, as provided by the authors of the dataset. This experiment is repeated 5 times, varying the order of sliding windows, and results are averaged. We vary the online adaptation momentumbetween [0.001, 0.05]
The CNN model employed in all our experiments consists of 5 1D-convolutional layers with 64 feature maps, a convolutional kernel of size 5, stride 1 and ReLU activation function. Zero padding is applied to keep the size of the feature maps constant throughout the convolutional block. After the last convolutional layer 1D, non-overlapping max pooling with a kernel size of 4 is applied. The following fully connected block consists of 1 fully connected layer with 256 neurons, the batch normalization of the respective experiment, a 50 % dropout rate and a ReLU activation function. The classification layer uses a Softmax activation function. Figure3
summarizes the architecture with the respective hyper-parameters. As the loss function we have chosen the cross-entropy loss. During training we employ the ADAM optimizer with a 0.0001 learning rate and 0.001 decay, training for 649 epochs on batches of size 177.
We determined these hyper parameters to work best in a grid search for a general recognition model. We also conducted a grid search for the personal models of the Upper Baseline. However, the results with the best Lower Baseline hyper parameters in the Upper Baseline experiment were only slightly different. So, for higher comparability, we also employ the same hyper parameters for the Upper Baseline, except for the number of epochs. We determined these separately for each personal model using early stopping on a validation set.
The code for the experiments is in Python, using the PyTorch (with CUDA), Numpy and Pandas libraries. The experiments have been run on the Pittsburgh Super Computer with NVIDIA Tesla V100 16 GB memory GPUs[Nystrom:2015:BUF:2792745.2792775].
6.2.1 Comparison Across All Experiments
Figure 4 compares the LOPOCV results of the experiments. The values in the boxes are the median accuracies. For Online Unrandomized and Online Randomized, the results are for online adaptation momentum values of 0.0009 and 0.01 respectively. For Unsupervised Batch, Supervised Baseline and Supervised Batch, 10 % of the target data is used as the pre-estimation set and the remaining 90 % are used for testing.
As intended and expected, the Upper Baseline sets the maximal possible detection accuracies through personalization. On the other hand, the Lower Baseline must be improved upon for personaliztion to be any good. The (unsupervised) online experiments have slightly smaller improvements over the Lower Baseline than the Unsupervised Batch
experiment. The supervised (batch) experiments outperform all the unsupervised ones. The boxplot contains outliers towards the lower tail, for most but two experiments. These are users for whom the accuracy (“user performance” in the following) is very low compared to the other users.
Comparing the unsupervised experiments to the Lower Baseline
, we see an improvement over the whole distribution of users: The medians, the Inter Quartile Range (IQR) and the minimum and maximum are higher, as well as the two outliers (User 10 and 14). The biggest improvement can be seen between the minimum of theLower Baseline and the Online Unrandomized experiment. This suggests that in particular users with low accuracies on the Lower Baseline experience significant improvements. The same effect can be seen when comparing the accuracies of the outliers on the Supervised Baseline to the Supervised Batch case. For some users, accuracies already are above 90 % using a simple general model. So it is important to improve detection accuracy for users who are different and hard to classify by the general model. Our approach seems to do this, as we will see later when discussing Figure 5 and Figure 6.
Between the Online Unrandomized and the Online Randomized experiment, median, maximum and minimum are almost equal. User 10 and 14 are doing worse in the Online Unrandomized case, pushing its overall results a little bit down. Although, the IQR is of the same size, its upper border is slightly higher for the Online Randomized experiment. This suggests that accuracy for more users is higher in the randomized case. A reason could be that in the unrandomized case all instances of one activity are processed before all instances of the next one. The DA-BN layer needs to process some instances of the new activity to adjust its statistics to the new pattern; this might lead to an artificial “concept drift”. During that time, instances of the new activity are classified using statistics based on the previous class. In the randomized case, this does not happen. As this data is randomized, concept drift occurs only once at the beginning when data of the new user arrives. The statistics converge towards their target values and don’t change much until the end of the online phase. Nevertheless, both cases show strong improvements over the Lower Baseline with result not too different from each other. It shows how online DA-BN is applicable in different scenarios. We will see in Figure 8 that this has to do with the choice of online adaptation momentum .
When comparing the Online Randomized to the Unsupervised Batch experiment, one can see that the minimum accuracy and one of the outlier’s accuracy are lower for the Unsupervised Batch case. This means that the lowest performers are doing better in the online case than in the batch case. However, this might be a random effect only happening on the two lowest performers on this specific data. The median, the 75th percentile and the maximum are higher though. This suggests that accuracy for the average and top performers is higher in the batch case.
The Supervised Baseline improves the median accuracy over the Lower Baseline by 6%. This shows the effect of tuning the network weights with a small amount of labeled target user data. However, the difference between the median and the maximum of the Supervised Baseline to the Unsupervised Batch experiment is only marginal. This shows the strength of our approach in the batch setting. It means that the results of our unsupervised approach are not much worse than the results with supervised fine-tuning. Nevertheless, supervised fine-tuning obviously beats an unsupervised approach. This can be seen by the higher minimum value and the thinner, upward shifted IQR of the Supervised Baseline. Still, applying DA-BN on top of weight tuning improves activity recognition. The higher median and 75th percentile of the Supervised Batch experiment show this.
6.2.2 Average Results for Groups of Users
Figure 5 compares the average accuracies over all users, as well as results on a subset of the 10 best (dubbed “top 10”) and worst (“flop 10”) users on the Lower Baseline. As in Figure 4, we report results for Online Unrandomized and Online Randomized with a online adaptation momentum of 0.0009 and 0.01 respectively. Equally, for Unsupervised Batch, Supervised Baseline and Supervised Batch 10 % of the target data is used as the pre-estimation set and the remaining 90 % is used for testing.
When looking at the results for all users, there is almost no difference between the Unsupervised Batch and the Online Randomized experiment, while there is a 1 % difference when comparing Online Randomized and Online Unrandomized. This again, comes from the relatively good performance of User 10 and User 14 in the Online Randomized case. Their relatively worse performance in the other two cases pushes the respective averages down. Therefore, this same pattern can be seen for the flop 10 but not for the top 10.
For the Supervised Batch case, the improvement on all users over the Supervised Baseline is higher in terms of mean than median. This is because the Supervised Batch sharply improved the outliers. It suggests that tuning weights in connection with DA-BN layers is even more beneficial in the supervised than in the unsupervised case.
We see that using DA-BN has the greatest effect on the flop 10. There is a big leap from the Lower Baseline to the Online Unrandomized case. However, the improvement from adding supervision is obviously larger than from unsupervised DA-BN. DA-BN also improves in the supervised case but not as much as in the unsupervised (online) cases. For the top 10, the unsupervised online experiments do not show an improvement. Unsupervised Batch and Supervised Batch, however, still show an improvement of approx. 2 % over their respective baseline. Supervised Batch even achieves results equal to the Upper Baseline.
All in all, using DA-BN consistently improves detection accuracy, be it in the online or batch, supervised or unsupervised case. For users who do not perform well under a general model there is more room for improvement and a higher overall effect. For the top 10 there is not so much room for improvement. Still DA-BN shows a significant effect. There is a difference in performance in the online randomized and unrandomized case that warrants further investigation.
6.2.3 Online Randomized Improvement per Person
Figure 6 shows the accuracy improvement for the Online Randomized experiment over the Lower Baseline for each user. The x-axis is in descending rank order based on the users Lower Baseline accuracy. To illustrate, User 38 has the highest accuracy on the Lower Baseline.
There are few individuals for whom the accuracy goes down. The biggest decline is about 4 %. It seems that users who are performing well on the Lower Baseline are more likely to experience a decrease in performance. Other studies also showed decreased performance for some users after personalization, cf. [qin2019cross, chang2020systematic, deng2014cross]. In transfer learning this effect is known as negative transfer. Recall from Section 4 that domain adaptation assumes the conditional distribution between source and target to stay the same, while the marginal distributions are different. In this real world scenario, the assumption of equal conditional distributions, i.e. , may be violated between some individuals, possibly leading to the negative transfer. However, for most users the accuracy improves with gains of up to 14 %. Some of the top gainers, namely Users 10, 12, 18, and 27, are among the lowest performers on the Lower Baseline. In fact, User 10 has the lowest accuracy on the Lower Baseline and is the second biggest gainer. This is in line with what we have seen in Figure 5, however we had expected this relationship to be stronger. For instance, User 1 has the 10th best accuracy on the Lower Baseline but has the 4th highest improvement. This makes him the top performer of the Online Randomized experiment.
6.2.4 Impact of Pre-Estimation Set Size
Figure 7 displays the median LOPOCV accuracy depending on the relative pre-estimation set size. The x-axis denotes the relative size of the pre-estimation batch compared to the overall size of the target data. Note that the scale of the x-axis is not linear.
All experiments start with a sharp increase in accuracy. The Supervised Batch already improves accuracy by 1 % over the Lower Baseline to 0.81, given only 1 % (8 instances / 9 sec) of the data of the target individuals for pre-estimation. Given 2 % of this data, accuracy sharply increases to 0.84. From there on, accuracy slowly goes to almost 0.86 with 10 % of the target person’s data. These results show how DA-BN can achieve strong improvements with only little data from the target person and without any label. They also indicate how adaptive the online algorithm should be in the given setting. When the online algorithm has processed 16 sliding windows (x = 0.02 / 17 sec) the DA-BN statistics could actually be predominantly based on data from the target person only. This speaks for a rather high online adaption momentum in the beginning. Over time the gain due to fast adaption decreases. In this case it might be beneficial to have a lower online adaptation momentum to be more robust towards, say, unbalanced class distributions, short within user temporal changes, noise, etc. Developing a method that continuously adjust the online adaptation momentum on its own during the online phase might be a promising future direction.
On the other side, using only a small subset of a target person’s labeled data to fine tune the network weights (without DA-BN) has a negative impact on accuracy. It drops by 2 % under the Lower Baseline. Until x = 0.05 the Unsupervised Batch performs better than the Supervised Baseline and is only marginally better until x = 0.2. From there on, the advantages of supervised fine-tuning come to bear. The Supervised Baseline becomes better than the Unsupervised Batch. Comparing these results to the Supervised Batch, one can see that it is consistently better than the Supervised Baseline by 1 to 4 %. Using DA-BN in the supervised setting mitigates the initial loss at x = 0.01 and already improves over the Unsupervised Batch very early at x = 0.02. Thus, it is always beneficial to apply DA-BN.
6.2.5 Impact of Online Adaptation Momentum
Operating in a dynamic online environment leads to the problem of stability-plasticity that incurs a tradeoff between the ability to take in new knowledge and "forget" old information [gepperth2016incremental]. Finding an appropriate way to regulate this tradeoff is key in optimizing online detection performance. In contrast to other online personalization approaches in HAR, online DA-BN can explicitly regulate this balance using the online adaptation momentum. Figure 8 shows that, depending on whether we are looking at the Online Randomized or Online Unrandomized case, very different values for that hyper parameter are optimal. In the Online Unrandomized case, activities arrive in blocks, one after the other. Thus, using rather high momentum values confines the DA-BN statistics to the pattern of one activity only. This has a negative impact on detection accuracy. On the other hand, in the Online Randomized case, activities do not arrive in blocks but are mixed. Estimating the DA-BN statistics from the last, say, 16 sliding windows reflects the overall pattern across all activities better. In this case, being more adaptive and “forgetting” the statistics over all users in favor of the user-specific statistics leads to better detection accuracies. We have discussed this effect when we were presenting Figure 4. We can see how crucial it is to regulate the strength of adaptation depending on the setting in online HAR.
In this work we have presented the first fully unsupervised online personalization approach based on theoretically grounded domain adaptation for accelerometer-based HAR. The approach incrementally personalizes a general model in real-time, right before classification. It also allows to regulate how gradual adaption to new information should be.
Personalization of general activity recognition models is necessary to achieve good detection results for new unseen users with unique motion patterns. In the online setting, no samples from the target user are available in advance, but they arrive sequentially, possibly until infinity. Thus, an algorithm cannot store all previous observations and retrain the model each time with the entire batch. Further, the user’s motion pattern or the operation setting of the system may change over time. Therefore, adapting to new information and forgetting old one must be balanced. Finally, the target user should not have to do any work to use the recognition system by, say, labeling any activities. As we have seen, our approach addresses all of these challenges.
The experiments on the publicly available WISDM dataset confirmed this. Our approach improved accuracy for all but a few users and in particular for users whose movement patterns is quite different from their peers by up to 14 %. This indicates that our approach provides improvements especially for users who are hard to classify by a general model. The experiments also showed that utilizing new data as soon as it becomes available is indeed beneficial. However, depending on the setting, the adaptation rate (momentum) to new information must be stronger or weaker, and may also change over time. Investigating an algorithm to automatically regulate the momentum parameter may be a promising future direction. Further, our experiments showed that using DA-BN layers also leads to competitive results in the supervised and unsupervised batch cases. This is especially true, if only very little (labeled or unlabeled) target data is available. A major next step would be to extend these experiments to a variety of additional datasets.
This work was supported by The International Center for Advanced Communication Technologies (InterACT) and the Baden-Württemberg Foundation. This work further used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).