People-centric sensing enables a wide range of challenging but promising applications which have great potential on impacting people’s daily lives [Liao et al.2015, Chen et al.2018b] in many realms such as Brain Computer Interface (BCI) [Zhang et al.2018], assistive living [Basanta, Huang, and Lee2017], robotics [Lauretti et al.2017] and rehabilitation [Smeddinck, Herrlich, and Malaka2015]. One of the major components of people-centric sensing is understanding human behaviors by analyzing the data collected from people-centric sensing devices, such as wearable sensors and biosensors. However, annotation is difficult in the context of people-centric sensing due to the expensive manual cost, privacy violation and the difficulty in automation [Do and Gatica-Perez2014]. Therefore, a large body of research on semi-supervised learning (SSL) has been proposed. SSL enables a reliable model to be trained by learning from the labeled samples and properly leveraging the unlabeled samples as well.
Most of the existing SSL works are based on the assumption that the labeled data and the unlabeled data are drawn from identical or similar distributions. For example, [Cheng et al.2016], [Xing et al.2018] and [Chen et al.2018a]
utilize multiple classifiers to pseudo-label the unlabeled samples that obtain confident predictions. In their tasks, the correctness of labeling is ensured by the condition that the labeled data and the unlabeled data are drawn from similar distributions. But this assumption does not always stand.
In practical human-centred scenarios, only a few subjects’ labeled data can be collected for training and unlabeled data are usually collected from the target users. Since people have diverse behavior patterns and biological phenomena [Bulling, Blanke, and Schiele2014], data collected from different subjects are distributed variously. This triggers the distribution shift problem where the labeled data and the unlabeled data are distributed differently.
Distribution shift is a common problem in people-centric sensing and most practical applications that require predictive modeling. Despite this, the major attention is given to semi-supervised learning of which the main challenge is data scarcity instead of shifted distributions. Distribution shift has been relatively underexplored until recently. Some researchers propose to tackle the distribution shift problem by unsupervised domain adaptation or transferring the model trained on the labeled data to the unlabeled data. For instance, some recent works such as [Liu and Tuzel2016] and [Tzeng et al.2017] are committed to mapping both domains into the common feature space. However, they make the covariate assumption that only the marginal distributions of the input data are shifted but overlook the potential shift in the conditional distributions of the output labels given inputs. In this setting, their models only see the difference between the labeled data and the unlabeled data but neglect their latent output-related similarity.
To fill this gap, we propose a two-faced treatment that tackles the problem of SSL for distribution shift. We define two characteristics for the training data, person-specific discrepancy and task-specific consistency. Person-specific discrepancy means the distribution divergence of data collected from different people owing to their different behavior patterns and biological phenomena. In our semi-supervised setting, person-specific discrepancy also represents the distribution divergence between the labeled data and the unlabeled data. By contrast, task-specific consistency denotes the inherent similarity of the data of the subjects performing the same task. Our aim is to learn an embedding that reduces person-specific discrepancy and simultaneously preserves task-specific consistency. The main building blocks of the proposed approach are illustrated in Figure 1. We start by reducing person-specific discrepancy. By adversarial training, we reduce the distribution divergence between the latent features of the labeled data and the unlabeled data. Then, we generate paired features and force them to lie in the same space to preserve task-specific consistency. In this way, we ensure the classifier trained with the labeled samples is also effective on the unlabeled samples.
The key contributions of this research are as follows:
We propose a novel distributionally robust semi-supervised learning algorithm to address the distribution shift problem. We consider the distribution discrepancy between the labeled data and the unlabeled data, and align the feature distributions when the training data are distributed differently. We also leverage the similarity of the labeled data and the unlabeled data to learn the task-related discriminative features for classification.
We propose to reduce person-specific discrepancy by aligning the marginal distributions of the labeled data and the unlabeled data. Specifically, we force the latent feature distributions to be similar by training the model in an adversarial way.
Furthermore, considering the classification task of our model, we propose to preserve task-specific consistency by generating paired data and making their features maintain consistent. Task-specific consistency avoids the features losing the task-related information and facilitates the classification.
We compare the proposed model with eight state-of-the-art methods in four challenging people-centric sensing tasks: intention recognition, activity recognition, muscular movement recognition and gesture recognition. The comprehensive results demonstrate the effectiveness of our model in tackling the distribution shift problem in SSL.
The Proposed Method
Problem Statement and Method Overview
We now detail the distributionally robust model for semi-supervised learning on distributionally shifted data. Assume, there are two parts to the training data: the labeled set and the unlabeled set . In , each sample (, ,
) consists of an input vector, an activity label and a distribution indicator that indicates the sample is from , where is some input space and is a finite label space for classification problems. In , the samples that lack labels are denoted by (, ), where and which indicates the sample is from . For simplicity, when referring to a sample regardless of whether it is labeled or unlabeled, we denote the input vector by .
Under the distribution shift assumption, we assume that the data are drawn from different distributions, that is, is drawn from a marginal distribution and is drawn from a different marginal distribution . Thus, person-specific discrepancy is formulated as the divergence of and : . Simultaneously, unlike some domain adaptation methods [Liu and Tuzel2016, Tzeng et al.2017] that assume , we do not make the same assumption but hold the opinion that there exists latent consistency for data collected in the same tasks. Therefore, we aim at preserving task-specific consistency by learning latent features so that and the predictor learned with is also effective on .
We decompose the proposed model into five parts: an encoder that maps input data to a latent feature , a label predictor that maps feature to the label , a distribution predictor that predicts whether the feature is mapped from or , and two decoders and that reconstruct input vectors of and . The parameters of the five parts are denoted by , respectively. An overview of the proposed model is shown in Figure 1.
We define four components of the training objective: the user adversarial loss, , forces a reduction in the distribution divergence of the latent features of and ; the reconstruction loss, , learns two decoders to reconstruct input vectors from latent features ; the latent consistency loss, , is a constraint that avoids losing the task-specific information during training; the final prediction loss, , encourages the encoder to learn discriminative features and ensures a powerful label predictor is trained. The total loss can be defined as the sum of the four components:
Reducing Person-Specific Discrepancy
To reduce person-specific discrepancy, we aim at learning features and making the distributions and similar. Since calculating and controlling the distribution discrepancy is non-trivial, we force the feature extractor to map and to a unified distribution by learning the features whose distributions cannot be distinguished by the distribution classifier. This is constrained by an adversarial loss . (see Figure 1
(a)) For the binary classification problem, the loss function is defined as:
where is the number of labeled samples and is the number of unlabeled samples. Firstly, we need a sufficiently strong classifier to distinguish users from latent features because successfully deceiving a weak classifier does not mean the features are drawn from similar distributions. This step is done by updating while maximizing Eq. 2 and fixing . Meanwhile, we need to learn the features that are unidentifiable for . This is done by updating while minimizing Eq. 2 and fixing . Therefore, the optimization of the adversarial loss can be summarized as:
Probabilistically, Eq. 2 can be rewritten as:
regardless of the constant, the optimization of the adversarial loss can be formulized as the problem of finding the optimized so that the discrepancy between and is minimized.
Preserving Task-Specific Consistency
By preserving task-specific consistency, we learn features so that . Intuitively, if there exists a matching sample that belongs to the same label , we only need to make . However, in our semi-supervised setting, we do not have paired data to assess the latent task-related differences. Instead, we generate paired data using the decoders shown in Figure 1(b). and are able to reconstruct input vectors from the corresponding latent features . They can also be regarded as two generators that generate from . Therefore, we generate with : , and similarly for the reverse: . In this way, we only need to make and to ensure task-specific consistency of the paired data.
Firstly, we need two decoders that can reconstruct input vectors from the corresponding latent features
. They are optimized as regular autoencoders (see the left of Figure1(b)):
where denotes the distance between vectors. Note that only two decoders are updated when minimizing since may distract the encoder from learning the features that reduce person-specific discrepancy. Then, task-specific consistency is ensured by the consistency loss as shown in the right of Figure 1 (b):
We finally conduct the prediction. Good prediction performance not only relies on a powerful predictor but also requires discriminative features. We harness the annotated data to optimize the parameters of both the feature extractor (the encoder) and the predictor as Figure 1
where is the number of label classes, and if the -th sample belongs to the -th class and otherwise. ensures the discriminativeness of the features learned by the encoder and the good classification ability of the predictor for the annotated data. Reducing person-specific discrepancy and preserving task-specific consistency ensures that the learned with only is effective on .
Training and Optimization
The training objective is to minimize Eq. 1. Nevertheless, the four losses , , and have respective goals and different associated parameters to learn. The optimization problem can be summarized and jointly trained as:
However, in the experiments, we find that a very strong classifier may minimize the feature distribution discrepancy of and , but it will also distract the encoder from learning discriminative features for prediction. Therefore, we set a threshold to seek a balance for the min-max game between person-specific discrepancy and discriminativeness. On the other hand, we require rather strong decoders for reconstruction, a threshold is thus set to guarantee the reconstruction performance. The detailed procedure is shown in Algorithm 1.
In this section, we evaluate the performance of our proposed method in four challenging people-centric sensing tasks: intention recognition, activity recognition, muscular movement recognition and gesture recognition. In particular, we first compare our model with both semi-supervised methods that take no account to distribution shift and other domain adaptation state-of-the-art. The experiment results show that our method outperforms these state-of-the-art methods. Secondly, we perform a detailed ablation study to examine the contributions of the proposed components to the prediction performance. Then we explore the scalability of our model when and are associated with multiple subjects. We further present the visualized distributions of the latent features. Lastly, we analyze the model’s sensitivity to the two thresholds.
Intention Recognition–EEG Dataset [Goldberger et al.2000]: The EEG dataset contains 108 subjects executing left/right fist open and close intention tasks. The EEG data is collected using BCI2000 instrumentation [Schalk et al.2004] with 64 electrode channels and 160Hz sampling rate. Each subject performs around 45 trials with a roughly balanced ratio of the right and the left fist. We randomly choose 10 subjects for evaluation and select the period from 1 second after the onset to the end of one trial.
Muscular Movement Recognition–EMG Dataset 111http://archive.ics.uci.edu/ml/datasets/emg+dataset+in+lower+
limb#: The UCI EMG Dataset in Lower Limb contains 11 subjects with no abnormalities in the knee executing three different exercises for analysis in the behavior associated with the knee muscle, gait, leg extension from a sitting position, and flexion of the leg up. The data is collected by MWX8 datalog from the Biometrics company. The acquisition process was conducted with four electrodes and one goniometer in the knee. Data with 5 channels are acquired directly from equipment MWX8 at 14 bits of resolution and 1000Hz frequency.
Activity Recognition–MHEALTH [Banos et al.2014]: This dataset is devised to benchmark human activity recognition methods based on multimodal wearable sensor data. Three inertial measurement units (IMUs) are respectively placed on 10 participants’ chest, right wrist, and left ankle to record the acceleration (), angular velocity (deg/s) and the magnetic field (local) data while they are performing 12 activities. The IMU on the chest also collects 2-lead ECG data (mV) to monitor the electrical activity of the heart. All sensing models are recorded at a frequency of 50 Hz.
Gesture Recognition–Opportunity Gesture [Roggen et al.2010]: This dataset consists of data collected from four subjects by a wide variety of body-worn, object-based and ambient sensors in a realistic manner. There are a total of 17 gesture classes that comprises the coarser characterization of the user’s hand activities such as opening a door and closing a door, toggle switch. Each recording contains 242 real-value sensory readings.
In this work, we use a convolutional autoencoder as the main architecture. The encoder has one convolutional layer, one max-pooling layer and one fully-connected layer. Two decoders use a mirrored architecture with the encoder, including one fully-connected layer, one un-pooling layer and one deconvolutional layer. Each convolutional layer is followed by a rectified linear unit (ReLU) activation and the classification outputs are calculated by the softmax functions. The kernel size of the convolutional layer and the deconvolutional layers isand the number of feature maps is 40, where denotes the number of features of the datasets and the pooling size is
. We use stochastic gradient descent with Adam update rule to minimize the loss functions at a learning rate of-. Dropout regularization with a keep probability of
is applied before the fully-connected layers. Batch normalization during training is also used to get better performance. All the experiments are conducted on a Nvidia Titan X Pascal GPU.
Comparison with State-of-the-Art
To verify the overall performance of the proposed model, we first compare our model with other state-of-the-art methods. The compared methods include semi-supervised methods (Tri-Net [Chen et al.2018a], DP [Cheng et al.2016] and MS [Shinozaki2016]), none of which take into account distribution shift, and other domain adaptation methods (DANN [Ganin et al.2016], CYCADA [Hoffman et al.2018], ADDA [Tzeng et al.2017], CoGAN [Liu and Tuzel2016] and Cycle GAN [Zhu et al.2017]). We also employ a regular CNN as a supervised baseline which is only trained with the labeled set . Considering that different people have different behavior patterns and biological phenomena, we simulate distribution shift scenarios by drawing training sets and from two different subjects and . The data of is evenly separated into two, one is the unlabeled training set and the other is used as the test set . Cross-validation is conducted on all the participant subjects to ensure rigorousness.
As we can observe from Table 1, the performance of all the methods on MHEALTH achieves even though and are collected from different subjects, while the performance on the other datasets only achieves or . The prediction performance demonstrates the degrees of distribution shift in four datasets, among which the discrepancy in MHEALTH is the smallest. This observation coincides with the visualized distribution discrepancy we show in Figure 3.
With respect to the compared methods, the semi-supervised methods Tri-Net, DP and MSS only obtain similar results with regular CNN even though they resort to the unlabeled data of . Owing to the distribution shift, the information of cannot be well leveraged by these methods. In contrast, DANN, CYCADA, ADDA, CoGAN and Cycle GAN achieve better results since they consider distribution shift and are devoted to mitigating the shift. Overall, the proposed model significantly outperforms the conventional semi-supervised methods. Also, our model achieves better performance than other domain adaptation state-of-the-art. By reducing person-specific discrepancy and preserving task-specific consistency, our model makes the classifier trained on also effective on and .
We perform a detailed ablation study to examine the contributions of the proposed model components to the prediction performance in Table 2. We first consider the model trained only with . This model is composed of and , which is the same as a regular CNN trained on and tested on . This model serves as a baseline to evaluate the effectiveness of the other components. Secondly, we evaluate the contribution of reducing person-specific discrepancy by combining and . This model is composed of , and . As we can see in Table 2, the adversarial loss is effective since the prediction results are improved by to . This is in accordance with the analysis that optimizes the parameters of the encoder to minimize person-specific discrepancy and is beneficial to prediction. We also conduct experiments using the model with preserving task-specific consistency but without the adversarial loss, that is, . The model is composed of , , and . Note that is only meaningful when it works with to build the consistency loop. Otherwise it only trains two decoders of no utilization. It can be observed that this setting also achieves better performance than the regular model since it directly forces the paired features to be equal and generalizes the model by creating more samples. But it is less effective than reducing person-specific discrepancy. When person-specific discrepancy is large, it is harder to generate data or so the effect of preserving task-specific consistency of and is limited. When combining all these benefits, our model achieves the best performance.
Scalability to Multi-Subjects
The setting of this model is that and obey two different distributions. The example is and are drawn from two subjects. However, situations still exist when and are separately collected from quite a number of subjects. Therefore, the training sets and may include multiple diverse distributions. We now explore the scalability of our model in this setting. As Figure 2 shows, we increase the number of the labeled subjects from to in the EEG and MHEALTH datasets, from to in the EMG dataset, and from to in the OPPORTUNITY dataset, and increase the number of the unlabeled subjects in the same way. Note that we do not conduct experiments in the settings when the summation of the number of the labeled subjects and the number of the unlabeled subjects is larger than the total number of the participant subjects since in these settings, there must exist overlapping data shared by and , which disobeys the overall distribution shift setting.
In this experiment, the distribution classifier still works as a binary classifier. We consider the merging of all the distributions in as a new distribution and the same for . It can be observed that accuracy increases with an increase in the number of labeled subjects and decreases with an increase in the number of unlabeled subjects, which conforms to the intuition that diversely distributed labeled data gives the model generalization ability, but too scattered unlabeled data is detrimental to training.
Latent Feature Visualization
To verify the effectiveness of the proposed model, we present the visualized distributions of both the raw data and the latent features of and via t-SNE visualization [Maaten2013] as Figure 3 shows. We can observe a rather obvious discrepancy between the raw data distributions of and . In line with Table 1, the discrepancy of raw data is relatively unobvious in MHEALTH and is noticeable in OPPORTUNITY. After training, the features of the labeled data and the unlabeled data are well merged in the MHEALTH, EMG and OPPORTUNITY datasets. The merging is not that effective in the EEG dataset, but a reduction in the discrepancy still can be noticed.
Sensitivity to Thresholds
Lastly, we present the model’s sensitivity to two thresholds in Figure 4. controls how strong the classifier is to align the features of and , and affects the reconstruction performance. In Figure 4(a), the prediction accuracy achieves the top when is around or . The reason for this is that although a too strong classifier may minimize the feature distribution discrepancy of and , it also distracts the encoder from learning discriminative features for prediction. Meanwhile, too weak is meaningless to our model. The best , in fact, finds out the balance for the min-max game between person-specific discrepancy and discriminativeness. In Figure 4(b), accuracy decreases with an increase in . It can be inferred that powerful reconstruction ability is significant for the proposed model.
We propose a novel distributionally-robust semi-supervised method for handling shifted distributions of the labeled and the unlabeled data. The model first reduces person-specific discrepancy by aligning the distributions of the labeled data and unlabeled data. Task-specific consistency is further proposed for extracting label-related features. We experimentally validate our model on a variety of people-centric sensing tasks. The results demonstrate the outperformance of the proposed model compared with the state-of-the-art. Our model is generic and can be applied to practical applications.
- [Banos et al.2014] Banos, O.; Garcia, R.; Holgado-Terriza, J. A.; Damas, M.; Pomares, H.; Rojas, I.; Saez, A.; and Villalonga, C. 2014. mhealthdroid: a novel framework for agile development of mobile health applications. In International Workshop on Ambient Assisted Living, 91–98. Springer.
- [Basanta, Huang, and Lee2017] Basanta, H.; Huang, Y.-P.; and Lee, T.-T. 2017. Assistive design for elderly living ambient using voice and gesture recognition system. In Systems, Man, and Cybernetics (SMC), 2017 IEEE International Conference on, 840–845. IEEE.
- [Bulling, Blanke, and Schiele2014] Bulling, A.; Blanke, U.; and Schiele, B. 2014. A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys (CSUR) 46(3):33.
[Chen et al.2018a]
Chen, D.; Wang, W.; Gao, W.; and Zhou, Z.
Tri-net for semi-supervised deep learning.In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., 2014–2020.
[Chen et al.2018b]
Chen, K.; Yao, L.; Wang, X.; Zhang, D.; Gu, T.; Yu, Z.; and Yang, Z.
Interpretable parallel recurrent neural networks with convolutional attentions for multi-modality activity modeling.In 2018 International Joint Conference on Neural Networks, IJCNN 2018, Rio de Janeiro, Brazil, July 8-13, 2018, 1–8.
- [Cheng et al.2016] Cheng, Y.; Zhao, X.; Cai, R.; Li, Z.; Huang, K.; and Rui, Y. 2016. Semi-supervised multimodal deep learning for rgb-d object recognition. In IJCAI, 3345–3351.
- [Do and Gatica-Perez2014] Do, T. M. T., and Gatica-Perez, D. 2014. The places of our lives: Visiting patterns and automatic labeling from longitudinal smartphone data. IEEE Transactions on Mobile Computing 13(3):638–648.
[Ganin et al.2016]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette,
F.; Marchand, M.; and Lempitsky, V.
Domain-adversarial training of neural networks.
The Journal of Machine Learning Research17(1):2096–2030.
- [Goldberger et al.2000] Goldberger, A. L.; Amaral, L. A.; Glass, L.; Hausdorff, J. M.; Ivanov, P. C.; Mark, R. G.; Mietus, J. E.; Moody, G. B.; Peng, C.-K.; and Stanley, H. E. 2000. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation 101(23):e215–e220.
- [Goodfellow et al.2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
- [Hoffman et al.2018] Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.; Isola, P.; Saenko, K.; Efros, A. A.; and Darrell, T. 2018. Cycada: Cycle-consistent adversarial domain adaptation. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, 1994–2003.
- [Lauretti et al.2017] Lauretti, C.; Cordella, F.; Guglielmelli, E.; and Zollo, L. 2017. Learning by demonstration for planning activities of daily living in rehabilitation and assistive robotics. IEEE Robotics and Automation Letters 2(3):1375–1382.
- [Liao et al.2015] Liao, L.; Xue, F.; Lin, M.; Li, X.-L.; and Krishnaswamy, S. P. 2015. Human activity classification in people centric sensing exploiting sparseness measurement. In Information, Communications and Signal Processing (ICICS), 2015 10th International Conference on, 1–5. IEEE.
- [Liu and Tuzel2016] Liu, M.-Y., and Tuzel, O. 2016. Coupled generative adversarial networks. In Advances in neural information processing systems, 469–477.
- [Maaten2013] Maaten, L. v. d. 2013. Barnes-hut-sne. In Proceedings of the International Conference on Learning Representations.
- [Roggen et al.2010] Roggen, D.; Calatroni, A.; Rossi, M.; Holleczek, T.; Förster, K.; Tröster, G.; Lukowicz, P.; Bannach, D.; Pirkl, G.; Ferscha, A.; et al. 2010. Collecting complex activity datasets in highly rich networked sensor environments. In Networked Sensing Systems (INSS), 2010 Seventh International Conference on, 233–240. IEEE.
- [Schalk et al.2004] Schalk, G.; McFarland, D. J.; Hinterberger, T.; Birbaumer, N.; and Wolpaw, J. R. 2004. Bci2000: a general-purpose brain-computer interface (bci) system. IEEE Transactions on biomedical engineering 51(6):1034–1043.
Semi-supervised learning for convolutional neural networks using mild supervisory signals.In International Conference on Neural Information Processing, 381–388. Springer.
- [Smeddinck, Herrlich, and Malaka2015] Smeddinck, J. D.; Herrlich, M.; and Malaka, R. 2015. Exergames for physiotherapy and rehabilitation: a medium-term situated study of motivational aspects and impact on functional reach. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 4143–4146. ACM.
- [Tzeng et al.2017] Tzeng, E.; Hoffman, J.; Saenko, K.; and Darrell, T. 2017. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, 4.
- [Xing et al.2018] Xing, Y.; Yu, G.; Domeniconi, C.; Wang, J.; and Zhang, Z. 2018. Multi-label co-training. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., 2882–2888.
- [Zhang et al.2018] Zhang, D.; Yao, L.; Zhang, X.; Wang, S.; Chen, W.; Boots, R.; and Benatallah, B. 2018. Cascade and parallel convolutional recurrent neural networks on eeg-based intention recognition for brain computer interface. In AAAI.
[Zhu et al.2017]
Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A.
Unpaired image-to-image translation using cycle-consistent adversarial networks.In IEEE International Conference on Computer Vision.