Semi-unsupervised Learning of Human Activity using Deep Generative Models

by   Matthew Willetts, et al.

Here we demonstrate a new deep generative model for classification. We introduce `semi-unsupervised learning', a problem regime related to transfer learning and zero/few shot learning where, in the training data, some classes are sparsely labelled and others entirely unlabelled. Models able to learn from training data of this type are potentially of great use, as many medical datasets are `semi-unsupervised'. Our model demonstrates superior semi-unsupervised classification performance on MNIST to model M2 from Kingma and Welling (2014). We apply the model to human accelerometer data, performing activity classification and structure discovery on windows of time series data.



page 4


Semi-Unsupervised Learning with Deep Generative Models: Clustering and Classifying using Ultra-Sparse Labels

We introduce semi-unsupervised learning, an extreme case of semi-supervi...

Semi-Supervised Generation with Cluster-aware Generative Models

Deep generative models trained with large amounts of unlabelled data hav...

Unsupervised Learning of 3D Structure from Images

A key goal of computer vision is to recover the underlying 3D structure ...

Learning Consistent Deep Generative Models from Sparse Data via Prediction Constraints

We develop a new framework for learning variational autoencoders and oth...

Learning Neural Random Fields with Inclusive Auxiliary Generators

In this paper we develop Neural Random Field learning with Inclusive-div...

A Non-generative Framework and Convex Relaxations for Unsupervised Learning

We give a novel formal theoretical framework for unsupervised learning w...

Semi-Generative Modelling: Domain Adaptation with Cause and Effect Features

This paper presents a novel, causally-inspired approach to domain adapta...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While developing machine learning solutions in healthcare and medicine, the amount of unlabelled data is typically much larger than the amount of labelled data. Further, there is selection bias: the labelled data is often from a biased sample of the overall data distribution. For rare diseases, rare risk factors and other uncommon states, particular class categories might be entirely unobserved in the labelled dataset, only appearing in unlabelled data.

This occurs in the use of machine learning to measure sleep duration and physical activity from sensor data to better understand their association with morbidity and mortality Doherty2017 ; Willetts2018 . By increasing our understanding of human activity, combined with rich medical datasets such as biobanks, we can potentially obtain deeper insights into diseases and their links with lifestyle factors.

Thus we are interested in the case where an unlabelled instance of data could be from one of the sparsely-labelled classes or from an entirely-unlabelled class. We call this ‘semi-unsupervised learning’. Here we are jointly performing semi-supervised learning on sparsely-labelled classes, and unsupervised learning on completely unlabelled classes. We give a deep generative models

Rezende2014 ; Kingma2013 that can solve this problem.111

Note: related work-in-progress, with more focus on theory, is in the NeurIPS Bayesian Deep Learning Workshop 2018, titled ‘Semi-unsupervised Learning using Deep Generative Models’.

Semi-unsupervised learning has similarities to some varieties of zero-shot learning (ZSL), where deep generative models have been of interest Weiss2016

, but in ZSL one has access to auxiliary side information (commonly an ‘attribute vector’) for data at training time, which we do not. So our regime is equivalent to transductive generalised ZSL, but with no side information

Xian2018 . It also has similarities to transfer learning. In Cook et al.’s terms Cook2013 , ‘semi-unsupervised learning’ is related to uninformed semi-supervised transductive transfer learning but here we use our source and target information jointly, and our discrete label space can either be the same or different for our labelled and unlabelled data.

Numerous methods have been proposed for activity recognition, including: deep discriminative models Alsheikh2015 ; Ordonez2016 ; Hammerla2016

; random forests


; probabilistic graphical models/Hidden Markov Models

Ellis2016 ; Willetts2018 ; and Gaussian processes Alvarez2011 . None are set up with the ‘semi-unsupervised’ regime in mind. Thus we are interested in flexible, scalable semi-unsupervised machine learning methods, where unlabelled and labelled data are used together to improve classification performance. We show our model’s utility for MNIST image data classification, and for human activity recognition from sensor data.

2 Deep Generative Models

In a deep generative model, the parameters of the distributions within a probabilistic graphical model are themselves parameterised by neural networks. The simplest is a variational autoencoder

Kingma2013 ; Rezende2014 , the deep version of factor analysis. Here there is a continuous unobserved latent and observed data

. The joint probability is

with and where are each parameterised by neural networks with parameters . As exact inference for is intractable, it is standard to perform stochastic amortised variational inference to obtain an approximation to the true posterior.

For a VAE, introduce a recognition network (where are neural networks with parameters ). Through joint optimisation over

using stochastic gradient descent we aim to find the point-estimates of the parameters

that maximises the evidence lower bound . For the expectation over in we take Monte Carlo (MC) samples. To then take derivatives through these samples wrt we need to be able to differentiate through a sample from a Gaussian. For this we use a case of the ‘reparameterisation trick’ where we notice that is possible to rewrite a sample from a Gaussian as a deterministic function of sample from :


thus we can differentiate a sample w.r.t. , so we can differentiate our MC approximation w.r.t .

To perform semi-supervised classification with a deep generative model, introduce a discrete class variable into the generative model and into the recognition networks. There will be two evidence lower bounds for the model, one where is a latent variable to be inferred: and one where is observed: .

In this work we build on the M2 model developed by Kingma and Welling (2014) Kingma2014a . Here and , and . is the discrete prior on . Via simple manipulation one can show . Note that

, which is to be our trained classifier at the end, only appears in

, so it would only be trained on unlabelled data. To remedy this, motivated by considering a Dirichlet hyperprior on

, they add to the loss the cross entropy between the true label and , weighted by a factor . So the overall loss for model M2 with unlabelled data and labelled data is:


This model has been demonstrated in the semi-supervised case Kingma2014a , but when there is no label data at all, when we are just optimising , the model can fail to learn an informative distribution for (see similar effect in Dilokthanakula ). Either it collapses to the prior , or it maps every datapoint to one class. Either way the model reduces to something very similar to a standard VAE with no variable. This happens when the encoder and decoder are high enough in capacity to obtain a locally optimal value of the evidence lower bound without using the class label. Thus, if one wishes to use high-capacity neural networks it is necessary to adjust the model in some way. Here we propose a change to the generative structure.
















Figure 1: Representation of our DGM as a probabilistic graphical model, for data , partially observed class , continuous latent , , . Figures (a,c) shows the generative model with latent and observed. Figures (b,d) shows the variational approximate posterior with latent and observed.

M2’s inability to consistently learn in the semi-unsupervised case is not satisfactory for us as we are interested in cases when some classes of data are not found in our labelled dataset. This is to enable us to learn with both semi-supervised and unsupervised classes. Many deep generative models have been proposed for semi-supervised learning Maaloe2016 ; Maaloe2017a and for unsupervised learning Dilokthanakula ; Burda ; Kingma2016 , but none have dealt with posterior collapse in so as to perform semi-unsupervised learning. So we propose a deep generative model: a Gaussian mixture version of a variational auto-encoder, inspired by Kingma et al.’s M2 Kingma2014a and the GMM-VAE Dilokthanakula . Rather than having the same distribution for all classes as in M2, we condition on to obtain a mixture of gaussians in space. We perform semi-unsupervised classification with this model, and also compare performance with M2. Note that our model can also be trained unsupervised. The generative model for the data is:


We then perform amortised stochastic variational inference, with variational distributions as before for M2. See Figure 1 for a graphical representation of our model.

3 Experimental setup

3.1 Model Implementation

All networks are small MLPs, 2-4 layers with 500 hidden units per layer and RELU activations.

is 100 dimensional. The same network architectures were used for networks with the same inputs and outputs. Our code is based on the template code associated with Gordon Hernandez-Lobato (2017) Gordon2017 . Training was done using Adam Kingma2015 . Kernel initialisation was Glorot-Normal and weights were regularised via a Gaussian prior as in Kingma2014a . 222Code accompanying the paper is available at:

3.2 MNIST experiment

Here we trained the both our model and M2 with digits semi-supervised with labels, and digits entirely unsupervised. We augmented with 5 extra classes to learn into in addition to the 5 vacated classes. was equal to 1/10 for the 5 semi-supervised classes and 1/20 for each of the 10 unsupervised classes. To calculate accuracy we attributed the learnt, unsupervised classes to the most common class within it at test time. See Figure 2 for the resulting confusion matrices.

3.3 Activity recognition experiment

We trained and evaluated the model to distinguish between activity states using the CAPTURE-24 dataset. This dataset was gained from 132 people (mean age=42) wearing a wrist-mounted accelerometer for 2-3 days each. This device measured acceleration at 100Hz tri-axially along axes pinned to the device, saturating at Doherty2017

. We preprocess the data into 126 standard features extracted in 30 second windows as in Willetts et al. (2018)

Willetts2018 . In addition to wearing an accelerometer, labels were obtained via wearable cameras that took photos every Doherty2013 ; Gurrin2014 . These images were then classified visually by medical students into the classes of the Compendium of Physical Activity (Ainsworth2011 ), which are detailed but overlap. Thus we select only the most unambiguous CPA classes (eg. 17250: Walking as a means of transport or 1010: Bicycling) to use as labels, the rest are considered unlabelled. See appendix for details. Further we only will give ourselves access to labels for a subset of 20 participants, the remaining 92 treated as entirely unlabelled with a test set of 20 participants. We treat the problem as classification of these windows – but only having ground truth labels for a fraction of data. This dataset is also highly unbalanced. The rarest class, running, is times less prevalent than the commonest, sleep.

4 Results

Figure 2: Confusion matrices for a) M2 - accuracy 0.82 - and b) Our model - 0.94 - when trained on MNIST with 100 labelled examples each for digits 0,1,2,8,9 and digits 3,4,5,6,7 entirely unsupervised. 5 additional classes were added to . All unsupervised classes were attributed to the true 10 classes by taking them to represent their most prevalent original class. Clearly the original model M2 shows more confusion between unsupervised classes than our model. Each are the best of 10 runs.

4.1 Activity Results

Model performance is not well captured by simple scores on the predictions, as we are interested in state discovery so some data may be counted as an error for having been place in a new class when truly it should be. Looking at the the full CPA annotations within the discovered clusters, we see the model learns a class for activities while standing, containing: 5035: kitchen activity general cooking, 11795: walking on job and carrying light objects, 11600: (generic) standing tasks such as store clerk, 11475: (generic) manual labour’ and a cluster for sitting activities: 11580: office work/computer work general, 11580: office work such as writing and typing, 7030: sleeping, 9030: sitting desk work. Compared to M2, the overall Calinski-Harambasz score Calinski1974 rose from to .

5 Conclusion

We show that our method can perform better than Kingma et al’s M2 Kingma2014a in semi-unsupervised learning on MNIST, and can discover structure in CAPTURE-24 data. and can be thought of as separating out class and style information about data . Our model, through having a mixture of Gaussians in space is a suitable choice of model when different classes in data might have different stylistic information for different classes. This is work in progress, the next steps are to apply these methods, with more flexible and powerful parameterisations of the parameters of the distributions, to the full UK Biobank Activity Dataset of 100k people using CAPTURE-24 as the small labelled dataset, and extend this to time series analysis.


  • (1) Aiden Doherty, Dan Jackson, Nils Hammerla, Thomas Plötz, Patrick Olivier, Malcolm H Granat, Tom White, Vincent T. Van Hees, Michael I Trenell, Christoper G Owen, Stephen J Preece, Rob Gillions, Simon Sheard, Tim Peakman, Soren Brage, and Nicholas J Wareham. Large scale population assessment of physical activity using wrist worn accelerometers: The UK biobank study. PLoS ONE, 12(2), 2017.
  • (2) Matthew Willetts, Sven Hollowell, Louis Aslett, Chris Holmes, and Aiden Doherty. Statistical machine learning of sleep and physical activity phenotypes from sensor data in 96, 220 UK Biobank participants. NSR, 2018.
  • (3) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.

    Stochastic Backpropagation and Approximate Inference in Deep Generative Models.

    Proceedings of the 31nd International Conference on Machine Learning (ICML), 2014.
  • (4) Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. In Advances in Neural Information Processing Systems (NIPS), 2013.
  • (5) Karl Weiss, Taghi M Khoshgoftaar, and Ding Ding Wang. A survey of transfer learning. Journal of Big Data, 3(1), 2016.
  • (6) Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly, 2018.
  • (7) Diane Cook, Kyle D Feuz, and Narayanan C Krishnan. Transfer learning for activity recognition: A survey. Knowledge and Information Systems, 36(3):537–556, 2013.
  • (8) Mohammad Abu Alsheikh, Ahmed Selim, Dusit Niyato, Linda Doyle, Shaowei Lin, and Hwee-Pink Tan. Deep Activity Recognition Models with Triaxial Accelerometers. In

    Workshops of AAAI Conference on Artificial Intelligence

    , 2015.
  • (9) Francisco Javier Ordóñez and Daniel Roggen.

    Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition.

    Sensors (Switzerland), 16(1), 2016.
  • (10) Nils Y. Hammerla, Shane Halloran, and Thomas Ploetz. Deep, Convolutional, and Recurrent Models for Human Activity Recognition using Wearables. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI), 2016.
  • (11) Katherine Ellis and Et al. Hip and wrist accelerometer algorithms for free-living behavior classification. Medicine and Science in Sports and Exercise, 48(5):933–940, 2016.
  • (12) Mauricio A Alvarez, Jan Peters, Bernhard Schölkopf, and Neil D Lawrence. Switched Latent Force Models for Movement Segmentation. In Advances in Neural Information Processing Systems (NIPS), 2011.
  • (13) Diederik P Kingma, Danilo J Rezende, Shakir Mohamed, and Max Welling. Semi-Supervised Learning with Deep Generative Models. In Advances in Neural Information Processing Systems (NIPS), 2014.
  • (14) Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary Deep Generative Models. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016.
  • (15) Lars Maaløe, Marco Fraccaro, and Ole Winther. Semi-Supervised Generation with Cluster-aware Generative Models. CoRR, 2017.
  • (16) Nat Dilokthanakul, Pedro A M Mediano, Marta Garnelo, Matthew C H Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep Unsupervised Clustering with Gaussian Mixture VAE. CoRR, 2017.
  • (17) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders. arXiv preprint, 1509.00519, 2015.
  • (18) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improving Variational Inference with Inverse Autoregressive Flow. In Advances in Neural Information Processing Systems (NIPS), 2016.
  • (19) Jonathan Gordon and José Miguel Hernández-Lobato. Bayesian Semisupervised Learning with Deep Generative Models. ICML Workshop on Principled Approaches to Deep Learning, 2017.
  • (20) Diederik P Kingma and Jimmy Lei Ba. Adam: A Method for Stochastic Optimisation. ICLR, 2015.
  • (21) Aiden R Doherty, Steve E Hodges, Abby C King, Alan F Smeaton, Emma Berry, Chris J.A. Moulin, Siân Lindley, Paul Kelly, and Charlie Foster. Wearable cameras in health: The state of the art and future possibilities. American Journal of Preventive Medicine, 44(3):320–323, 2013.
  • (22) Cathal Gurrin, Alan F. Smeaton, and Aiden R. Doherty. LifeLogging: Personal Big Data. Foundations and Trends in Information Retrieval, 8(1):1–125, 2014.
  • (23) Barabara E Ainsworth, William L Haskell, Stephen D Herrman, Nathanael Meckes, David R Bassett, Catrine Tudor-Locke, Jennifer Greer, Jesse Vezina, Melicia Whitt-Glover, and Arthur Leon. Compendium of Physical Activities: A Second Update of Codes and MET Values. Medicine and Science in Sports and Exercise, 43(8):1575–1581, 2011.
  • (24) T. Caliñski and J. Harabasz.

    A Dendrite Method Foe Cluster Analysis.

    Communications in Statistics, 3(1):1–27, 1974.

Appendix A Dictionary from CPA to Classes

We chose a small number of well-represented and unambiguous CPS codes to form the sparsely-labelled classes in the activity data analysis. In total these CPS codes cover of the labelled data.

Class CPA Code CPA Description Proportion ()
Bicycling 1010 Bicycling 1.5
Driving 6206 Driving automobile or light truck 3.4
Eating 13030 eating sitting alone or with someone 4.2
Riding Transport 16016 Riding in a bus or train 1.5
Running 12150 Running 0.12
Sitting 9060 Sitting/lying reading or without observable/identifiable activities 8.2
Sleep 7030 Sleep 9.6
Walking 17250 Walking as the single means to a destination not to work or class 2.5
Walking 17161 Walking not as the single means of transport 1.8
Walking 17270 Walking as the single means to work or class 0.55
Walking 17082 Hiking or walking at a normal pace through fields and hillsides 0.18
Table 1: CPA Dictionary