Accurate and effective activity monitoring is becoming more important to reach higher quality of living. Current smart phones and tablets, equipped with powerful processors and a wide variety of sensors, have become ideal platforms for activity monitoring. Especially for elderly people, accurate monitoring of daily activities is ultimately important for ovearall health condition analysis. 
describe a hierarchical method of activity classification based on a smart phone, equipped with an embedded 3D-accelerometer, worn on a belt. With two multi-class SVM classifiers for rule based reasoning of motion and motionless activities, they were able to achieve an accuracy offor six different activities. Accelerometers have also been used in the activity classification for sports activities  in order to capture and archive various training statistics. One of the key points discussed in 
is that the placement of the phone makes it hard to generalize a model trained offline. This is the main drawback of the supervised approach. Ideally, an adaptive model should be able to learn the activities on the fly and expands its repertoire when it encounters a new type of activity. In other words, we need an unsupervised non-parametric model.
There are a few unsupervised learning methods applied to wearable accelerometer data. For instance,  applied k-nearest neighbor algorithm to the problem of activity classification for dogs with classification accuracy.
used unsupervised learning applied on sensor data including accelerometer, magnetometer and gyroscope for smartphones to classify human activities. They used a method based on hierarchical clustering coupled with Gaussian Mixture Models. proposed unsupervised human activity classification using Hidden Markov Model(HMM) in a multiple regression context using sensors attached to ankle, hip and chest. The sensors are used to record three-dimensional accelerometer data resulting in nine-dimensional data for classifier methods. Moreover, 
used a sticky hierarchical Dirichlet process HMM (sticky HDP-HMM) to segment human activities such as waving good-bye, walking and throwing a ball so that a robot can imitate them. They measure angles of various joints of human body to estimate unknown number of human activities in order to classify them. In all these methods, the autocorrelated nature of the accelerometer data is not accounted for. Most notably HMM based approaches are making the assumption that each observation is i.i.d. given the activity. This is valid approach if the activities differ in the mean of the observation but it fails if the activities differ mostly in the covariance of the observation.
To collect the accelerometer data, the smartphone is attached to a belt around the waist. There are two benefits of this placement i) it is close to the center of gravity therefore the images from the camera are more stable  ii) it has been recently shown that the acceloremeter placed around the waist provides the highest accuracy of classification . In terms of activities, the subjects are instructed to choose among walking, going up and down the stairs, turning, taking the elevator, sitting and lying down in a random order. The subjects were also encouraged to repeat each activity random amount of times.
Activity classification problem exhibits a piecewise linear autocorrelation characteristic. One of the popular models for such timeseries data is the switching linear dynamical system (SLDS) model. SLDS model consists of a collection of linear dynamical system (LDS) models which are primarily used to infer the hidden dynamical behavior of a noisy system from noisy observations . SLDS models have successfully been used to model the regime switching in interest rates , the human motion from video data , the dance of honey bees , the respiration pattern of a sleep apnea patient , the interconnectivity of brain regions , and many others.
SLDS can be described as a hybrid of hidden Markov model (HMM) and LDS model. The linear Gaussian process models the dynamical behavior of the system within each temporal mode and the hidden Markov chain captures the sequence of temporal modes. Each hidden state in the Markov chain corresponds to a distinct temporal mode and has its own parameters for the linear Gaussian process it governs. In an ordinary SLDS, the number of temporal modes must be specified ahead of time; however, this information may not always be available. Furthermore, it might be desirable to expand the repertory of the model as new data arrives. With Bayesian nonparametric techniques, this problem can be alleviated. Letting a Hierarchical Dirichlet Process (HDP) determine the state transition, one can allow for countably infinite number of states. This method is first explored for HMMs in. The initial attempts resulted in oscillatory behavior in practice.  modified the generative process to ensure the mode persistence. Later this concept is extended to SLDS in  where the new model is named sticky HDP-SLDS.
The Bayesian approach to HMM, LDS and SLDS models has been explored extensively. A variational inference algorithm is developed for a fully hierarchical LDS model with automatic relevance determination (ARD) in . ,  and  proposed variational inference algorithms for various SLDS models. These early works do not have the non-parametric treatment of the state transition.  incorporated HDP into SLDS; however, they used Gibbs sampling for inference with the assumption that each temporal mode shares the same observation noise model. The use of Gibbs sampling degraded the speed of inference significantly.
The major contribution of this work is to develop a variational inference algorithm for the sticky HDP-SLDS model presented in . We restrict ourselves to ARD modeling; however, we allow for each temporal mode to have a distinct sensor model. We further extend the original model to have mutliple observations sharing the same underlying switching beahvior. We compare the original Gibbs sampling and variational inference on synthetic data in their ability to capture the true temporal mode sequence as well as their speed.
Having shown the superiority of variational inference for the inference speed, we employed it for the activity classification problem. Previously,  used sticky HDP-HMM on a similar dataset with a limited number of activities. We experimentally verified that sticky HDP-SLDS with variational inference is a fast and accurate unsupervised method suitable for the activity classification with a single accelerometer sensor.
Ii-a Switching Linear Dynamical Systems
An LDS model with an underlying state and observation can be described as
where is the state dynamics matrix, is the observation matrix, and are the state and observation noise covariance matrices respectively.  showed that without loss of generality we can assume
to be the identity matrix.
SLDS requires another state variable for the Markov chain of temporal modes. An SLDS with observations sharing the same temporal mode sequence can be described as
where is the transition matrix and the initial states are given by and .
Ii-B Hierarchical Dirichlet Process
A two level HDP can be described as a collection of Dirichlet Processes (DP) each characterized by the parameter and the base measure which is also drawn from a DP with the parameter and the base measure . Mathematically we write
To construct such HDP,  used Sethuraman’s stick breaking procedure as follows
In the case of SLDS, refers to the set of parameters of temporal mode. acts as an average distribution over the temporal modes. is related to the transition probabilities from temporal mode. The connection between and can be established with a set of indicator variables such that and . Now we can write the transition probabilities as
With this construction we ensure that the transitioned state shares the same base measure in expectation i.e. As noted earlier, this implementation might fail to capture the mode persistency. To solve this problem,  introduced the concept of sticky HDP. For the sticky version we make the following modifications
where is the self transition parameter. With this construction we obtain a slightly different expectation for the transitioned state
Ii-C Variational Inference
In a well constructed graphical model, the latent variables encode the hidden pattern in the data and it is feasible to learn this pattern by computing the posterior of latent variables given observations 
. Using chain rule we write
where and are the sets of latent variables and observed variables respectively. The inference of the latent variables is relatively easy given . However, this integration becomes intractable for many practical models. With no analytical solution, one has to resort to approximate inference methods. A common approach is to use sampling techniques such as Markov Chain Monte Carlo (MCMC) for the inference . An alternative strategy is to use divergence methods such as variational inference. In variational inference, the probability of observation is approximated by another set of distributions over the latent variables. Using Jensen’s inequality we write
where is the variational distribution used for approximation and is the evidence lower bound.  showed that maximizing is equivalent to minimizing the KL divergence between the posterior distribution and the variational distribution .
To simplify further, we make the mean field assumption which requires all the variational distributions to be in the mean field variational family. In this family, each latent variable is independent and governed by its own parameter, With this assumption, we posit the inference as an optimization problem which can be solved using a coordinate ascent algorithm. Furhermore, it gives the flexibility of using stochastic methods and scalability . The update for each variational distribution can be represented as
where is the normalization constant and the expectation is taken under .
Iii Variational Inference for HDP-SLDS
The graphical model for the multi observation sticky HDP-SLDS is depicted in Figure 1. is the hidden Markov process for observation sequence and is the hidden Markov chain for temporal modes as described in Section II-A. and govern the transition from temporal mode characterized by . is related to the average distribution in Section II-B. is the distribution of the initial state . The joint likelihood of the model can be written as
With the mean field assumption, the variational distribution can be decomposed into its components. Using the stick breaking construction described in Section II-B and ARD modeling, the variational distributions for the time-invariant latent variables can be written as
where , , refers to the row of , refers to the colummn of and is the truncation parameter i.e.
The truncation parameter might give the impression that we are working with to grow as needed. is simply the upper bound on the number of possible modes .
We can make use of the conjugacy relations to update the time-invariant variational parameters . The details of these updates are given in Table I. We will explain the update for in further detail as it requires an approximation. We note that and are easy to compute since are i.i.d. . The prior value of is given by
The posterior for is proportional to
Using the approximations
we write as
We note that as , the update for becomes equivalent to the
case. Optionally, it is possible to update the hyperparametersand . We place a Gamma prior on and Beta prior on as follows:
Given the updates for and it is possible to calculate point estimates for and .
For the time-invariant parameters we are able to incorporate KL divergence terms. This will prove to be useful in computing since closed form expressions are readily available for ordinary distributions. Unfortunately, we still need to compute the entropy of and separately. In a simple HMM or LDS, there is no need to compute the entropy of time dependent latent variables explicitly. One can simply compute the likelihood of the observation conditioned on the latent variables and the entropy term will cancel out . However, in an SLDS time dependent latent variables are tightly coupled and this approach is not applicable; therefore, we need to deal with the first term in . It can be written more explicitly as
where we define . Using Eq. 5 we can write the updates for and as
For these updates it is sufficient to use since other terms in only contribute to the constants.
There exist efficient and well studied inference algorithms for LDS models and HMMs such as RTS smoother  and forward-backward algorithm [18, 27]. However, we cannot use the parameters or directly, this would mean working with instead of its relevant expectation. One solution to this problem is to introduce a set of auxiliary variables to convert into an ordinary likelihood function that can be feed into these algorithms. This approach is first explored in  and extended in .
To update , first we compute using Algorithm 1, and then we use RTS smoother with . We note that the expectations in Algorithm 1 are taken under . With a similar approach we introduce another set of variables to emulate the likelihood function of an ordinary HMM with (the probability of the initial state), (the probability of transition) and (the probability of emission). We compute using Algorithm 2 and then use forward backward algorithm to infer . The expectations of the evidence in Algorithm 2 are taken under . It is usually more convenient to work in log domain to prevent underflow in the forward backward algorithm; therefore, we directly compute the auxiliary variables in log domain. The full variational inference algorithm is given in Algorithm 3.
We collected data from 10 different subjects using a smarthphone worn around the belt. After a trivial rescaling of the data, we first applied HMM with Gaussian emission probabilities(GaussianHMM). We later used an HMM with GMM as in . These two methods are parametric unsupervised method and require the number of modes to be specified ahead of time. Furthermore they do not possess the stickiness property and do not account for autocorrelation. We then compare the HMM model to sticky HDP-SLDS with variational inference. Later we compared the Gibbs Sampler of  and the proposed variational inference algorithm. Finally we investigated the sensitivity of variational inference to hyperparameter initialization.
Iv-a Data collection
The experiments are recorded using Galaxy S. Due to the advantage of code written in Android platform, the camera and accelerometer data are recorded synchronously at the rate of approximately samples per second to be processed offline. The subjects have worn the smartphone around their waist to perform recordings as it is shown in Fig. 2. The image resolution used for recording is QVGA(320x240). Although the smartphone is capable of recording in higher resolutions, QVGA is selected in order not to increase the computation complexity. The front camera if the device is used to record activities while captured images are displayed on the screen. An illustration of a subject wearing the smartphone on top of the belt can be seen in Fig. 2.
While wearing the smartphone around the belt, the experimenters performed various activities in indoor environments of a university building and apartment unit. The activities are selected to replicate daily activities in indoor environments for accurate and robust monitoring of a subject. The activities include sitting down, standing up, walking, turning to change direction, going up and down the stairs, and taking the elevator. While performing the activities, accelerometer and images are recorded continuously and simultaneously. The experimenters have performed their daily routine activities during the recordings.
While recording the experimental data to be used in training of the proposed model, the procedure is visualized in Fig. 3. As it can be observed, 3 dimensional accelerometer data with the gravity component recorded is recorded in sync with the images captured from the camera. For reading the accelerometer data the sample rate of the camera is used since it is slower. Therefore, approximately samples per second of image and and accelerometer data is captured and recorded. Accelerometer data is recorded as comma separated values(.csv) file whereas the images are recorded into separate folder. Then, different set of recordings are fed into learning algorithm to derive a model for classification. In the next section, we evaluate the performance of the derived model based on hamming distances and estimated class comparisons with ground truth for activity classes.
In parametric models where the number of modes is specified ahead of time, it is common to relabel the predicted sequence and measure the accuracy of the model. The relabeling problem is a combinatorial one and becomes intractable when the number of modes is not small. In non-parametric models though it is more common to use Normalized Mutual Information (NMI). NMI allows us to compare two models resulting in different number of modes. NMI takes values between 0 and 1 where the former corresponds to random assignments. NMI between two collection of sets and can be defined as
Iv-C Experimental Setup
We first rescaled each acceloremeter channel such that the standard deviation of the walking mode is. This is a trivial rescaling to set the hyper parameters more easily.
In HMM experiments, we provided the true number of modes to the algorithms. The inference is performed via Viterbi and the parameter estimation is done with maximum likelihood. In GMM-HMM, the number of Gaussian Mixtures is set to .
In all sticky HDP-SLDS experiments we fixed the upper limit for the number of modes to . We first did a grid search in log domain for after putting uninformative priors to by setting to and to . We then chose that maximize . Across all subjects, setting