As the rapid growth of computational capability of mobile chips, more and more augmented intelligent (AI) applications start to deviate from cloud based solutions to explore on device methods. The trend is not a surprise as mobile device users gain better understanding of data privacy [12, 8] and prefer high reliability, low latency experience [10, 6, 5]. Meanwhile, many research works [13, 11] have indicated that the performance of on device AI applications can be enhanced if correct contextual information about the user can be leveraged. For example, speaker recognition  can localize to use different model if user context shows him/her in a meeting room or in a bus. GPS can be pre-activated and smartphone message can be disabled when the context shows the user starts to drive . Gesture/speech recognition [4, 3] can lead to different intention given user’s location context shows in car, in bedroom, or in a restaurant.
Nevertheless, the contextual information would become worthless if the cost of computation is way higher than the value it brings. In practice, the cost maps to both the infrastructure cost which include sensor price, battery consumption, component size, and the algorithm cost which include runtime complexity, memory requirement, training data gathering effort. For example, GPS is an informative source of information widely used in the recognition of mode of transportation 
. However, the on time power consumption when GPS is in tracking mode is typically at the order of 10’s mA, assuming one second EPOCH. This is discouraging designers from including GPS in an always-on transportation mode recognition engine in mobile devices. More importantly, we observe that the context information, typically, does not vary frequently in time. Thus, always-on low power solutions that leverage simple sensors and adopt low complexity algorithm are preferred.
Even though, it seems a trivial solution to enable high cost sensing modalities in always-on inference, we claim that they can still be leveraged in a opportunistic fashion to maximize the system efficiency. Currently, we are exploring two approaches to incorporate the information opportunistically. On one hand, we can use those information as a confirmation or correction to the always-on prediction result. On the other hand, we consider leveraging the information to perform on device learning to personalize and enhance the always-on inference model for each particular user, however, only if they are available. In this work, we focus on the second route. Specifically, we model the problem as a weakly supervised learning task, where the annotation is provided opportunistically through the high cost, non-always-on sensors. In mobile sensing it can lead to significant improvement if the inference model is personalized to the target. In other words, it turns out to be extremely challenging in most of the mobile sensing use cases to build a model with good interpersonal generalization performance. And this has become the primary motivation of this work.
The rest of this work is organized as follows. In Section 2 we review related arts on weakly supervised learning to serve as a background. In Section 3 we propose our main algorithm that leverage the ideas in weakly supervised learning to learn the target distribution of interest. In Section 4, we show that the proposed algorithm is statistically consistent and provide some insight that relates the performance of the algorithm to the noise statistics in the annotation. In Section 5, we provide both synthetic example and an application to validate our theory. Finally, we conclude and point out a few problems that worth further research.
2 Multimodal weakly supervised learning
The definition of weakly
supervised learning varies in literature. To the best of our knowledge, state-of-the-art works have focused on three types of weakly supervised learning problems. The first scenario considers that only part of the data are labeled. This is also known as semi–supervised learning in some works. The objective in this case is to leverage prior knowledge, such as geometry of the data, to optimize prediction power on the labeled instances while encouraging the prior knowledge to be satisfied on both labeled and unlabeled instances. The second scenario is known as the positive–unlabeled (PU) learning. Under this category, only part of the instances from the positive hypothesis are labeled. The challenge is to properly handle the negative instances, without explicit knowledge of the label, so that the algorithm does not overfit due to the label imbalance. The last set of problem can be categorized as learning with label noise. In other words, even though labeled, the training instances may not be perfectly supervised. Instead, some noise adding process is considered so that the learner has no direct access to the ground truth. There can be many practical reasons for this to happen. For example, privacy requirement and adversarial attack.
In this paper, we consider the last type of problem, however, under a slightly different setup. Instead of directly assuming some label noise behavior, we consider the label to be obtained through a separate inference process, which is imperfect. The following graphical model illustrates the concept we are discussing about:
Here, denotes the ground truth class label which takes value in a discrete finite alphabets. and represent two independent measurements which contain class conditional information about . The objective here is to learn the statistical relationship between and . In this work, we consider to learn the generative distribution and assume the prior distribution is given. However, we are not directly given the pair . Instead, there is an inference process which takes as input and predicts as . In other words, we have no access to the generative distribution of and , but only the pair . We start by assuming the predictive model is trained separately beforehand and fixed. Moreover, the experts who trained the model can also provide their confusion statistics that understands their performance for predicting using . Specifically, the
th element of the confusion matrix
represents the probability.111Alternatively, one can also define the forward confusion matrix which represents . Later, we will consider can be flexible and discuss its effect on the learning process.
From a practical system design perspective, this model captures a few common use cases in mobile sensing. For example, modality– can be heavily user dependent such as speech, face image or motion kinetics while modality– can be invariant to user identity such as speed of traveling on traffic, altitude change or illumination level et al. Therefore, we may enhance and personalize the prediction model for modality–. Also, the price for obtaining label might itself be different. For example, we can always ask the user to provide an annotation, which is the most accurate but expensive. In comparison, if we design an annotator using modality– that runs in background, it will not interfere with the user experience. But, the label will become imperfect as a consequence. It is also worth mentioning that typically the predictive power using modality– is worse than using modality– or tends to be more power hungry. Because, otherwise the problem statement becomes trivial, and there will be no value to improve the model for .
3 Noise correction estimator
In this section, we describe our algorithm for recovering the generative distribution of with access only to the pairs . We start with the following assumptions on the confusion matrix .
The confusion matrix is a proper left stochastic (Markov) matrix.
is a proper left stochastic (Markov) matrix.i.e. .
The confusion matrix is invertible. i.e. .
The inference algorithm is deterministic.
The first assumption can be easily met. Typically the confusion matrix is provided by the designer of the inference system using modality– through empirical evaluation on some validation dataset. Therefore, it requires only proper column normalization. Later, we will understand the second assumption ensures the recovery of the generative distribution to be feasible. Intuitively, if the confusion matrix is not full rank, then information of some alphabet in will be lost in modality–. Therefore, it will become not recoverable. Moreover, we assumed the inference algorithm to be deterministic, which will most like be the case in practical system for complexity concern. Therefore, we can simplify the graphical model to Figure 2.
From Figure 2 we observe that since is unobserved, the generative distribution of can be written as:
If we write it in matrix format, we have:
Finally, the generative distribution of can be recovered by inverting the confusion matrix:
In practice, we also have no access to the true distribution of . And it has to be learned from the stored training instances
. Denote the estimator ofas , we can similarly calculate the noise corrected estimator of as:
which simplifies to
However, we need to be aware that the estimator (5
) may not be a proper probability distribution. Even though, it automatically satisfy the condition that the integral overequals one, there can be regions where this function take negative values. Therefore, it becomes challenging how to construct when
is a continuous random vector. In the next section, we will prove that this noise correction estimator is indeed lossless when number of stored training instances goes to infinity. Similarly, the posterior probabilitycan be estimated by following this method. We have, for the posterior probability:
Thus we have:
in this case represents the right stochastic matrix whoseth element measure .
Here, we also observe that estimator (5
) is interestingly related to the method of unbiased estimator proposed by Natarajanet al. . Specifically, they considered a binary classification problem where with the following label flipping probabilities:
satisfying the constraint:
They proposed an unbiased estimator for the loss function that can be adopted in an empirical risk minimization (ERM) procedure as:
Their result can indeed be understood as the frequentists’ counterpart of the probabilistic framework we proposed here. Additionally, we can generalize the theory to a multiclass setting and construct:
where in this case is the right stochastic matrix with . Similarly, we assume the forward confusion matrix to be invertible, i.e. . Therefore, the multiclass unbiased risk can be calculated as:
which simplifies to
One interesting observation here is that the original requirement given in appears to be not necessary. Instead, in the binary case, one can easily prove that the invertibility condition translates into only . This matches the properties of the receiver operating characteristics (ROC) curve in decision theory . In principle, ROC curve is always above the straight line passing through and , which corresponds to the Bernoulli random guess decision. Otherwise, the decision rule can be flipped to achieve that. In (10), the matrix inverse operation can automatically handle the decision flipping procedure when .
We analyze the consistency of our estimator defined in (4) in this section. The key challenge in (4) is to analyze the effect of the inverse stochastic matrix on the density estimator. We start by establishing a theory that guarantees estimator (4) can recover the true generative distribution given certain conditions.
Let be the Kullback–Leibler (KL) divergence that measures the discrepancy between the two distributions and . Specifically,
Similarly, let denote the KL divergence between and . Suppose both and are valid distribution functions and Assumptions 1–3 hold, we have
if and only if
To prove the necessity part, observe that
Apply log–sum inequality to yields
Necessity follows since if for all , then for all .
For sufficiency, the log–sum property cannot be used since may be negative valued. Instead, we evaluate
by directly observing that since implies for finite . Therefore, the integral remains to be zero. ∎
Theorem 1 guarantees that when sample size goes to infinity, a perfect estimator for yields a perfect estimator for . Next, we discuss how the confusion matrix affect the convergence rate.
Let denotes the following empirical process
where denotes the true distribution of and is the empirical version of it. Then we have the following condition holds for :
where and are respectively the maximal and minimal eigenvalue of
are respectively the maximal and minimal eigenvalue of. And
denotes the empirical process that measures the convergence of individual density estimator.
where the inequality is based on the fact that the eigenvalues of satisfies . For the second term, we observe the rate for to converge to is determined by the slowest term since each individual term is strictly positive. Therefore, the upper bound to it is
. For the first term, from central limit theorem we have. ∎
Theorem 2 provides insight on how the confusion process affect the learning rate. Specifically, in addition to the learning rate of each individual density estimator, an extra cost has to be paid based on how much information is lost during the confusion process. To give one example, when the confusion matrix is any permutation matrix, there will be no loss of information but only deterministic label swapping are performed. Thus, we have and no additional cost will be paid. In contrast, if we have a close to singular confusion matrix, the loss of information will be high, because there are multiple now representing highly similar information in . In this case, is large.
Finally, for the risk function constructed in (10), it can be proven to be unbiased.
The risk function estimator defined in (10) is an unbiased estimator of :
5 Experimental results
In this section, we provide synthetic example and real world applications for our theory. We start by considering a synthetic example. We draw binomial samples from three classes with success parameters respectively. The class annotations are corrupted using a confusion matrix . We select the confusion matrix to have identical diagonal components and identical off-diagonal components to simplify the experiment. And the noise level in this case can be controlled by adjusting the off-diagonal elements while maintaining the rows sums to one. We evaluate the performance of estimator (5) by computing the sum of KL divergence between the estimated distribution and the true distribution, using the analytic form.
increases; (b) as sample size increases; (c) Figure (b) in logarithm scale. Solid square indicates the mean over 2000 test runs. Error bar show the one sided standard deviation.
We observe the convergence behavior in terms of is close to linear for each fixed sample size. And the convergence behavior in sample size in this example can be theoretically proven to be using central limit theorem.
Next we apply the proposed algorithm in a real application. We consider activity recognition using smartphone accelerometer and gyroscope sensors. As we noticed for many users, it is a challenging task to distinguish phone call222Holding phone close to ear while speaking. People tends to pace very slowly around without moving to a particular destination. and slow walk (
Hz step rate). Also, for some users, their slow walk and biking signatures can be hard to distinguish. Thus, we select these three classes and apply our algorithm. Our basic classifier is a Bayes network that converts the time series input from sensors to activity class probabilities. We built an hierarchical model that first extract features from the time series and then perform smoothed prediction using a hidden semi-Markov model (HsMM). The graphical model is shown in Figure4
. The feature extraction layer convert the time series into a set of finite alphabets, however, we cannot reveal details about the feature extraction block due to confidential reason. We explore next if the inference layer can be re-trained and enhanced by leveraging GPS speed readings. As GPS readings are power consuming, they are not always available. In addition, we cross validated that for these three classes, a simple threshold based classifier would leads to the confusion statistics in Table1.
|TruthResult||call (fidget)||slow walk||bike|
The testing users are required to collect two minutes of data within each category for re-training purpose followed by an uninterrupted collection of transition among those three classes. The baseline model is trained on our company internal dataset which contain not only these three classes but a few other classes. Next, the baseline model is personalized using the two minutes clean collection but annotated in two different ways. The first annotation is to use the ground truth. And the second annotation is to use the GPS speed based classifier in Table 1. Then estimator (5) is used to correct the annotation noise. We select a Dirichlet–Multinomial pair for the HsMM’s emission probability, where the Dirichlet prior parameters are set according to the posterior values of the baseline model. After re-training, those parameters are updated again using the clean data. The baseline model, personalized model using ground truth and personalized model using GPS inputs are tested on the transition data for comparison in Figure 5.
As we may observe, the baseline model is not correctly recognizing call as fidget. Instead, it creates some confusion between slow walk and bike. Subsequently, after we personalize the HsMM emission model using the separate collection paired with ground truth annotation, the model has gained significant confidence to correctly recognize call as a fidget event. Finally, the model personalized with GPS based annotation also achieves a satisfying recognition result. A detailed empirical Bayes error rate (BER) in this experiment is provided in Table 2.
In this work, we proposed an automated annotation method for personalization of always-on mobile sensing model. The proposed algorithm leverages the non-always-on sensing modalities opportunistically. Synthetic results show that our algorithm can find the correct generative model given enough data. Our application shows the model can indeed help to improve smartphone based human activity recognition performance in some cases.
Nevertheless, some problems remain open. First, it is still challenging to construct and verify the generative model estimated, whether it satisfy basic probability measure properties, especially for high dimensional and continuous random variables. Second, as we noticed that the convergence rate is governed by both sample size and the eigenvalue structure of the confusion matrix, it is worth investigating if some tradeoff can be defined to perform sample selection. For example, if in addition to the noisy annotation, we are also provided a confidence measure for that annotation, it is interesting to consider subsampling the data for re-training. As rejecting samples with low confidence can leads to cleaner confusion statistics, it reduces the amount of samples that are available to learn the generative distribution. Also, in situation where training needs to happen on edge, it is important for mobile devices to save as less data as possible due to storage constraints.
-  (2018) Smart and robust speaker recognition for context-aware in-vehicle applications. IEEE Transactions on Vehicular Technology 67 (9), pp. 8808–8821. Cited by: §1.
-  (2013) Transportation mode recognition using gps and accelerometer data. Transportation Research Part C: Emerging Technologies 37, pp. 118–130. Cited by: §1.
-  (2016) Towards pervasive augmented reality: context-awareness in augmented reality. IEEE transactions on visualization and computer graphics 23 (6), pp. 1706–1724. Cited by: §1.
-  (2016-January 26) Location based conversational understanding. Google Patents. Note: US Patent 9,244,984 Cited by: §1.
An early resource characterization of deep learning on wearables, smartphones and internet-of-things devices. In Proceedings of the 2015 international workshop on internet of things towards applications, pp. 7–12. Cited by: §1.
-  (2018) Learning iot in edge: deep learning for the internet of things with edge computing. IEEE Network 32 (1), pp. 96–101. Cited by: §1.
-  (2013) Learning with noisy labels. In Advances in neural information processing systems, pp. 1196–1204. Cited by: §3.
-  (2017) A hybrid deep learning architecture for privacy-preserving mobile analytics. arXiv preprint arXiv:1703.02952. Cited by: §1.
-  (2017) Automatic identification of driver’s smartphone exploiting common vehicle-riding actions. IEEE Transactions on Mobile Computing 17 (2), pp. 265–278. Cited by: §1.
-  (2016) Challenges and opportunities in edge computing. In 2016 IEEE International Conference on Smart Cloud (SmartCloud), pp. 20–26. Cited by: §1.
-  (2012) Context-aware mobile music recommendation for daily activities. In Proceedings of the 20th ACM international conference on Multimedia, pp. 99–108. Cited by: §1.
Privacy-preserving machine learning based data analytics on edge devices. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 341–346. Cited by: §1.
-  (2015) Mining mobile user preferences for personalized context-aware recommendation. ACM Transactions on Intelligent Systems and Technology (TIST) 5 (4), pp. 58. Cited by: §1.