1 Introduction
Deep learning has been shown to be an effective solution in analyzing timeseries data and making continuous predictions [fawaz2018deep]
. It redefines stateoftheart performance in a wide range of areas, including illness prediction, surveillance monitoring and anomaly detection
[zhao2016deep, acharya2018deep, schirrmeister2017deep, yildirim2018arrhythmia, javaid2016deep, xu2015learning]. As a canonical example, results in [rajpurkar2017cardiologist] show that deep learning based algorithm already exceeds board certified cardiologists in detecting heart arrhythmias from electrocardiograms when given a large annotated dataset.The success of deep learning often relies on the large model size and therefore heavy computation complexities. For example, the aforementioned arrhythmias detection algorithm [rajpurkar2017cardiologist]
relies on a 34layer convolutional neural network to map the sequences of ECG samples to rhythm classes on a fixed dataset. However, for most real scenarios, monitoring is performed on edge devices that are often designed with limited storage and computation ability, such as wearable devices for health monitoring and cameras for anomaly detection. Besides, the input signals for realtime monitoring arrive in a sequential order rather than lie in a fixed dataset, making it more demanding or even impossible for edge devices to provide realtime predictions if a huge network is deployed.
One possible solution is collecting data on edge devices and providing realtime response with a truncated or compressed model. But this would often lead to an undesirable sacrifices of prediction accuracies as most simple neural networks cannot fully reveal the underlying mechanisms and fail to provide predictions as accurate as those complex models. Another possible approach is to consider the local monitoring devices only as data collectors, while model training and data analysis are performed on remote servers. But this may potentially incur large overheads if data dimension and/or the monitoring time is large. Moreover, from the users’ perspective, constantly sending local data like their biomedical information to remote server may raise their concerns on privacy and security. In the meantime, servers may need to make continuous predictions and provide timely response for each user, which can often be impractical since the number of users can be huge.
Therefore, it is necessary to design a new learning paradigm with collaborative inference so that most of the computations and predictions are performed through the local device, without the necessities of sending sensitive data to servers. In case of emergency or situations that are beyond the capability of local simple models, data are transported to servers where further analysis is performed. Moreover, for scenarios like medical monitoring, one needs more than just finding an adequate approximation to a target function. A safety requirement should be guaranteed so that local device can detect unusual situations in advance, and false negative cases should always be eliminated.
To address these two issues, we consider the following question: Given a strong but complex classifier , how to make it suitable to deploy on edge devices and at the same time making sure the safety regime is guaranteed?
The answer provided in this paper is designing a new learning architecture by decomposing the monitoring system into two separate parts: a simple ondevice model acting as the monitoring tool to provide timely response locally and a complex remote model acting as the corrector to provide accurate modifications for unusual cases (Fig 1). With this design, the whole system is able to achieve the following advantages.

Approximation accuracy: with assistance of the onserver corrector, we show that the approximation ability of this collaborative system is no worse than the original accurate but complex model (See Prop 1). This is in contrast to the classical model compression methods where the system usually involves tradeoff between the model size and the accuracy drops.

Communication reduction: most of the time the computations and predictions are performed through the local device without the necessity of sending sensitive data to servers. The communication will be triggered only when the local predictions exceed the predefined threshold.

Safety requirement: the correction term onserver is designed to be everywhere nonpositive so local model provides upperbounded predictions and guarantees the safety requirement.
Related Work
Our work follows the studies of model compression so that complex deep learning algorithms can be deployed on edge devices. Typical model compaction approaches include distillation [Bucilua:2006:MC:1150402.1150464, ba2014deep, romero2014fitnets], pruning [hanson1989comparing, gong2014compressing, han2015learning], compact convolution filters [zhai2016doubly, cohen2016group] and dynamic execution [DBLP:journals/corr/abs180607568, DBLP:journals/corr/abs181101476]. But in contrast to these studies that only focus on model compression and may sacrifice prediction accuracies, we reuse the original model architecture on server and show that the approximation ability of our scheme is no worse than the original model through collaboration.
Collaborative inference between the server and edge devices is also studied in [kang2017neurosurgeon, li2018edge, choi2018deep] by using various computation partitioning strategies. The main difference between our work and these studies is the safety regime proposed in our work, as we strictly require the local predictions to lie above the groundtruth (See Fig 1), making our approach more suitable for scenarios like illness prediction and anomaly detection where safety is more vital than good approximation.
Our work is also partially related to the studies of boosting algorithms [friedman2001greedy, chen2016xgboost, schwenk2000boosting] and cascading methods [gama2000cascade, zhao2004constrained, marquez2018deep], as we both use multiple classifiers to form a strong classifier. The difference between our work and these works lies in the fact that boosting and cascading algorithms usually assume we have only weak classifiers and focus on forming a strong classifier by cascading the decisions of these weak classifiers. But our work assumes that the strong classifier already exists, but unsuitable for edge devices due to its large model size and heavy computation requirement. Therefore, we decompose the strong classifier into one weak monitoring classifier on device and one strong correcting classifier on server.
2 Efficient Monitoring by Model Decomposition
The previous conceptual scheme shall be formally formulated in this part, followed by three evaluation metrics: approximation error, false positive rate and false negative rate.
2.1 Problem Formulation
The introduced model decomposition scheme in the previous section can be summarized as the task of computing a function approximation. Formally, we let be the bounded set of all possible states (e.g. physiological data collected by a wearable device), and let the function denote the ground truth target, which maps the input to a scalar output (e.g. some metric that measures the health index of an individual, based on the physiological data). Note that the output of can also be binary (1 and 1), in which case we have a classification problem. We assume that an adverse event (e.g. onset of a heat stroke) happens when for some threshold , which for simplicity of presentation we can set to . In the binary classification case, an adverse event is simply .
The goal of efficient remote monitoring is to learn an approximation that is accurate, deployable on edge devices and most importantly, safe – the approximation should signal an adverse event before or when it happens. Moreover, we want the approach to be easily implementable. To this end, we assume that we already have a good hypothesis space ( denotes the space of continuous functions on ) that approximates well, i.e. . However, can be very complex (e.g. deep convolution neural networks), so that , despite being able to approximate well, cannot be directly deployed for monitoring. The question we would like to answer is, given , can we construct a new hypothesis space so that we can realize efficient monitoring?
2.2 Model Decomposition
The approach we propose in this paper is as follows. Consider in addition to another hypothesis space that is very simple, so that any function from it is deployable on an edge device. For example, may consist of much smaller neural networks than those in . Now, we form as an approximation to in the following form
(1) 
Here, is chosen to be some fixed continuous invertible function whose purpose is to modulate the output from the function so that it is contained within a bounded interval, whose scale is set by ; for instance, a convenient choice for
is the sigmoid function
. For the purpose of theoretical analysis, we hereafter assume that is twice differentiable with bounded second derivative. Note that these conditions are satisfied by the sigmoid function.Now, suppose that are chosen such that . Then, by construction (1) we must have , since both and are necessarily positive. Hence, satisfies both the efficiency condition ( is simple) and the safety condition (), and is a promising candidate for remote monitoring on edge devices. On the other hand, the second term in (1) serves as a correction on the server side and need only be evaluated when accurate predictions are required (e.g. when is above the threshold, signaling the possibility of unusual cases).
2.3 Performance metrics
The following performance metrics are of interest both in theory and in practice:

Approximation Error. Take
(2) 
False Positive Rate. Take
(3) 
False Negative Rate.
(4)
Some remarks on the metrics are in order. For approximation we typically consider the () and () norms. In the definition of false positive and false negative rates, we introduced a regularization parameter representing a margin between the decision boundaries. For understanding sake one may take but a positive value is useful to rule out pathological cases involving the decision boundary. Also, if then these are true “rates”. Otherwise they are unnormalized, but this minor point does not affect subsequent results.
3 Theoretical Analysis
We have established our model decomposition scheme in the previous part, but a few questions still remain unknown: 1) whether this decomposition scheme works well; 2) how to choose the simplified model and the parameter . The answers shall be provided based on a complete theoretical analysis in the following parts.
3.1 Approximation Error
The validity of our model decomposition approach rests upon the premise: given that approximates well, does that mean that there must exist so that in the form of (1) also approximates well? It turns out that the answer is positive under mild conditions on and .
Proposition 1.
Consider the approximation scheme in (1). Suppose that are linearly closed, meaning that for every in (resp. ), (resp. ) for any . Then
(5) 
Proof.
Let and fix . Then, there exists a such that and . By assumption, there exists such that . By Taylor’s theorem, for each and ,
for some . Let and take in the above , then we have for any ,
Now, simply take large enough so that , and then take (which is always possible by linear closure). We then have
(6) 
Moreover, by the linear closure assumption. Since is arbitrary, this proves the claim. ∎
Remark: The above result shows by collaboration of the ondevice model and the onserver negative corrector , the approximation power is no worse than the original complex model . This is in contrast to previous algorithms where the original model is discarded and the model’s approximation ability usually drops after compression.
Prop 1 ensures that the combined model to be a good approximator, but it does not imply how to choose the model and the parameter . To proceed, we need the following assumption.
Assumption 1.
The underlying groundtruth admits a decomposition as the linear combination of simpler functions drawn from a collection , namely
(7) 
Remark: Any wellbehaved function has the above decomposition scheme, including many practical machine learning models. In neural networks, represents the highlevel features at the secondlast layer and the final classification relies on a linear combination of these feature functions (e.g., in LeNet [lecun1998gradient], the classification on the last layer relies on the 84 highlevel feature functions in the secondlast layer). For explicit kernel methods, may represent feature maps, while for Fourier expansions these ’s represent the basis functions.
With this assumption, we have a direct way of constructing simple approximations to by truncating the series. To promote the safety condition , it is necessary to augment the truncated series with an appropriate positive function. For simplicity, we analyze the setting where we augment the truncated series with a constant term:
(8) 
Here, controls the complexity of , which for
large enough will become a safe monitoring function. The following result gives some quantitative estimates in this direction.
Proposition 2.
Suppose that admits the decomposition (7) and represents a universal approximator so that . By setting and , we have
(9)  
(10) 
Proof of Proposition 2.
The first bound comes directly from the definition .
Next, by noting that
where the inequality comes from
and .
Noticing we have and therefore takes value in . Since is a universal approximator to match any value in , the r.h.s equals 0. ∎
Remark: (1) By construction of , guarantees the safety requirement as the false negative rate is always 0.
(2) By assuming for simplicity that is an universal approximating class for continuous functions (e.g. very large neural networks), the estimate says that as long as the scale exceeds , the infimum on the right hand side is 0, i.e. the combined prediction approximates well. Compared with Proposition 1 which may require very large , here only needs to exceed , which converges to 0 as .
(3) The choice of captures the dynamic range of the residual term. Subsequently, the choice of is most useful if the series (8) converges rapidly. More generally, Proposition 2 is most useful when applied in conjunction with function classes that allow us to expand the target function with rapidly decaying coefficients. For example, if is a class of feature maps, then we want to choose this class so that the ground truth is wellapproximated by only a few of the feature maps. This can be done for example by training with regularizers that promote sparsity in the combination coefficients.
3.2 False Positive Rate
Furthermore, observe that is intimately connected with the false positive rate and they increase together. The following result makes this precise.
Proposition 3.
Let and , then
where .
Proof.
We have
∎
Remark: This shows that the upper bound for false positive rate increases with , and smaller is preferred to minimally incur false positive. To combine the result with Prop 2, setting is sufficient to obtain the minimal false positive cases.
Moreover, if we put into the approximation results in Proposition 2, then the false positive rate again involves a tradeoff between (which decreases ) and . In summary, estimates 2 and 3 are quantitative estimates of the performance tradeoff: for fixed model complexity for the ondevice monitoring model (fixed ), increasing the scale of the serverside correction () improves overall model accuracy but also increases false positive rates. On the other hand, for a fixed scale, increasing monitoring model complexity improves both the approximation quality and the false positive rate, at the cost of heavier computation.
3.3 False Negatives Rates
Prop 2 already guarantees the false negative to be always 0 by augmenting a positive constant based on the size of the residual term in the infinitynorm. This of course relies on the assumption that this residual term converges uniformly so it is finite, which holds in most practical scenarios.
For completeness of the analysis, it is of interest to understand the performance of our framework in case this uniform convergence is violated or an offset smaller than is chosen. An immediate consequence is that we violate the safety requirement ; i.e., we incur a False Negative instance. The analyses developed in Propositions 2 and 3 are applicable for quantifying the extent by which our safety requirement is violated, as well as the size of the residual error , and we describe these result formally in Proposition 4.
Proposition 4.
Proof of Proposition 4.
The first bound follows from an application of the Chebyshev’s inequality. Next we evaluate the integral of the residual error over the regions and , and apply the triangle inequality to arrive at the bound
For the first term we have
Note that the latter term is valid because the image of the function lies within the domain of whenever . For the second term we have
By combining both inequalities and subsequently taking the infimum over we arrive at the following:
Next, by applying the CauchySchwarz inequality we have the bound
By applying Chebyshev’s inequality appropriately we arrive at the bounds
The result follows by combining these bounds. ∎
Remark: Our result shows that, under the assumption that the series (7) converges in the L2norm, the False Negative rate is effectively governed by how quickly the series vanishes as well as the size of the offset . Our bound of the error in Proposition 4 comprises two key components. The first quantity depends on the expressive power of in approximating the function , and it may potentially increase with larger . The second quantity captures the residue of the function that cannot be captured by . In contrast with the first quantity, it is independent of and it depends inversely with the parameter .
3.4 Summary and Practical Implications
To summarize, Prop 1 indicates for arbitrary ondevice structure , by collaborating with onserver network and selecting a proper parameter , the model decomposition scheme is always guaranteed to perform no worse than the original complex model . With further Assump 1 that is applicable to most deep learning models, the ondevice model structure can be precisely truncated from the original model structure . We illustrate applications of the bounds in Propositions 2 and 3 with a series of examples.
General Case: In the general case, the ondevice network can be truncated as from the original model, with an augmented term to capture the residuals as shown in (8). By combining the results of Prop 2 and Prop 3, it’s sufficient to set and to obtain ideal approximation with the lowest FP rate, and the FN rate is strictly 0 in this scenario.
Exponential Decay: In case the coefficients in (7) decays in a specific trend, for example the coefficients have exponential decay, i.e. for some fixed with , we have more concrete results. In this case, Prop 2 says that picking ensures both the positivity criterion and accurate approximation. Moreover, Prop 3 suggests that .
Powerlaw Decay: Suppose instead that have powerlaw decay, i.e. for some fixed . Assume further that the ’s in (7) are orthonormal. First we have the bound . Thus, the upper bound on the approximation error is , and we may set . The false positive rate has estimate .
4 Experimental Validations
In this part, we conduct a series of experiments to validate our theoretical analysis in the previous section.
4.1 Synthetic Dataset
The first experiment we consider is a synthetic dataset to simulate the exponential decay case in Sec 3.4. The goal of this simulation is to validate our analysis in previous section, as well as examining the model efficiency.
The dataset is generated by sampling input
from a uniform distribution
and computing its label aswhich is equivalent to setting ’s to cosine functions at different frequencies in Eq (7) and . In practice, a FC(1,16,32,64,100,1) neural network structure can efficiently train the loss to approximately 0, hence we select it as our onserver structure. The ondevice monitoring structure is selected by truncating and augmenting as Eq (8). Once the ondevice and onserver architectures are selected, the entire model is trained endtoend using the Adam optimizer [kingma2014adam].
We first validate the analysis presented in Sec 3.1 to 3.3 by showing the landscape of the three metrics in Fig 2.

For safety concerns, the FN rate should be minimal, and Fig 2(b) indicates the FN rate is almost 0 everywhere except for both small and where the requirement is violated.
The above results show that the decomposed model can achieve a low approximation error with safety guarantee (0 false negative rate) with proper choices of . Indeed, our analysis in Sec 3.4 provides a more concrete choice of . Fig 3 shows how the practical approximation error goes with different , and the results indicate that the theoretical choices all obtain good approximation for these three cases. Since we use to approximate , there is a gap between the theoretical and practical optimal value.
4.2 Financial Dataset
The above experiment validates the possibility of our model decomposition strategy, and shows our analysis developed in Sec 3 can guide our practical architecture designs. In the following parts, we further demonstrate the broad applicability of our proposed decomposition scheme on a realworld dataset.
The Dow Jones Industrial Average (DJIA) dataset [finance] represents the stock market index of 30 companies’ stock value and how they have traded in the stock market during various periods of time. We consider Apple Inc.’s stock price as the potential ground truth and use the price from the other 29 companies to predict it. All data are normalized to range , with the level as the warning threshold. The baseline architecture () is selected as a FC(29,64,128,256,1) neural network, which attains a meansquared loss of approximately . We truncate the neural numbers on the secondlast layer to 16 to obtain the ondevice model .
Fig 4 reports the results in the following aspects.

The ondevice predictor can always provide an upper approximation of the true signal , leading to an early warning before the ground truth exceeds the threshold. This naturally guarantees the safety requirement, since the false negative rate is always 0 in this case.

In case the ondevice provides signal exceeds the threshold, the negative corrector is activated and the server provides more accurate predictions . In Fig 4, this combined approximation is almost identical to the groundtruth, hence the whole system obtains more accurate approximation and the extra false positive prediction is eliminated by remote corrector.

By local monitoring and local analysis , the communication cost is reduced 10x times.
In addition to strictly truncate model based on Prop 2, Prop 1 also indicates a simple neural network can also act as ondevice upper approximator. In appendix, we use a FC(29,10,1) network as a local monitoring tool so that the ondevice model size is compressed more significantly. The drawback is we have to manually select a larger and incur slightly more false positive cases (+2%) for .
To summarize, the model decomposition scheme allows us to use collaborative inference instead of purely ondevice or onserver predictions. The approximation error is shown to be no worse than the previous complex model in both theory and practice. Specifically, the safety regime proposed in this paper makes our work different from previous studies. For financial dataset, this upper approximation allows the users to buy in the stock price in advance. For other scenarios like health monitoring, this would allow us to predict the disease in advance, which could be more vital than purely proving an accurate approximation.
5 Conclusion
In this paper, we introduce a model decomposition scheme to realize efficient and safe inference for remote monitoring tasks on edge devices. The key idea is the combination of a simple local monitoring surrogate and a complex negative corrector on the server side. We demonstrate using experiments that following our theoretical analysis, one can greatly decrease the model complexity required for safe monitoring, thereby increasing the applicability of deep learning models in resourceconstrained and safetysensitive applications such as remote monitoring.
Comments
There are no comments yet.