Deep learning has been shown to be an effective solution in analyzing time-series data and making continuous predictions [fawaz2018deep]
. It redefines state-of-the-art performance in a wide range of areas, including illness prediction, surveillance monitoring and anomaly detection[zhao2016deep, acharya2018deep, schirrmeister2017deep, yildirim2018arrhythmia, javaid2016deep, xu2015learning]. As a canonical example, results in [rajpurkar2017cardiologist] show that deep learning based algorithm already exceeds board certified cardiologists in detecting heart arrhythmias from electrocardiograms when given a large annotated dataset.
The success of deep learning often relies on the large model size and therefore heavy computation complexities. For example, the aforementioned arrhythmias detection algorithm [rajpurkar2017cardiologist]
relies on a 34-layer convolutional neural network to map the sequences of ECG samples to rhythm classes on a fixed dataset. However, for most real scenarios, monitoring is performed on edge devices that are often designed with limited storage and computation ability, such as wearable devices for health monitoring and cameras for anomaly detection. Besides, the input signals for real-time monitoring arrive in a sequential order rather than lie in a fixed dataset, making it more demanding or even impossible for edge devices to provide real-time predictions if a huge network is deployed.
One possible solution is collecting data on edge devices and providing real-time response with a truncated or compressed model. But this would often lead to an undesirable sacrifices of prediction accuracies as most simple neural networks cannot fully reveal the underlying mechanisms and fail to provide predictions as accurate as those complex models. Another possible approach is to consider the local monitoring devices only as data collectors, while model training and data analysis are performed on remote servers. But this may potentially incur large overheads if data dimension and/or the monitoring time is large. Moreover, from the users’ perspective, constantly sending local data like their bio-medical information to remote server may raise their concerns on privacy and security. In the meantime, servers may need to make continuous predictions and provide timely response for each user, which can often be impractical since the number of users can be huge.
Therefore, it is necessary to design a new learning paradigm with collaborative inference so that most of the computations and predictions are performed through the local device, without the necessities of sending sensitive data to servers. In case of emergency or situations that are beyond the capability of local simple models, data are transported to servers where further analysis is performed. Moreover, for scenarios like medical monitoring, one needs more than just finding an adequate approximation to a target function. A safety requirement should be guaranteed so that local device can detect unusual situations in advance, and false negative cases should always be eliminated.
To address these two issues, we consider the following question: Given a strong but complex classifier , how to make it suitable to deploy on edge devices and at the same time making sure the safety regime is guaranteed?
The answer provided in this paper is designing a new learning architecture by decomposing the monitoring system into two separate parts: a simple on-device model acting as the monitoring tool to provide timely response locally and a complex remote model acting as the corrector to provide accurate modifications for unusual cases (Fig 1). With this design, the whole system is able to achieve the following advantages.
Approximation accuracy: with assistance of the on-server corrector, we show that the approximation ability of this collaborative system is no worse than the original accurate but complex model (See Prop 1). This is in contrast to the classical model compression methods where the system usually involves trade-off between the model size and the accuracy drops.
Communication reduction: most of the time the computations and predictions are performed through the local device without the necessity of sending sensitive data to servers. The communication will be triggered only when the local predictions exceed the predefined threshold.
Safety requirement: the correction term on-server is designed to be everywhere non-positive so local model provides upper-bounded predictions and guarantees the safety requirement.
Our work follows the studies of model compression so that complex deep learning algorithms can be deployed on edge devices. Typical model compaction approaches include distillation [Bucilua:2006:MC:1150402.1150464, ba2014deep, romero2014fitnets], pruning [hanson1989comparing, gong2014compressing, han2015learning], compact convolution filters [zhai2016doubly, cohen2016group] and dynamic execution [DBLP:journals/corr/abs-1806-07568, DBLP:journals/corr/abs-1811-01476]. But in contrast to these studies that only focus on model compression and may sacrifice prediction accuracies, we reuse the original model architecture on server and show that the approximation ability of our scheme is no worse than the original model through collaboration.
Collaborative inference between the server and edge devices is also studied in [kang2017neurosurgeon, li2018edge, choi2018deep] by using various computation partitioning strategies. The main difference between our work and these studies is the safety regime proposed in our work, as we strictly require the local predictions to lie above the groundtruth (See Fig 1), making our approach more suitable for scenarios like illness prediction and anomaly detection where safety is more vital than good approximation.
Our work is also partially related to the studies of boosting algorithms [friedman2001greedy, chen2016xgboost, schwenk2000boosting] and cascading methods [gama2000cascade, zhao2004constrained, marquez2018deep], as we both use multiple classifiers to form a strong classifier. The difference between our work and these works lies in the fact that boosting and cascading algorithms usually assume we have only weak classifiers and focus on forming a strong classifier by cascading the decisions of these weak classifiers. But our work assumes that the strong classifier already exists, but unsuitable for edge devices due to its large model size and heavy computation requirement. Therefore, we decompose the strong classifier into one weak monitoring classifier on device and one strong correcting classifier on server.
2 Efficient Monitoring by Model Decomposition
The previous conceptual scheme shall be formally formulated in this part, followed by three evaluation metrics: approximation error, false positive rate and false negative rate.
2.1 Problem Formulation
The introduced model decomposition scheme in the previous section can be summarized as the task of computing a function approximation. Formally, we let be the bounded set of all possible states (e.g. physiological data collected by a wearable device), and let the function denote the ground truth target, which maps the input to a scalar output (e.g. some metric that measures the health index of an individual, based on the physiological data). Note that the output of can also be binary (-1 and 1), in which case we have a classification problem. We assume that an adverse event (e.g. onset of a heat stroke) happens when for some threshold , which for simplicity of presentation we can set to . In the binary classification case, an adverse event is simply .
The goal of efficient remote monitoring is to learn an approximation that is accurate, deployable on edge devices and most importantly, safe – the approximation should signal an adverse event before or when it happens. Moreover, we want the approach to be easily implementable. To this end, we assume that we already have a good hypothesis space ( denotes the space of continuous functions on ) that approximates well, i.e. . However, can be very complex (e.g. deep convolution neural networks), so that , despite being able to approximate well, cannot be directly deployed for monitoring. The question we would like to answer is, given , can we construct a new hypothesis space so that we can realize efficient monitoring?
2.2 Model Decomposition
The approach we propose in this paper is as follows. Consider in addition to another hypothesis space that is very simple, so that any function from it is deployable on an edge device. For example, may consist of much smaller neural networks than those in . Now, we form as an approximation to in the following form
Here, is chosen to be some fixed continuous invertible function whose purpose is to modulate the output from the function so that it is contained within a bounded interval, whose scale is set by ; for instance, a convenient choice for
is the sigmoid function. For the purpose of theoretical analysis, we hereafter assume that is twice differentiable with bounded second derivative. Note that these conditions are satisfied by the sigmoid function.
Now, suppose that are chosen such that . Then, by construction (1) we must have , since both and are necessarily positive. Hence, satisfies both the efficiency condition ( is simple) and the safety condition (), and is a promising candidate for remote monitoring on edge devices. On the other hand, the second term in (1) serves as a correction on the server side and need only be evaluated when accurate predictions are required (e.g. when is above the threshold, signaling the possibility of unusual cases).
2.3 Performance metrics
The following performance metrics are of interest both in theory and in practice:
Approximation Error. Take
False Positive Rate. Take
False Negative Rate.
Some remarks on the metrics are in order. For approximation we typically consider the () and () norms. In the definition of false positive and false negative rates, we introduced a regularization parameter representing a margin between the decision boundaries. For understanding sake one may take but a positive value is useful to rule out pathological cases involving the decision boundary. Also, if then these are true “rates”. Otherwise they are unnormalized, but this minor point does not affect subsequent results.
3 Theoretical Analysis
We have established our model decomposition scheme in the previous part, but a few questions still remain unknown: 1) whether this decomposition scheme works well; 2) how to choose the simplified model and the parameter . The answers shall be provided based on a complete theoretical analysis in the following parts.
3.1 Approximation Error
The validity of our model decomposition approach rests upon the premise: given that approximates well, does that mean that there must exist so that in the form of (1) also approximates well? It turns out that the answer is positive under mild conditions on and .
Consider the approximation scheme in (1). Suppose that are linearly closed, meaning that for every in (resp. ), (resp. ) for any . Then
Let and fix . Then, there exists a such that and . By assumption, there exists such that . By Taylor’s theorem, for each and ,
for some . Let and take in the above , then we have for any ,
Now, simply take large enough so that , and then take (which is always possible by linear closure). We then have
Moreover, by the linear closure assumption. Since is arbitrary, this proves the claim. ∎
Remark: The above result shows by collaboration of the on-device model and the on-server negative corrector , the approximation power is no worse than the original complex model . This is in contrast to previous algorithms where the original model is discarded and the model’s approximation ability usually drops after compression.
Prop 1 ensures that the combined model to be a good approximator, but it does not imply how to choose the model and the parameter . To proceed, we need the following assumption.
The underlying groundtruth admits a decomposition as the linear combination of simpler functions drawn from a collection , namely
Remark: Any well-behaved function has the above decomposition scheme, including many practical machine learning models. In neural networks, represents the high-level features at the second-last layer and the final classification relies on a linear combination of these feature functions (e.g., in LeNet [lecun1998gradient], the classification on the last layer relies on the 84 high-level feature functions in the second-last layer). For explicit kernel methods, may represent feature maps, while for Fourier expansions these ’s represent the basis functions.
With this assumption, we have a direct way of constructing simple approximations to by truncating the series. To promote the safety condition , it is necessary to augment the truncated series with an appropriate positive function. For simplicity, we analyze the setting where we augment the truncated series with a constant term:
Here, controls the complexity of , which for
large enough will become a safe monitoring function. The following result gives some quantitative estimates in this direction.
Suppose that admits the decomposition (7) and represents a universal approximator so that . By setting and , we have
Proof of Proposition 2.
The first bound comes directly from the definition .
Next, by noting that
where the inequality comes from
Noticing we have and therefore takes value in . Since is a universal approximator to match any value in , the r.h.s equals 0. ∎
Remark: (1) By construction of , guarantees the safety requirement as the false negative rate is always 0.
(2) By assuming for simplicity that is an universal approximating class for continuous functions (e.g. very large neural networks), the estimate says that as long as the scale exceeds , the infimum on the right hand side is 0, i.e. the combined prediction approximates well. Compared with Proposition 1 which may require very large , here only needs to exceed , which converges to 0 as .
(3) The choice of captures the dynamic range of the residual term. Subsequently, the choice of is most useful if the series (8) converges rapidly. More generally, Proposition 2 is most useful when applied in conjunction with function classes that allow us to expand the target function with rapidly decaying coefficients. For example, if is a class of feature maps, then we want to choose this class so that the ground truth is well-approximated by only a few of the feature maps. This can be done for example by training with regularizers that promote sparsity in the combination coefficients.
3.2 False Positive Rate
Furthermore, observe that is intimately connected with the false positive rate and they increase together. The following result makes this precise.
Let and , then
Remark: This shows that the upper bound for false positive rate increases with , and smaller is preferred to minimally incur false positive. To combine the result with Prop 2, setting is sufficient to obtain the minimal false positive cases.
Moreover, if we put into the approximation results in Proposition 2, then the false positive rate again involves a trade-off between (which decreases ) and . In summary, estimates 2 and 3 are quantitative estimates of the performance trade-off: for fixed model complexity for the on-device monitoring model (fixed ), increasing the scale of the server-side correction () improves overall model accuracy but also increases false positive rates. On the other hand, for a fixed scale, increasing monitoring model complexity improves both the approximation quality and the false positive rate, at the cost of heavier computation.
3.3 False Negatives Rates
Prop 2 already guarantees the false negative to be always 0 by augmenting a positive constant based on the size of the residual term in the infinity-norm. This of course relies on the assumption that this residual term converges uniformly so it is finite, which holds in most practical scenarios.
For completeness of the analysis, it is of interest to understand the performance of our framework in case this uniform convergence is violated or an offset smaller than is chosen. An immediate consequence is that we violate the safety requirement ; i.e., we incur a False Negative instance. The analyses developed in Propositions 2 and 3 are applicable for quantifying the extent by which our safety requirement is violated, as well as the size of the residual error , and we describe these result formally in Proposition 4.
Proof of Proposition 4.
The first bound follows from an application of the Chebyshev’s inequality. Next we evaluate the integral of the residual error over the regions and , and apply the triangle inequality to arrive at the bound
For the first term we have
Note that the latter term is valid because the image of the function lies within the domain of whenever . For the second term we have
By combining both inequalities and subsequently taking the infimum over we arrive at the following:
Next, by applying the Cauchy-Schwarz inequality we have the bound
By applying Chebyshev’s inequality appropriately we arrive at the bounds
The result follows by combining these bounds. ∎
Remark: Our result shows that, under the assumption that the series (7) converges in the L2-norm, the False Negative rate is effectively governed by how quickly the series vanishes as well as the size of the offset . Our bound of the error in Proposition 4 comprises two key components. The first quantity depends on the expressive power of in approximating the function , and it may potentially increase with larger . The second quantity captures the residue of the function that cannot be captured by . In contrast with the first quantity, it is independent of and it depends inversely with the parameter .
3.4 Summary and Practical Implications
To summarize, Prop 1 indicates for arbitrary on-device structure , by collaborating with on-server network and selecting a proper parameter , the model decomposition scheme is always guaranteed to perform no worse than the original complex model . With further Assump 1 that is applicable to most deep learning models, the on-device model structure can be precisely truncated from the original model structure . We illustrate applications of the bounds in Propositions 2 and 3 with a series of examples.
General Case: In the general case, the on-device network can be truncated as from the original model, with an augmented term to capture the residuals as shown in (8). By combining the results of Prop 2 and Prop 3, it’s sufficient to set and to obtain ideal approximation with the lowest FP rate, and the FN rate is strictly 0 in this scenario.
Exponential Decay: In case the coefficients in (7) decays in a specific trend, for example the coefficients have exponential decay, i.e. for some fixed with , we have more concrete results. In this case, Prop 2 says that picking ensures both the positivity criterion and accurate approximation. Moreover, Prop 3 suggests that .
Power-law Decay: Suppose instead that have power-law decay, i.e. for some fixed . Assume further that the ’s in (7) are orthonormal. First we have the bound . Thus, the upper bound on the approximation error is , and we may set . The false positive rate has estimate .
4 Experimental Validations
In this part, we conduct a series of experiments to validate our theoretical analysis in the previous section.
4.1 Synthetic Dataset
The first experiment we consider is a synthetic dataset to simulate the exponential decay case in Sec 3.4. The goal of this simulation is to validate our analysis in previous section, as well as examining the model efficiency.
The dataset is generated by sampling input
from a uniform distributionand computing its label as
which is equivalent to setting ’s to cosine functions at different frequencies in Eq (7) and . In practice, a FC(1,16,32,64,100,1) neural network structure can efficiently train the loss to approximately 0, hence we select it as our on-server structure. The on-device monitoring structure is selected by truncating and augmenting as Eq (8). Once the on-device and on-server architectures are selected, the entire model is trained end-to-end using the Adam optimizer [kingma2014adam].
For safety concerns, the FN rate should be minimal, and Fig 2(b) indicates the FN rate is almost 0 everywhere except for both small and where the requirement is violated.
The above results show that the decomposed model can achieve a low approximation error with safety guarantee (0 false negative rate) with proper choices of . Indeed, our analysis in Sec 3.4 provides a more concrete choice of . Fig 3 shows how the practical approximation error goes with different , and the results indicate that the theoretical choices all obtain good approximation for these three cases. Since we use to approximate , there is a gap between the theoretical and practical optimal value.
4.2 Financial Dataset
The above experiment validates the possibility of our model decomposition strategy, and shows our analysis developed in Sec 3 can guide our practical architecture designs. In the following parts, we further demonstrate the broad applicability of our proposed decomposition scheme on a real-world dataset.
The Dow Jones Industrial Average (DJIA) dataset [finance] represents the stock market index of 30 companies’ stock value and how they have traded in the stock market during various periods of time. We consider Apple Inc.’s stock price as the potential ground truth and use the price from the other 29 companies to predict it. All data are normalized to range , with the level as the warning threshold. The baseline architecture () is selected as a FC(29,64,128,256,1) neural network, which attains a mean-squared loss of approximately . We truncate the neural numbers on the second-last layer to 16 to obtain the on-device model .
Fig 4 reports the results in the following aspects.
The on-device predictor can always provide an upper approximation of the true signal , leading to an early warning before the ground truth exceeds the threshold. This naturally guarantees the safety requirement, since the false negative rate is always 0 in this case.
In case the on-device provides signal exceeds the threshold, the negative corrector is activated and the server provides more accurate predictions . In Fig 4, this combined approximation is almost identical to the groundtruth, hence the whole system obtains more accurate approximation and the extra false positive prediction is eliminated by remote corrector.
By local monitoring and local analysis , the communication cost is reduced 10x times.
In addition to strictly truncate model based on Prop 2, Prop 1 also indicates a simple neural network can also act as on-device upper approximator. In appendix, we use a FC(29,10,1) network as a local monitoring tool so that the on-device model size is compressed more significantly. The drawback is we have to manually select a larger and incur slightly more false positive cases (+2%) for .
To summarize, the model decomposition scheme allows us to use collaborative inference instead of purely on-device or on-server predictions. The approximation error is shown to be no worse than the previous complex model in both theory and practice. Specifically, the safety regime proposed in this paper makes our work different from previous studies. For financial dataset, this upper approximation allows the users to buy in the stock price in advance. For other scenarios like health monitoring, this would allow us to predict the disease in advance, which could be more vital than purely proving an accurate approximation.
In this paper, we introduce a model decomposition scheme to realize efficient and safe inference for remote monitoring tasks on edge devices. The key idea is the combination of a simple local monitoring surrogate and a complex negative corrector on the server side. We demonstrate using experiments that following our theoretical analysis, one can greatly decrease the model complexity required for safe monitoring, thereby increasing the applicability of deep learning models in resource-constrained and safety-sensitive applications such as remote monitoring.