## 1 Introduction

The increased computing power of mobile devices and the growing demand for real-time sensor data analytics have created a trend of mobile-centric artificial intelligence (AI) [lasagna, on-device-inference-survey, edge-ai, in-edge-ai]

. It is estimated that over 80% of enterprise IoT projects will incorporate AI by 2022. The on-device inference of computer vision models brings us increasingly rich real-time AR applications on mobile devices

[mobile-ar]. A judicious combination of on-device and edge computing can analyze videos taken by drones in real-time [drone-video]. The resource efficiency of model inference is critical for AI applications, especially for resource-limited mobile devices and latency-sensitive tasks. However, many AI models with state-of-the-art accuracy [sota-seg, sota-pose, sota-nlp] are too computationally intensive to perform high-throughput inference, even when they are offloaded to edge or cloud servers [elf].For resource-efficient inference, one direct and popular way is to eliminate the redundancy of the deep model itself via accelerating and compressing techniques [yolov3-tiny, light-openpose, mobile-bert, mobilenet-v2, mnasnet, mcdnn, automc]. In this work, we follow another series of approaches [ff, reducto, foggycache, potluck, glimpse, noscope] that attempt to filter the redundancy in the input data. Fig. 1

shows four examples of input redundancy in mobile-centric AI applications. We call this series of approaches input filtering and classify them into two categories: SKIP and REUSE. (1)

SKIP methods [ff, noscope] aim to filter input data that will bring useless inference results, e.g., images without faces for a face detector (Fig. 0(a)) and audios without a valid command for a speech recognizer (Fig. 0(b)). FilterForward [ff] trains a binary classifier and sets a threshold on classification confidence to filter input images. (2) REUSE methods [foggycache, potluck] attempt to filter input whose results can reuse the previous inference results, e.g., motion signals of the same action (Fig. 0(c)) and video frames with the same vehicle count (Fig. 0(d)). FoggyCache [foggycache] maintains a cache of feature embedding and inference results of previous inputs and searches reusable results in the cache for newly arrived data. Input filtering usually works as a necessary prelude to inference for under-resourced mobile systems. Moreover, compared with model optimizations, input filtering provides more flexible trade-offs between the accuracy and efficiency, e.g., FilterForward can adjust the threshold in SKIP and FoggyCache can adjust the cache size in REUSE. Although prior efforts have designed effective input filters for a range of applications, two important and challenging questions remain unanswered:1. Theoretical filterability analysis for the guidance of applying input filtering to mobile-centric inference: Not all inference workloads have the optimization potential by using input filtering. Sometimes, to achieve the required accuracy, a SKIP/REUSE filter is more costly than the original inference. Characterizing the conditions under which the filter has to cost more to be accurate is thus essential to input filtering. Previous efforts study the input filtering problem from an application-oriented perspective. They start from the observation of redundancy and propose bespoke input filtering solutions without further analyzing the relation between their inference workloads and input filters. Without theoretical guidance and explanation, though they delivered accurate and lightweight input filters for specific workloads, the trial-and-error process of designing input filters for other workloads is still very cumbersome and may fail next time, especially for resource-scarce mobile systems.

2. Robust feature discriminability for diverse tasks and modalities in mobile-centric inference: A discriminative feature representation [dis-feat] is critical to filtering performance, since it directly determines the accuracy of making SKIP decisions and finding REUSABLE results. Recent work [reducto] shows that for different workloads, the discriminability of low-level features is different, e.g., area feature works better for counting while edge feature works better for detection. Most existing filtering methods leverage handcrafted features [foggycache, potluck, reducto]

or pre-trained neural networks as feature embedding

[ff], and implicitly assume that these features are sufficiently discriminative for the target workloads. However, mobile applications usually have high diversity in input content and inference tasks. The dependency on pre-trained or handcrafted features leads to unguaranteed discriminability to these diversities. Our experiments ( 6.2) show that, for an action classification workload, neither a SKIP method using the pre-trained feature [ff] nor a REUSE method using the handcrafted feature [foggycache] can work effectively. The feature embedding should be obtained in a workload-agnostic and learnable manner, rather than tailored case by case.To answer these questions, we first provide a generic formalization of the input filtering problem and conditions of valid filters. Then we theoretically define filterability and analyze the filterability of two most common types of inference workloads, namely classification and regression, by comparing the hypothesis complexity [foundationML, colt] of the inference model and its input filter. Instead of designing bespoke solutions for narrowly-defined tasks, we propose the first, to our best knowledge, end-to-end learnable framework which unifies both SKIP and REUSE approaches [ff, reducto, foggycache]. The end-to-end learnability provides feature embedding with robust discriminability in a workload-agnostic manner, thus significantly broadens the applicability. Based on the unified framework, we design an input filtering system, named InFi, which supports both SKIP and REUSE functions. In addition to image, audio and video inputs, InFi complements existing techniques in supporting text, sensor signal, and feature map inputs. Previous methods are typically designed for a certain deployment, e.g., inference offloading [reducto, foggycache]. InFi flexibly supports common deployments in mobile systems, including on-device inference, offloading, and model partitioning [model-partition]. In summary, our main contributions are as follows:

•We formalize the input filtering problem and provide validity conditions of a filter. We present the analysis on complexity comparisons between hypothesis families of inference workloads and input filters, which can guide and explain the application of input filtering techniques.

•We propose the first end-to-end learnable input filtering framework that unifies SKIP and REUSE methods. Our framework covers most existing methods and surpasses them in feature embedding with robust discriminability, thus supporting more input modalities and inference tasks.

•We design and implement an input filtering system InFi. Comprehensive evaluations on workloads with 6 input modalities, blue 12 inference tasks, and 3 types of mobile-centric deployments show that InFi has wider applicability and outperforms strong baselines in accuracy and efficiency. For a video analytics application on a mobile platform (NVIDIA JETSON TX2), InFi can achieve up to 8.5 throughput and save 95% bandwidth compared with the naive vehicle counting workload, while keeping over 90% accuracy.

## 2 Input Filtering

This section formalizes the input filtering problem and provides the conditions of a “valid” input filter for resource-efficient mobile-centric inference.

### 2.1 Problem Definition

An input filtering problem needs to determine what input is redundant and should be filtered for a given inference model. First, the definition of an input filtering problem is based on its target inference model. Let denote the input space and the label space of the target model, respectively. Define , named the target concept [pac], which provides the ground-truth label for each input. Then training a target model is to search for a function from a hypothesis family [pac] using a set of training samples , where are sampled independently from with an identical distribution and . Using the above notations, we define the learning problem of the target inference model by . Step 0 in Fig. 2 shows the original inference workflow of a trained model , which takes an input from and returns an inference result .

Next, given a trained inference model , its redundancy measurement function can be defined as:

###### Definition 2.1 (Redundancy Measurement).

A redundancy measurement of a model is a function that takes only the output of as input and returns a score that indicates whether the inference computation is redundant.

Such measurements are common in practice. For example, based on the output of a face detector the inference computation that returns no detected face is redundant and can be skipped, and we can set the score ; Otherwise, . Formally, , where is the output set of detected faces, is the indicator function. For REUSE cases, if the inference result of an action classifier on a new query is the same as previously cached, the computation is redundant, and we can define . Note that, this definition of redundancy measurement does not depend on ground-truth labels, since our focus is not the accuracy but to optimize the resource efficiency of a deployment-ready target model with trusted accuracy by eliminating its redundant inference. Step 1 in Fig. 2 shows how redundancy measurement works.

Given the inference workload and redundancy measurement , as Step 2 in Fig. 2, learning an input filter is defined as searching for a function from a hypothesis family using a set of training samples , where are sampled independently with a distribution and . This learning problem is denoted by , i.e., ’s target concept is the composite function of and .

Inference with an input filter. Once an input filter is trained, the inference workflow changes from Step 0 to Step 3 in Fig. 2. The input filter becomes the entrance of the workload, which predicts the redundancy score of each input . If not redundant, the inference model will be directly executed on the input.

### 2.2 Validity Conditions

After defining an input filter, we now give the conditions that a “valid” input filter needs to meet for resource-efficient mobile inference. The input filter is designed to balance the resource and accuracy: filtering more inputs can save more resources, but it also brings a higher risk of incorrect inference results.

Inference accuracy. With an input filter, the inference result for input is returned either by executing or applying . Following previous work [foggycache, reducto, ff], the correctness of the result refers to its consistency with the exact inference result by , rather than the ground-truth label. An input filter’s inference accuracy is defined as the ratio of correct results obtained by the inference workload with the filter.

Filtering rate. The filtering rate, denoted by , is defined as the ratio of filtered inputs (i.e., the ratio of results obtained by applying ), which is also an important performance metric considered in previous work [ff, reducto, foggycache].

Overall cost. The overhead of an inference workload with an input filter needs to take , and into consideration. Let denote the cost of a certain function. For the cost of computation (e.g., runtime), the average cost per input changes from into . The communication cost (e.g., bandwidth) depends on the deployment of the mobile-centric inference workload. On-device inference does not involve communication, while the overall bandwidth cost of offloading [ff, reducto] and model partitioning [model-partition] deployments becomes the original cost multiplied by .

Based on the above metrics, we define an input filter as “valid” if it satisfies two conditions: 1) Accurate enough: , where is the threshold of acceptable inference accuracy. 2) Reduced overhead: the overall cost with an input filter is lower. If we aim to reduce the computation cost, we need , i.e., ; If we aim to reduce the communication cost, we only need .

## 3 Filterability Analysis

As mentioned in Sec.1, not all inference workloads have the optimization potential by using input filtering techniques. Given an inference workload in a mobile-centric AI application, is there a valid input filter? To answer this question, based on our formalization of the input filtering problem, we first define the filterability of an inference workload. Then we analyze filterability in three typical inference cases in SKIP settings, and discuss uncovered cases.

### 3.1 Definition of Filterability

Given the learning problem of an inference model and the learning problem of its input filter, to simplify the analysis, we make assumptions as follows: (1) i.e., the training samples follow the identical distribution; (2) , i.e., the two learning problems share the same inputs in their training samples. But they are supervised under different labels. The inference model is supervised by , while the input filter is supervised by . Our intuitive idea for filterability is that, if an inference workload is filterable, the learning problem of its input filter should have lower complexity than the learning problem of its inference model. Formally, we define filterability as follows:

###### Definition 3.1 (Filterability).

Let denote the complexity measurement of a hypothesis family. We say that the inference workload is filterable, if , where and .

Since the hypothesis family cannot be determined based only on input and output spaces, we use the family of the input filter’s target concept as .

Now we can characterize the theoretically achievable accuracy and overhead of the input filter for a given inference model by leveraging computational learning theory

[foundationML]. It has been proven that, the more complex the hypothesis family is, the worse the bounds of generalization error. On the other hand, the hypothesis complexity of neural networks has a positive correlation with the number of parameters. For example, let denote the number of weights and the number of layers in a deep neural network. The VC-dimension [vcdim] (a measurement of the hypothesis complexity) is [vcd-linear-nn]. In the case of the same layer structure, the more parameters the higher the inference overhead of neural networks. The generalization error bound and the number of parameters correspond to the accuracy and efficiency metrics in validity conditions ( 2.2), respectively, although they are not strict quantification. Therefore, if an inference workload is filterable, whose input filter has lower hypothesis complexity, we are confident to obtain a valid filter with sufficiently high accuracy and lower overhead than the inference model. Next, we will analyze the complexities of the hypothesis family of inference workload and its input filter in different cases.### 3.2 Low-Confidence Classification as Redundancy

Considering an inference workload, where the inference model is a binary classifier that returns the classification confidence, and the redundancy measurement regards the classification result with confidence lower than a threshold as redundant, i.e., . Confidence-based classification is very common in mobile AI applications, such as speaker verification. We adopt the empirical Rademacher complexity [colt], denoted by , as the complexity measurement, which derives the following generalization bounds [foundationML]:

###### Theorem 1 (Rademacher complexity bounds).

Let be a family of hypothesis taking values in . Then for any

, with probability at least

, the following holds for all :(1) |

where and denote the empirical and generalization errors, and is the number of training samples.

This theorem shows that the higher a hypothesis family’s empirical Rademacher complexity, the worse the bounds of its generalization error. The classification confidence-based redundancy measurement creates two hyperplanes parallel to

: points between them are considered redundant, and points outside them are considered not redundant. Thus, the hypothesis family of the input filter’s target concept has the form: , where and . Then we have proven the following lemma, which shows that the discussed inference workload is not filterable.###### Lemma 2.

Let be a family of binary classifiers taking values in . For where :

(2) |

###### Proof.

By definition, and , where Rademacher variables . By fixing ,

where we used the fact that . ∎

Multi-class classifiers can be treated as a set of confidence scoring functions, one for each class. The above lemma can also be applied to derive that multi-class classifiers using such a confidence-based redundancy measurement are not filterable either.

### 3.3 Class Subset as Redundancy

Considering the inference model as a multi-class mono-label classifier and . Then its hypothesis family has the form: , where returns the probability of the -th class. The redundancy measurement checks whether the predicted class belongs to a specific subset, i.e., , where . It is common in mobile applications to select only a subset of labels for use. For example, when deploying a pre-trained common object detector [mscoco] on a drone for traffic monitoring, we only care about the labels of vehicles and pedestrians, while considering other labels like animals and trees as redundancy. With the class subset-based redundancy measurement, the hypothesis family of the input filter’s target concept has the form: . We have proven the following lemma, which shows that the discussed inference workload is filterable:

###### Lemma 3.

Let be hypothesis sets in , , and let . For , where :

(3) |

###### Proof.

For any :

∎

The equation holds only if the max-value scoring function is in the selected subset for all , which means that without loss of inference accuracy, the optimal filterable ratio in the data is 0. Except in this extreme case, we can think that the complexity of learning the input filter is strictly lower.

### 3.4 Regression Bound as Redundancy

Considering a bounded regression model , whose outputs are bounded by that (recall that is the target concept) for all . The redundancy measurement checks whether the returned value is larger than a threshold, i.e., . As an example, face authentication on mobile devices usually requires the coordinates of the detected face to be within the specified range. Then learning the target concept of input filter becomes learning a regression model whose outputs are bounded by , where . We also adopt the empirical Rademacher complexity and have the following theorem [foundationML]:

###### Theorem 4.

Let and . Assume that for all and . Then the following inequality holds: .

Since , this theorem shows that the upper bound of is tighter than the upper bound of . So we can be confident that the bounded regression inference workload discussed is filterable.

## 4 Framework

In this section, we first propose a novel input filtering framework that unifies SKIP and REUSE approaches. Then we discuss how existing approaches are covered by our framework and their limitations. Finally we present the key design, end-to-end learnability, and advantages it brings.

### 4.1 SKIP as REUSE

We unify SKIP and REUSE approaches based on the idea that:

SKIP equals to REUSE the NONE output of .

Suppose we have an all-zero input and apparently its inference result can be interpreted as NONE. Then given a new input , if it is similar to in the feature space, we can REUSE the cached NONE result, i.e., we SKIP the inference computation. The key to reuse is to measure the semantic similarity between the current input and previously cached ones. However, it is difficult to accurately measure semantic similarity directly based on the raw input. As Step 1 in Fig. 3 illustrated, our framework first computes the feature embedding of each raw input. Taking a pair of inputs , then our framework applies a difference function on their corresponding embeddings and feeds the result into a classifier that predicts a single scalar . Under this framework, for SKIP, we fix as an all-zero input , then the process degenerates to a binary classification task that takes as input and returns the prediction . In this way, our framework unifies SKIP and REUSE approaches, with only difference in interpretation of the value . For REUSE, we interpret as the distance between two inputs. For SKIP, we interpret as the probability that input is not redundant.

### 4.2 Inference with an Input Filter

For the inference phase, as shown in Step 2 in Fig. 3, SKIP and REUSE filters only differ in the inputs of the difference function . (1) SKIP: Inference with a SKIP filter is the same as serving a binary classifier. We can set a threshold on the predicted redundancy score to determine whether to skip. (2) REUSE: Inference with a REUSE filter needs to maintain a key-value table, where a key is a feature embedding and its value is the corresponding inference result. For an arrived input , the trained feature embedding network returns its embedding and the distances between and cached keys are computed by the difference function

and the trained classifier. Then we can leverage classification algorithms, e.g., KNN, to obtain the reusable cached results.

### 4.3 Sub-Instance Approaches

Here we explain how our framework covers three state-of-the-art input filtering methods [ff, reducto, foggycache] that will be used for comparison in our evaluations.

Sub-instance1: FilterForward (FF) [ff] is a SKIP method for image input. FF uses a pre-trained MobileNet’s intermediate output as the feature embedding. Then it trains a “micro-classifier” that consists of convolution blocks to make the binary decision for filtering.

Sub-instance 2: FoggyCache (FC) [foggycache] is a REUSE method for image and audio input. FC uses low-level features (SIFT for image, MFCC for audio) and applies locality-sensitive hashing (LSH) for embedding. Then FC uses L2 norm as the difference function and applies KNN to get the reusable inference results from previously cached ones.

Sub-instance 3: Reducto [reducto] is a variant of SKIP method for video input. It measures low-level feature (pixel, edge, corner, area) difference between successive frames. If they are similar enough, Reducto skips the current frame and returns the latest result. Formally, let be the current frame and be the previous frame. Reducto defines , where are low-level features of . It uses a threshold function as the classifier, i.e., .

### 4.4 End-to-end Learnability

To obtain feature with robust discriminability for diverse data modalities and inference tasks in mobile applications, a key design principle of our framework is the end-to-end learnability. End-to-end learning system casts complex processing components into coherent connections in deep neural networks [e2e-limit] and optimizes itself by applying gradient-based back-propagation algorithms all through the networks. Deep end-to-end models have shown state-of-the-art performance on various tasks including autonomous driving [auto-drive] and speech recognition [deepspeech2]. As aforementioned, a main component of our unified framework is to measure the semantic similarity between two inputs. To make our framework end-to-end learnable, we leverage the metric learning paradigm, whose goal is to learn a task-specific distance function on two objects. The metric learning paradigm turns the fixed difference function (e.g., Euclidean distance and L2 norm) used by existing methods into an end-to-end learnable network. Within the metric learning paradigm, we adopt Siamese network structure [siamese]

for feature embedding to support two inputs and flexible input modalities. Siamese network uses the same weights while working on two different inputs to compute comparable output vectors, and has been successfully applied in face verification

[face-verification], pedestrian tracking [pedestrian-track], etc. We can flexibly implement the Siamese feature embedding by incorporating different neural network blocks to learn modality-specific features in an end-to end manner, instead of tailoring handcrafted or pre-trained feature modules. Our experimental results show that our end-to-end learned features have robust discriminability to diverse inference workloads in mobile-centric AI applications.## 5 Design of InFi

Based on our input filtering framework, in this section, we present the concrete design of InFi (INput FIlter), which supports both SKIP and REUSE functions, named InFi-Skip and InFi-Reuse. The design of InFi has four key components: feature embedding, classifier, training mechanism, and inference algorithm. We also discuss diverse deployments of InFi in AI applications on mobile, edge, and cloud devices.

### 5.1 Feature Networks for Diverse Input Modalities in Mobile-Centric AI

InFi supports filtering inference workloads with six typical input modalities in mobile applications: text, image, video, audio, sensor signal, and feature map. We develop a collection of modality-specific feature networks as building blocks for learning feature embedding. Our major consideration in designing these feature networks is resource efficiency on mobile devices.

Text modality (). Text is tokenized into a sequence of integers, where each integer refers to the index of a token. We adopt the word-embedding layer to map the sequence to a fixed-length vector by a transformation matrix and use a densely connected layer with a Sigmoid activation to learn the text features.

Image modality (). We use depth-wise separable convolution [separablecnn], denoted by , to learn visual features. is a parameter-efficient and computation-efficient variant of the traditional convolution which performs a depth-wise spatial convolution on each feature channel separately and a point-wise convolution mixing all output channels. Then we build residual convolution blocks [residual] as follows:

where

denotes the rectified linear unit,

denotes the layer normalization anddenotes the 2D max-pooling layer. Finally, we build the image feature network with two residual blocks followed by a global max-pooling layer and a Sigmoid-activated dense layer.

Video modality (). For video modality, we need to represent not only the spatial but also the temporal features. Given a window of frames, we stack one residual block for each frame and then concatenate their resulting feature maps. Except for the first residual block, the video feature network performs the same operation as the image feature network.

Audio modality (). We consider audio inputs in the form of either a 1D waveform or a 2D spectrogram and use the same structure as image feature networks to learn features from audio.

Sensor signal and feature map modality (). Motion sensors are widely used in mobile devices and play a key role in many smart applications, e.g., gyroscope for augmented reality [gyroscope-ar] and accelerator for activity analysis [ucihar]. Feature maps refer to the intermediate outputs of deep models and need to be transmitted in workloads that involve model partitioning [model-partition]. We consider these two types of input as a vector with fixed shape and use two densely connected layers to learn the feature embedding from the flattened vector.

Flexible support for input modalities.

Our design provides a flexible support for diverse input modalities in mobile-centric AI applications. We can easily integrate a modality-specific neural network from advanced machine learning research as the feature network block into our framework, so as to learn feature embeddings in the end-to-end way.

### 5.2 Task-Agnostic Classifier

Each feature network , where belongs to {text, image, video, audio, vec}, takes as input and output the embedding . We add a dropout layer after the last dense layer of feature networks to reduce overfitting. Following previous design of Siamese network [siamese], we use the absolute difference as the function . Let denote the embedding outputs of two inputs . The classifier is defined as , where denotes the j-th element in the embedding vector and

is the Sigmoid function. To sum up, the input filter function

can be defined as . With a proper implementation, the modality of input data can be automatically detected without manually setting.### 5.3 Multi-Task Extension

The above design is described for single-task workloads, however, it is common to concurrently run multiple AI models in real applications. We will show that the design of InFi can be flexibly extended to multi-modal and multi-task inference workloads.

Multi-modality single-task. Multi-modal learning aims to learn AI models given multiple inputs with different modalities, which is receiving increasing attention in areas such as autonomous driving [multimodal-autodrive]. Our designs of modality-specific feature networks and task-agnostic classifier naturally support multi-modal extension: For each modality {text, image, video, audio, vec}, we build the corresponding feature network to learn its embedding. Then we concatenate the resulting embeddings and feed it to the classifier .

Single-modality multi-task. It is common to deploy multiple AI models to analyze the same input, e.g., detecting vehicles and classifying traffic conditions on the same video stream. For input filtering, we simply extend the length of the last dense layer in the classifier , one dimension per task. Existing work on multi-task learning [mtl] demonstrate that the cross-task representation improves learning performance. Formally, given tasks, it has been proven that the sample complexity needed [theory-mtl] is as follows:

(4) |

That is, we can save sample complexity, compared with learning a filter for each task separately. And our experimental results (Fig. 13) also show that the cross-task representation is beneficial for input filtering.

Multi-modality multi-task. Considering a general case where we need to filter inputs for multi-modality and multi-task workloads, we can combine the above two extensions, as shown in Fig. 4: For each input modality, we build the corresponding feature network and concatenate the resulting embeddings; And for each task, we build a multi-dimension classifier, one dimension per task, which takes the concatenated embedding as input. Comparing with the naive way that deploys independent InFi for different inference workloads, our proposed extension saves computation and leverages potential advantages of cross-task and cross-modality representation.

### 5.4 End-to-End Training

InFi-Skip and InFi-Reuse share the same model architecture, but have different formats of training data. 1) Learning an InFi-Skip filter uses the same paradigm as training a binary classifier. Thus its training samples are

and we use the binary cross-entropy loss function. In practice, we can use the original training set of

or data collected during serving . Since only depends on the inference result, the supervision labels can be collected automatically. 2) InFi-Reuse filters are trained using the contrastive loss [contrastive-loss] with a margin parameter of one. Given a set of input and their discrete inference results, the redundancy measurement is defined as the distance metric between a pair of inputs. Formally, a training sample consists of a pair of inputs and their distance label . We can optimize all trainable parameters end-to-end, using standard back-propagation algorithms.Online active update. Unlike benchmark datasets, the distribution of real-world inputs, e.g., the video streams captured by surveillance cameras, is much narrower and changes online [online-kd]. In a video-based vehicle counting application (see Sec. 6.1 for detailed setup), we explored the shifted distribution of frames over wall-clock time. As shown in Fig. 4(a), the vehicle count varies with the time. There are two distinct count peaks in the morning and evening and the nighttime results remain stable at low values. We noticed that the captured frames switched between infrared (IR) and RGB images when the lighting condition changed. The RGB-IR technology provides day and night vision capability for cameras and is supported by popular commercial sensors. We split all frames into infrared and RGB subsets and plot their distribution over the number of detected vehicles in Fig. 4(b). We can see a clear difference in the distribution. Therefore, a vanilla offline training policy that selects initial samples (e.g., frames in the first hour) for training the input filter results in sub-optimal performance quickly. To overcome the poor adaptability of offline training, we adopt the least confidence [active] strategy to actively select samples for update on the fly. Existing work [active-complexity]

proved that the sample complexity of active learning is asymptotically smaller than passive learning. Specifically, we preset a period length and a sampling ratio

%. Then we execute the input filter on all inputs within a period and select % samples with least confidence (). For InFi-Reuse, we treat as the confidence score. Our experimental results (Fig. 14) show that, given the same budget for the number of training samples, this active strategy significantly outperforms the offline one.### 5.5 Inference Phase

After training an InFi filter, we integrate it into the original inference workload using Alg. 1.

InFi-Skip. We set a redundancy threshold for InFi-Skip to determine whether to skip the current input. And if we skip the input, InFi-Skip will return a NONE result, whose interpretation depends on the redundancy measurement in specific applications. For example, NONE means no face detected in face detection, 0 vehicle in vehicle counting application, meaningless speech in speech recognition, etc.

InFi-Reuse. To reuse previous inference results, we need to maintain a cache whose entry is a key-value pair of an input embedding and its inference results. Following the previous RESUE approach [foggycache], we adopt K-Nearest Neighbors (KNN) algorithm to reuse cached results. But it is possible that a new input is not similar with any cached entries, i.e., a cache miss. We adopt Homogenized KNN (H-KNN) [foggycache] algorithm to handle this problem, which calculates a homogeneity score of the found K nearest neighbors and sets a threshold on the homogeneity score to detect the cache miss. Then we can replace entries using policies like least frequently used (LFU), denoted by replace in Alg. 1. Different from original KNN that typically uses Euclidean distance, which is non-parametric, we set the distance measurement as the trained . We denote as the H-KNN function which returns the majority inference result of nearest neighbors of in using the to calculate the distance between embeddings, and computes . We focus on taking the advantage of end-to-end learnability, and other subtle optimization opportunities such as cache warm-up are out of the scope of this work.

### 5.6 Mobile-Centric Deployments

Unlike existing work tailored for specific deployment, e.g., inference offloading [ff, foggycache, reducto], InFi supports diverse mobile-centric deployments: (1) On-device: both inference model and input filter are deployed on one device; (2) Offloading: the input filter is deployed on one device, and the inference model is deployed on another device. (3) Model Partitioning (MP) [model-partition]: the inference model is partitioned across two devices, and the input filter is deployed with the first part. MP is a promising approach to collaboratively make use of the computing resources of mobile and edge devices [code-partition, dnn-partition] and better protect the privacy of mobile data [partition-privacy]. For MP deployment, the filter’s input is the feature map, so existing filtering approaches [ff, foggycache, reducto] cannot be applied. Due to the support of feature map modality, InFi is the first input filter that can be applied in model partitioning workloads. Note that InFi is not limited to systems with a single mobile and edge node. For example, training one filter per server, or changing one filter’s binary classifier into a multi-category one (one bit per server), InFi-Skip can be used in the multi-tenancy context [ff].

## 6 Evaluation

### 6.1 Implementation and Configurations

We implemented InFi ^{1}^{1}1https://github.com/yuanmu97/infi

in Python. We build all feature networks and classifiers with TensorFlow 2.4. Learning rate is set as 0.001, batch size is 32, and the number of training epochs is 20. In the text feature network, the output dimension of embedding layer is 32. In image, video, and audio feature networks, we use 32 and 64 convolution kernels in the two residual blocks. We use 128 units in the first dense layer in vector feature networks. The last dense layer of all feature networks has 200 units and 0.5 dropout probability.

Dataset | Modality | Inference Task |

Hollywood2 | Video Clip | Action Classification (AC) |

Image | Face Detection (FD) | |

Pose Estimation (PE) | ||

Gender Classification (GC) | ||

Audio | Speech Recognition (SR) | |

Text | Named Entity Recognition (NER) | |

Sentiment Classification (SC) | ||

ESC-10 | Audio | Anomaly Detection (AD) |

UCI HAR | Motion Signal | Activity Recognition (HAR) |

MoCap | Motion Signal | User Identification (UI) |

City Traffic | Video Stream | Vehicle Counting (VC) |

Feature Map | Vehicle Counting (VC-MP) |

Datasets and inference models. To evaluate InFi’s wide applicability, we choose 10 inference workloads that cover six input modalities and three deployments (see Tab. I). Five datasets are used: (1) We reprocessed a standard video dataset, Hollywood2 [hw2], to create four different input modalities: video clip, image, audio and text. An action classification model [actionmodel] is deployed on the original video clips. Images are sampled from the video clips and a face detection [deepface], a pose estimation [openpose] and a gender classification [deepface] models are deployed. Audio is extracted from each video clip and we deploy a speech recognition model [deepspeech2]

. Text is the caption generated on sampled images by an image captioning model

[caption]. A named entity recognition model (spacy) and a sentiment classification model [sentiment-model] are deployed. (2) We use ESC-10 dataset [esc] for audio anomaly detection and deploy an transformer-based model [ast-model]. (3) We use UCI HAR dataset [ucihar] for motion signal-based human activity recognition and deploy a LSTM-based model. (4) We use MoCap dataset [mocap] for training a motion signal-based user identification (12 users) model, using a LSTM-based architecture, and deploy it as the inference workload. (5) We collected a video dataset, named City Traffic, from a real city-scale video analytics platform. We collected 48 hours of videos (1FPS) from 10 cameras at road intersections and deploy YOLOv3 re-implemented with TensorFlow 2.0 to count the number of vehicles in video frames. All deployed inference models load publicly released pretrained weights. And we split each dataset for training and testing by 1:1 (Hollywood2 and UCI HAR are split randomly, while City Traffic is split by time on each camera).Devices and deployments. We use an edge server with one NVIDIA 2080Ti GPU and three mobile platforms: (1) NVIDIA JETSON TX2, (2) XIAOMI Mi 5, and (3) HUAWEI WATCH. All device-independent metrics are tested on the edge. For vehicle counting, we test three deployments: on-device, offloading, and model partitioning (see Sec. 5.6).

Baselines. We adopt three strong baselines: FilterForward (FF) [ff], Reducto [reducto], and FoggyCache (FC) [foggycache]. See Sec. 4.3 for details of baselines. For workloads with no existing method presented (to our best knowledge), we tested a method dubbed Low-level that first computes low-level embedding for inputs (MFCC for audio, Bag-of-Words for text, raw data for motion signal and feature map). Then Low-level uses K-nearest neighbors vote (K=10) for both SKIP and REUSE cases. We also deployed YOLOv3-tiny [yolov3-tiny] model for vehicle counting and a lightweight pose estimation model [light-openpose] to compare input filtering and model compression techniques.

Method | FD | PE | GC | AC | VC | AD |
---|---|---|---|---|---|---|

FF | 0.0% | 14.5% | 0.0% | 0.0% | 48.0% | / |

Reducto | / | / | / | / | 48.6% | / |

InFi-Skip | 36.1% | 18.9% | 33.1% | 56.0% | 66.5% | 75.4% |

Optimal | 64.8% | 34.4% | 71.8% | 91.2% | 77.7% | 86.8% |

Method | SR | NER | HAR | UI | SC | VC-MP |

InFi-Skip | 44.1% | 26.8% | 91.2% | 72.4% | 22.5% | 70.7% |

Optimal | 59.9% | 34.4% | 91.8% | 79.8% | 63.8% | 77.7% |

Method | GC | AC | HAR | SC | VC-MP | VC |
---|---|---|---|---|---|---|

FC | 66.1% | 13.2% | / | / | / | 59.4% |

InFi-Reuse | 98.8% | 32.1% | 98.3% | 43.4% | 95.0% | 91.1% |

### 6.2 Inference Accuracy vs. Filtering Rate

First, we test two device-independent metrics (inference accuracy and filtering rate) on the ten inference workloads. We adjust the confidence threshold in FF, Reducto and InFi-Skip, and the ratio of cached inputs in FC and InFi-Reuse, from 0 to 1 with 0.01 interval.

Redundancy measurements. (1) SKIP: For FD (PE), outputs with no detected face (person keypoints) are redundant. For GC (SC), outputs with classification confidence less than a threshold, CONF (0.9), are redundant. For AC, outputs that are not in a subset of classes, Sub, are redundant. For SR, outputs with the number of recognized words less than a threshold, N, are redundant. For NER, outputs without entity label “PERSON” are redundant. For HAR, outputs that are not “LAYING” are redundant. For UI, outputs that do not belong to the first 6 users are redundant. For AD, outputs that are not in {“Cry, Sneeze, Firing”} (anomaly events) are redundant. For VC and VC-MP, outputs with zero count are redundant. (2) REUSE: Experimental results show that cache miss happens rarely, so the homogeneity threshold is set as 0.5. We regard inputs that hit the cache as redundant. For the VC (-MP), since we have 86K images from each camera, a fixed cache ratio can lead to serious inefficiency in the KNN algorithm. We fix the cache size as 1000 and reinitialize the cache every 5000 frames. For other inference workloads, we set a fixed cache size according to the cache ratio.

Overview of results. Tab. II and Tab. III summarize the results of SKIP and REUSE methods. Following related work [reducto], we report the filtering rates at 90% inference accuracy. The optimal results are computed by (1-0.9)+ where denotes the ratio of redundant inputs in the test dataset. Results show that InFi-Skip outperforms FF and Reducto on all 10 workloads with significantly higher filtering rate and wider applicability. Similarly, InFi-Reuse significantly outperforms FC on all 6 applicable workloads. InFi-Skip can filters 18.9%-91.2% inputs and InFi-Reuse can filters 32.1%-98.8% inputs, while keeping more than 90% inference accuracy. For all workloads, Low-level method cannot achieve 90% inference accuracy unless no input is filtered (i.e. 0.0% filtering rate), and we omit these results in the tables.

Feature discriminability. By comparing FF and InFi on FD, PE, GC, and AC workloads, we evaluate the discriminability of our end-to-end learned features. As shown in Fig. 6, FF works on the pose estimation workload, but not on the face detection workload. The “Worst” case is calculated by

. The reason may be that there is a “person” label in the ImageNet dataset, so the pretrained feature embedding in FF is discriminative for determining whether there is a human pose. However, on other tasks (e.g., FD, GC and AC), the pretrained features are not discriminative and FF can only provide two extreme filtering policies: either filtering all input or filtering nothing, which is useless in practice. On the contrary,

InFi-Skip learns feature embedding with robust discriminability and performs well on all four workloads. With over 90% inference accuracy, InFi-Skip can filter 18.9% and 36.1% inputs for PE and FD workloads, respectively.Transferability. One interesting question is, how transferable is the trained filter to workloads with a looser or tighter redundancy measurement? We set the minimal number of recognized words, N, as 0 and 2 and train two InFi-Skip filters. Then we test the two filters on two test sets with different N. As shown in Fig. 7, the performance of InFi-Skip (N=2) is closed to InFi-Skip (N=0) when tested with N=0, however, the performance of InFi-Skip (N=0) is apparently worse when tested with N=2. An intuitive explain is that the learned feature with a looser redundancy measurement covers the one with a tighter redundancy measurement, while the opposite is not true.

Sensitivity to class subset size. For the class-subset redundancy measurement, we set different subset sizes to test the sensitivity. As shown in Fig. 8, for action classification workload, setting a smaller class subset bringing more redundant samples. And InFi-Skip robustly provides smooth accuracy-efficiency trade-off curves in both cases, which significantly outperforms FF (only two extreme points are provided).

Sensitivity to training size. We further divide training splits to sets with different size. As shown in Fig. 9, using only 10% samples from the training set, InFi can still achieve near-optimal performance on HAR workload. Let R denote the ratio of training samples used for training. When achieving over 95% inference accuracy, InFi-Skip (R=1) filters 86.4% inputs while InFi-Skip (R=0.1) still filters 81.1%. For high-accuracy reuse, the impact of training size is relatively greater. When filtering 90% inputs, InFi-Reuse (R=1) can achieve 95.9% inference accuracy, while the accuracy of InFi-Reuse (R=0.1) decreases to 88.1%.

Sensitivity to model complexity. To explore the relationship between the complexity and performance of input filters, we trained InFi-Skip filters for the UI workload using different length of embedding (1, 16, 32, 64, 128, 256) and number of dense units (1, 100, 200, 400) in the classifier. And we measure the performance by the maximum filtering rate when achieving 90% inference accuracy. As shown in Fig. 9(b), except for extreme cases (e.g., single dense or embedding unit), the filtering performance is relatively robust.

Sensitivity to K in KNN. The parameter K in KNN affects the classification accuracy. We vary K from 1 to 20 and test the REUSE filters’ performance. As shown in Fig. 11, on GC workload, InFi-Reuse is robust to varied K parameters, while FC suffers serious performance degradation. For example, with 90% inference accuracy, FC (K=5) can filter 68.4% inputs, while FC (K=1) can only filter 27.3% which is slightly higher than the random guess (20%). On the contrary, InFi-Reuse (K=1,5) can all achieve a 94.3% filtering rate with more than 95% inference accuracy. For the AC workload, the results show that the handcrafted feature SIFT is not discriminative, and all tested K parameters lead to similar performance with random labeling. InFi-Reuse can learn an action-related discriminative feature, it can filter 18.6% inputs and keep more than 90% inference accuracy (K=10).

Comparisons on VC(-MP) workloads. Unlike other datasets, the video frames arrive in time order rather than randomly. For VC-MP, we partition the YOLOv3 model to mobile-side (the first 39 layers) and edge-side (the rest layers). As shown in Fig. 12, InFi outperforms FF, Reducto, and FC, and also is the only applicable method for the VC-MP workload. With over 90% inference accuracy, InFi-Skip achieves 66.5% filtering rate, while FF and Reducto achieve 48.0% and 48.6%, respectively; InFi-Reuse filters 31.7% more inputs than FC when K=10. The results show the superiority of end-to-end learned features over handcrafted and pre-trained ones.

Multi-task extension. In Sec. 5.3, we present how to extend InFi to multi-task workloads. We use the Hollywood2 dataset and corresponding inference tasks to evaluate the multi-task extension of InFi-Skip. First, we select three inference tasks on the image modality: FD, GC, PE. We build a single-modality (image) and 3-task InFi-Skip and evaluate its performance on each task. As shown in Fig. 12(a), the multi-task filter outperforms single-task ones on all three tasks, improving the filtering rate up to 3.3% when achieving 90% inference accuracy. Next, we select three inference tasks on different modalities: PE on images, NER on texts, SR on audios. And we build a 3-modality (image, text, audio) and 3-task InFi-Skip and evaluate it. As shown in Fig. 12(b), fusing these three multi-modality tasks into one filter results in a slight decrease in the filtering rate. But note that the overall efficiency is improved due to the shared parameters among different tasks.

Online active update. To evaluate the active strategy for online adaptation of InFi, we select the VC workload and compare three training methods: (1) Offline: selects the first 10% frames of a day to train; (2) Periodic: selects the first 10% frames of each hour to train and update; (3) Active: see Sec. 5.4. For a fair comparison, we set the same threshold (0.5) for the three methods. As shown in Fig. 14, our proposed active strategy significantly improves the online adaptability of InFi-Skip. The offline policy’s performance seriously degrades when the input distribution changes, mainly because the frames changed from infrared to RGB images. On average, the offline policy achieves only 56.4% inference accuracy. The periodic update policy alleviates this problem to some extent, improving the average accuracy to 87.0%, but still suffers from performance fluctuations. The active strategy’s performance only drops at the 7th time segment, as it does not see any RGB images before that. And the active strategy effectively selects informative samples to fit the new distribution and performs accurate filtering robustly. On average, our active policy achieves 94.8% inference accuracy, which is 38.4% higher than the offline policy.

### 6.3 Filterability

In Sec. 3, we compare the hypothesis complexity of the inference and filter models. Let “Conf.T” denote the low-confidence classification case ( 3.2), “Class Subset” denote the redundant class subset case ( 3.3), and “Reg.T” denote the bounded regression case ( 3.4). GC and SC belong to the “Conf.T” case, where T is 0.9. AC, NER, and HAR belong to the “Class Subset” case, where AC selects 2 action labels, NER selects “PERSON” label, and HAR selects “LAYING” label. FD, PE and VC(MP) belong to the ‘Reg.T” case, where T is 0. SR is a sequence-to-sequence model, which cannot perfectly fits any of these three cases. We compute the ratio of resulting filtering rate to the optimal filtering rate at 90% inference accuracy to compare the filterability of different cases. From a practical perspective, we evaluate the overall throughput with and without InFi-Skip filters. As shown in Fig. 15, the “Conf.T” case which we proved that the filter’s complexity is not less than the inference model’s achieves obviously lower filtering ratio (0.41 median), while other cases which we proved that the filter tends to be less complex achieve apparently higher ratios (0.71/0.78 medians). On the other hand, the overall throughput improved by InFi-Skip filters on filterable cases is more significant than the non-filterable cases. In the non-filterable cases, GC and SC, InFi achieves around 1.3 throughput, while in the filterable cases, it can improve the throughput up to 5.92 and achieves 1.8 and 2.25 medians for regression and subset-class cases, respectively. These results show the guiding significance of our proved filterability in real applications.

### 6.4 Computation and Resource Efficiency

As we discussed in Sec. 2.2, a “valid” filter should be both accurate and lightweight. The above results have shown that InFi can filter significant amount of inputs while keeping accurate inference. In the training phase, InFi (image modality) takes around 710 ms per batch (batch size is 32) and requires 5337 MB GPU memory which most commercial GPUs can meet. InFi for other input modalities requires far less resources, e.g., InFi (vector) tasks 3 ms per batch and needs only 435 MB memory. We test the latency and energy in the inference phase on mobile platforms. As a fair comparison, we chose the TFLite-optimized MobileNetV1, which is one of the most efficient CNNs on mobile devices. As shown in Fig. 16, on three mobile platforms, InFi with the image feature network costs only 12-25% runtime of MobileNetV1. The average energy costs of InFi are 14.4/79.7 mJ per frame, which are much lower than MobileNetV1 (410.4/803.8 mJ per frame) on the phone/smartwatch. We implement InFi with MindSpore and the results show that InFi’s low-energy consumption and low-latency execution do not depend on the implementation framework.

On-device online update. Based on Chaquopy library, we tested the overhead of training on a mobile phone (XIAOMI Mi 5) and a smart watch (HUAWEI WATCH). We randomly generated images with the shape of (224, 224, 3) and set the batch size as 16. For InFi (image modality), experiments show that it takes around 20s and 50s per batch to online update weights on the phone and the watch, respectively. And for NVIDIA JETSON TX2, training the input filter with the same configurations takes around 1s per batch.

### 6.5 Different Mobile-centric Deployments

Now we evaluate the overall performance of inference workloads in real systems with three ways of deployments.

Vehicle counting. First, we consider the vehicle counting workload: 1) on-device: InFi (image) and YOLOv3 model on TX2; 2) offload: InFi (image) on TX2 and YOLOv3 model on edge; 3) model partitioning (MP): first 39 layers (10 convolution blocks) of YOLOv3 and InFi (feature map) on TX2, rest of YOLOv3 on edge server. The average throughput of YOLOv3 model on TX2 and edge is 3.2 FPS and 22.0 FPS, respectively. For MP deployment, the edge-side model serves 24.5 FPS. We report the average throughput and the bandwidth saving of using InFi-Skip and InFi-Reuse, with over 90% inference accuracy, in Tab. IV. As a fair comparison, we test the throughput of YOLOv3-tiny [yolov3-tiny] model, a compressed version for YOLOv3. The inference accuracy of YOLOv3-tiny is only 67.9% which does not meet the 90% target. Breaking down the overheads, InFi’s inference costs around 3 ms per frame, and the average latency of KNN is 6 ms per frame with K=10 and cache size=1000. Achieving over 90% inference accuracy, InFi-Skip improves the throughput to 9.3/55.2/39.0 FPS for on-device/offload/MP deployments, respectively. Apparently, in vehicle counting workloads, there are more filtering opportunities for InFi-Reuse. InFi-Reuse improves the throughput to 27.2/77.2/46.0 FPS for these three deployments. Except the on-device deployment that does not involve cross-device data transmission, InFi-Skip / InFi-Reuse also save 66.5% / 91.1% and 70.7% / 95.0% bandwidth for offloading and MP workloads. Unlike YOLOv3-tiny which trades a significant and fixed loss of accuracy for efficiency, InFi provides a flexible trade-off between the inference accuracy and overheads.

Pose estimation. Second, we evaluate the pose estimation workload: 1) on-device: InFi (image) and OpenPose model on TX2; 2) offload: InFi (image) on TX2 and OpenPose model on edge; 3) model partitioning (MP): first 39 layers (10 convolution blocks) of OpenPose and InFi (feature map) on TX2, rest of OpenPose on edge server. Also, we test the throughput of OpenPose-light [light-openpose] model, a lightweight version of OpenPose. Experimental results are shown in Tab. V. Similar to the vehicle counting workload, the lightweight model cannot achieve our target 90% inference accuracy, although its throughput boosts significantly. InFi-Skip can flexibly balance the inference accuracy and throughput. For example, for the on-device deployment, the throughput improves to 1.17 after using InFi-Skip and the inference accuracy keeps over 90%.

Workload | YOLOv3 | InFi-Skip | InFi-Reuse | YOLOv3-tiny |
---|---|---|---|---|

Acc. (%) | 100 | 90.3 | 90.5 | 67.9 |

On-device | 3.2/- | 9.3/- | 27.2/- | 20.4/- |

Offloading | 22.0/- | 55.2/66.5 | 77.2/91.1 | 225.3/- |

MP | 24.5/- | 39.0/70.7 | 46.0/95.0 | 230.4/- |

Workload | OpenPose | InFi-Skip | OpenPose-light |
---|---|---|---|

Inference Accuracy (%) | 100 | 90.1 | 76.5 |

On-device | 15.4/- | 18.0/- | 28.1/- |

Offloading | 27.7/- | 31.5/18.9 | 98.5/- |

MP | 29.2/- | 33.1/20.2 | 102.4/- |

## 7 Related Work

Frame filtering. NoScope [noscope] trains task-specific difference detectors to choose necessary frames for object queries in the video database. FilterForward [ff] leverages MobileNet and trains a binary micro-classifier on the intermediate output of a selected layer to determine whether to transmit the input image to the server with offloaded model. Reducto [reducto] performs on-device frame filtering by thresholding difference of low-level features between successive frames. Through elaborate selection for different tasks, low-level features can efficiently and accurately measure the difference.

Inference caching. Potluck [potluck] stores and shares inference results between augmented reality applications. It dynamically tunes the threshold of input similarity and manages cache based on the reuse opportunities. FoggyCache [foggycache] is more general and can be applied to both image and audio inputs. It designs adaptive LSH and homogenized KNN algorithms to address practical challenges in inference caching. Instead of caching the final inference results, DeepCache [deepcache] stores the intermediate feature maps to achieve more granular reuse. For object recognition, Glimpse [glimpse] maintains a cache of video frames on mobile devices. It uses cached results to perform on-device object tracking and sends only trigger frames to the server with offloaded recognition model.

Approaches tailored for specific pipelines. Focus [focus] is designed for querying detected objects in video database and uses compressed CNN to index possible object classes at ingest stage and reduces the query latency by clustering similar objects. Blazeit [blazeit] develops neural networks-based methods to optimize approximate aggregation queries of detected objects in video database. Focusing on the object detection in video streams, Chameleon [chameleon] proposes to adaptively select a suitable pipeline configuration including the resolution and frame rate of videos, backbone neural networks for inference, etc. Elf [elf] is designed for mobile video analytic where the input data is pre-processed by a lightweight on-device model and then offloaded in parallel to multiple servers with the same subsequent inference functionality.

Our proposed input-filtering framework unifies the frame filtering and inference caching approaches. And we complement existing work in theoretical analysis and flexible supports for more input modalities and deployments.

## 8 Conclusion

In this paper, we study the input filtering problem and provide theoretical results on complexity comparisons between the hypothesis families of inference models and their input filters. We propose the first end-to-end learnable framework that unifies both SKIP and REUSE methods and supports multiple input modalities and deployments. We design and implement an input filter system InFi based on our framework. Comprehensive evaluations confirm our proven results and show that InFi has wider applicability and outperforms strong baselines on accuracy and efficiency.

## Acknowledgments

This research was supported by the National Key R&D Program of China 2021YFB2900103, China National Natural Science Foundation with No. 61932016, No. 62132018. This work is partially sponsored by CAAI-Huawei MindSpore Open Fund and “the Fundamental Research Funds for the Central Universities” WK2150110024.