Machine learning (ML) has been increasingly utilized in mobile devices (e.g., face and voice recognition and smart keyboard). However, many ML applications running on mobile devices (such as smartphone and smart home hub) mainly focus on ML inference not ML training. Recently, training ML models on mobile devices attract attentions because of the concerns on data privacy, security, network bandwidth, and availability of public cloud for model training Eom et al. (2015); Konečnỳ et al. (2016). Training ML models leverages data generated locally in mobile devices without uploading data to any public domain. However, data generated on local mobile devices usually does not have labels, causing difficulty for training ML models (especially supervised ML models). For example, the user uses a smartphone to take pictures. Those pictures are seldom labeled, and are difficult to be used to train ML models. How to accurately and efficiently label data on a mobile device is critical for the success of training ML models on the mobile device.
Using automatic labeling is a solution to address the above problem. Studies of automatic labeling in the past decade have been focusing on data stored on servers Varma and Ré (2018b); Ratner et al. (2017); Yang et al. (2018); Haas et al. (2015). Such data usually has a fixed size, and labels are pre-determined and fixed. However, the data generated on a mobile device have different characteristics, compared with the data located on a server. In particular, on a mobile device, data is incrementally added. For example, pictures are added into a smartphone on a daily base, as the user takes pictures from time to time. Also, as data is dynamically generated, some new labels may appear. However, most of the existing methods cannot recognize new labels.
Besides distinguished data characteristics, auto-labeling on mobile devices face a challenge on hardware heterogeneity on mobile devices. Mobile devices are often equipped with mobile processors with rich heterogeneity for high energy efficiency and performance Wu et al. (2019); Liu et al. (2019b). For example, Samsung S9, a mobile device we study in this paper, has two types of CPU cores and a mobile GPU. Scheduling computation becomes complicated as we have hardware heterogeneity, because we must decide where to run computation and how to control thread-level parallelism for short execution time and low energy consumption. Leveraging heterogeneous mobile processors to efficiently execute the auto-labeling workload is a key to enable feasible auto-labeling on mobile devices.
In this paper, we introduce an auto-labeling system, Flame, in order to address the above challenges. Flame is particularly designed for mobile devices with heterogeneous mobile processors. Flame is featured with self-adaptiveness to handle incrementally generated data with unknown labels. In particular, Flame uses a cluster based technique to gather and detect data that belongs to the same class but with new features. Flame then creates and assigns new labels without using the existing labels.
Furthermore, Flame is featured with a hardware heterogeneity-aware runtime system. To efficiently schedule computation kernels, the runtime system profiles performance of kernels on different computing units, based on high predictability of the auto-labeling workload. Using the profiling results, the runtime system performs kernel scheduling based on a set of greedy heuristic policies. To efficiently run individual kernels, the runtime system chooses optimal number of threads (i.e., thread-level concurrency control) and divides the kernel computation between heterogeneous mobile processors, using a couple of analytical performance models.
We summarize our contributions as follows.
We propose a self-adaptive auto-labeling system to assign labels for data on mobile devices. To our best knowledge, this is the first auto-labeling system focusing on the data labeling problem on mobile devices.
We present a new labeling algorithm to process dynamically generated data (i.e., non-stationary data) with unknown labels.
We introduce a runtime system to efficiently leverage heterogeneous mobile processors for auto-labeling. We evaluate our system on a smartphone, and demonstrate high labeling quality and high performance with Flame.
2.1 Auto-labeling and Preliminaries
An important problem in ML is how to automatically label a relatively large quantity of unlabeled data with little labeled data Mei et al. (2007); Mao et al. (2012); Dube et al. (2019). When the user gets a new mobile device, there is usually a limited number of labeled data. The goal of our auto-labeling system is to label gradually increasing unlabeled data. An existing solution to address the traditional auto-labeling problem on servers is called co-training Qiao et al. (2018). In this solution, the label of each data can be determined with two conditionally independent labeling functions by utilizing both labeled and unlabeled data. Another existing solution on servers uses the boosting method Varma and Ré (2018b)
, which constructs and combines many “weak” classifiers into a “strong” one. However, for the auto-labeling problem on mobile device, co-training is difficult to derive precise classifiers, because of the insufficiency of labeled data; The boosting method cannot work either, because it assumes a fixed number of labels, and hence cannot label an unfixed number of labels. In our paper, we combine the co-training and boosting methods to deal with the auto-labeling problem on mobile devices.
We define several terms used in the paper as follows. The prototype indicates an average or best exemplar of a category (label), which could represent the instances of an entire category. A prototype is usually denoted as a tuple , each element in the tuple represents some important patterns and characteristics of the category.
The heuristic function in our paper is an union of several prototypes, denoted as , where is the number of prototypes contained by the heuristic function. Updating each heuristic function is through updating its prototypes. The process for maintaining the optimal prototypes for a category is called prototype learning. In our study, prototypes are learned through optimizing a self-defined object function. We combine the prototype learning with the boosting method for auto-labeling, which generates more discriminating and robust results.
2.2 Heterogeneous Mobile Processors
The System on Chips (SoCs) in mobile devices increasingly employs heterogeneous mobile processors, such as CPU, GPU, DSP, and NPU. Samsung S9, a mobile device we used for study, uses Qualcomm 845 SoC, which includes a 4-core fast CPU, a 4-core slow CPU, and an Adreno 630 mobile GPU. The fast and slow CPU cores are different in terms of frequency, cache hierarchy, instruction scheduling and energy efficiency. The mobile GPU is particularly efficient for processing data-intensive tasks.
3 Model Design
In this section, we describe the auto-labeling algorithm design for Flame. As shown in the left part of Figure 1, the algorithm design includes three components: a heuristic function generator to generate a number of heuristic functions for assigning labels, a self-adaptive mechanism for updating the heuristic functions and detecting whether an unknown label appears, and a labeling aggregator for combining and verifying the confidence of label assignments. We describe the three components and input/output of Flame as follows.
3.1 Input and Output Data
Input Data. The input data of Flame is a small number of data with labels and a large number of data without labels. Each data is defined by its primitives. In many of our labeling use cases, the primitives can be viewed as the basic features associated with the corresponding data. For instance, in the use case of labeling, primitives can be color, size, shape, etc. In our work, we want to label the data in an automatic way, so we extract the features of the images as the primitives in our auto-labeling algorithm. Given a labeled dataset , where is the primitives of the data and is the associated true label, represents a dimensional space for data and is the total number of labels in . The non-stationary unlabeled dataset , where and represents the number of the unlabeled data. In our setting can be very large as the new data is dynamically generated.
Output Data. The output of Flame is the confidence of a label for data in the unlabeled dataset ( is the set of result labels including old labels and detected new labels). Here, , which indicates that some new labels that are not in may appear in , as new data is incrementally generated. The confidence value is calculated through an ensemble method in Flame, discussed in Section 3.4.
3.2 Heuristic Functions Generation
on heuristic function generation for auto-labeling have demonstrated the success of using machine learning models (e.g., Decision Tree, Logistic Regression, K-Nearest Neighbor) as the heuristic functions. However, these methods can be costly because of two reasons. (1) A large number of heuristic functions are generated, which bring high computation overhead during the labeling, especially on mobile devices with limited computation resources; (2) Each heuristic function has parameters. Updating parameters from all heuristic functions demands large amount of computation. Therefore, we design a heuristic function generation method to address the above problems, while being cognizant of computational and memory accessing efficiency required for Flame.
Each heuristic function works well for a part of data in the primitive space. We expect that the heuristics in the system together have high data coverage. The larger the coverage is, the higher the labeling cost could be reduced. In Flame, we use a clustering algorithm to determine the boundary of each heuristic function, because of the flexibility of the clustering algorithm. In particular, each heuristic function consists of several clusters generated through an impurity based K-means algorithm based on the initial limited number of labeled data points. A cluster is pure if it contains labeled data points from only one class (along with some unlabeled data). Once the clusters are created, the raw data points are discarded to save memory space. The discarded data points in a cluster are replaced by a prototype of the cluster. A prototype of a cluster is a tuple denoted by , where is the centroid of the cluster, represents the radius which is the distance between the centroid and the farthest data point in the cluster, is the mean distance between each data point and the centroid in the cluster, is the total number of data points in the cluster, and
is a vector recording the number of data points belonging to different labels (referred asfrequencies in the rest of the paper). For example, , where each element in is the frequency of the corresponding label assigned to the existing data. Each heuristic function is a collection of K prototypes, .
When building the heuristic functions using the impurity K-means algorithm, the objective is to minimize the dispersion and impurity of data points contained in each prototype. Thus, the objective function for building the heuristic function is formulated as follows.
In Equation 1, is the loss value caused by the dispersion of data points contained in each prototype. is calculated with , where is the total number of prototypes in and is the set of all data points in the prototype . In Equation 1, is the loss value caused by the impurity of data points in each prototype, and is a hyper-parameter controlling the importance of . is calculated as follows: , where quantifies labeling diversity in a prototype and is the entropy value of data points in the prototype. A small leads to small impurity. is calculated based on each data point’s labeling dissimilarity in the prototype (particularly, , where LD(x,y) of a data in the prototype with the label is the total number of labeled points in the prototype that should have labels other than . , if a data point is unlabeled; Otherwise, , when the data point is labeled and its label , where and are the sets of all labeled data points and labeled data points with the label in the prototype, respectively. Thus, can be written as . Furthermore, , where
is the probability of labeling().
Based on the above discussion, the loss function can be re-formulated as follows.
Minimizing can get the optimal prototypes for each heuristic function. Each prototype corresponds to a “hypersphere” in the primitive space with a centroid and radius. The coverage of a heuristic function is the union of the hyperspheres encompassed by all prototypes in . The coverage boundary of the heuristic function pool is the union of coverage of all heuristic functions . If a data point is inside the coverage boundary of , it is labeled using each as follows. Let is the prototype of whose centroid is the closest to . In the prototype , assume is the highest frequency value in the frequency vector , then ’s corresponding label will be assigned to the unlabeled data point . Each maintains an assigned label and this label’s confidence value for data point . Finally, the label for the data point is determined by taking the majority vote among all heuristic functions.
3.3 Labeling Heuristics Self-adaptation
Many automatic labeling methods Dunnmon et al. (2019); Ratner et al. (2017); Yang et al. (2018); Varma et al. (2019) assume that the number of possible labels associated with data points is known and fixed. However, in some cases, this is not true. Data points belonging to unknown labels may appear as the dataset dynamically increases.
In Flame, if a data point is outside of the coverage boundary of , is regarded as a data point with an unknown label and stored in a buffer . This buffer is periodically checked to observe whether there are enough data points in the buffer with the same new label. We use a distance based method called q-Neighborhood Silhouette Coefficient Masud et al. (2010), shorted as q-NSC to address the problem about detecting new labels. is a predefined parameter. q-NSC considers both cohesion and separation of data points located in the primitive space, and yields a value in . A positive q-NSC value of the data point indicates that is close to other data points in the buffer . This means that these data points together may have the same potential unknown label.
Because of the dynamic characteristic of the dataset and the requirement of continuous updating prototypes for existing heuristic functions, Flame must have an ability to adapt to the changes over the non-stationary dataset without increasing memory footprint and computing overhead. We introduce a mechanism to incrementally incorporate new label information from data points in the buffer to the existing heuristic functions without loosing discriminatory characteristics. Moreover, to limit the memory usage, we set the maximum number of prototypes across all heuristic functions to be .
Algorithm 1 depicts the mechanism, including self-adaptation of heuristic functions and corresponding updates on prototypes. Flame periodically checks the buffer and requests checking on potential new label. After a new label is found, Flame removes data points from that belong to the new label. Next, Flame uses to collect new data points with potential new labels. At last, Flame builds new prototypes for each new label detected, and updates the parameters of heuristic functions. Each prototype is a tuple occupies limited storage space and Flame also constrains the maximum number of prototypes in each heuristic function, therefore, Flame limits memory usage and computation cost.
3.4 Labeling Confidence Aggregator
After a heuristic function is built, use a metric to quantify its confidence for labeling any data point . This metric, named as , is defined as follows.
Where is the prototype closest to the data point in the heuristic function , is the radius of , and is the distance between and , is the highest label frequency in , and is the sum of all frequencies in .
In Equation 3, if is small, which means that is close to the centroid of , then () is large which leads to a high confidence of labeling. In addition, a large , which means high purity of the prototype , also lead to a high confidence of labeling. Hence, the metric considers the impact of both distance and purity, here, purity is calculated by .
Flame calculates the confidence value for each in . These confidence value are then normalized between 0 and 1, and then aggregated together to calculate the overall labeling confidence of all heuristic functions, which is shown in Equation 4.
In Equation 4 we have a threshold to decide if the confidence is high enough. If the overall confidence is higher than , the label is assigned; Otherwise, the data point is added into the buffer to wait for further checking.
4 Runtime System Design
We describe the runtime design for Flame in this section. Figure 2 gives an overview of the runtime system. In general, the runtime system includes three techniques: hardware heterogeneity-aware kernel scheduling, concurrency control to determine the number of threads to run a kernel on homogeneous CPU cores, and kernel division on heterogeneous mobile processors. We describe them in details as follows.
4.1 Preliminary Performance Study
Flame decomposes major computation in the auto-labeling process into kernels. A kernel can be a frequently invoked function; A kernel can also be a computation intensive loop. Table 1 lists kernels in Flame. If a kernel has a parallel loop and there is no dependence between iterations of the loop, we name the kernel parallel kernel. Otherwise, we name the kernel serial kernel and run it with only one CPU thread.
|Sample Processing||Parallel||38.0%||Process data samples with heuristic functions|
|Detect Change||Parallel||18.0%||Detect the occurrence of new labels|
|Test Ensemble||Parallel||17.9%||Test the ensemble of heuristic functions|
|Label Single||Parallel||17.7%||Label a single instance with heuristic functions|
|Warmup Processing||Serial||6.4%||Warmup and preparation of Flame|
|Others (20 in total)||Parallel & Serial||2.0%||Primitive math operations|
We profile execution times of those kernels using Cifar-10. Table 1 shows the results. We observe that the top five kernels consume more than 95% of the total execution time in the auto-labeling process. We call the top five kernels the time-consuming kernels; The other kernels are the small kernels.
The auto-labeling process using Flame can involve a number of iterative steps, typically hundreds or thousands of steps (depending on how many data to be labeled). In each auto-labeling step, a group of data is labeled. We refer a group of data as a chunk. At the end of each step, there is a barrier working as a synchronization point where detection of new labels based on a chunk must be finished before Flame processes the next chunk.
The execution time of some kernels highly depends on the number of existing labels. The execution time of those kernels is roughly in linear proportion to the number of existing labels. All parallel kernels in Flame are those kernels. Furthermore, for most of the serial kernels, their performance is not related to the number of existing labels, because those serial kernels are used to process input and output data and initialize Flame. We leverage the above facts in our analytical models Equations 5- 7 to optimize execution of individual kernels.
4.2 Hardware Heterogeneity-aware Kernel Scheduling
Kernel scheduling has big impact on kernel execution time. To quantify the performance difference of different kernel scheduling, Figure 3
shows the execution time of four frequently used kernels (HF-A_CS, HF-B_CS, HF-C_CS and HF-D_CS) running on fast CPU-only, fast and slow CPUs (i.e., using all CPU cores), and GPU. In general, all kernels have performance variance. There is up to 18% difference between running HF-B_CS on GPU and all CPU cores.
To decide where to run a kernel, we use the following three policies. (1) The time-consuming parallel kernels use all processing units (including both GPU and CPUs), because those kernels are time-consuming and in the critical path. (2) Small and serial kernels use fast CPU, because they cannot benefit from high thread-level parallelism on GPU and can be in the critical path. (3) If all fast CPU cores are busy, we assign kernels to slow CPU cores.
The kernel scheduling can suffer from the straggler effect. We describe it as follows. During the auto-labeling process, there is a barrier at the end of each auto-labeling step. The computation of all heuristic functions must finish at the barrier, before the auto-labeling process moves on to the next step. The computation of those heuristic functions happens in parallel. If the computation of one heuristic function finishes much later than the other ones, then we may have idling processing units and hence lose system throughput. Ideally, computation of all heuristic functions should be finished at the same time.
To address the above problem, we schedule kernels based on the following algorithm. In particular, we associate each kernel with the ID of the heuristic function for which the kernel computes. When the runtime system picks up kernels to execute, the runtime follows a round-robin policy to ensure that kernels from different heuristic functions have the same opportunity to execute. Also, kernels with long execution time are scheduled to execute first, in order to shorten the critical path of execution, which is also helpful to avoid the straggler effect.
4.3 Optimization of Intra-Kernel Parallelism
The execution time of a kernel is sensitive to the number of threads to run it. This is especially true for small kernels running on CPU. Figure 4 shows the execution time of running four frequently invoked small kernels (HF-A_DC, HF-B_DC, HF-C_DC and HF-D_DC) with different number of threads.
4.3.1 Concurrency Control
Figure 4 shows that the kernels HF-A_DC, HF-B_DC and HF-C_DC achieve the best performance using only one CPU thread, while HF-D_DC achieves the best performance using 3 CPU threads on fast CPU. The reason accounting for the performance variance is because of thread management overhead (e.g., thread spawning and binding to cores) and cache thrashing due to multi-threading.
Equation 5 calculates performance speedup of using multiple threads to run a kernel. In Equation 5, is the serial execution time of processing one chunk and there is only one existing label; is the number of threads and is the number of labels; is the parallel execution time with threads and labels; is the thread management overhead for one thread; is the proportion of the execution time in which thread-level parallelism can be employed.
In Equation 5, the numerator is the serial execution time of processing labels, where corresponds to the execution which is not sensitive to the number of labels, and corresponds to the execution which is sensitive to the number of labels. The denominator of Equation 5 is the parallel execution time, including thread management overhead () and computation time () of the parallel kernel.
Equation 6 calculates computation time of the parallel kernel, including the serial time () that is not sensitive to the number of labels and do not run in parallel, and parallel computation time ().
|Serial execution time of processing one chunk when there is only one existing label|
|Number of threads|
|Computation time of using threads|
|Thread management overhead for one thread|
|Number of existing labels|
|Proportion of the execution time in which thread-level parallelism can be employed|
|Execution time on fast CPU|
|Execution time on slow CPU|
|Execution time on accelerator (GPU)|
|Serial execution time on fast CPU to process one chunk when there is only one existing label|
|Serial execution time on slow CPU to process one chunk when there is only one existing label|
|Execution time on accelerator (GPU) to process one chunk when there is only one existing label|
|Time for data copy between CPU and GPU|
|Percentage of workload assigned to fast CPU|
|Percentage of workload assigned to slow CPU|
|Percentage of workload assigned to accelerator (GPU)|
|Number of threads to run on fast CPU|
|Number of threads to run on slow CPU|
To determine the optimal number of threads to run a parallel kernel, we use the following method. Given a kernel, , and are measured offline. can be known at runtime. We enumerate various number of threads and use Equations 5 and 6 to find the optimal number of threads that lead to the largest speedup.
4.3.2 Heterogeneity-Aware Kernel Division
When running a time-consuming parallel kernel on heterogeneous mobile processors, we must ensure load balance between computing units (GPU, slow CPU and fast CPU). We introduce an analytical model to decide how to divide the computation of a kernel between different computing units (i.e., kernel division) for load balance. The kernel division is implemented by assigning iterations of the loop in a parallel kernel to different computing units. Equation 7 shows the model.
Equation 7 is an extension to Equation 6 which is for homogeneous CPU. Equation 7 considers the difference of computation ability in heterogeneous processors. Given a time-consuming parallel kernel, Equation 7 partitions the workload to run on fast CPU (), slow CPU () and GPU (). Equation 7 also considers the computation that is not sensitive to the number of labels () and the computation that is sensitive to the number of labels.
In Equation 7, given a kernel, , , , , , , and can be measured offline. can be known at runtime. To determine the optimal kernel division, we enumerate the possible values of , , , , and , and use Equations 7 and 5 to find the optimal kernel division that leads to the largest speedup.
We implement Flame using C++ with Native Development Kit (NDK) on Android 9.0 and evaluate the system on Samsung S9 with Snapdragon 845 SoC. Our implementation includes 3135 lines of code in total. In our mobile platform, we have three types of mobile processors, which are GPU, fast CPU and slow CPU. We implement a scheduler to schedule kernels and divide the computation within a parallel kernel over the three types of mobile processors. To run a kernel on a specific type of CPU, we use the thread affinity API. To execute the kernel on GPU, we maintain a CPU thread to execute an OpenCL version of the kernel. To execute a parallel kernel, we examine the availability of CPU cores and GPU at runtime and then employ the performance models discussed in Section 4.3 to obtain the optimal concurrency and workload division.
We evaluate Flame from the perspectives of labeling quality and execution time on heterogeneous mobile processors.
6.1 Experimental Setup
|Type||Data Set||of features||of labels|
Datasets. We use eight datasets to evaluate Flame. Table 3 summarizes those datasets. Those datasets are commonly used for object detection or recognition, which are common applications in mobile devices.
Flame can detect unknown new labels, when new data are dynamically generated in a mobile device. To evaluate this ability of Flame, we use the following method. For each dataset, we choose 20% of labels as known, and the rest of labels as unknown. The rest of labels needs to be detected by Flame. We choose a subset of data from each dataset. The data subset has known labels, and is used to build heuristic functions at the beginning of auto-labeling. Excluding , the rest of dataset () is used to simulate the scenario where new data is incrementally generated for auto-labeling. The ratio between and is 0.1.
We use six heuristic functions, each of which contains 40 prototypes. The dynamically generated data is fed into Flame in the granularity of chunk. A chunk includes 20 data samples. The labeling confidence threshold is set as 0.7.
Mobile device configuration. We evaluate Flame on a Samsung S9 smartphone. This device is equipped with Snapdragon 845 SoC and Android 9.0 Pie OS. In Snapdragon 845 SoC, there is a mobile GPU, Adreno 630. We program it with OpenCL 2.0 Khronos .
Evaluation metrics. We use the following metrics to evaluate the system’s labeling results.
, where is total number of data with new labels correctly labeled, is the number of data with existing labels identified correctly, and is total number of data labeled by the system.
Let represent the total data that should be assigned with existing labels but is mislabeled with new labels (i.e., previously unknown labels); Let represent the total data that should be assigned with new labels but is mislabeled with existing labels; Let the total number of data assigned with new labels. To evaluate labeling results, besides , we use another three metrics based on , , and .
: Percentage of data that should be assigned with new labels but is mislabeled with existing labels, i.e. .
: Percentage of data that should be assigned with existing labels but is mislabeled with new labels, i.e. .
: This metric quantifies the overall labeling quality of the system by considering both and . is defined as , where is the total number of data that should be assigned with new labels and assigned correctly. In this paper, we use , which gives .
6.2 Evaluation Results
Labeling results. Table 4 shows the results, evaluated with the eight datasets based on the four different evaluation metrics. In general, Flame provides good labeling quality, This labeling quality is comparable to that in the existing work Varma and Ré (2018a); Ratner et al. (2017).
The labeling quality on MNIST is the best: and . They are the highest among all datasets. This is because the image characteristics with different labels in MNIST are quite dissimilar, making the work of auto-labeling easier. However, for the GTSRB dataset, Flame has a relatively low labeling quality (). This is because the data in this dataset comes from 40 different classes and using our simulation method to simulate unknown labels, we have 32 unknown labels. Such a large number of unknown labels can influence the labeling quality. For , the results of this metric show that Flame has a good ability to detect new labels as dataset dynamically increases, especially for the dataset related with image classification. This is due to the effectiveness of the self-adaptation mechanism in Flame, which is superior to other auto-labeling methods that can only work for a fixed size of datasets. Besides the superior ability to detect new labels, Flame also maintains a high labeling quality for the data that should be assigned with the existing labels. This fact is supported by the results of , where can be as small as 4.47.
Parameter sensitivity. We evaluate how execution time and labeling accuracy () vary, as we use different system configurations (particularly the number of heuristic functions, the confidence score threshold , chunk size, and the number of prototypes in a heuristic function). Figure 5 shows the results.
Figure 5.a shows that the accuracy increases as the number of heuristic functions increases but is smaller than 10. However, the accuracy drops down when the number of heuristic functions is larger than 10. This is because too many heuristic functions cause an overfitting problem. Furthermore, the execution time consistently increases as the number of heuristic functions increases. This is expected, because the execution time of Flame is related to the number of heuristic functions.
Figure 5.b shows that the accuracy increases as the confidence score threshold increases. However, the execution time dramatically increases after the confidence score threshold is larger than 0.7. Hence, we use in Flame.
Figure 5.c shows that the accuracy increases as the chunk size increases. This is because as the chunk size increases, the diversity of the data within the chunk also increases. The high data diversity influences the labeling quality. However, if the chunk size is larger than 20, increasing the chunk size is not helpful to improve the accuracy. Also, the execution time increases a lot as the chunk size is larger than 20. Hence, we choose 20 as the chunk size in Flame.
Figure 5.d shows that the accuracy increases as the number of prototypes in a heuristic function increases. But if the number of prototypes is larger than 40, increasing the number of prototypes is not helpful for increasing accuracy. This is because too many prototypes in a heuristic function may cause an overfitting problem for labeling results. Hence, we choose 40 as the number of prototypes in Flame.
Execution time. Figure 6 presents the execution time of labeling 5000 data samples over eight datasets. We use six execution strategies to evaluate the effectiveness of Flame. These strategies are (1) serial execution with only one fast CPU core with a FIFO scheduling strategy; (2) parallel execution using CPUs (fast and slow CPU cores) with the FIFO scheduling strategy; (3) parallel execution using CPUs and GPU with the FIFO scheduling strategy; (4) parallel execution using CPUs and GPU with optimization on kernel scheduling but not on concurrency control and heterogeneity-aware kernel division; (5) parallel execution using CPUs and GPU with the FIFO scheduling strategy and heterogeneity-aware kernel division; (6) parallel execution using all techniques including kernel scheduling, concurrency control and heterogeneity-aware kernel division. We use the first strategy as the baseline for comparison.
Figure 6 shows 3.6x, 6.9x, 9.2x, 9.6x and 11.6x performance improvement on average after applying strategies 2-6 respectively. Flame (Strategy 6) leads to be largest performance improvement. Using Strategy 2, each kernel is executed by leveraging multiple CPU cores, leading to 3.6x performance improvement. However, GPU is idling. In Strategy 3, both CPUs and GPU are utilized. As a result, the performance speedup is increased to 6.9x. However, the kernel scheduling (FIFO) is not efficient. Strategy 4 improves the kernel scheduling by considering hardware heterogeneity. The performance speedup is increased from 6.9x to 9.2x. Strategy 5 does not uses the kernel scheduling in Flame, but uses the heterogeneity-aware kernel division. This optimization also leads to big performance improvement (9.6x). But Strategy 6 after using all techniques lead to the largest performance improvement.
Energy consumption. Figure 7 shows the energy consumption of six strategies. The energy consumption reported in Figure 7 is normalized by the energy consumption of Strategy 1. Figure 7 shows that energy consumption of Strategies 2-6 is 55%, 67%, 52%, 51% and 40% of that of Strategy 1, respectively. Flame (Strategy 6) uses the least energy. Having low energy consumption is important for mobile devices to extend its lifetime. Low energy consumption of Flame comes from its high performance, i.e., labeling data within the shortest time among all strategies.
7 Related Work
Automatic labeling. We provide an overview of automatic labeling methods, which label data automatically based on generated heuristic functions using both labeled and unlabeled data.
The main challenge for auto-labeling is to build proper heuristic functions that can cover almost all data in the dataset Wang and Rudin (2015); Wang et al. (2015); Varma and Ré (2018b); Varma et al. (2017); Ratner et al. (2016); Bach et al. (2017). Heuristic functions with high quality are usually difficult to acquire and can be highly application specific. Sometimes domain experts are even needed for auto-labeling. In Varma and Ré (2018b), Varma et. al propose a method that uses machine learning models to build heuristic functions under weak supervision. Other work Hastie et al. (2009); Weiss et al. (2016); Ratner et al. (2017) uses distant supervision Hastie et al. (2009); Weiss et al. (2016), in which the training sets are generated with the help of external resources, such as knowledge bases. This kind of method called crowdsource has been intensively studied Li (2017); Chai et al. (2016); Khan and Garcia-Molina (2016); Verroios et al. (2017); Das et al. (2017); Wang et al. (2012), and has been applied in many fields, such as task generation Wang et al. (2012), image labeling Ratner et al. (2017); Varma and Ré (2018b) and task selection Verroios et al. (2017).
Some approaches are recently proposed for noisy or weak heuristic functions Sheng et al. (2008); Ratner et al. (2017). Those approaches demonstrate the use of proper strategies to boost the overall quality of labeling by ensemble heuristic functions Bach et al. (2017). Our work is different from those approaches. The existing approaches focus on static datasets with fixed size and pre-determined number of labels, and datasets are deployed on a server. Our work focuses on the dynamically increased datasets on mobile devices. Our work has the capability to identify new labels that are never seen before. We also leverage processor heterogeneity in mobile devices to run the auto-labeling workload. Hence, our work not only labels dataset with high quality, but also is specific for mobile devices.
Optimization of machine learning on mobile devices. There are many existing efforts that optimize machine learning models on mobile devices, including dynamic resource scheduling Georgiev et al. (2016); LiKamWa and Zhong (2015); Lane et al. (2016); Liu et al. (2019a); Ogden and Guo (2018), computation pruning Gordon et al. (2018); Li et al. (2018), model partitioning Kang et al. (2017); Lane et al. (2016), model compression Fang et al. (2018); Liu et al. (2018), coordination with cloud servers Georgiev et al. (2016); Kang et al. (2017) and memory management Fang et al. (2018); LiKamWa and Zhong (2015). In particular, DeepX Lane et al. (2016) proposes a number of resource scheduling algorithms to decompose DNNs into different sub-tasks on mobile devices. LEO Georgiev et al. (2016) introduces a power-priority resource scheduler to maximize energy efficiency. NestDNN Fang et al. (2018) compresses and prunes models based on the available hardware resource on mobile devices. Our work is different from those efforts, in that we introduce an efficient hardware heterogeneity-aware kernel scheduling and focus on optimization of intra-kernel parallelism to achieve high performance and energy consumption.
Auto-labeling on mobile devices is critical to enable successful ML training on mobile devices for many large ML models. However, it is challenging to enable auto-labeling on mobile devices, because of unique data characteristics on mobile devices and heterogeneity of mobile processors. In this paper, we introduce the first auto-labeling system named Flame to address the above problem. Flame includes an auto-labeling algorithm to detect new unknown labels from non-stationary data; It also includes a runtime system that efficiently schedules and executes auto-labeling workloads on heterogeneous mobile processors. Evaluating with eight datasets, we demonstrate that Flame enables auto-labeling with high labeling accuracy and high performance.
- Learning the structure of generative models without labeled data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 273–282. Cited by: §7, §7.
- Cost-effective crowdsourced entity resolution: a partial-order approach. In Proceedings of the 2016 International Conference on Management of Data, pp. 969–984. Cited by: §7.
- Falcon: scaling up hands-off crowdsourced entity matching to build cloud services. In Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1431–1446. Cited by: §7.
Automatic labeling of data for transfer learning. nature 192255, pp. 241. Cited by: §2.1.
- Cross-modal data programming enables rapid medical machine learning. arXiv preprint arXiv:1903.11101. Cited by: §3.3.
- Malmos: machine learning-based mobile offloading scheduler with online training. In 2015 3rd IEEE International Conference on Mobile Cloud Computing, Services, and Engineering, pp. 51–60. Cited by: §1.
Nestdnn: resource-aware multi-tenant on-device deep learning for continuous mobile vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking (MobiCom), pp. 115–127. Cited by: §7.
- Leo: scheduling sensor inference algorithms across heterogeneous mobile processors and network resources. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking (MobiCom), pp. 320–333. Cited by: §7.
- Morphnet: fast & simple resource-constrained structure learning of deep networks. In , pp. 1586–1595. Cited by: §7.
- Clamshell: speeding up crowds for low-latency data labeling. Proceedings of the VLDB Endowment 9 (4), pp. 372–383. Cited by: §1.
- Multi-class adaboost. Statistics and its Interface 2 (3), pp. 349–360. Cited by: §7.
- Neurosurgeon: collaborative intelligence between the cloud and mobile edge. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 615–629. Cited by: §7.
- Attribute-based crowd entity resolution. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 549–558. Cited by: §7.
-  OpenCL 2.0. Note: https://www.khronos.org/opencl Cited by: §6.1.
- Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §1.
- Deepx: a software accelerator for low-power deep learning inference on mobile devices. In Proceedings of the 15th International Conference on Information Processing in Sensor Networks (IPSN), pp. 23. Cited by: §7.
Deeprebirth: accelerating deep neural network execution on mobile devices. In
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §7.
- Human-in-the-loop data integration. Proceedings of the VLDB Endowment 10 (12), pp. 2006–2017. Cited by: §7.
- Starfish: efficient concurrency support for computer vision applications. In Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys), pp. 213–226. Cited by: §7.
- Runtime concurrency control and operation scheduling for high performance neural network training. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 188–199. Cited by: §7.
- Performance analysis and characterization of training deep learning models on mobile devices. CoRR. External Links: Cited by: §1.
- On-demand deep model compression for mobile devices: a usage-driven model selection framework. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys), pp. 389–400. Cited by: §7.
- Automatic labeling hierarchical topics. In Proceedings of the 21st ACM international conference on Information and knowledge management, pp. 2383–2386. Cited by: §2.1.
- Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Transactions on Knowledge and Data Engineering 23 (6), pp. 859–874. Cited by: §3.3.
- Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 490–499. Cited by: §2.1.
- modi: Mobile deep inference made efficient by edge computing. In Workshop on Hot Topics in Edge Computing (HotEdge), Cited by: §7.
- Deep co-training for semi-supervised image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 135–152. Cited by: §2.1.
- Snorkel: rapid training data creation with weak supervision. Proceedings of the VLDB Endowment 11 (3), pp. 269–282. Cited by: §1, §3.2, §3.3, §6.2, §7, §7.
- Data programming: creating large training sets, quickly. In Advances in neural information processing systems, pp. 3567–3575. Cited by: §7.
- Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 614–622. Cited by: §7.
- Inferring generative model structure with static analysis. Advances in neural information processing systems 30, pp. 239–249. Cited by: §7.
- Snuba: automating weak supervision to label training data. Proceedings of the VLDB Endowment 12 (3), pp. 223–236. Cited by: §6.2.
- Snuba: automating weak supervision to label training data. PVLDB 12, pp. 223–236. Cited by: §1, §2.1, §7.
- Learning dependency structures for weak supervision models. arXiv preprint arXiv:1903.05844. Cited by: §3.3.
- Waldo: an adaptive human interface for crowd entity resolution. In Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1133–1148. Cited by: §7.
- Falling rule lists. In Artificial Intelligence and Statistics, pp. 1013–1022. Cited by: §7.
- Crowder: crowdsourcing entity resolution. Proceedings of the VLDB Endowment 5 (11), pp. 1483–1494. Cited by: §7.
- Or’s of and’s for interpretable classification, with application to context-aware recommender systems. arXiv preprint arXiv:1504.07614. Cited by: §7.
- A survey of transfer learning. Journal of Big data 3 (1), pp. 9. Cited by: §7.
- Machine learning at facebook: understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 331–344. Cited by: §1.
- Cost-effective data annotation using game-based crowdsourcing. Proceedings of the VLDB Endowment 12 (1), pp. 57–70. Cited by: §1, §3.2, §3.3.