Nowadays, computationally intensive machine-learning applications such as image recognition are becoming popular on resource-constrained edge devices (e.g., intelligent camera). While enjoying the merits of these applications, users are also frustrated when striking the balance between execution time and resource consumption on the edge. To address this problem, many task partitioning approaches have been proposed. Generally, an edge application is partitioned into a set of tasks which can be executed on the edge devices. For example, the video analytics application usually consists of several tasks (e.g., face detection and action classification), and allocates these tasks to multiple edge nodes to execute. Application partition and task allocation reduce the burden of a single edge device and jointly improve the performance of the application.
However, in major edge computing systems, we often face challenges in learning under data scarcity, due to either prohibitive cost (e.g., privacy concern, storage limitations, and networking costs), or inherent difficulty in obtaining required proper training samples with respect to the system complexity and uncertainty on the edge. Recently, transfer learning shows its effectiveness to tackle the data scarcity issue [hutchinson2017overcoming] and serves as a widely-suggested remedy for different industrial applications with insufficient samples, e.g., image recognition [yuan13], speech analysis [wu15], disease diagnosis [emrani17], medical informatics [zhou11] and industrial operations (e.g., AIOps) [ide2017multi].
In this paper, we focus on the Multi-task Transfer Learning (MTL) on the edge, where a machine-learning-based application can be divided into multiple machine-learning tasks, and each task can obtain the knowledge of some other tasks to improve its performance. It is well known that the machine-learning-based application is highly computation-intensive, while the computation resource of edge device is limited. Many efforts have been devoted to designing task allocation mechanisms to achieve various objectives, e.g., optimizing the makespan [biswas2017multi], throughput [hong2007adaptive] or reliability [jiang2015reliable] of the application. However, these frameworks focus on general parallel tasks in the centralized datacenter, where the computation capacity is assumed to be infinite in terms of constantly leasing of virtual machines.
In edge computing systems, it is sometimes hard to obtain a satisfactory result within time and resource limitations if we directly utilize existing frameworks for the cloud. Admittedly, existing task allocation studies have considered that different tasks may require different resources in edge computing systems in order to jointly improve the performance of the application [sundar2018offloading, cao2015energy, mao2016power]. They are usually designed for general machine learning and typically assume that all tasks contribute identically to overall performance improvement of the application. However, in MTL, tasks belonging to the same machine-learning-based application usually have different potential for improving the application’s overall performance. Directly applying these techniques leads to inefficient resource utilization at a task level under MTL in edge computing systems.
To solve the above inefficiency issue for multiple-task allocation in edge computing systems, the key is that more important tasks, which have the higher potential for improving the application’s overall decision performance, should be allocated to more powerful edge devices for priority execution under time limits. Recently, Geng et al. also considered the priority of tasks by leveraging the dependency of tasks in task allocation [geng2018energy]. In that study, the task dependency is predefined and remains fixed over time, e.g., installing Hadoop before Spark. However, due to the complex nature of machine-learning tasks, variables such as environmental conditions and model configurations are likely to change over time. The dependency of machine-learning tasks is dynamic and usually not available before learning. Directly applying the current allocation mechanism can easily result in significant overall application performance degradation for MTL on the edge.
Instead of assuming that all tasks contribute identically to the application’s overall decision performance improvement and conducting the time-dynamic task allocation on the edge, our idea is to leverage machine learning techniques to capture the correlated and collective potential improvement of multiple tasks. Accordingly, we propose a Data-driven Cooperative Task Allocation (DCTA) mechanism to maximize the application’s overall decision performance among multiple tasks on the edge. We also conduct a new comprehensive case study under the real-world industrial operation (e.g., AIOps) scenario, where MTL is necessary due to the data scarcity on edge devices.
Challenges and solutions. In designing DCTA, we have to overcome three following major technical challenges.
First, the metric of tasks impact on overall decision performance improvement remains unknown in current studies. To tackle the challenge, we propose a metric of task importance, which is to measure the overall performance degradation when the measured task is not conducted in MTL. We also observe the long-tail property of task importance, i.e., only a few tasks are important, which serves as a key metric to guide task allocation and facilitate resource saving from less important tasks. We formally define the TATIM problem of task allocation with task importance for MTL on the edge.
Second, the TATIM problem is challenging due to not only its computation complexity (i.e., NP-complete) but also the varying contexts (i.e., dynamic task importance) on the edge. We first prove that TATIM is a variant of Knapsack problem and thus NP-complete. We then show that the task importance is difficult to capture, due to varying environmental conditions and configurations. Therefore, the complicated computation to solve this problem needs to be conducted repeatedly under varying contexts on the edge. To enhance the computational efficiency, we propose a data-driven task allocation mechanism based on reinforcement learning.
Third, applying the machine learning technique to solve the TATIM problem introduces a trade-off between accuracy and cost. On one hand, an accurate data-driven model requires a huge amount of expensive local data on real-world operations. On the other hand, merely using general data from simulation helps to reduce the amount of local data needed but leads to low accuracy. To tackle the challenge, we propose a cooperative learning mechanism to reduce the amount of data needed to generate a reliable data-driven model, by leveraging both general simulated data and local real-world data.
We implement DCTA as a task allocation approach within a data-driven building management system. We also evaluate various distinct task allocation approaches by not only a trace-driven simulation, but also a new comprehensive real-world AIOps case study which bridges model and practice via a new architecture and main components design within AIOps system. Extensive experiments show that our DCTA reduces 3.24 times of processing time, and saves 48.4% energy consumption when solving TATIM compared to the state-of-the-art.
The rest of the paper is organized as follows. In Sec. 2, differing from our preliminary work [chen2019], we reorganize all the notations and observations on task importance for better understanding. In Sec. 3
, we introduce the data-driven approach for task allocation, by leveraging both cluster reinforcement learning and support vector machine. In Sec.4, we conduct trace-driven simulations to evaluate the performance of the proposed DCTA mechanism. In Sec. 5, we add a new comprehensive case study on AIOps for our DCTA mechanism to bridge model and practice. Specifically, we first elaborate the background of AIOps system for better understanding, and exhaustively analyze the motivation of applying DCTA mechanism to the AIOps system. We then further analyze how to apply the DCTA mechanism by proposing a new architecture and main components design within AIOps system. Extensive experiments are complemented to demonstrate the superiority of AIOps system integrating our DCTA mechanism. Sec. 6 discusses related work and Sec. 7 analyzes some future work and possible improvements. At last, we conclude this paper in Sec. 8.
2 Background and Problem Definition of Task Allocation with Task Importance
In this section, we first introduce the background of Multi-task Transfer Learning (MTL). We then give a formal definition of task importance. We also observe the long-tail property of task importance and the potential of leveraging task importance for task allocation in MTL. With these notations, we formally define the problem of task allocation with task importance for MTL.
2.1 Background of Multi-task Transfer Learning (MTL)
In this paper, we study the issue of Multi-task Transfer Learning (MTL) on the edge, where varying tasks together can facilitate better decision performance. It basically reuses parameters or training samples of source tasks to support target tasks, e.g., which are lack of training data. The term task is defined as a set of data, label and its corresponding learning model for a predefined context. For example, for a self-driving car on the road, the detection of each type of object, e.g., neighboring-car, traffic-sign, or pedestrian detection, can be modeled separately as a task. Another example is to take the coefficient of performance (COP) prediction of a chiller for one particular operation as a task [zheng2018data]. The process is shown in Fig. 1.
The benefits of multiple tasks come in mainly two ways. First, similar tasks can transfer their knowledge between each other during the training process, which reduces the negative effect of data scarcity, especially on the edge. Second, in the real-world scenario, it is common to make the final decision by aggregating the output of multiple tasks. Maintaining the high performance of all these tasks contribute to the final aggregated decision performance. Again in the example of a self-driving car, the final driving operation of the car is conducted based on the result of multiple data-driven tasks, e.g., the neighboring-car, traffic-sign, and pedestrian detection.
The Computation Challenge. However, the current MTL systems are way too computationally complicated for edge devices. The reason is twofold: 1) Each task needs to be learned individually from scratch, where siloing tasks make training a new task or a comprehensive perception system a Sisyphean challenge; 2) To avoid data-driven task model being out-of-date and leverage the latest accumulated data as effectively as possible, MTL practitioners retrain their models repeatedly to get the final model with the best quality, including to explore feature representation [yang16, lin16, gong12], adjust structures of task relationship [zhang17, lin16Interactive, oyen12] and tune hyper-parameters [isele16]. For better understanding, a formal formulation of MTL tasks on the edge is available in the following Section 2.4.
2.2 Notations of Task Importance
Confronted with the computational challenge of MTL, we aim to allocate tasks for more efficient MTL on the edge. When allocating tasks, current studies usually assume that all machine-learning tasks are equally important so that resources should be allocated to ensure the accuracy of all these tasks.
However, tasks are not always related to the current context, and thus not equally important. At a specific period of time, e.g., within one hour, the number of highly important tasks are likely to be of a minor, compared with the number of all possible tasks. For example, for a self-driving car on the high way, neighboring car detection can be much more related and important compared with most tasks like pedestrian detection which are more important in a downtown area. 111As a further demonstration, a real-world experiment and the corresponding observation are also available at the end of the subsection.
In this part, before studying task importance, we first formally define it and its related notations. The key notations in this paper are also listed in Table I for ease of reference. A further experiment on task importance is available after the definition.
|Set of tasks where|
|Set of edge devices where|
|The importance of task|
|The merit function indicates the ability to|
|provide credible decision performance|
|The decision-making function indicates|
|the best operation|
|The ideal performance|
|Whether the task is assigned to|
|processor (=1) or not (=0)|
|The execution time of task|
|The resource (e.g., battery) consumption of task|
|The maximum time limits to conduct the decision|
|The maximum resource capacity of processor|
|The model parameters of task|
|The learning loss of task|
|u||The task-allocation matrix where u =|
(Task Importance) Given a task set which consists of a series of tasks, the importance of task is
where a learning task is denoted by ; denotes the model parameters of task and denotes its vector; merit function
denotes its vector; merit functionoutputs the final potential performance improvement; denotes the entire task set.
Thus, given model parameters , the task importance can be updated using the merit function . Such a function indicates the ability to provide credible decision performance (e.g., energy saving) and outputs a value called overall merit, which is formally defined as below.
(Overall Merit) Given the task set and the ideal performance of final decisions , the overall merit is defined as the similarity with the ideal performance, i.e.,
where denotes a decision-making function given model parameters, and denotes the ideal performance which can usually be collected after final optimization, i.e., collected manually or automatically by leveraging historical samples.
In general, the historical data records the descriptor of contexts/ scenarios, requirement, historical operations, and its results. Such information helps us to define . For example, in the case of a self-driving car, in order to ensure the car arrive at the destination safely, it will conduct a series of decision actions where the least time-consuming situation can be regarded as the ideal performance.
Such a decision-making function
is intrinsically solving an optimization problem finding the best action according to parameters, which can be set once given the scenario. For example, in the case of a self-driving car, a possible decision-making function is to find an action which minimizes the probability of accident while ensures the car should be able to arrive at the destination under time limitations. For interested readers, a more concrete implementation ofis also available in Section 5.
2.3 Observations on Task Importance
We have introduced related concepts of task importance. We next justify the motivation of using task importance by further observations.
We first plot the distribution of task importance in Fig. 3, based on a real-world MTL dataset released in [zheng2018data]. In there are totally 50 data-driven tasks for cooling operations running across four years in three buildings. We observe a long-tail property of task importance, i.e., merely 12.72% of tasks have a high contribution of over 80% to the overall merit. We say that a task is unimportant when its task importance is critically low compared with others, e.g., below 0.05%. We therefore have such an observation.
In MTL, unimportant tasks exist; The importance of tasks obeys a long-tail distribution.
This observation reveals the non-uniform distribution of task importance in the real-world environment which motivates us to break the common assumption of modern MTL. Results in a recent CVPR paper also confirm such an observation[taskonomy2018]. The unimportant (e.g., redundant or noisy) tasks can be the result of 1) insufficient training samples on the edge, and 2) mismatch of context and submitted tasks in practical scenarios. It also indicates the potential of speeding up MTL from those unimportant tasks.
In the machine learning community, current MTL systems usually conduct tasks in the order of time stamps, where these time-ordered tasks are of arbitrary importance. Thus, the current execution sequence can be regarded as random, e.g., normally distributed, in terms of task importance. When there are limitations on resource and execution time for MTL tasks, the current approach can suffer from lower overall merit.
We conduct experiment on the MTL dataset mentioned above [zheng2018data], where the decision objective of MTL is to control the Chiller AIOps system to minimize the energy consumption for cooling, which refers to the decision performance. Fig. 3 shows the result by conducting MTL tasks in the order of task importance (called ACCURATE scheme), compared with the order of time with random task importance (called CURRENT scheme) under execution time limitations. Such an ACCURATE scheme can be obtained by computing task importance using historical data (Section 2.2), and can be regarded as ground truth. Base on the obtained accurate task importance, we can find the best task allocation strategy. For interested readers, a detailed optimization process is also available in Section 5.2.
Stacked bars on the left indicate the performance with the ACCURATE scheme, whereas the right show the CURRENT scheme using random task allocation. We see that the ACCURATE scheme considering task importance could have resulted in an average of over 45.68% potential improvement in terms of the overall merit. These results demonstrate that there is significant room to improve the overall merit when using a more accurate and robust scheme of task allocation. We summarize the observation as below.
Overall merit with MTL can be improved by task allocation according to task importance.
However, the task importance may not be always directly available for run-time usage. The above experiment is based on historical data so that we are able to compute the task importance after a task is executed. For run-time usage, we often need to know the task importance in advance, i.e., before a task is conducted. A natural question is whether the task importance is easy to predict, e.g., a fixed or stable value. Based on the above MTL dataset, we also conduct two experiments as more-detailed distribution studies showing how the importance fluctuates over operations under different industrial demands and conditions.
We first plot the average task importance as a function of different operations in terms of different types of machines in Fig. 5. We pick the first regular machine for example. It can be seen that these machines often operate at a small portion of operations, and the importance fluctuates somewhat randomly. At the same time, for the same types of machines, we plot in Fig. 5, the variation in their task importance under different operations, and note that there is a large fluctuation even for a given operation. This is because the task importance in practice is highly dependent on a variety of factors like environmental conditions and configurations. Such factors are referred as the term context in this paper. We therefore have such an observation.
Task importance fluctuates markedly over varying contexts with MTL in terms of average and variance.
Task importance fluctuates markedly over varying contexts with MTL in terms of average and variance.
This observation reveals that the time-dynamic task importance changes in varying contexts [hu2018synthesize, sax2018mid], e.g., with different external factors (like environmental conditions and dynamic industrial demands) and internal factors (like machine configurations and response). For example, in the case of self-driving car, the context contains the following specific factors such as visual observations, physical information, weather, traffic conditions and etc. These factors are exceedingly difficult to capture within an analytical model. Facing such a high variance of task importance situation, natural thinking of modeling task importance using synthetic models easily suffers from low accuracy.
2.4 Problem of Task Allocation with Task Importance for MTL
Based on the above notations and observations, the intuition behind this paper is that we should allocate more important tasks to more powerful edge devices (e.g., edge server) to optimize the final decision. Here we say an edge device is more powerful which refers to it’s processing speed or frequency is faster. We aim to leverage task importance to facilitate task allocation for MTL tasks on the edge, with an emphasis of time limits.
We start by formally defining the conception of task allocation and MTL tasks on the edge, where the former consists of the task placement and resource allocation. Considering machine-learning tasks are usually highly computation intensive, resource-constrained edge devices can barely handle multiple tasks in parallel. Therefore, we assume that, at a certain time, a task occupies the whole CPU computing resource under execution.
Given an edge device set which consists of a series of edge devices, the task allocation over is a binary variable
is a binary variable, i.e.,
where an edge device is denoted by .
Since each task is indivisible and must be assigned to exactly one edge device, we have the following constraint:
Considering edge devices are usually resource-constrained and discrete, we classify resources into two categories, i.e., execution-related and basic requirements. The former refers to CPU computing resources, whereas the latter refers to battery or storage resources. Therefore, the CPU execution time and basically resource requirements of all tasks assigned to edge deviceshould satisfy the following constraints:
where denotes the execution time of task ; denotes the time limitations; denotes the resource required for task ; denotes the resource capacity of edge device .
The objective of traditional MTL is to minimize the collective loss of all tasks. We study the modeling and define the MTL tasks specific to the edge computing scenario for better understanding.
Based on the above definitions, we formally define the problem of task allocation with task importance for MTL on the edge (TATIM Problem) as below.
We found that the TATIM problem under the execution time and resource limitations is in fact a 0-1 Knapsack problem, which is in general NP-complete.
Task allocation problem with task importance is a 0-1 multiply-constrained multiple Knapsack problem.
For interested readers, the proof of Theorem 1 can be found in our conference paper [chen2019].
3 Data-driven Approach for Task Allocation
As shown in the previous section, when we introducing the time-varying task importance , task allocation becomes a TATIM problem which is challenging as an NP-complete problem twofold.
First, the complexity introduced by task importance is the reason why we adopt a reinforcement learning (RL) model. We leverage data-driven methods in order to reduce the time needed to solve the origin NP-complete Knapsack problem. Specifically, in the data-driven RL method, we integrate task importance into the environment modeling of RL.
Second, because the task importance is time-varying, an RL model cannot simply be applied. In the first part, we propose a clustered reinforcement learning (CRL) model that makes decisions based on how observations of the environment relate to those previously seen. In the second part, because the CRL model can confront with quite a few unseen environments, we further propose a Support Vector Machine (SVM) model to predict the task importance and dynamically adjust CRL model decisions based on real-time data.
In a brief summary, the reason for using data-driven technique for TATIM with task importance is because it shows its effectiveness for complicated problems in time-varying environments, including Intelligent logistics [li2018development], Autonomous Mobility-on-Demand system [iglesias2018data], and Human-level game control [mnih2015human]. Basically, data-driven techniques are particularly helpful for solving complicated problems repeatedly with varying parameters, because they not only help to model and reduce the environmental randomness in multi-task scenarios but also help to significantly enhance the computational efficiency due to the fast inference phase when the solution is needed.222Though the training phase may be long, it merely needs to be conducted once in advance.
Formally, given a task set and the corresponding historical feature space
, we are to develop a data-driven task allocation scheme with a loss functionwhich maximize the overall decision performance of the task allocation, i.e.,
3.1 The Clustered Reinforcement Learning (CRL) Model
Next, we consider the proper approach to solving the TATIM problem. First, in the previous section, we have proved that the TATIM problem is in fact a Knapsack problem and therefore NP-complete. RL is widely suggested to efficiently solve such problems [iglesias2018data, mnih2015human]. Second, decisions made by industrial systems can be highly repetitive, thus generating an abundance of training data to support complicated data-driven model. Based on the two reasons, we applied the well-known RL to solve the TATIM problem.
In general, the RL works like this: at each decision epoch, the agent will make a decision based on the current state of the environment. Once the decision is made, a reward would be provided to the agent and the state of the environment would be updated for making future decisions. The agent tries to maximize the cumulative rewards over time. With RL, our TATIM problem is optimized in a Markov Decision Process (MDP), which is a five-tuple:, where denotes the set of states; denotes the set of actions;
denotes the transition probability distribution;denotes the reward function and denotes the discount factor for future rewards. Note that different optimization problems have quite different objectives, constraints, and variables. To adopt our TATIM problem, the different components of RL needs to be specially designed. The detailed design of these components in RL and MDP will be discussed next.
Environment-dynamic Task Allocation. However, RL should not be directly applied in our scenario, where the environment is diverse over time and existing RL approaches usually assume a fixed environment.
1) Novel Problem of Environment-dynamic Knapsacks. In TATIM, the task importance is critical for environment modeling and thus also important for RL. As we known, the knowledge learned by the decision of an agent is rewarded according to the environment. Once the task importance and the corresponding environment is not close to reality, the decision made by the agent will lead to poor performance.
However, due to the varying scenarios in MTL, the environment matrix of RL usually changes over time in reality. Recall the previous example where a self-driving car on the highway and pedestrians usually do not occur, the task of pedestrians detection is less important compared to other tasks. Nevertheless, when driving around the school, pedestrians are particularly frequent which makes the task of pedestrians detection more important. Therefore, we see that the environment is clearly diverse in different scenarios, especially when the task importance is encoded in the environment of RL.333Even in the same scenario, the environment can change over time, due to the accumulating size of training data and the overwritten when the storage is insufficient. Experiments in Section 2.3 also indicates the fluctuation between historical and current task importance.
In this regard, directly leveraging the RL model can easily mismatch the environment and submitted less important tasks, which leads to poor decision performance [bai2015information, hu2018synthesize]. We also conduct an experiment to demonstrate the negative impact. It shows a 46.28% reduction of performance when the environment is not accurate using existing RL.
To this end, we realize that our TATIM problem can be regarded as a novel variant of the Knapsack problem. It is even more challenging than the Multiply-constrained Multiple Knapsack Problem proved in the previous section. This time, additionally, the item value (i.e., task importance) can be changed randomly over time, instead of being fixed in the traditional Knapsack problem.
2) Clustered Approach for Environment Definition. Accordingly, to solve the TATIM problem, we are to learn the current environment. Our idea is that the more similar historical days, the more similar the environment is. Such similarity can be measured by comparing the current scenarios and configuration settings, e.g., sensing data, of the predicting day and the historical days.
The overall process is illustrated in Fig. 6, which consists of two parts, i.e., environment definition and data-driven task allocation. In the figure, different days represent different environments, and the darkness of each color represents the different task importance. Through the analysis of historical data, we establish an environment data set, i.e., historical environment . We define the historical environment as the collection of environment , i.e.,
where denotes the corresponding environment.
Through environment definition that we can find a similar environment by clustering algorithms such as k Nearest Neighbors
where denotes the sensing data. We then can make data-driven task allocation based on the clustered environment under the execution time and resource constraints.
Clustered Reinforcement Learning for Environment-dynamic Task Allocation. Next, we propose key designs of our approach, i.e., the environment modeling, state space, action space, reward function, and optimization, which should be specified based on our TATIM problem.
1) Environment. A key component in the RL model is the environment, which is everything outside the agent, and changes its state due to the action of the agent, and gives the agent corresponding rewards. For an RL predictor, the environment can be described as a matrix which is a map of the agent, e.g., Maze problem. More specifically, one dimension represents the subject types (e.g., neighboring car detection, traffic sign detection, and pedestrian detection), and the other represents the available processors (CPU processor, GPU processor, sensors). The elements of the matrix can be viewed as a data-driven task. It is formulated as follows:
where denotes the corresponding task importance and denotes the corresponding processor capacity.
2) State space. We represent the state, which is the current task selection of the system. Specifically, the state is defined by a matrix and the element of each position can be 0 or 1. Note that 1 represents the task is selected, otherwise, it is not selected, which is formulated as follows:
Such a fixed state representation indicates that it can be conveniently applied as an input to a neural network.
3) Action space. At each point in time, the scheduler may want to select any subset of the tasks. But this requires a large action space of size leading to unbearable computation to learn on the edge. We keep the action space small using a trick: we allow the agent to execute merely one action in each time step. The action space is given by , where = means to conduct the task for the current processor in the current time step. Hence, the action space is defined as follows:
In this way, we can greatly speed up our learning rate while keeping the action space linear in .
4) Reward Function. We craft the reward signal to guide the agent towards desired solutions for our objective: maximize overall task importance. Specifically, we set the reward at each time step to only if the agent reaches the terminal state (i.e., all tasks in the current system are assigned accordingly), where is the set of tasks currently in the system. Otherwise, the reward was set to 0. Hence,
It is worth noting that the agent is set to not receive any reward for intermediate decisions during a time step, which is well-suited to apply to our real-world decision objectives.
5) Optimization. With the above key elements, we leverage Deep Q-learning [liu2017hierarchical], where
denotes the adjustable parameter vector of neural networks. It estimates the value of executing an actionfrom a given state . Formally, given the feature space which consist of the environment and the initial state , we have
Based on the above design, we propose the Clustered Reinforcement Learning (CRL) approach, as shown in Algorithm 1.
3.2 The Cooperative Learning Model based on CRL
However, the CRL model should not be directly applied. In our scenario, the environment is diverse over time. Although we can find similar environments in the historical environment through simple clustering methods, there is a risk that the environment is still not closed to the real environment. That is especially true for edge devices without too much data, whereas the RL model can confront with quite a few unseen environments and it requires much environment observations to cover all possible situations.
In this regard, directly leveraging the CRL model can still mismatch the environment and submitted less important tasks, which leads to poor decision performance [hu2018synthesize]. We also conduct an experiment to demonstrate the negative impact. Based on our CRL model, when the environment is not accurate, it leads to a 28.84% reduction of performance.
A Cooperative Learning Approach. To tackle the challenge, our idea is to leverage runtime data to adjust the decision of the CRL model.
Accordingly, we propose a cooperative learning approach as shown in Fig. 7, which is especially well-suited to solve this problem. The proposed cooperative learning approach contains two components: 1) a CRL predictor with a huge environment definition data, and 2) an SVM predictor with few real-world data. Formally, let and be the feature spaces of the environment definition data, i.e., , and real-world data, respectively. Let denotes our cooperative learning model, which can be represented more specifically as:
where and denote the CRL predictor and SVM predictor; and denote the weight of the corresponding model results, respectively. In addition, the task-allocation matrix u is outputted by our cooperative learning model , i.e., .
As for the SVM predictor, we compare several state-of-the-art models of SVM, AdaBoost, and Random Forest. We select SVM because of its highest accuracy. Formally, given the target tasks feature values, our objective is to develop an SVM predictor which infers the target tasks allocation u. This can be formulated as follows:
where denotes its parameter vector. 444For interested readers, the design of loss function and feature engineering can be found in our conference paper [chen2019].
4 Performance evaluation
In this section, we investigate the performance of DCTA with extensive simulations over industrial operation (e.g., AIOps) scenarios using real-world data obtained from multiple data-driven building management systems.
4.1 Experiment Setup
For generating MTL tasks, we use a real-world building operation dataset released in [zheng2018data], which contains four-year operation data for three high-rise commercial buildings in a metropolitan, collected by a major building service provider. The total data is more than 1 TB. Supported 50 MTL tasks include independent multi-task learning, self-adapted multi-task learning and clustered multi-task learning based on SVM, AdaBoost and Random Forest.
Our simulation consists of nine Raspberry Pi (version 3) and one laptop computer as shown in Fig. 8, which are all interconnected via WiFi under a star network topology in an office building. This represents an edge computing environment where the computational capabilities of edge nodes are heterogeneous. The simulation parameters, e.g., the transmission and receiving energy consumption of the Raspberry Pi are both J/bit, the processing speed and energy consumption are s/bit and J/bit, which are based on the settings from [chen2016joint].
4.2 Comparison Baselines and Metrics
Comparison Baselines. We employ the following state-of-the-art task allocation methods as baselines. It is worth noting that the first two are some of the non-data-driven methods (e.g., synthetic method) that have been widely suggested, and the last two are the data-driven methods we proposed.
Random Mapping (RM) where each task is processed at different edge devices with equal probability [chen2016joint]. In other words, tasks are randomly assigned.
Distributed Machine Learning (DML) distributes tasks to multiple nodes, e.g., allocating the training iteration either to edge devices or to the cloud [teerapittayanon2017distributed].
Clustered Reinforcement Learning (CRL) conducts task allocation with our clustered reinforcement learning model.
Data-driven Cooperative Task Allocation (DCTA) leverages an SVM model to adjust the decision of the CRL model.
Evaluation Metrics. From the perspective of the following metrics, we compare our proposed DCTA method with the others above state-of-the-art.
1) Overall Merit (OM). Given an allocation method, the ability to provide credible overall merit (e.g., energy saving) is crucial to all stakeholders. For interested readers, a more concrete definition of overall merit is available in previous Section 2.2.
2) Processing Time (PT). Our decision should be conducted before the deadline, the processing time we measure is the time the main device needs to partition the application and receive the output of the decision results. Formally,
where denotes the time instant when final decision is made; denotes the time when each experiment start.
3) Energy Consumption (EC). Energy consumption is significantly critical for edge devices because most edge devices are energy-constrained. Formally, the energy consumption is defined as follows:
where and denote the processing and transmission energy consumption of processor , respectively.
4.3 Experiment Results
Result on Processing Time. Fig. 11 shows the processing time as a function of processors. Consistent with our intuition, as the number of processors increases, the processing time of the above methods gradually decreases. We see that DCTA can outperform RM, DML, and CRL by as much as 3.24, 2.32 and 2.01 times, respectively. On average, DCTA outperforms RM, DML, and CRL by 2.70, 2.05, and 1.80 times. That is because DCTA leverages data-driven techniques to capture the dynamic task importance and reduces the number of less important prediction tasks to perform.
Then, we compare the processing time of DCTA with that of RM, DML, and CRL for different average input data sizes. As we can see in Fig. 11, the processing time of our DCTA is always outperformed other state-of-the-art methods. For example, our DCTA has an improvement that is 2.71, 1.83, and 1.68 times to that of RM, DML, and CRL at the average input data size of 500 Mb. That is because our DCTA obtains the importance of each task which is time-dynamic changing, and then allocates to the most suitable edge devices to execute.
Finally, Fig. 11 shows the processing time as a function of network bandwidth. It is well known that network bandwidth affects the time of data transmission, and transmission time is also the main component of processing time. Thus, as the network bandwidth increases, the processing time also gradually decreases. But it is worth noting that our DCTA always outperforms RM, DML and CRL by 2.68, 1.94, and 1.71 times on average, respectively. That is mainly because our DCTA leverages data-driven techniques to capture the importance of each task and merely perform the most important tasks.
5 Case Study: Chiller AIOps on the Edge
In this section, we focus on applying our DCTA approach to the real-world edge-computing system. We first introduce the background of one core industry AIOps system, i.e, chiller AIOps system. We then present the overview of DCTA in chiller AIOps system and briefly introduce the system architecture and main components design within our chiller AIOps system. Finally, through extensive real-world experiments, we demonstrate the superiority of our chiller AIOps system integrating the DCTA mechanism.
5.1 Background of Industry AIOps System
An important application of MTL on the edge is AIOps. The term AIOps [lerner18] is coined as a system that utilizes big data, machine learning and other advanced analytics to enhance IT operations, such as monitoring, automation, and service desk, with proactive, customized and dynamic insight. Data-driven analytics have been widely suggested for IT Operations Management. According to Gartner Inc., by 2022, 40% of all large enterprises will adopt AIOps systems [cappelli18].
The industry AIOps system usually consists of two stages, i.e., Data-driven Multi-task Transfer Learning and Final Optimization, and they work as follows. First, when an industrial demand arrives, AIOps systems need to choose a series of data-driven prediction tasks to conduct, e.g., by using Data-driven Multi-task Transfer Learning. Second, it comes to Final Optimization. In this stage, the AIOps systems receive all the results of previous prediction tasks and conduct decisions until the decision performance, i.e., overall merit, is no longer improved.
Chiller AIOps System. As a case study, we focus on one of the core industry AIOps system, namely, chiller AIOps system, i.e., AIOps system conducting chiller sequencing, is deployed for one week on May, 2019, in a high-rise office building which serves more than three thousand people. A chiller is a machine that generates cooling power in commercial buildings and chiller sequencing is a significantly important operation, which aims to select run-time configurations of chillers at real-time so that the chiller AIOps system serves the time-varying cooling demand. For example, conducting chiller sequencing in a building with two chillers [0.5, 0.7] implies that chiller 1 and chiller 2 are operating at 50% and 70% of their maximum rated capacity, respectively. Thus, the chiller sequencing operation is to allocate the cooling load at any given time to the chillers in the most energy-efficient manner so that the overall cooling demand of the building is satisfied while at the same time the electricity consumed by the chillers is kept at a minimum [liuenergy17]. Chiller AIOps system has been studied recently to significantly improve energy efficiency in commercial buildings and this case study is conducted based on a real-world chiller operation dataset [zheng2018data].
5.2 Overview of DCTA in Chiller AIOps System
As mentioned before, the efficacy of chiller sequencing control in chiller AIOps system relies heavily on the run-time performance profile of the chillers, namely the COP under different cooling load regimes. COP is a measure of the energy-efficiency of a chiller and captures the cooling power that it can output for a certain input power consumption [wiki19]. Formally,
where is the electrical power consumed by chiller to deliver the required amount of cooling load .
The overall cooling load of the chiller AIOps system serves at a given time is the sum of the cooling load over all chillers , i.e., , where . Here, is the thermal capacity of water (kJ/kgC), is the chilled water mass flow rate (kg/s) and is the temperature difference between the returned and supplied chilled water (C) [liao15]. All these quantities are logged by our chiller AIOps system.
Reliable chiller sequencing depends on the COP across all the loading conditions for chiller . However, besides the well-known fact that COP degrades over time [yu08, firdaus16], COP also fluctuates markedly over different cooling loads and environmental conditions [zheng2018data], which makes it exceedingly difficult to capture within an analytical model. To this end, data-driven techniques can thus play a crucial role in accurate COP prediction for improved chiller sequencing in chiller AIOps system. Specifically, a learning task is defined as the coefficient of performance (COP) prediction of a chiller for one particular operation and works have been proposed for chiller AIOps [michopoulos07, powell13, hartman14, zheng2018data]. After COPs of operations is predicted, chiller sequencing conducted by selecting operation with the highest COP value to meet the cooling demand with the lowest electricity consumption.
Motivation of DCTA in Chiller AIOps. The chiller sequencing process requires performance predicted across all possible operations. There are too many controllable parameters in the industry and the number of parameter combinations is usually huge for all possible operations. However, the chiller sequencing process is typically accompanied by time limits, e.g., two hours for chiller sequencing [sun13]. A previous study indicates that blindly conducting all learning tasks leads to considerable time consumption which easily exceeds the time limits in chiller sequencing [zheng2018data]. When merely partial operations are conducted in random order and these operations fail to meet the cooling demand, the backup chiller plant would be launched and additionally consumes a large amount of electricity [yu10]. Therefore, we can conduct the proposed task allocation which assigns more important tasks to more powerful edge devices for priority execution under time limits.
Based on the above real-world chiller operation dataset, while in principle all COP operations (i.e., learning tasks) may be selected to conduct the chiller sequencing, in practice only a small subset of them are frequently selected in the optimal sequencing operation. The historical best operations can be computed with the sequencing optimization based on the ground truth of COP of 1460 days from 2012 to 2015. Then we can count the number of cases for each operation to be selected as the best operation and thus obtain the probability to become optimal. For example, if an operation is selected in 120 days as the best operation over the total 1460 days, its probability to become optimal is computed as 120 / 1460 = 8.22%. Fig. 12 shows that the probability of becoming the best operation for different machines vary greatly among the exponential optional operation space. It can be seen that there merely a small portion of operations are frequently selected. Results also confirm our previous Observation 1 in Section 2.3.
Task Importance in Chiller AIOps. The key of the DCTA lies in the computation of task importance. Next, in the context of this AIOps case study, we formally present a specified formulation of the task importance computation.
As discussed in previous Definition 1 in Section 2.2, given model parameters , the task importance can be updated using the merit function . For interested readers, a more concrete definition of is also available in Section 2.2, where involved two following important concepts, i.e., the ideal electricity consumption and decision-making function . Specifically, as for , we first find the best operation of each chiller (i.e., with the highest COP value) in each day through the historical ground truth of COP data, and then compute the electricity consumption of conducting these operations as the ideal performance.
Next, the decision-making function is intrinsically solving the chiller sequencing optimization problem finding the best chiller operations combination which minimize the total electricity consumption on one day, where all time instances in one day are denoted by and each operation is conducted at time . Let denote the maximum cooling capacity of chiller and denote the partial load ratio of chiller at time . Formally,
where denotes the data-driven prediction performance of chiller at time ; and respectively denote the cooling load produced by chiller and the total cooling demand; and denote the total processing time and the deadline, respectively. More specifically, the deadline here means the total time length of one chiller sequencing operation, including the computation time and the mechanical switching time, computed considering both the periodic interval and mechanical switching time , e.g., [zheng2018data].
5.3 Device Overview of Chiller AIOps System
According to above, we conduct the data-driven task allocation based on the chiller AIOps system in the Pacific Place, Hong Kong, where the network topology is shown in Fig. 15. The equipment of chillers, pumps, air-handling unit, and cooling tower differ greatly in operation, maintenance, and services. The data of each equipment in the chiller plant (Fig. 13 A.1) are captured and transmitted by 13 edge nodes, including 3 operation nodes (from the vendor of Trane, Fig. 13 A.2) conducting and recording operations, and 10 sensing nodes (from the vendor of Schneider Electric, Fig. 13 A.2) collecting sensing data. To process data from different types of equipment, we choose a centralized approach, where edge node transmits data to the controller (from the vendor of Wago, Fig. 13 A.3), and controllers are responsible for task allocation and decision making for the edge nodes. Finally, 3 operation nodes conduct data-driven COP prediction and send control sequences to devices (Fig. 13 B). Other sensing nodes without computation power are merely used to collect data.
Though hardware can be fully redeployed after introducing data-driven techniques [agyapong14], for the scalability purpose, we choose an incremental deployment for the chiller AIOps system, with minimal revision for the current HVAC system. That is to say, we leverage only the current commercial off-the-shelf components and avoid deploying any additional equipment within the HVAC system. However, we may sacrifice the probability to obtain more sensing data and have even better prediction performance, if we avoid deploying additional equipment inside the local system in each building for the scalability purpose.
5.4 Components Design within Chiller AIOps System
To apply our DCTA approach to the chiller AIOps system, we also introduce the architecture overview of our chiller AIOps system, as shown in Fig. 15. The architecture contains four main modules: (1) Data Collecting Module collects the data from the surroundings for analysis. Not only the current data but also the historical data are needed to be collected. (2) Data-driven Cooperative Task Allocation (DCTA) Module captures the time-dynamic task importance and allocates tasks with data-driven techniques, which has been introduced in detail in Section 3. (3) Traditional Prediction Module executes the data-driven prediction tasks at the edge nodes and outputs the prediction results. (4) Decision Making Module receives the prediction results from the multiple edge nodes and conducts the optimal decision which is to maximize the overall system merit.
The DCTA module for task allocation lies in the Controller and the design is elaborately introduced in previous Section 3. In the following, we are to briefly introduce the design of other components, i.e., data collecting module, traditional prediction module, and decision making module, within our chiller AIOps system architecture.
Data Collecting Module. The module lies in the Sensing Nodes, e,.g, the temperature sensor or humidity sensor, which collects the data from the surroundings for analysis. There exist a common data storage problem due to the storage limitations on these edge nodes. To tackle this problem, we keep uploading data to a more powerful edge node, e.g., gateway or server, and overwrite historical data on these edge nodes when the storage is insufficient.
Traditional Prediction Module. The module lies in the Operation Nodes, e.g., gateway or router, which executes data-driven COP prediction tasks and outputs the prediction results. To ensure the accuracy of each data-driven task in the case of data scarcity on these edge nodes, we apply clustered multi-task learning approach [jacob09]. It learns with training data not only from the target task, but also from other tasks, e.g., cases with similar temporal, meteorological and mechanical conditions.
Decision Making Module. The module lies in the Controller or Operation Nodes, e.g., server or gateway, which should work in an iterative optimization way. Under the circumstance, the frequency of the decision update is then critical to edge nodes network resource utilization and energy consumption. To tackle this problem, we propose an efficient algorithm to determine the frequency of decision update by analyzing the historical decision data. Specifically, we update the decision each time when an industrial demand coming. In order to reduce damage to the system, we ensure that the time interval between the two decisions can meet the needs of the system to transition from one steady state to another. As for the effects of varying this frequency, it would be an interesting future work for us to investigate the optimal frequency of decision update in industrial scenarios.
5.5 Experiment Results
Result on Overall Merit. With the chiller AIOps system, we first compare the overall merit of our DCTA with that of the other state-of-the-art task allocation methods. Fig. 18 shows that, on one hand, our DCTA approach and other state-of-the-art methods can eventually achieve the same performance; on the other hand, with the same performance, our DCTA approach can greatly reduce the number of tasks performed which means significant savings in time and resources. That is because our DCTA approach is developed combined with the runtime data in the real environment and a huge amount of simulation data. In addition, it leverages the ensemble technique to avoid overfitting in non-linear modeling, which can successfully capture the system local and dynamic performance.
Result on Processing Time. To show the potential of saving time, We compare the processing time of the state-of-the-art task allocation methods. In Fig. 18, we can see that our DCTA outperforms RM, DML, and CRL by 50.2%, 38.6% and 30.2%, respectively. That is because DCTA uses data-driven allocation to select the most important tasks for prediction, unlike other non-data-driven methods.
Result on Energy Consumption. Fig. 18 compares our DCTA approach with RM, DML and CRL method over a different number of tasks, in terms of Average Energy Consumption on edge devices. On average, our DCTA outperforms RM, DML and CRL by 48.4%, 39.6% and 31.3%, respectively. That is because not all predictions on all operations are necessary. Our DCTA captures the top important operations and still maintains the superiority of data-driven techniques.
6 Related Work
Task Allocation has been intensively researched in cloud computing systems [biswas2017multi, hong2007adaptive, jiang2015reliable]. Recent years have witnessed great prospects exhibited down to the edge, e.g., from OpenCL (2008) [OpenCL] to AWS IoT Greengrass (2017) [AWS-Greengrass] and Microsoft Azure IoT Edge (2018) [Azure-IoT-Edge]. Under edge computing, existing works on task allocation either 1) partition the machine-learning model and its input, or 2) are conducted according to different objectives.
First, task allocation in many distributed machine learning systems [hsieh17, xing15, li14, teerapittayanon2017distributed] have successfully demonstrated their effectiveness to enable big-data applications deployed on a large number of machines. For example, when allocating task for deep neural network (DNN), Neurosurgeon [kang2017neurosurgeon] identifies a strategy in a fine-grained layer level between edge and cloud. A similar approach presented in [ko2018edge] proposes a design guideline for DNN partitioning based on the layer-wise trade-off study. These methods provide the capability to accelerate the execution of a single data-driven task on the edge.
Second, existing works also consider different objectives for task allocation [gaobin2019, shutong2019]. Examples include reducing the energy consumption of edge device while predefined delay constraint is satisfied [cao2015energy], finding a proper trade-off between the energy consumption and the execution delay [mao2016power], and minimizing the overall application execution cost [sundar2018offloading]. A majority of these works are not designed for machine learning tasks. Nevertheless, though these techniques may consider a multi-task setting, they regard all submitted tasks as equally important, which leads to inefficient resource allocation at a task level when directly applied for MTL.
Different from these works, our study investigates task allocation for multiple machine-learning tasks without knowing task priority. We capture and leverage task importance to accelerate the overall learning process, which sheds some new light on task allocation for MTL on the edge.
Machine learning for Complicated Optimization Problems has been successfully employed especially with time-varying parameters and complicated solutions which are repeatedly conducted [samreen2016daleel, wang2018machine]. Examples include intelligent logistics [li2018development], code optimization [cummins2017end, ogilvie2017minimizing, taylor2017adaptive], task scheduling [wen2014smart, ren2017optimise, chen2018optimizing]. Our cooperative approach is closely related to ensemble learning where multiple models are used to solve an optimization problem. Ensemble learning is shown to be useful when scheduling parallel tasks [emani2015celebrating] and optimizing application memory usage [marco2017improving]. This work is the first attempt in applying ensemble techniques to optimize task allocation of MTL with task importance on the edge.
Industry AIOps. Recent advances in machine learning have been adopted in various business applications for both individuals and enterprises, whereas the industry sector receives relatively less attention mainly due to the common issue of data scarcity, especially in the past. However, nowadays in the industry sector, the lowered cost of sensing, computing, and communications has made the impractical data-driven techniques in the late 1980s eminently practical, e.g., industrial robots, driver-less cars, and recently, energy-efficient buildings [zheng2019edge]. It is time to deliver a punch and reduce the cost using data-driven techniques on each of the industry sector. E.g., in building management systems, since the release of BLUED [anderson12] on 2012, a dataset of electricity consumption of buildings from the data analytics community of SIGKDD, various works demonstrated the need for using data analytics in building management systems. Then, in SIGKDD 2016, a data-driven study on energy breakdown in buildings reveals the huge electricity demand [batra16]. Nevertheless, how machine learning can be deployed is still vague in each of the industry sectors to guide mechanical operations, especially on the edge.
Naturally, there is room for further work and possible improvements. We discuss a few points here.
Data Scarcity on the Edge. For industrial edge-computing applications, data scarcity often exists even though cloud storage can still cooperate for big data. The data scarcity is the result from 1) prohibitive cost or inherent difficulty in obtaining required proper training samples, 2) with respect to the application complexity and uncertainty. First, when considering the privacy concern, storage limitations, budget, and real-time requirements, partial or even the whole data set is not possible to be stored, transmitted and processed for the edge-computing applications, compared with that of cloud-computing applications. Meantime, due to the instability of the sensing devices, data loss also occurs frequently in some environments. Worse still, an industrial application can be complex or highly uncertain which requires a larger amount of data. For example, many robots for text production, such as search engines or translation programs, have difficulties in finding sufficient samples for each context. The reason lies in the context of words which can result in ambiguities and there exists a huge amount of possible contexts. Thus, we believe moves should be conducted for the data scarcity issue on the edge and we provide an edge-based MTL.
Real-time Sensing Data.
Real-time sensing data facilitate the learning process by incorporating the run-time observations on environmental dynamics. In order to capture the run-time effect from real-time sensing data, we discuss two learning modes, i.e. the offline and online modes. First, the offline mode divides historical samples into multiple clusters in advance, e.g., using K-means. When the real-sensing data is coming, the system selects the most similar clustered samples to train and predict. Its drawback lies in the possibly low prediction accuracy due to the offline clustering. Second, the online mode prepares the training samples in a run-time manner by finding those which are the most similar with the real-time data, e.g., using KNN. This mode guarantees a high prediction accuracy but could lead to extra time to choose the proper training data. In this paper, we adopt the online mode to guarantee that our final decision making can be more reliable. The additional time overhead can be significantly reduced through our proposed data-driven task allocation mechanism.
Multi-task Assumption. In this study, our approach is designed to tackle time-varying environments. We assume that 1) there are multiple related and indivisible machine-learning tasks, and 2) there is no strong pre- and post-dependency, which is also a prerequisite for performing multi-task transfer learning, and 3) there is not all tasks need to be learned individually from scratch to make the final decision. Thus, those cases 1) under single-task settings, or 2) under multi-task settings but with the sequential dependency between tasks, or 3) under multi-task settings but all tasks must be finished to produce the final result, are beyond the scope of this paper. It would be an interesting future work to extend our approach to those scenarios.
In this paper, we study task allocation for MTL scenarios on the edge, by introducing task importance and making the following contributions. First, we reveal that it is important to measure the impact of tasks on decision performance improvement and quantify task importance. We also observe the long-tail property of task importance, which serves as a key metric to guide task allocation, and facilitates resource saving from less important tasks. Second, we show that task allocation with task importance for MTL (TATIM) is a variant of NP-complete Knapsack problem, where the complicated computation to solve this problem needs to be conducted repeatedly under varying contexts. To solve TATIM with high computational efficiency, we propose a Data-driven Cooperative Task Allocation (DCTA) approach. Third, we conduct trace-driven simulations to evaluate the performance of the proposed DCTA approach. Extensive simulations show that our DCTA approach saves 3.24 times of processing time compared to the state-of-the-art. Finally, we add a new comprehensive real-world case study on AIOps for our DCTA approach to bridge model and practice, by proposing a new architecture and main components design within AIOps system. Extensive experiments are complemented to demonstrate the superiority, i.e., 48.4% energy saving, of AIOps system integrating our DCTA approach. We believe that our DCTA approach offers an effective and practical mechanism for reducing the required resource associated with performing MTL on the edge.
The authors would like to thank Zihan Lin for his valuable discussion and feedback. This work was supported in part by the NSFC under Grant 61722206 and 61761136014 (and 392046569 of NSFC-DFG) and 61520106005, in part by National Key Research & Development (R&D) Plan under grant 2017YFB1001703, in part by the Fundamental Research Funds for the Central Universities under Grant 2017KFKJXX009 and 3004210116, in part by the National Program for Support of Top-notch Young Professionals in National Program for Special Support of Eminent Professionals. Dan Wang’s work was supported in part by RGC GRF PolyU 15210119, CRF C5026-18G, ITF UIM/363, ITF ITS/070/19FP, PolyU 1-ZVPZ, and a Huawei Collaborative Grant. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.