An Energy-Aware Online Learning Framework for Resource Management in Heterogeneous Platforms

Mobile platforms must satisfy the contradictory requirements of fast response time and minimum energy consumption as a function of dynamically changing applications. To address this need, system-on-chips (SoC) that are at the heart of these devices provide a variety of control knobs, such as the number of active cores and their voltage/frequency levels. Controlling these knobs optimally at runtime is challenging for two reasons. First, the large configuration space prohibits exhaustive solutions. Second, control policies designed offline are at best sub-optimal since many potential new applications are unknown at design-time. We address these challenges by proposing an online imitation learning approach. Our key idea is to construct an offline policy and adapt it online to new applications to optimize a given metric (e.g., energy). The proposed methodology leverages the supervision enabled by power-performance models learned at runtime. We demonstrate its effectiveness on a commercial mobile platform with 16 diverse benchmarks. Our approach successfully adapts the control policy to an unknown application after executing less than 25 its instructions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 15

page 16

page 18

page 23

08/22/2020

Online Adaptive Learning for Runtime Resource Management of Heterogeneous SoCs

Dynamic resource management has become one of the major areas of researc...
04/14/2021

Learning Pareto-Frontier Resource Management Policies for Heterogeneous SoCs: An Information-Theoretic Approach

Mobile system-on-chips (SoCs) are growing in their complexity and hetero...
07/18/2020

Runtime Task Scheduling using Imitation Learning for Heterogeneous Many-Core Systems

Domain-specific systems-on-chip, a class of heterogeneous many-core syst...
06/20/2019

Using Machine Learning to Optimize Web Interactions on Heterogeneous Mobile Multi-cores

The web has become a ubiquitous application development platform for mob...
01/31/2018

REOH: Runtime Energy Optimization for Heterogeneous Systems

Significant efforts have been devoted to choosing the best configuration...
11/17/2020

AXES: Approximation Manager for Emerging Memory Architectures

Memory approximation techniques are commonly limited in scope, targeting...
03/17/2019

Compiler-assisted Adaptive Program Scheduling in big.LITTLE Systems

Energy-aware architectures provide applications with a mix of low (LITTL...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Over a billion people use various types of electronic devices including mobile phones, tablets, and personal computers (Statista, [n. d.]). As the processing and sensing capabilities of these devices expand, we see exponential growth in the number and types of applications. Common examples include graphics-intensive games, communication-intensive social media apps, health monitoring, and traditional compute-intensive applications. Delivering the required performance on-demand and maximizing the battery life are two common goals independent of the application. However, the runtime configurations to achieve these goals can vary dramatically for different application scenarios. For example, high-performance CPU cores (e.g., big cores) are preferred over low-power cores while running compute-intensive applications. Furthermore, the voltage and frequency levels should be controlled optimally at runtime, since the highest (i.e., most power-hungry) levels are not needed continuously. Therefore, heterogeneous SoCs must orchestrate the utilization of the available resources optimally at runtime as the composition of active applications evolve.

Determining the optimal SoC configuration at runtime is challenging for two reasons. First, the parameter space is prohibitively large to explore at runtime. Even with two types of cores (e.g., big and little cores) with frequency levels each, there are thousands of possible configurations (). Second, and more importantly, one cannot find the optimal configuration offline since many potential applications are unknown at design-time. Even if a set of key applications, i.e., key performance indicators (KPI), are available, we do not know the precise composition of these KPIs and background applications in advance. Furthermore, the type and number of applications grow continuously. Therefore, there is a strong need for approaches that adapt to new applications by learning optimal SoC configurations at runtime.

Existing governors in mobile platforms use simple metrics for runtime power management decisions. For example, interactive and on-demand governors in Android phones control the operating frequency as a function of the utilization. These policies maximize performance, but they are not energy-efficient. Several power management techniques have recently been proposed to overcome these limitations (Martinez and Ipek, 2009; Park et al., 2017; Pathania et al., 2015). These policies are typically built offline using prior knowledge of few known applications. Therefore, they may not perform well for new applications encountered at runtime. For example, Figure 1 illustrates an offline policy which performs poorly while running applications outside the training set. Reinforcement learning (RL) (Sutton and Barto, 2018) methods, such as Q-learning, can be employed for online learning. However, RL methods are not efficient to learn the optimal policy for new applications, since they learn via trial-and-error using very weak training signals and require a large number of data samples.

Figure 1. Illustration of the offline policy constructed using a training set that includes BasicMath-Large. The policy decisions for BasicMath-Large are perfect for entire execution, while decisions for Kmeans are different 50% of the time with respect to the reference.

In this paper, we propose a novel online imitation learning (Online-IL) approach to learn optimal policies for new applications at runtime. The Online-IL approach leverages an offline control algorithm, constructed at design-time, as the initial policy to effectively bootstrap the learning process. This policy may not perform well on new applications seen at runtime as demonstrated by Figure 1

and our experimental results. Hence, the proposed Online-IL approach synergistically combines the benefits of power/performance models and machine learning techniques. The power/performance models provide strong supervision via low-overhead and accurate evaluation of candidate configurations. Our machine learning techniques leverage this supervision to efficiently adapt the policy to new applications. In each training epoch, we perform the following four steps. 1) Execute the current policy and record the workload metadata, such as the number of instructions and memory accesses; 2) Evaluate the candidate configurations using the metadata and analytical power/performance models to determine the best action; 3) Update the policy parameters such that evaluation of policy at each system state matches the best configuration found via power/performance models; and 4) If required, update the parameters of analytical power/performance models based on real measurements. We repeat these four steps until the policy convergences.

Experimental evaluations on Odroid-XU3 board show that offline policies achieve almost 100% accuracy with respect to a golden reference while running applications from the training set. However, they perform poorly under new applications seen at runtime. Our proposed online-IL methodology adapts the offline policy online to new applications within a few seconds.

Contributions: The main contributions of this work are:

  • A novel approach to utilize online power/performance models at runtime to generate training data,

  • A runtime adaptive control algorithm that combines optimal offline policies and online learning,

  • Comprehensive experimental evaluation on a commercial mobile platform using 16 diverse applications.

2. Background and Related Work

Increasing mobile phone usage and performance demands have led to significant research in exploring power-performance trade-offs in heterogeneous SoCs. Mobile platforms employ heterogeneous SoCs that integrate different types of cores (big/little), GPUs, and application-specific processors. They come with power management governors, such as on-demand (Pallipadi and Starikovskiy, 2006), that use simple metrics including core utilization to make power management decisions, although the increasing complexity of heterogeneous SoCs necessitates new algorithms for their dynamic resource management.

The majority of prior power management techniques utilize control policies designed offline (Vallina-Rodriguez and Crowcroft, 2012; Pathania et al., 2014)

. For example, two recent studies propose CPU/GPU frequency selection techniques using decision tree 

(Pathania et al., 2015; Park et al., 2017). These techniques are designed offline and verified at runtime under gaming applications. Purely offline methodologies are impractical since they cannot adapt to new applications with unknown characteristics. Therefore, a practical dynamic power management technique for mobile platforms should automatically adapt to new application applications seen at runtime.

One class of techniques choose between multiple pre-designed dynamic power management policies by characterizing the incoming applications online (Dhiman et al., 2009; Cochran et al., [n. d.]; Aalsaud et al., 2016; Singh et al., 2019). For example, the technique presented in (Aalsaud et al., 2016)

updates the parameters online, when a new uncharacterized application begins. These methods incur the cost of storing multiple control policies and switching among them online. The accuracy of these approaches depends critically on the availability of an appropriate policy, which is infeasible due to the rapidly growing number of new applications. A recent study updates the policy online with the help of a proportional-integral (PI) controller which estimates the workload behavior 

(Won et al., 2014). However, this technique controls only one knob, the uncore frequency.

Modeling power-performance with system-level measurements is usually an integral part of dynamic resource management techniques. Multiple studies in the literature discuss power modeling for heterogeneous platforms (Sultan et al., 2014; Brooks et al., 2007; Ye et al., 2000). A comprehensive discussion about power modeling techniques is presented in (Sultan et al., 2014; Brooks et al., 2007). One class of power models rely on the parameters obtained through circuit simulations (Ye et al., 2000; Bellosa, 2000). Since typical circuit simulations are slow, obtaining power models though simulations are time-consuming. Another class of techniques use empirical data to obtain the power models. Since our technique uses empirical data to obtain the power model, we discuss this class in more detail. An empirical power modeling technique is proposed in (Bhat et al., 2017). In this technique, the authors collect power data from the platform itself by executing it at different fixed temperatures, voltages and frequencies. Several researchers also proposed offline and online performance models for heterogeneous platforms. Authors in (Kadjo et al., 2015; Gupta et al., 2018b) proposed machine learning-based performance models as a function of hardware counters to estimate GPU performance. Auto-regressive models are used to predict the performance in heterogeneous processors both offline (Dietrich and Chakraborty, 2014) and online (Dietrich et al., 2010). Adaptive performance models for heterogeneous SoCs are proposed in (Gupta et al., 2018a; Bhat et al., 2018)

. In these techniques, the authors perform offline feature selection and use Recursive Least Square (RLS) to learn model coefficients online. In our proposed work, we first collect power and execution time data for different applications. Since total power is a function of dynamic capacitance (

) and leakage current (), we model these two as a linear combination of hardware counters. However, these models might be inaccurate for the new applications seen at runtime. Therefore, we update these linear models through RLS with the help of the hardware counters and the power values obtained at runtime. We apply a similar technique to estimate execution time too.

Recent studies proposed applying Reinforcement Learning (RL) to dynamic power management to enable online adaptation (Chen and Marculescu, 2015; ul Islam and Lin, 2017; Zhang et al., 2017). For instance, Q-Table based approaches are used for energy-aware DVFS scheduling (Chen and Marculescu, 2015; ul Islam and Lin, 2017). These approaches are not scalable for heterogeneous SoCs since the number of state-action pairs is increasing with the number of cores and supported power states. To address this problem, Zhang et al. (Zhang et al., 2017) proposed a deep queue learning network (DQN) technique for heterogeneous SoCs. A common drawback of all RL methods is their inefficiency to learn the optimal policy for new applications. More specifically, RL techniques require a large number of data samples to converge, since they learn via trial-and-error using weak training signals.

In this work, we employ the principles of imitation learning (IL) (Schaal, 1999)

due to significant advantages of IL methods over RL: require fewer data and time to learn near-optimal policies. IL leverages the optimal sequential-decision making behavior (referred to as the Oracle) and employs supervised learning techniques to mimic this behavior. However, exact imitation learning algorithms suffer from error propagation. Advanced IL algorithms use additional training data collected during policy evaluation to recover from mistakes to address this problem 

(Ross et al., 2011). This idea has been used to control voltage-frequency islands in a multiprocessor SoC (Kim et al., 2017). In their approach, the control policy is created offline using known applications and evaluation is performed for homogeneous SoCs using only simulation studies. Recently, another IL-based dynamic power management technique is proposed in (Mandal et al., 2019) for heterogeneous platforms. In this approach, the authors design the Oracle policy using the applications known at design-time. Subsequently, they use the offline designed policy at runtime. Therefore, these methodologies cannot improve the control policy for new applications.

In contrast to prior work, we propose a novel low-overhead Online Imitation Learning approach that can learn to control multiple configuration knobs simultaneously as new applications arrive. Our proposed solution utilizes power/performance models to generate strong online supervision which enables us to tailor the policy to new applications seen at runtime. Finally, we evaluate the effectiveness of our solution with experiments on a hardware platform.

Sequence of epochs of an application ()
epoch of an application
Set of candidate configurations
Configuration on which epoch of an application runs
Number of active big cores
Number of active little cores
Frequency of big cores
Frequency of little cores
A power management policy which maps to
Table 1. Notations and important parameters
Instructions Retired Noncache External Memory Request
CPU Cycles Total Little Cluster Utilization
Branch Miss Prediction Per Core Big Cluster Utilization
Level 2 Cache Misses Total Chip Power Consumption
Data Memory Access
Table 2. Performance and power consumption data collected in each epoch.

3. Preliminaries and Problem Setup

We consider a heterogeneous platform with different core types. Each type can contain a single core or multiple cores. For example, the platform may integrate multiple CPU cores while having a single GPU. Given such a platform, our goal is to design a power management policy to control the number of active cores and their respective frequencies. Each possible combination of the number of cores and their frequencies is a unique runtime configuration for the platform. Using this definition, we can represent all possible configurations in the system by , as summarized in Table 1. For our experimental platform, Odroid-XU3, a configuration consists of four control knobs: the number of active big cores (), the number of active little cores (), the frequency of big cores (), and the frequency of little cores (). Therefore, the goal of the power management policy is to determine the tuple for each control interval.

The default power management governors used on mobile platforms apply decisions in periodic intervals ranging from 10–100 ms (Pallipadi and Starikovskiy, 2006). These decisions typically involve determining the number of active cores and their respective frequencies. Suppose that we run an application repeatedly at different frequencies. Decision intervals observed in each execution will be different from the other runs. This means that we cannot execute the same application at different configurations and collect the optimal configuration for each interval. Thus, periodic intervals impede construction of optimal oracles even for known applications.

To facilitate Oracle policy construction, we segment applications into repeatable decision epochs, which are clusters of macro-blocks. To this end, we use the methodology proposed by Gupta et al. (Gupta et al., 2017). We start by inserting PAPI API (Mucci et al., 1999) call within consecutive epochs using LLVM (Lattner and Adve, 2004) and clang compiler framework. Specifically, we first find all the available basic blocks in the source code. Then, we instrument at all possible basic blocks in the application. Finally, we remove PAPI calls from basic blocks at the lower hierarchy of call graph to prune the number of epochs. As a result of the instrumentation, the number of instructions in each epoch ranges from 10 - 100 million instructions with a median of about 7.5 million instructions. This translates to about 50 - 1000 epochs per workload as a function of the total execution time of the workload. This instrumentation enables us to collect the performance counters listed in Table 2 and power consumption for each repeatable epoch. We note that the repeatability here means that the number of instructions executed for a given epoch are always the same regardless of the configuration they are being executed on. At the same time, the other performance counters, such as LLC misses, memory accesses, can be different each time the application is executed. Consequently, an application with decision epochs can be expressed as a sequence of epochs . Each epoch has a fixed set of macro-blocks (e.g., a while loop) ranging from 10 to 100 millions of instructions. Using these definitions, a policy is a function that maps a given epoch to one of the supported configurations. For instance, means that epoch should run on configuration based on policy .

The standard approach to obtain a policy is to first obtain data with a number of applications and then apply supervised learning methods to train a policy. However, as we show in Figure 1, policies trained offline may not perform well for unseen applications. Therefore, there is a need for methodologies that continue to learn online and adapt to unseen applications. To this end, we first present an approach using reinforcement learning to solve this problem since it is commonly used for online learning. However, RL methods require the design of a reward function and exploration of large state space. This can lead to a slow convergence for RL policies. To overcome this, we present a novel low-overhead Online Imitation Learning approach that can learn to control multiple configuration knobs simultaneously as new applications arrive online.

4. Canonical Baseline Approach: Reinforcement Learning (RL)

1 Input: Number of learning epochs (), Reward function (

), Exploration probability (

)
2 for k =  do
3       if rand() ¡  then
4             Choose randomly from // Exploration
5      else
6             Evaluate -value for different configurations as a function of the features of the current epoch ()
7             Select the configuration () with highest -value // Exploitation
8             Obtain reward () from the environment for the chosen configuration ()
9             Update function following Equation 1
10            
11       end if
12      
13 end for
Algorithm 1 Q-learning Algorithm with Greedy Exploration

Reinforcement Learning (RL) is a commonly used framework for solving sequential decision-making problems. In RL setting, a controller takes an action as a function of the current system state. In our power management problem, we define the state as the hardware counters observed in a given decision epoch. The controller takes an action as a function of the hardware counters. After the controller takes the action, it interacts with the environment. The environment provides feedback (reward) to the controller about the quality of the action taken by the controller. The learned utility of actions at states is represented in the form of -values. Each state and action pair corresponds to a -value. Depending on the reward, the controller gets to know how good or bad the previously taken action was and the corresponding -value gets updated. Using the reward and the current state information, the controller performs an update to its parameters such that the actions in future states can be improved. After a sufficient number of observations, the controller learns to take near-optimal action for a given state. Algorithm 1 delineates the canonical Q-learning algorithm that is popularly employed in the design automation community. There are different methodologies in the literature to implement this learning algorithm, especially to store -values. One of them employs a Q-table to explicitly store the values for each state and action pair, and another one uses a function approximator to implicitly store the -values.

Table-based approach: A well-known methodology to implement Q-learning is the table-based approach. In this setting, the -values represent the value of taking an action in a given state. The -values of all the states and the corresponding actions are stored in a table. When the controller takes an action, the environment serves the reward using a hand-designed reward function. Subsequently, it calculates the -value for the current state-action pair and populates the table. It has been proved that table-based RL approach will asymptotically converge to take optimal actions (Sutton and Barto, 2018). Table-based RL is typically used to perform power management for SoCs in existing literature (Chen and Marculescu, 2015; Ge and Qiu, 2011; Shafik et al., 2015; ul Islam and Lin, 2017).

Function approximation approach: Although table-based RL approach is very efficient to learn an optimal controller, this approach scales poorly as the number of states grows beyond a small number. If there are possible states and candidate actions, then the number of entries in the Q-table is . Therefore, if is very high, then the number of entries in the Q-table will grow. Moreover, in the table-based approach, the state has to be discretized into a fixed number of bins when the state space is continuous, as is the case in our problem. If the discretization is coarse, then the learned model can suffer from low accuracy. On the other hand, if the discretization is very fine, then storing a large table will incur additional overhead. To solve these challenges with table-based policy representation, function approximation based Q-learning algorithm is proposed in the literature (Mnih et al., 2015). In this methodology, the

-values are learned using a function approximator, such as a neural network or regression trees. If a neural network is used as a function approximator, then the Q-learning methodology is often referred to as deep Q-Learning (DQN). Deep Q-Learning has been used to manage control knobs of an SoC in prior work 

(Gupta et al., 2019; Zhang et al., 2018; Tian et al., 2018). We refer the reader to  (Mnih et al., 2015) for more details on deep Q-learning.

The power management problem described in Section 3 has a continuous state space and a discrete action space. Therefore, we apply Deep Q-Learning (Mnih et al., 2015) to obtain a policy . The -values are updated following Equation 1:

(1)

where is the -value for epoch executed on configuration , is the learning rate and is the discount factor.

Reward function: The performance of RL algorithms critically depends on the design of a good reward function. We define the reward function () at decision epoch as follows.

(2)

where and are the power consumption and execution time respectively at decision epoch with configuration . is the minimum possible energy consumed at decision epoch . Typically, when the powersave governor is active, energy consumption is minimum. is the minimum possible execution time at decision epoch . With performance governor, the platform delivers minimum execution time. Further, we can change the value of parameter to fine-tune the optimization objective. For example, with , the reward function results in control behavior to minimize power, and with , the reward function results in control behavior to minimize energy. In our experiments, we employ , as this minimizes energy. However, we argue that deep Q-Learning based RL is not suitable for controlling heterogeneous processors when compared to our proposed online imitation learning approach. This can be attributed to the following reasons:

  • First, the feedback obtained by the controller from the environment through the reward function is indirect. Therefore, the controller has to sufficiently explore the large state space to be able to learn a near-optimal policy. This can result in a long convergence time for the RL policy, as shown later in Section 6.4. Such a long convergence time is not desirable since the workload composition can change frequently in mobile devices.

  • Second, the accuracy of policy learning in RL critically depends on the reward function. However, designing a good reward function that can achieve the desired convergence is challenging.

In order to overcome these challenges, we propose a novel online imitation learning methodology in the next section.

5. Online Imitation Learning Method

Imitation learning (IL) is a supervised machine learning framework that aims to imitate an Oracle for sequential decision-making problems. The Oracle policy provides expert-level feedback to the policy being learned by IL. In classical IL settings, the Oracle is provided as training data from expert demonstrations (such as humans performing a task in robot learning (Schaal, 1999)). In the case of dynamic management of heterogeneous systems, the Oracle provides the best configuration for a given epoch of the application. The IL-based controller receives the correct control action as feedback when it makes incorrect decisions. However, RL-based controllers can only observe the reward from the environment, which is weaker supervision with respect to IL. It has been shown that IL policies converge exponentially faster than RL (Sun et al., 2017). Due to this advantage, we choose IL to train our resource management policies.

Construction of an Oracle is one of the most important steps to train an IL policy. In an offline setting, this can be achieved by collecting characterization data and using advanced algorithms like dynamic programming to obtain the Oracle (Mandal et al., 2019). In literature, there are techniques to construct an offline policy using IL for dynamic management of SoCs (Kim et al., 2017; Mandal et al., 2019). However, constructing an Oracle policy is non-trivial for an online learning setting because of two reasons. First, Oracle construction needs a large amount of data which is not readily available in an online setting. Second, the methodology to construct Oracle policy may involve expensive computation which can be prohibitive at runtime. Therefore, there is a strong need for methodologies that can construct high-quality Oracle policies at runtime with minimal overhead. To address this challenge, we propose a four-step methodology for online imitation learning as depicted in Figure 2. These steps are outlined below and detailed in the following sub-sections.

1. Construct Control Policy Offline (Section 5.1) – The first step of the proposed methodology is to construct an offline control policy using the set of training applications available at the design-time. Prior approaches use similar policies constructed offline to make runtime power management decisions (Park et al., 2017; Gupta et al., 2017). However, they do not learn the characteristics of new applications online and adapt the policy at runtime. In contrast, the proposed approach uses this offline policy as a starting point and employs online imitation learning to adapt it to new applications.

2. Online Execution and Power-Performance Modeling (Section 5.2) – At runtime, we use the most up to date policy trained until that point to make power management decisions. Meanwhile, we also continuously collect the workload metadata summarized in Table 2. Subsequently, we employ this data for maintaining accurate runtime adaptive power-performance models with negligible overhead.

3. Online Oracle Policy Construction (Section 5.3) – The fundamental challenge in online learning is to construct a good Oracle (or reference) policy that can provide strong supervision needed to update the policy and quickly learn an optimal policy for new applications. We overcome this challenge by using our power-performance models maintained online, as shown in Figure 2. More specifically, we use these models at the end of each control interval to determine the best action with less than 0.1% runtime overhead. This retrospective view enables us to compare this action against the actual policy decision. If they do not match, we record the metadata and optimal decision as future training data to update the policy.

4. Online Training (Section 5.4) – After a set of metadata and corresponding optimal configuration are collected, we use them as supervised training examples to incrementally re-train the policy using imitation learning. Hence, the policy adapts to new applications via supervision from the oracle constructed online.

Figure 2. Overview of the proposed framework.

5.1. Offline Policy Construction

We can employ any existing method to construct a control policy offline using the applications known at the design time. Without loss of generality, we use imitation learning for this purpose. Segmenting the applications into repeatable epochs, i.e., microbenchmarks, enables us to execute each epoch of an application at each supported configurations. Hence, we collect power consumption and performance data for each epoch-configuration pair at design-time. This data allows us to determine the configuration that optimizes a given metric (e.g., energy) for each epoch known at design-time. More specifically, we use this data to construct an Oracle policy that maps each epoch to the configuration that minimizes the energy consumption.

The Oracle cannot be used directly at runtime due to excessive (more than 1 MB) storage and computational requirements. Therefore, we need to construct a policy which determines the best power management decisions for each epoch . In other words, we want the policy to mimic the behavior of the Oracle policy. To this end, we apply imitation learning to first construct an offline policy using the training data for all epochs and the corresponding labels for best configuration obtained from the Oracle policy. The features for each epoch are the performance counters and power consumption values listed in Table 2

. We use a neural network to approximate the Oracle using this training data. Advantages of the neural networks include the ability to learn complex decision functions and perform online updates via backpropagation. After learning a policy using exact imitation, we further improve its robustness and accuracy using state-of-the-art IL techniques 

(Ross et al., 2011) to closely approximate the behavior of Oracle policy. At the end of this design-time step, we obtain the offline policy in the form of four functions to predict the configuration for knobs , , , and .

5.2. Online Execution and Power-Performance Modeling

At runtime, we start with the offline policy () created at design time. Subsequently, we run the incoming applications using the most up to date policy obtained so far, as shown in Figure 2. Our online learning objective is to update the parameters of policy to minimize the overall energy consumption. Meeting this goal requires a reliable supervision, but constructing an Oracle similar to one presented in Section 5.1 is not practical at runtime. Therefore, we also collect power consumption and performance data listed in Table 2 and maintain accurate analytical power-performance models during the regular execution at runtime. Then, we use these models to achieve the online learning objective.

Figure 3. Illustration of the online recursive least square algorithm to update the analytical models of power consumption and execution time.
1 Input: Features listed in Table 2 (), Measured power/performance ()
2 for each epoch  do
3       Obtain weights at the end of the previous epoch ()
4       Estimate power/performance through a linear model:
5       Calculate the error between the measured value and the estimated value:
6       Apply weight update algorithm to obtain weights () for the next epoch.
7 end for
Algorithm 2 Online power/performance modeling

We begin with the power consumption and execution time models built offline with the help of the same applications used for the design-time Oracle construction. For example, the total power in epoch is expressed as:

(3)

where the dynamic switching capacitance and leakage current are modeled as a function of the performance counters () shown in Table 2

. We employ a linear model using these counters and weight vector

, as depicted in Figure 3. Note that our data also includes measured power consumption enabled by current sensors (Hardkernel, 2014). At the end of each epoch, we employ our model to estimate the power consumption in the previous interval. Subsequently, we compare it to the actual measurement, as illustrated in Figure 3 and Algorithm 2. We feed this error to the Recursive Least Square (RLS) (Mendel, 1995) algorithm to update the parameters of the power consumption model. Similarly, we maintain an execution time model as a function of the performance counters. In this case, the reference is simply the length of the epoch, which can be tracked easily at runtime.

In summary, we maintain accurate power-performance models by using the data available at the end of each epoch. These models are used to generate strong supervision required by our online learning algorithm as described in the following section. We also note that updating the power and performance models have less than 0.1% overhead with respect to the execution time of an epoch.

5.3. Online Oracle Construction

In general, providing online supervision in terms of the best configuration to update the policy is a challenging problem. We overcome this challenge by leveraging the power and performance models, as described in the previous section. At the end of each epoch (), we have the metadata listed in Table 2 and up to date power-performance models. These models and the inputs enable us to evaluate the energy at candidate configurations in other than the one chosen by the current policy. Specifically, we use the hardware counters measured on the current configuration to estimate the power and performance when using other configurations. In general, the hardware counters do not remain the same when we estimate the power consumption of other configurations. Indeed, predicting the change in the counters is a complex problem which involves modeling of system dynamics and the state transition probability. This can potentially add additional overhead to the Oracle estimation step of our approach. Therefore, to avoid additional overhead, we reuse the hardware counter values that we observe at runtime (of the current configuration) to estimate the power consumption of other configurations. After obtaining the energy consumption values at candidate configurations in , we mark the configuration consuming the least energy as the optimal configuration ().

We note that exhaustive search in the whole configuration space is resource expensive. For example, Samsung Exynos 5422 processor supports up to 4940 different configurations. Our implementation shows that evaluating each configuration can take up to 5% of the execution time of an epoch, which is prohibitive at runtime. To overcome this difficulty, we exploit the fact that the current policy () is reasonably accurate and makes decisions that are not too far from those made by the Oracle policy. Therefore, instead of searching in the whole configuration space, we search within a local neighborhood of the configuration predicted by the current policy as described in Algorithm 3.

Resource-bounded Online Oracle Construction: We start the local search in the neighborhood of by setting a maximum allowable time limit for the search, such as 5 . We initialize the optimal configuration to the current policy decision () and optimal cost () to the cost of (line 3 in Algorithm 3). After this step, we start performing a search around the configuration chosen by the current policy. We first evaluate the metrics obtained by configurations that are one step away from the policy decision. Specifically, we get configurations to be evaluated by changing the frequency of the cores by 200 MHz (i.e., 200 MHz) or changing the number of active cores by 1 (i.e., 1). After evaluating all the configurations that are one step away, we find the configuration with the minimum cost from as shown in Algorithm 3. We update the optimal configuration () whenever a configuration in the neighborhood search outperforms the best configuration found so far (line 10 in Algorithm 3). If the configuration chosen by the current policy is different than , then we store along with the features of the system state in a buffer, as illustrated in Figure 2. We continue the search in this way until the time limit is exceeded. The data in this buffer serves as the supervised training data from Oracle which is used for online learning, as detailed in the next section. This flexible methodology can be easily extended to different objectives. For example, we can perform a gradient-based search in the direction of increasing value if the objective function is to maximize performance.

1 Input: Power and performance models; , predicted configuration; Timeout period
2 Output: Oracle configuration () at decision epoch
3 Initialize: , ,
4 while !Timeout do
5       Configurations with control knob values one step away from (e.g. )
6       Evaluate cost of configurations in
7       Configuration with minimum cost in // Greedy search step
8      
9       if () then
10            
11            
12       end if
13      
14 end while
Algorithm 3 Online Estimation of Oracle Configuration
1 Input: Current Policy , Online Oracle via power-performance models
2 for each epoch  do
3       if  !=  then
4             Record hardware counters and the Oracle configuration as a new training example in the buffer
5       end if
6      if Buffer is full then
7             Perform online training to update the parameters of policy (i.e., neural network) via backpropagation algorithm using the new training examples
8             return the new policy
9       end if
10      
11 end for
Algorithm 4 Policy Update via Online Training

5.4. Policy Update via Online Training

The final step in the proposed online-IL methodology is to incrementally update the parameters of the policy as a function of the new unseen applications and new training examples. The online training method to update the policy is shown in Algorithm 4. As described in the previous section, the Oracle policy (i.e., ) is stored in a buffer. The size of this buffer is important since it determines the training accuracy and implementation overhead. Our experimental evaluations show that metadata and label for 100 epochs easily provide close to 100% accuracy in adapting to new applications. Furthermore, the corresponding storage overhead is less than 20KB. Therefore, we use a buffer to store 100 Oracle entries in our experiments. We aggregate the training examples until the buffer is full. Subsequently, we update the parameters of the policy, which is represented as a neural network using this training data and backpropagation algorithm (Hecht-Nielsen, 1992). Once the online training procedure to update the policy parameters is completed, we replace the current policy with the new policy and continue the next iteration of online-IL approach.

6. Experimental Evaluation

6.1. Experimental Setup

Platform: We implement the proposed online IL algorithm on Odroid-XU3 (Hardkernel, 2014) board running Ubuntu 15.04 OS. The board integrates a Samsung Exynos 5422 SoC that consists of a A15 (big) core cluster with 4 cores, a A7 (little) cluster with 4 cores, ARM Mali GPU and other components. In order to facilitate power measurements, the board also provides sensors to measure the power consumption of the big cores, little cores, main memory, and the GPU. We sample these sensors every 50 ms to record the power consumption of each component. These measurements are used in the online learning algorithm to update the power/performance models and to evaluate the performance of the proposed online learning algorithm. The constructed policies are implemented as user-space governors on the Odroid-XU3 board. Specifically, the frequency of operation is specified in a sysfs entry that is read by the kernel to set the frequency. The number of cores for the little and big cluster are set using the dynamic hotplugging feature of the Linux kernel.

Benchmarks: We evaluate the proposed methodology on a total of 16 applications from Mibench (Guthaus et al., 2001), Cortex (Thomas et al., 2014), and PARSEC (Bienia et al., 2008) benchmark suites. Applications from these benchmark suites exhibit a wide range of characteristics, such as compute intensiveness, memory intensiveness, and parallelism. Therefore, these applications represent a broad group of applications commonly executed in mobile platforms. We divide the benchmarks into two sets, as shown in Table 3. At any time, one of the sets is used for the offline learning phase while the other set is reserved for online learning.

Set 1 Set 2
BML AES
Dijkstra Kmeans
FFT Spectral
Patricia Motion Estimation
Qsort PCA
SHA Blackscholes-2T
Blowfish Blackscholes-4T
String Search
ADPCM
Table 3. List of applications instrumented from different benchmark suites.

Data Collection: Construction of an Oracle for offline policy requires characterization of the applications while running at different configurations. Therefore, we perform extensive data collection on the Odroid-XU3 platform while running the benchmarks. Specifically, we sweep the core configuration from 1 big–1 little to 4 big–4 little. Within each configuration, we change the frequency of big cores from 600 MHz–2.0 GHz and little cores from 600 MHz–1.4 GHz in steps of 200 MHz. We do not go below 600 MHz as lower frequencies do not provide better energy-efficiency (Aalsaud et al., 2016). This characterization data is used to construct the Oracle policies in the offline training phase.

Model Hyper-
Parameters
No. of Hidden Layers 2

No. of Neurons

20 in each layer
Activation ReLu
Optimizer Adam
Learning Rate 0.001
Loss Function
Categorical
Cross-entropy
Training Parameters
for Offline Learning
Batch Size 150
Epochs 500
Training Parameters
for Online Learning
Buffer Size 100
Batch Size 20
Epochs 20
Table 4. Different Parameters for Neural Network

6.2. Evaluation of the Policy Trained Offline

To obtain a baseline, we first design the initial offline policy using only the applications known at design-time. We employ IL to construct the policy using a neural network with two hidden layers. The parameters of the policy and the hyperparameters used in training are listed in Table 

4

. We adopt a neural network implementation over other regression techniques, such as linear regression and regression trees, since neural networks allow us to learn non-linear decision functions and facilitate online updates through the backpropagation algorithm 

(Hecht-Nielsen, 1992). We use a smaller number of training epochs and smaller batch size for online learning because the policy updates at runtime are incremental with a relatively small number of training examples. We observe that a training buffer size of 100, and 20 epochs with a batch size of 20 are sufficient to update the policy.

We evaluate the accuracy of the policy for each configuration knob by measuring the distance between the policy decision and a golden reference, which is constructed to evaluate the proposed policies. If the total number of available levels for a configuration knob is , level chosen by the policy is and the reference level is , then accuracy can be expressed as:

(4)
Figure 4. Accuracy comparison for all knobs for applications from Mibench, Cortex and PARSEC benchmark suites when the policy is trained with only Mibench applications.
Figure 5. Accuracy comparison for all knobs for applications from Mibench, Cortex and PARSEC benchmark suites when the policy is trained with Cortex and PARSEC applications.

After finding the accuracy of each epoch, we take the average across all epochs of an application to compute the overall accuracy for each control knob. We note that 100% accuracy denotes that the decisions of the underlying policy match with the golden reference, and no policy can perform better than that. We first train the policy offline using applications from the Mibench suite. Then, we evaluate the trained policy with applications from all three benchmark suites, as shown in Figure 4. Since the Mibench applications are included in the training set for this experiment, the offline policy always reaches more than 95% accuracy compared to the Oracle for all four configurations knobs. However, the accuracy is significantly lower for the applications from the Cortex suite which are not included in the training set. For example, MotionEstimation shows only 77% and 53% accuracy for big and little core frequencies, respectively. Similarly, the accuracies for the Blackscholes application from the PARSEC suite are lower than 80% for the number of big cores and frequency of little cores. This shows that the offline policy is not good enough to capture the characteristics of applications not seen at design-time.

We also construct an offline policy by using only the applications from Cortex and PARSEC suites. This time the offline policy performs well when running applications from these two benchmark suites, but it shows poor performance for Mibench suite, as shown in Figure 5. More precisely, we observe above 95% accuracy for all knobs when running applications from Cortex and PARSEC suites. In contrast, the accuracy for applications from the Mibench suite are significantly lower. For example, Qsort has only 67.5% accuracy for number of big cores. The degradation of accuracy with unseen applications demonstrates the need for online learning.

Next, we compare the performance of IL and RL policies that are trained offline. Figure 6 shows the accuracy of the offline policies trained with IL and RL, respectively. In this case, we construct two offline policies through IL and RL by considering all 9 applications from Mibench suite as training set. Then, we test these policies individually on all applications in Mibench suite. We observe that the IL policy achieves close to 100% accuracy for all the control knobs for all applications. That means IL policy can achieve performance similar to that of the Oracle policy. However, the offline RL policy has much lower accuracy when compared to the IL policy. For example, number of big cores has 17% accuracy and number of little cores has 33% accuracy for Qsort application. This shows that even in the offline setting, the IL policy achieves a superior performance when compared to the RL policy. Moreover, the offline RL policy requires more than 10 time to converge than the offline IL policy.

Figure 6. Accuracy comparison when the policy is trained with only Mibench applications (Set 1 in Table 3).
Figure 7. Comparison of estimated power for Kmeans application. While estimating with RLS, the power estimation accuracy increases from 90% to 99% within 0.6 second.
Figure 8. Histogram showing (a) number of L2 misses and (b) number of noncache memory request for a particular epoch of Spectral application.

6.3. Evaluation of Power-Performance Models

Similar to the offline policy, we construct offline power and performance models using the applications available at design-time. These power and performance models may not be valid for new applications encountered only at runtime if they are not represented by the training data. For example, Figure 7 shows that power models constructed using Mibench applications cannot track the power consumption while running Kmeans application from the Cortex suite. In contrast, our online power models based on RLS converge to the measured power in about 200 ms, as depicted by the dashed line in Figure 7. Although the Kmeans application is not seen during training, we achieve less than 1% error compared to the reference value. Similarly, the performance models converge to reference values in 100-300 ms, which amounts to less than 10 epochs. This shows that the proposed technique adapts the offline model to unseen applications at runtime with high efficiency. Overall, the power models have an average error of 2.6%. These models enable us to construct the accurate runtime supervision required for adapting the offline policy. From our experiments, we observe that generating runtime supervision and updating the power and performance models at each epoch take approximately 8.5 s. This is negligible compared to the execution time of an epoch which is in the order of 10 ms.

Accuracy of Power Estimation Across Configurations: We use the performance counters measured at a given configuration to estimate the power consumption at other configurations. Figure 8(a) shows the histogram of L2 misses for different configurations for a particular epoch of Spectral application. The number of L2 misses for the applications we consider is in the range of 500-2.510. However, the number of L2 misses (for the particular epoch of Spectral application) of 122 configurations out of total 128 configurations (96%) is between 3.5710-3.5910, i.e. within 0.6% of each other. Therefore, the number of L2 misses is nearly the same for all configurations for that particular epoch. The histogram for the number of non-cache memory requests is shown in Figure 8(b). The number of non-cache memory requests for the applications we consider is in the range of 20-7.610. However, The number of non-cache memory requests (for the particular epoch of Spectral application) of 98% of all configurations is between 2.5810-2.610, i.e. within 0.8% of each other. We observe similar behavior for other hardware counters and for other applications too. For this reason, we can safely take the hardware counters of the current configuration as the features to estimate the power of other configurations. Since we also consider the configuration itself as a feature to estimate the power/performance of the configuration, we obtain different power/performance estimation for different configurations. To validate the accuracy of this power estimation, we compared the estimated power of all the configurations with the actual power consumption of the configuration. That is, we use the hardware counters observed for a single configuration to estimate the power consumption of all the configurations and compared it with the measured value. For the applications used in online training, we observed that the estimation error is around 10%. This error is acceptable since the online policy converges to the optimal for unseen applications.

6.4. Evaluation on Stream of Unseen Applications

In this section, we evaluate the effectiveness of the proposed online-IL methodology to new applications observed only at runtime. We start with a policy which is trained offline with applications from Mibench suite. Then, we run a random sequence of seven applications from the Cortex and PARSEC suites (AES, Kmeans, Spectral, Motion Estimation, PCA, 2-threaded Blackscholes, and 4-threaded Blackscholes) back to back. Each complete run of this set of applications takes around 33 seconds. We repeat this experiment for 5 different random sequences.

Figure 9. Comparison of accuracy of (a) number of big cores, (b) number of little cores, (c) big core frequency, and (d) little core frequency for applications in Cortex and PARSEC benchmark suites.

After running the application sequences, we evaluate the effectiveness of the proposed approach on each application individually. To achieve this, we average the accuracy of each control knob over all random sequences. We note that here accuracy measures the distance of the policy decision to the golden reference (Equation 4). Figure 9 compares the accuracy between the offline policy and the online policy after it converges for all four configurations for the random sequences. Figure 9(a) and Figure 9(b) show that our offline policy provides good accuracy for the number of cores. The online policy maintains this accuracy without providing significant improvement. In contrast, Figure 9(c) and Figure 9(d) show that the proposed online learning technique achieves a significant increase in the accuracy of little and big core frequencies. The biggest improvements (20%–47%) are observed for Kmeans, Spectral, and MotionEstimation applications. In particular, the action accuracy of the little core frequency for MotionEstimation application increases from 53% to 99%.

Figure 10. Comparison between proposed online IL and reinforcement learning (RL) with application sequence 1 for (a) frequency of big cores (), (b) frequency of little cores ()
Figure 11. Comparison between proposed online IL and reinforcement learning (RL) with application sequence 2 for (a) frequency of big cores (), (b) frequency of little cores ()
Figure 10. Comparison between proposed online IL and reinforcement learning (RL) with application sequence 1 for (a) frequency of big cores (), (b) frequency of little cores ()

We also analyze how quickly the online policy converges to the highest accuracy for different application sequences. As stated before, we randomly construct five different application sequences. To illustrate the convergence properties, we pick two representative application sequences. Sequence 1 consists of Kmeans, AES, Spectral, MotionEstimation, PCA, BlackScholes-2T, and BlackScholes-4T applications one after another. Figure 11(a) and Figure 11(b) shows that the accuracy of big () and little () core frequencies start at 52% and 71%, respectively. Our online-IL technique increases both of these accuracies to 100% within 0.9 seconds, i.e., after running less than 3% of the sequence. Moreover, we also use the offline RL policy to perform online learning for this application sequence. To perform experiment with RL, we take the offline RL model trained with applications from Mibench suite as described in Section 6.2. From Figure 11(a) and Figure 11(b), it is observed that RL-based online policy does not achieve similar performance as Oracle policy most of the time. For example, the accuracy of choosing the frequency of big cores () and frequency of little cores () are 100% only 16% of the entire execution with RL-based online learned policy. Moreover, we observe that the online RL policy could not converge to optimal configuration even after the sequence is run 14 times back to back. In Sequence 2, we run MotionEstimation, BlackScholes-4T, AES, BlackScholes-2T, Kmeans, Spectral and PCA applications one after another. Similar to the previous example, our initial policy converges to the optimal policy only 1.2 seconds, as shown in Figure 11(a) and Figure 11(b). RL-based online training is unable to converge as fast as online IL policy for this application sequence also. For both sequences, the initial offline IL model is retrained once before converging. However, the offline RL model is not able to obtain high accuracy ever after retraining five times. We also compare the energy consumption of the applications upon applying the RL policy and the proposed online IL policy. Figure 13 and Figure 13 show this energy comparison with respect to the Oracle for application sequence 1 and application sequence 2 respectively. For both sequences, we observe that the energy consumption with the proposed online IL policy is always within 2% of the energy consumption with Oracle policy. However, the energy consumption due to the RL policy is 10%–40% higher than the energy consumption with the Oracle policy. This demonstrates the efficiency of our proposed online IL technique to adapt to new unseen applications.

Figure 12. Energy comparison with application sequence 1.
Figure 13. Energy comparison with application sequence 2.
Figure 14. Comparison between proposed online IL and reinforcement learning (RL) with application sequence 1 for (a) frequency of big cores (), (b) frequency of little cores (). In this experiment, no pre-trained offline model is used.
Figure 15. Comparison between proposed online IL and reinforcement learning (RL) with application sequence 2 for (a) frequency of big cores (), (b) frequency of little cores (). In this experiment, no pre-trained offline model is used.
Figure 16. Energy comparison with application sequence 1 without offline model.
Figure 17. Energy comparison with application sequence 2 without offline model.

Evaluation of the Proposed Approach without Offline Learning: We also evaluate the proposed online learning framework when a pre-trained offline model is not available. In this case, we start with a neural network model initialized with random weights. At runtime, with incoming applications, the model is updated using the proposed approach. Figure 14(a) and Figure 14(b) show how fast the online policy converges when the pre-trained offline model is not provided for frequency of big cores () and frequency of little cores () respectively. In this case also the online IL policy converges within 1.5 second of execution which is around 5% of the sequence. Similar to the results with offline policy, policy learned through RL does not obtain high accuracy when compared to the proposed online IL methodology. Figure 15(a) and Figure 15(b) show the convergence result with application sequence 2. In this case also the proposed online IL policy converges faster than RL policy. The RL policy is unable to achieve optimal performance even after the application sequence is run 5 times. Figure 17 and Figure 17 show the energy comparison of the application sequences with RL and proposed online IL method. In both cases, online IL achieves similar energy consumption as the reference Oracle policy. In contrast, energy consumption with RL policy is always higher than the energy consumption by the reference Oracle policy.

Figure 18. Performance of online learned policy on initial applications.

Performance of Online Learned Policy on Initial Applications: In online learning forgetting offline learned policy is a common issue (Goodfellow et al., 2013). To show that the proposed online IL methodology does not suffer from forgetting, we apply the online learned policy to the applications with which the offline training was performed. Figure 18 shows the accuracy of choosing action knobs for Mibench applications with the policy obtained after online learning is performed on the applications from Cortex and Parsec benchmark suites. It can be seen that the accuracy is close to for all applications and for all control knobs. This shows that the policy obtained after performing several rounds of online updates does not suffer from unwanted catastrophic forgetting.

6.5. Convergence for Single Application

Figure 9(c) and Figure 9(d) show that Kmeans, Spectral, and MotionEstimation applications have large improvements when we perform an online update of the policy. Therefore, we execute these applications standalone to measure the time required for each of the applications to converge to the optimal Oracle policy. From Figure 19(a), we see that to converge to optimal policy Spectral requires 0.7 seconds which is 6% of the total application execution time. Also, we observe that within this time our online learning methodology aggregated 100 new samples which are then used to update the policy once. Similarly, Figure 19(b) shows that MotionEstimation requires 1.4 seconds to converge. In this case, 149 new samples are collected as training data to retrain the policy twice. Finally, we note that each online update of the policy incurs 2 ms of execution time. Since this is less than 0.5% of the time interval between successive updates, the runtime overhead is small.

Figure 19. Convergence of and for (a) Spectral and (b) Motion Estimation application.
Figure 20. Analysis of a Policy using only Power-Performance Models.

Analysis of a Policy using only Power-Performance Models: We perform another experiment to specifically highlight the contribution of online-IL to provide an optimal policy. In this experiment, we do not use online-IL to estimate the configuration. Instead, we only use the power-performance models to estimate the configuration. Specifically, in the current epoch, we search around the configuration which was found optimal through the search performed in the previous epoch (same as in Algorithm 3). Figure 20 shows the accuracy comparison for frequency of big cores () between proposed online-IL technique and the case where we use only power-performance models. This experiment is performed for the Spectral application. We observe that, if only power-performance models are used, the initial accuracy is more than the proposed method. This happens because, initially, the IL policy is not suitable for the application. However, after learning on a sufficient number of samples, the IL policy reaches to the highest accuracy. This IL policy uses an online Oracle which is obtained through a local search around the current configuration. We note that, if there is no learned policy, the local search guided only by the power-performance models would not be effective. The online learned policy provides a good starting configuration to bootstrap the search process. In other words, only a small amount of search is sufficient to reach the optimal configuration if we perform local search around the configuration suggested by the learned policy. On the other hand, if only power-performance models are used to take the control decision, then local search will require evaluating a significantly large number of configurations (as much as 3 than online-IL) to construct an equally good online Oracle policy, which is not practically feasible. Therefore, if we perform local search guided only by the power-performance models, the resulting policy never converges to the optimal as shown in Figure 20. Since local search requires a significantly large number of evaluations of configurations using power-performance models, it is not practical to use only power-performance models for online control. The combination of learned policy for getting good initial configuration and a small amount of local search guided by power-performance models to generate better configurations (supervision to improve policy) makes our overall approach effective. Once we converge to a near-optimal policy, we can apply the policy with very little search, but the baseline solution (search guided by power-performance models) will still require a lot of search and is impractical.

Figure 21. Comparison of energy consumption and execution time with respect to the DyPO (Gupta et al., 2017) and Powersave governor.

6.6. Comparison with the state-of-the-art Power Management Techniques

Powersave governor is commonly used in mobile platforms to operate in low power mode. Therefore, we compare the energy consumption of individual applications when we use the proposed online-IL approach and the Powersave governor. We note that these comparisons include energy and execution time overheads since the evaluations are based on hardware measurements. Similar to the previous sections, we start with an offline policy trained with applications from Mibench suite. Then, we use the proposed technique to update the initial policy as we run applications from Cortex and PARSEC suites.

Figure 21 shows the energy consumption and execution time comparisons. We normalize the energy and execution time of all applications with respect to the Powersave governor since absolute numbers vary widely across different applications. We observe that the proposed online learning methodology achieves both lower energy and execution time compared to the Powersave governor for all applications. More specifically, we achieve on average 10% lower energy consumption and 24% faster execution time. In particular, we improve the execution time for Kmeans by 42% while also reducing the energy consumption.

The proposed policy achieves 48.4% less energy consumption on average when compared to the interactive governor. We omit the detailed comparison with the interactive governor since it targets performance optimization. We also compared the proposed online learning methodology with a recent offline learning technique (Gupta et al., 2017). The DyPO approach first finds the Pareto-optimal configurations for energy optimal operation and then chooses a configuration among them at runtime. However, the DyPO approach does not perform any online learning. We observe that our proposed methodology on average achieves 6.39% improvement in energy consumption and 15.24% improvement in execution time for unseen applications when compared to the DyPO approach.

Furthermore, to validate the robustness of the proposed online IL technique, we apply the technique while executing Kmeans and BML applications concurrently (Kmeans-BML). We choose Kmeans as the foreground application and BML as the background application. We observe that the proposed online IL technique results in 5% less energy consumption than powersave governor for Kmeans-BML. Moreover, the execution time of Kmeans-BML with proposed online IL technique is 25% and 22% less than the powersave governor and DyPO respectively. This experiment proves that the proposed online IL technique can efficiently manage the platform while executing multiple applications concurrently.

In summary, the proposed online imitation learning methodology efficiently adapts to unseen applications and provides significant improvements over state-of-the-art methods with minimal overhead.

Metric
Module Overhead
Time between
successive calls
% Over-
head
Exe.
time
RLS 6 s 10-20 ms 0.06-0.12
Local
search
2.42 s 10-20 ms 0.02-0.05
Backprop. 2 ms 0.5-1.5 s 0.4-1.0
Table 5. Runtime and implementation overhead summary.

6.7. Implementation Overhead Analysis

We conclude our experimental evaluation by summarizing the runtime and storage overhead of our approach measured on Odroid XU3 board. The online power-performance models are updated in each epoch, which ranges between 10–20ms. Since the model updates take 6, the runtime overhead is only 0.06-0.12%, as summarized in Table 5. Online policy update via backpropagation takes longer (2ms) than the other operations, as expected. However, we note that it runs every 100 epochs, i.e., in the order of seconds. Therefore, its runtime overhead is less than 1% as shown in Table 5. Finally, the storage overhead of the buffer to store new training examples is only 18KB. In summary, the proposed approach incurs minimal runtime and storage overheads.

Overhead Comparison with RL: We start the overhead comparison with the overhead of retraining neural network with backpropagation algorithm. For each backpropagation step we use 100 training samples for both online-IL and RL algorithms. Furthermore, the network structure used in both the algorithm is similar. Therefore, the execution time overhead for both algorithms for a single backpropagation step is same (2 ms). However, RL needs more backpropagation steps than the proposed online-IL to learn a policy that performs as good as online-IL. As a result, the cumulative overhead of RL in terms of both execution time and training samples required is much higher than online-IL. For an example, to obtain the optimal policy for Spectral application, online-IL performs backpropagation step only once (total 2 ms overhead). But, even after RL performs 8 backpropagation steps (total 16 ms overhead), it is not able to learn the optimal policy for Spectral. Overall, for Spectral application, the performance overhead incurred by RL is twice of the overhead incurred by IL. This further proves the efficiency of proposed online-IL approach over RL.

7. Conclusions and Future Work

Designing an optimal runtime power management policy that achieves fast response time and minimum energy consumption is challenging due to the large space of available control knobs and diversity of applications. Policies constructed at design-time may fail to deliver optimal execution for applications encountered at runtime. This paper presented a novel online imitation learning methodology to efficiently learn the optimal policy for new applications at runtime. Experimental evaluation on Odroid XU3 platform shows that we achieve both more than 10% lower energy consumption and 24% faster execution time with respect to the Powersave governor. The proposed approach is also applicable to other sequential decision-making problems in mobile systems, such as, like prefetcher, cache replacement, branch prediction. However, extending the proposed IL framework to these domains is beyond the scope of this paper. Therefore, it is left as future work.

References

  • (1)
  • Aalsaud et al. (2016) Ali Aalsaud et al. 2016. Power–Aware Performance Adaptation of Concurrent Applications In Heterogeneous Many-Core Systems. In Proc. Int. Symp. on Low Power Electronics and Design. 368–373.
  • Bellosa (2000) Frank Bellosa. 2000. The Benefits of Event: Driven Energy Accounting in Power-Sensitive Systems. In Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system. 37–42.
  • Bhat et al. (2018) Ganapati Bhat, Sumit K Mandal, Ujjwal Gupta, and Umit Y Ogras. 2018. Online learning for adaptive optimization of heterogeneous SoCs. In Proceedings of the International Conference on Computer-Aided Design. ACM, 61.
  • Bhat et al. (2017) Ganapati Bhat, Gaurav Singla, Ali K Unver, and Umit Y Ogras. 2017. Algorithmic Optimization of Thermal and Power Management for Heterogeneous Mobile Platforms. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26, 3 (2017), 544–557.
  • Bienia et al. (2008) Christian Bienia et al. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proc. Int. Conf. on Parallel Arch. and Compilation Techniques. 72–81.
  • Brooks et al. (2007) David Brooks, Robert P Dick, Russ Joseph, and Li Shang. 2007. Power, Thermal, and Reliability Modeling in Nanometer-Scale Microprocessors. IEEE Micro 27, 3 (2007), 49–62.
  • Chen and Marculescu (2015) Zhuo Chen and Diana Marculescu. 2015. Distributed Reinforcement Learning For Power Limited Many-Core System Performance Optimization. In Proc. of the DATE. 1521–1526.
  • Cochran et al. ([n. d.]) Ryan Cochran et al. [n. d.]. Pack & Cap: adaptive DVFS and thread packing under power caps. In 2011 44th Annual IEEE/ACM Intl. Symp. on Microarchitecture (MICRO). 175–185.
  • Dhiman et al. (2009) Gaurav Dhiman et al. 2009. System-level power management using online learning. IEEE Trans. on CAD 28, 5 (2009), 676–689.
  • Dietrich and Chakraborty (2014) Benedikt Dietrich and Samarjit Chakraborty. 2014. Lightweight Graphics Instrumentation for Game State-Specific Power Management in Android. Multimedia Systems 20, 5 (2014), 563–578.
  • Dietrich et al. (2010) Benedikt Dietrich, Swaroop Nunna, Dip Goswami, Samarjit Chakraborty, and Matthias Gries. 2010. LMS-based Low-Complexity Game Workload Prediction for DVFS. In 2010 IEEE International Conference on Computer Design. 417–424.
  • Ge and Qiu (2011) Yang Ge and Qinru Qiu. 2011. Dynamic Thermal Management for Multimedia Applications using Machine Learning. In Proceedings of the 48th Design Automation Conference. 95–100.
  • Goodfellow et al. (2013) Ian J Goodfellow et al. 2013. An Empirical Investigation of Catastrophic Forgetting in Gradient-based Neural Networks. arXiv preprint arXiv:1312.6211 (2013).
  • Gupta et al. (2017) Ujjwal Gupta et al. 2017. DyPO: Dynamic Pareto-Optimal Configuration Selection For Heterogeneous MpSoCs. ACM Trans. Embedd. Comput. Syst. 16, 5s (2017), 123.
  • Gupta et al. (2019) Ujjwal Gupta et al. 2019. A Deep Q-Learning Approach for Dynamic Management of Heterogeneous Processors. IEEE Computer Architecture Letters (2019).
  • Gupta et al. (2018b) Ujjwal Gupta, Manoj Babu, Raid Ayoub, Michael Kishinevsky, Francesco Paterna, Suat Gumussoy, and Umit Y Ogras. 2018b. An Online Learning Methodology for Performance Modeling of Graphics Processors. IEEE Trans. Comput. 67, 12 (2018), 1677–1691.
  • Gupta et al. (2018a) Ujjwal Gupta, Manoj Babu, Raid Ayoub, Michael Kishinevsky, Francesco Paterna, and Umit Y Ogras. 2018a. STAFF: Online Learning with Stabilized Adaptive Forgetting Factor and Feature Selection Algorithm. In Proceedings of the 55th Annual Design Automation Conference. 1–6.
  • Guthaus et al. (2001) Matthew R Guthaus et al. 2001. Mibench: A Free, Commercially Representative Embedded Benchmark Suite. In Proc. of the Int. Workshop on Workload Characterization. 3–14.
  • Hardkernel (2014) Hardkernel. 2014. ODROID-XU3. https://wiki.odroid.com/old_product/odroid-xu3/odroid-xu3 Accessed 24 Nov. 2018.
  • Hecht-Nielsen (1992) Robert Hecht-Nielsen. 1992. Theory of the backpropagation neural network. In Neural networks for perception. Elsevier, 65–93.
  • Kadjo et al. (2015) David Kadjo, Raid Ayoub, Michael Kishinevsky, and Paul V Gratz. 2015. A Control-Theoretic Approach for Energy Efficient CPU-GPU Subsystem in Mobile Platforms. In Proceedings of the 52nd Annual Design Automation Conference. ACM, 62.
  • Kim et al. (2017) Ryan Gary Kim et al. 2017. Imitation Learning For Dynamic VFI Control In Large-Scale Manycore Systems. IEEE Trans. on VLSI Systems 25, 9 (2017), 2458–2471.
  • Lattner and Adve (2004) Chris Lattner and Vikram Adve. 2004. LLVM: A compilation Framework For Lifelong Program Analysis & Transformation. In Proc. of the Int. Symp. on Code Generation and Optimization: Feedback-Directed and Runtime Optimization. 75.
  • Mandal et al. (2019) Sumit K Mandal et al. 2019. Dynamic Resource Management of Heterogeneous Mobile Platforms via Imitation Learning. IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2019).
  • Martinez and Ipek (2009) Jose F Martinez and Engin Ipek. 2009. Dynamic Multicore Resource Management: A Machine Learning Approach. IEEE Micro 29, 5 (2009).
  • Mendel (1995) Jerry M Mendel. 1995. Lessons in estimation theory for signal processing, communications, and control. Pearson Education.
  • Mnih et al. (2015) Volodymyr Mnih et al. 2015. Human-level Control Through Deep Reinforcement Learning. Nature 518, 7540 (2015), 529.
  • Mucci et al. (1999) Philip J Mucci, Shirley Browne, Christine Deane, and George Ho. 1999. PAPI: A Portable Interface To Hardware Performance Counters. In Proc. of the Dept. of defense HPCMP Users Group Conf., Vol. 710.
  • Pallipadi and Starikovskiy (2006) Venkatesh Pallipadi and Alexey Starikovskiy. 2006. The Ondemand Governor. In Proc. Linux Symp., Vol. 2. 215–230.
  • Park et al. (2017) Jurn-Gyu Park, Nikil Dutt, and Sung-Soo Lim. 2017. ML-Gov: A Machine Learning Enhanced Integrated CPU-GPU DVFS Governor For Mobile Gaming. In Proc. Symp. on Embedd. Syst. for Real-Time Multimedia. 12–21.
  • Pathania et al. (2015) Anuj Pathania et al. 2015. Power-Performance Modelling of Mobile Gaming Workloads On Heterogeneous MPSoCs. In Proc. of Design Autom. Conf. 201.
  • Pathania et al. (2014) Anuj Pathania, Qing Jiao, Alok Prakash, and Tulika Mitra. 2014. Integrated CPU-GPU Power Management For 3D Mobile Games. In Design Autom. Conf. 1–6.
  • Ross et al. (2011) Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A Reduction of Imitation Learning And Structured Prediction To No-Regret Online Learning. In Proc. of the Int. Conf. on Art. Intel. and Stat. 627–635.
  • Schaal (1999) Stefan Schaal. 1999. Is Imitation Learning The Route To Humanoid Robots? Trends in cognitive sciences 3, 6 (1999), 233–242.
  • Shafik et al. (2015) Rishad A Shafik, Sheng Yang, Anup Das, Luis A Maeda-Nunez, Geoff V Merrett, and Bashir M Al-Hashimi. 2015. Learning transfer-based adaptive energy minimization in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 6 (2015), 877–890.
  • Singh et al. (2019) Amit Singh, Karunakar Reddy Basireddy, Alok Prakash, Geoff Merrett, and Bashir M Al-Hashimi. 2019. Collaborative Adaptation for Energy-Efficient Heterogeneous Mobile SoCs. IEEE Trans. Comput. (2019).
  • Statista ([n. d.]) Statista. [n. d.]. Mobile App Usage - Statistics & Facts. https://www.statista.com/topics/1002/mobile-app-usage/ Accessed 24 Nov. 2018.
  • Sultan et al. (2014) Hameedah Sultan, Gayathri Ananthanarayanan, and Smruti R Sarangi. 2014. Processor Power Estimation Techniques: A Survey. IJHPSA 5, 2 (2014), 93–114.
  • Sun et al. (2017) Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. 2017. Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction. In Proc. 34th Int. Conf. Machine Learning, Vol. 70. 3309–3318.
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement Learning: An Introduction. MIT press.
  • Thomas et al. (2014) Shelby Thomas et al. 2014. CortexSuite: A Synthetic Brain Benchmark Suite.. In IISWC. 76–79.
  • Tian et al. (2018) Zhongyuan Tian, Zhe Wang, Haoran Li, Peng Yang, Rafael Kioji Vivas Maeda, and Jiang Xu. 2018. Multi-device collaborative management through knowledge sharing. In 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC). 22–27.
  • ul Islam and Lin (2017) F.M.M. ul Islam and Man Lin. 2017. Hybrid DVFS Scheduling For Real-Time Systems Based On Reinforcement Learning. IEEE Systems J. 11, 2 (2017), 931–940.
  • Vallina-Rodriguez and Crowcroft (2012) N. Vallina-Rodriguez and Jon Crowcroft. 2012. Energy Management Techniques in Modern Mobile Handsets. IEEE Comm. Surveys & Tutorials 15, 1 (2012), 1–20.
  • Won et al. (2014) Jae-Yeon Won et al. 2014. Up by their bootstraps: Online learning in artificial neural networks for CMP uncore power management. In 2014 IEEE 20th Intl. Symp. on HPCA. 308–319.
  • Ye et al. (2000) Wu Ye, Narayanan Vijaykrishnan, Mahmut Kandemir, and Mary Jane Irwin. 2000. The Design and Use of Simplepower: A Cycle-Accurate Energy Estimation Tool. In Proceedings of the 37th Annual Design Automation Conference. ACM, 340–345.
  • Zhang et al. (2017) Qingchen Zhang et al. 2017. Energy-efficient Scheduling for Real-time Systems Based on Deep Q-learning Model. IEEE Trans. on Sustainable Computing (2017).
  • Zhang et al. (2018) Qingchen Zhang, Man Lin, Laurence T Yang, Zhikui Chen, Samee U Khan, and Peng Li. 2018. A double deep Q-learning model for energy-efficient edge scheduling. IEEE Transactions on Services Computing (2018).