Embedded GPUs (Graphical Processing Units) which are physically present in the same chip as the central processing unit (CPU) are popular as general-purpose accelerators in power constrained applications such as unmanned aerial vehicles (UAV), self-driving cars or robotics.
The power profiles of these GPUs are in order of Watts compared with the hundreds of Watts needed in their desktop counterparts that invariably use the PCIe bus to communicate with the host CPU and memory. Power optimization in these embedded devices is of critical importance since in these applications, the power sources tend to be batteries and the systems must operate untethered for as long as possible. In this paper we investigate the application of the power modelling framework created in (Nikov2020)
for heterogeneous embedded CPUs to embedded GPUs capable of general-purpose computing thanks to their support for languages such as as CUDA and OpenCL. Our framework, called ROSE (RObust Statistical search of explanatory activity Events), can be used to automatically collect activity and power data and then perform a complete search for optimal events across a large range of frequency and voltage pairs as defined in the device DVFS (Dynamic Voltage and Frequency Scaling) tables. ROSE multiple linear regression optimization uses OLS (Ordinary Least Squares) which is well understood for the case of both desktop and integrated CPUs but it has been less studied in GPUs that are characterized by a proprietary black box architecture with restricted access to internal microarchitecture details. Taking these points into account, the novelty of this work can be summarized as follows:
We perform power modelling on an embedded GPU device with integrated power measurement and voltage/frequency scaling compared with previous work largely focused on desktop GPUs.
We limit the number of explanatory variables used in the model to enable the run-time collection of this information using a limited number of hardware registers.
We propose a novel unified model that includes temperature, frequency and voltage as global states and compare it against models with coefficients optimized for single temperatures and frequency/voltage pairs.
This paper is organized as follows. Section 2 introduces related work in the area of power modelling with GPUs. Section 3 presents the methodology based on our previous work in this area, the set of CUDA benchmarks for model training/verification and the techniques used to obtain run-time measures of power and event information. Section 4 develops models based on this methodology with coefficients optimized for individual voltage and frequency pairs. Section 5 proposes an unified model so that the power of a single per-frequency model can be scaled to an extended range of voltage and frequency points. Section 6 investigates temperature effects on power consumption and model accuracy. Finally, section 7 concludes the paper.
As previously indicated, there is a significant amount of work on power modelling of CPU cores and CPU-based systems and the interested reader is referred to (nunez) for a review of results and techniques. In the field of general-purpose GPU-based computing, the amount of power modelling research is more limited. The authors of (nag10) investigate how performance counters can be used to model power on a desktop NVIDIA GPUs connected to a host computer via a PCIe interface. The PCIe interface is instrumented with current clamp sensors and the host computer samples these sensors while collecting performance counter information. The authors identify a total of 13 CUDA performance counters but since only four hardware registers are available, multiple runs are needed to access all of them. The authors also identify that certain kernels that perform texture reads, such as Rodinia Leukocyte, show significant power error up to 50% due to lack of relevant counter information. The impact of DVFS on power modelling is not considered. Also targeting desktop GPUs, the authors of (mian12)
introduce a support vector regression model instead of the least squares linear regression more commonly used. A total of five variables are used, such asvertex shader busy and texture busy to build the model. Instead of predicting the power of full kernels as done in (nag10), they predict the power of the different execution phases as similarly done in our work. The authors show a slight advantage of SVR in accuracy although some phases of execution of the GPU power cannot be modelled correctly. The performance and power characterization done in (abe14) considers different desktop NVIDIA GPU families (i.e. Tesla, Fermi and Kepler). The external power measurements apply to the entire system which includes the GPU and CPU and not the individual components. The proposed power model uses performance counters and linear regression and introduces a frequency scaling parameter in the power equation to account for the different performance levels possible in the GPU. It does not consider the operating voltage and, with multiple voltage levels possible for a single frequency, this could explain the errors in the prediction accuracy which are measured at around 20 to 30%. The work of (Mei16)
also focuses on desktop GPUs with a review that shows that the number of explanatory variables used varies between 8 and 23. It considers the use of neural networks to perform the prediction, indicating how neural networks can address the nonlinear dependencies of the input variables at the expense of significantly higher complexity. However, this could make the models harder to deploy as part of an energy-aware operating system. In this paper, we focus on using a low number of explanatory variables to make the models easy to deploy at run-time and investigate the accuracy of multiple linear regression for power modelling considering voltage, frequency and temperature in embedded GPUs.
The methodology is based on our previous work targeting ARM big.LITTLE SoCs and introduced in (Nikov2020). The CUDA benchmarks used for the model creation and validation are shown in Table 1. Training and testing benchmarks are independent and have been obtained from the Rodinia and CUDA SDK benchmark sets.
|CUDA Rodinia Train Set|
|CUDA SDK Test Set|
We have modified the collection and processing stages to account for the differences in counter availability, power and current sensors and DVFS implementations. In the CPU-based power model done in (Nikov2020) the DVFS table contains fixed pairs of voltage and frequency and the preferred way to build the power model is to use a per-frequency model in which a distinct set of coefficient values are calculated for each pair. To use this approach directly on the TX1 is problematic due to temperature dependencies and the practical difficulties of adjusting the temperature of the device to each possible value during a data collection run that typically executes over several days. To manage this complexity in this work we distinguish between local events such as the number of instructions executed or the number of memory accesses that will affect power in certain regions of the device, and global states such as the operating frequency, voltage and temperature that will affect power globally. This approach enables us to propose an unified model with a single set of coefficients that could be used for multiple combinations of voltage, frequency and temperature. The development of this unified model and its comparison with the per-frequency models is conducted in section 5 and Section 7. The performance counters considered in this work are shown in Figure 1. The number of physical registers available in the GPU device to collect activity information in parallel is limited and for this reason limiting the number of model counters is preferred. The methodology presented in (Nikov2020) implements different types of automatic searches and analysis of the effects of different counters on the power model accuracy.
In the methodology flow, the octave_makemodel script receives with -r a measurement.txt text file containing the power and activity counter samples with around 12,000 samples in our case. Then with -b a benchmark.txt file that identifies which benchmarks should be used for training and which for testing. Then with -f all the frequency values that are going to be considered (each frequency value also corresponds to a different voltage as determined by the DVFS table), -p identifies the column number in measurement.txt that contains power information, -m set to 1 is the search mode heuristic defined as bottom-up and -l lists the performance counters selected for analysis as columns in measurement.txt, -n set to 4 instructs the framework to search for the best possible four performance counters that result in a more accurate model as indicated with -c 1. This means that the script will search up to a maximum of 4 performance counters in the list provided, across all the frequencies and voltages automatically. The result is a set of coefficients for each frequency/voltage pair. To minimize temperature interference, the experiments are conducted setting the available TX1 fan to maximum speed initially. If, for example, the user is interested in obtaining models across all possible frequencies for a particular set of events that have been pre-selected, the user can use the switch -e to specify the four columns in power_measurement.txt with events that need to be analyzed.
4. per-frequency model development
Equation 4 shows the general form of the power model proposed in this work. Comparing this equation with the previous work done in (Nikov2020) we normalize the total event count with the total number of cycles available in the time slot to obtain an activity density measurement that should remain constant as frequency changes. For example, if the frequency doubles then the number of events (e.g. instructions executed) in the same time period should also double, but since the number of clock cycles also doubles the ratio should remain constant.
We limit all experiments to a maximum set of four counters to account for the limited number of registers available in commercial GPUs. Figure 2 shows the four examples of performance counters that the methodology ends up selecting as the more accurate identified as model A, B, C and D. The coefficients shown are for a single example frequency of 76MHz with a corresponding voltage of 0.82v and a similar set of coefficients exists for the other 12 possible frequency and voltage pairs at a constant temperature.
P_GPUfreq_1= α_0 + α_1 ×events_1/cycles + …+ α_n ×events_n/cycles
Figure 3 shows the comparison of the accuracy of these four models across all the frequency and voltage pairs. The performance of models A, C and D is similar with an overall error below 5%. Model D offers a slightly better overall accuracy, as shown in the overall value and will be taken forward to derive a unified power model in the next section. We can also appreciate that at different frequencies, the accuracy varies and this is largely defined by the model parameters.
5. Unified model development
The previous per-frequency models contain a total of 13 times 5 parameters with four event coefficients and a constant parameter for each of the 13 voltage/frequency points. They are obtained at fixed voltage and frequency pairs and do not take into account the multiple voltage levels available for each frequency in the TX1 device part of the DVFS table. In this section, we propose a new type of model that unifies the previous models with a single set of coefficients and includes independent variables for frequency and voltage. Equation 5 shows the general form of this unified power model with two terms being added corresponding to dynamic and static power. The approach consists of scaling the power predicted by a power model at a single frequency to fit the rest of the frequencies and voltages.
P_GPUfreq_x= (P_GPUfreq_1 - P_GPUsta_x)×freq1freqx ×(volt1voltx)^2+P_GPUsta_x×(volt1voltx)^2
Scaling is possible because the model uses normalized activity rates that should remain constant at different frequencies since both events and cycles should reduce proportionally. The scaling is done based on how voltage and frequency affect the dynamic and static power of a chip. Dynamic power is proportional to the voltage square and frequency. In our experiments, we observe that static power accuracy improves using also voltage square scaling. Static power or leakage is the power of the device when the frequency is zero, so the frequency term should not be used to scale it. To isolate the static power in the second term of the equation to be able to scale it correctly, we need to measure it first. It is important to note that the per-frequency model contains a constant component that represents the device power with no activity and this power can be defined as idle power as shown in Equation 6. This idle power is formed mainly by the static power and the clock power since the clocks remain active when there is no active load. A direct way to measure static power will be to clock gate the GPU device, however, the Linux for Tegra L4T JetPack 4.2.1 for the TX1 SoC used in this work does not implement this feature and only allows frequency configurations part of the DVFS table. To be able to extract the static power, we use an indirect method as follows. We sweep all the points available in the DVFS table with no benchmarks running to obtain the idle power. The first few frequency points in the DVFS table do not affect the supply voltage of 0.82V and this results in a linear relation of power and frequency as shown in Figure 4 for these points at a reference temperature of 23C at full fan speed. We use the point at which this line intersects the Y axis as the frequency of zero and the corresponding value rounded to 0.21W as the static power present in the device at that voltage level and temperature. With this information, we can create a unified model based on Equation 5 using as reference point any frequency that has the common voltage point of 0.82V and should have constant static power.
We consider two possible reference points at the minimum frequency of 76 MHz and the middle frequency of 380 MHz that both share the voltage of 0.82V. We call these models UAL and UAM for ”unified anchor low” and ”unified anchor middle”, respectively. Another alternative is to use a reference point at a high frequency if we can estimate the static power at that level. Since we know that the dynamic power follows equation 6, we can solve equation 6 and with the available values for Pidle, Pstatic, V and f we can obtain the ×C that we treat as a constant. Our hypothesis is that the activity rate ×C should remain constant within a small sample interval across different frequencies because we are measuring events divided by cycles in the sample interval. We can now extract the static power for the high frequency point of 998MHz and 1.07V by obtaining Pdynamic_clock and subtracting it from Pidle. We call this model UAH for ”unified anchor high”.
Figure 5 shows the accuracy of the UAL, UAM and UAH unified models. This figure shows how the accuracy compares between these unified models derived from the PF (per-frequency) model D. The anchor low and anchor high models obtain the best accuracy at their respective reference points but suffer a significant degradation as the frequency/voltage moves further away from the reference point. On the other hand, the anchor middle offers a largely identical accuracy to model D with an overall percentile error rate of around 5%. This result shows that the unified model can be competitive in terms of accuracy compared with the per-frequency models developed in the previous section. This unified power model with 380 MHz and 0.82V as the reference frequency and voltage is shown in Equation 6 where PGPUfreq_ref is obtained by Equation 6. Equation 6 contains a negative coefficient which, in principle, is not an intuitive result but it can be explained by the fact that the different explanatory variables have correlations among them (i.e. GPU busy increases as the number of instructions executed increases) and the multiple linear regression process finds this negative value as a value that improves the model fit to the training data.
Finally, Figure 6 compares the power predictions performed by the per-frequency model D and the unified derived model with the measured values at run-time for a full sweep of the test benchmarks at different voltages and frequencies. We observe that the power consumption ranges from below 1 Watt to over 13 Watts depending on the operation point and benchmark. The power predictions follow the different execution phases, although it is at the highest points of power consumption that the errors are more noticeable. Also the measured power tends to show low spikes between benchmarks that the model does not predict. This effect could be due to our sampling frequency that is limited to one sample every 0.5 seconds. Further research is needed to increase this sample rate taking into account that, since the thread that samples the power sensors is also executed by the CPU cores higher sampling rates could mean that the processors are not available to launch the CUDA benchmarks which could introduce artifacts.
P_dynamic_clock= α×C ×V^2 ×f
P_idle= P_dynamic_clock + P_static
P_GPUfreq_x= (P_GPUfreq_ref - 0.21w)×freq_x380MHz ×(volt_x0.82V)^2+0.21W×(volt_x0.82V)^2
P_GPUfreq_ref= 0.7720W + 0.0025W ×inst_executed_cscycles + 0.0908w ×executed_global_storescycles - 0.000017W ×gpu_busycycles + 0.000019W ×active_warpscycles
6. Temperature effects
The power equation presented in Equation 6 will consider possible changes in voltage and frequency due to temperature changes as defined in the DVFS table but it does not consider the changes in power due to temperature itself. Temperature has a direct effect on the static power consumption of the device as shown in (GOEL20127). The static power depends on leakage current and supply voltage linearly while the leakage current itself depends on the supply voltage and temperature exponentially. In this analysis we approximate these exponential relations linearly for the range in which device temperatures occur (GOEL20127). To understand the dependency of static power and temperature we run a number of experiments with no load on the GPU varying the fan rate and frequency at a constant voltage of 0.82V to generate different temperature profiles. For each run we obtain a linear relation between frequency and power that enables as to estimate the static power by setting frequency at zero. We use the same approach to obtain a linear relation between power and temperature that allows to estimate the temperature at frequency zero for each of the runs.
We can now plot the points of temperature and power as shown in Figure 7 and obtain a linear relation between temperature and static power at 0.82V. Using this information we can now replace the Pstatic in Equation 6 to obtain Equation 7
Two temperature and power profiles generated by varying the fan activity are used to test the temperature-aware model. Overall the results show that under the same workload, voltage and frequency, temperature results in a power variation higher than 20%. This accuracy result justifies the importance of capturing temperature in a power model as done in this work. We evaluate the temperature-aware power equation in Figures 8 and 9 against the other models. Figure 8 shows that under the same temperature conditions considered in the previous section, the model operates with a similar value of accuracy. Figure 9 shows that when the device heats up the accuracy of the original models degrades significantly while the temperature-aware model largely maintains the same level of accuracy.
P_GPUfreq_x= (P_GPUfreq_ref -(Tref×0.0051+0.0849)W)×freq_x380MHz ×(volt_x0.82V)^2+(T×0.0051+0.0849)W×(volt_x0.82V)^2
7. conclusions and future work
The proposed per-frequency and unified power models are kept simple by using only four performance counters collected in parallel. We also extend the unified model with temperature-aware capabilities to improve accuracy when the device can work in multiple temperature regimes such as in fanless configuration. We observe that the temperature impact on static power can increase power by over 20% which is the rationale to make it a part of the proposed model. Overall, the research shows that the CPU power methodology can be applied successfully to GPU devices despite that the performance counters are very different in nature and that the prediction error can be maintained at around 5% using a combination of local events represented by the performance counters and global states represented by voltage, frequency and temperature variables. The simplicity of these models means that they could be deployed as part of an energy-aware operating system and scheduling framework. The unified model could be particularly useful since it can capture multiple voltage levels for one frequency level with a single set of coefficients. Our future work involves further validation of the methodology and its application with additional benchmarks, improving the data collection approach to increase the granularity of the samples to better capture the different phases of benchmark execution and experimenting with inter-prediction strategies across different GPU devices and technologies. The power modelling methodology used in this paper is available open-source at the following github repository(buildmodel).