Energy-Efficient High-Throughput Data Transfers via Dynamic CPU Frequency and Core Scaling

04/11/2019
by   Luigi Di Tacchio, et al.
0

The energy footprint of global data movement has surpassed 100 terawatt hours, costing more than 20 billion US dollars to the world economy. Depending on the number of switches, routers, and hubs between the source and destination nodes, the networking infrastructure consumes 10 during active data transfers, and the rest is consumed by the end systems. Even though there has been extensive research on reducing the power consumption at the networking infrastructure, the work focusing on saving energy at the end systems has been limited to the tuning of a few application level parameters such as parallelism, pipelining, and concurrency. In this paper, we introduce three novel application-level parameter tuning algorithms which employ dynamic CPU frequency and core scaling, combining heuristics and runtime measurements to achieve energy efficient data transfers. Experimental results show that our proposed algorithms outperform the state-of-the-art solutions, achieving up to 48

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/13/2018

GreenDataFlow: Minimizing the Energy Footprint of Global Data Movement

The global data movement over Internet has an estimated energy footprint...
04/02/2021

Energy-saving Cross-layer Optimization of Big Data Transfer Based on Historical Log Analysis

With the proliferation of data movement across the Internet, global data...
04/15/2022

Energy-Efficient Data Transfer Optimization via Decision-Tree Based Uncertainty Reduction

The increase and rapid growth of data produced by scientific instruments...
12/04/2018

Implementing energy saving algorithms for Ethernet link aggregates with ONOS

During the last few years, there has been plenty of research for reducin...
07/30/2017

Adaptive Performance Optimization under Power Constraint in Multi-thread Applications with Diverse Scalability

In modern data centers, energy usage represents one of the major factors...
05/20/2021

Modelling DVFS and UFS for Region-Based Energy Aware Tuning of HPC Applications

Energy efficiency and energy conservation are one of the most crucial co...
09/24/2008

Multiprocessor Global Scheduling on Frame-Based DVFS Systems

In this ongoing work, we are interested in multiprocessor energy efficie...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The tsunami of data generated by the Internet users, sensors, e-commerce, and surveillance cameras are fueling the large scale Artificial Intelligence (AI) systems. As a result, data transfer over Internet has been increasing exponentially each year and has already exceeded zettabyte scale 

[19]

. It is estimated that the communication industry could use 20% of all world’s electricity by 2025. More than one billion people are expected to come online in developing countries in the next 5 years. Currently, the data transfer task alone consumes hundred terawatt-hours energy with a price tag of 20 Billion US dollars annually. Moreover, the environmental side-effect is monumental - Information and Communication Technologies (ICT) alone could be responsible for a staggering 3.5% carbon emission by 2020 

[18]. This trend has motivated a considerable amount of work in energy consumption optimization of hardware and software systems as well as the network devices.

Numerous works has been done to optimize the power consumption of the core network infrastructure (e.g., routers, switches, hubs, and network interface cards), however, little work has been done on energy efficiency in the end systems (i.e., sender/receiver compute nodes). Many of the works suggested to put idle components into sleep [8],turning off unused links, switches [10], link rate adaptation [15], power-aware optimization of packet routing based on power models [4].

Many of these approaches require expensive energy efficient hardware replacements. Some solutions require replacing existing protocol with a newer one, that comes with a huge adaptation overhead. Most of the solutions cannot balance between performance and energy efficiency. Many of the solutions are not well accepted in the industry as they take a big toll out of the performance to provide energy efficiency. Moreover, these approaches hardly consider energy consumption in the end systems. In order to make the end systems more energy efficient, one can optimize the transport protocol with an adaptive sending rate. However, such a protocol update requires expensive kernel update. The adaptation of such protocol takes a long time as the operating system vendors show reluctance to such updates. In this paper, we propose a solution that is free from expensive hardware or protocol replacement, and provide a complete application layer solution which employs dynamic CPU frequency and core scaling. This novel solution can balance between performance and energy efficiency using Service Layer Agreement (SLA) based tuning. Our model is easy to deploy as it can be implemented in the user space. Moreover, user can set performance or energy constraints based on SLAs.

The major contributions of this paper include the following:

  1. It proposes three novel application-level parameter tuning algorithms for energy efficient data transfers.

  2. It introduces CPU frequency scaling and active cores tuning to reduce the energy consumption of large scale data transfers.

  3. It combines heuristics and runtime measurements to dynamically and jointly tune five application-level parameters and satisfy the SLA requirements.

The rest of the paper is organized as follows: Section II provides a short description of the parameters that we aimed to optimize; Section III provides an overview of the runtime parameter tuning model; Section IV presents three SLA based energy-efficient parameter tuning algorithm; Section V provides experimental evaluation of our models; Section VI describes the related work in this field; and Section VII concludes the paper.

Ii Application-Level Parameters

Data transfer throughput and energy consumption are influenced by a plethora of parameters at different layers of the network protocol stack. However, there has been little work on tuning application layer parameters, which has the advantage of leaving the rest of the protocol stack unchanged. To the best of our knowledge, this is the first work performing joint tuning of 5 application-level parameters: number of active cores, CPU frequency level, pipelining, parallelism, and concurrency.

Number of active cores and CPU frequency level determine the number of Instructions per Second (IPS) that the CPU can execute, as well as its energy consumption. Since pipelining, parallelism, and concurrency have an impact on the CPU utilization, it is important for those 3 parameters to be tuned jointly with the number of active cores and the CPU frequency level.

Pipelining is the number of requests that can be sent back to back before having to stop and wait for the data to reach the destination. Pipelining reduces the total number of Round-Trip-Times (RTTs) required to complete the transfer: therefore, it is most beneficial when moving smaller files, since as the download or upload time decreases, the RTTs have a greater impact on the total transfer time.

Parallelism is the number of file chunks that can be transferred concurrently for each file in the dataset. Using parallelism improves the network utilization by opening multiple connections and increasing the fraction of the bandwidth used during the transfer. Parallelism is most advantageous when transferring large files, especially when their size greatly exceeds the buffer size.

Concurrency is the number of files that can be transferred at the same time on multiple connections. Like parallelism, opening multiple streams allows to use a larger share of the bandwidth, but it can improve the throughput even when transferring smaller files. Tuning concurrency is a difficult task, for having too many streams competing for a share of the bandwidth might lower the throughput and increase the energy consumption.

Iii Heuristic and Runtime Tuning Techniques

In this section, we present some of the techniques used to set and tune the 5 application-level parameters presented in section II. First, the transfer parameters are initialized using a heuristic-based approach, described in section III. After the transfer starts, runtime measurements are used to adjust the parameter values and guarantee the SLA requirements stipulated with the client, following the finite state machine presented in section III-B. During the transfer, the CPU frequency and number of active cores are tuned following the threshold-based policy presented in section III-C.

1:  datasets = partitionFiles()
2:  for dataset in datasets do
3:     if avgFileSize BDP then
4:        dataset.splitFiles(BDP)
5:     end if
6:     ppLevel = BDP / avgFileSize
7:  end for
8:  tputChannel = avgWinSize / RTT
9:  numChannels = bandwidth / tputChannel
10:  for dataset in datasets do
11:     weight = partitionSize / partitionSize
12:     ccLevel= weight numChannels
13:  end for
14:  if SLApolicy(Energy) then
15:     numActiveCores = 1
16:     coreFrequency = minFrequency
17:  else if SLApolicy(Throughput) then
18:     numActiveCores = numCores
19:     coreFrequency = minFrequency
20:  end if
Algorithm 1 Heuristic-based parameter initialization

Iii-a Heuristic-based Parameter Initialization

Algorithm 1 is executed at the beginning of a data transfer. Its purpose is to initialize the 5 parameters described in section II to near-optimal values, based on a heuristic aimed at optimizing channels usage and utilization of system resources.

After clustering the data, if a partition contains files larger than the Bandwidth-Delay-Product (BDP), its files will be split in chunks (line 2-5). This has a twofold effect: multiple chunks can be transferred on different channels concurrently, increasing the throughput, and each chunk completely fills up the channel, its size being equal to the BDP.

Subsequently, the algorithm tries to find the minimum number of channels necessary to use the entire bandwidth (line 8-9). First, it calculates the theoretical throughput of a single TCP channel (line 8), as the average TCP window size over the Round-Trip-Time (RTT). The average window size can be easily estimated using a network benchmark tool such as iperf. Given the estimated channel throughput, the algorithm calculates the number of channels necessary to use the whole bandwidth (line 9).

After that, the channels are distributed among the file partitions based on the faction of data that each partition contains, with respect to the entire dataset (line 10-13).

Finally, depending on the SLA requirement, the algorithm initializes the number of active cores and the core frequency (line 14-20).

Iii-B Runtime Tuning Finite State Machine

SLOWSTART

INCREASE

WARNING

RECOVERY

neutral feedback

positive feedbackincreaseChannels()

negative feedback

positive or neutral feedback

negative feedbackdecreaseChannels()

negative feedbackincreaseChannels()

positive orneutral feedback
Fig. 1: Algorithms Finite State Machine
1:  for Timeout do
2:     calculateThroughput()
3:     numCh = bandwidth / lastThroughput
4:     updateWeights()
5:     for dataset in datasets do
6:        ccLevel = weight numCh
7:        updateChannels()
8:     end for
9:  end for
Algorithm 2 Slow Start algorithm

After initializing pipelining, parallelism, concurrency, CPU frequency, and number of active CPU cores using algorithm 1, the data transfer is started and a different algorithm is executed depending on the SLA agreement. Nonetheless, all three algorithms presented in this paper follow a similar structure that can be described using a Finite State Machine, illustrated in figure 1.

The first state, denominated Slow Start, is entered after algorithm 1. After a short timeout, the tuning algorithm measures the throughput and, if necessary, adjusts the number of channels to compensate for the initial estimation error.

In state Increase, the transfer parameters are increased or decreased based on the feedback from the channel. If the algorithm’s goal is energy-related, the feedback is represented by the energy consumption since the last timeout, otherwise it is the average throughput during the last time interval.

Upon receiving negative feedback, the algorithm transitions to the state Warning. From there, a positive or neutral feedback suggests that the performance drop was only temporary, which causes a transition back to state Tuning. However, upon receiving a second negative feedback, the algorithm enters state Recovery.

From here, a positive or neutral feedback is a sign that reducing the channel count eased the load on the network, and the algorithm goes back to state Increase. On the other hand, a negative feedback indicates that the channel’s available bandwidth dropped, hence the previous channel count is restored and the algorithm transitions back to state Increase.

Iii-C Threshold-based dynamic frequency and core scaling

1:  for Timeout do
2:     if cpuLoad > maxLoad then
3:        if numActiveCores < numCores then
4:           increaseActiveCores()
5:        else if cpuFreq < maxFreq then
6:           increaseFrequency()
7:        end if
8:     else if cpuLoad < minLoad then
9:        if cpuFreq > minFreq then
10:           decreaseFrequency()
11:        else if numActiveCores > 1 then
12:           decreaseActiveCores()
13:        end if
14:     end if
15:  end for
Algorithm 3 Load Control algorithm

The CPU frequency and the number of active cores are dynamically tuned using a threshold-based policy, implemented in algorithm 3.

When the CPU utilization increases above a certain threshold, named maxLoad, the algorithm tries to increase the number of active cores or CPU frequency, in order to reduce the load on the system (line 2-7). Conversely, if the CPU utilization is lower than a certain threshold, named minLoad, the algorithm tries to reduce the CPU frequency or the number of active cores.

Algorithm 3 is called at regular intervals by the parameter tuning algorithms to keep the energy consumption as low as possible without sacrificing performance. In fact, every time one of the other transfer parameters is modified, the CPU load might change as well, and could either use more energy than needed or cause a lower performance gain than expected.

Iv Parameter Tuning Algorithms

In this section, we present three novel energy-efficient parameter tuning algorithms, which dynamically adapt the parameter values to achieve three different SLA requirements: minimum energy consumption, maximum throughput, and target throughput.

Iv-a Minimum energy algorithm

1:  SlowStart()
2:  for Timeout do
3:     calculateThroughput()
4:     calculateEnergy()
5:     remainTime = remainData / avgThroughput
6:     predictedEnergy = avgPower remainTime
7:     if state = INCREASE then
8:        if E + E (1-) E then
9:           numCh = min(numCh + Ch, maxCh)
10:        else if E + E (1+) E then
11:           state = WARNING
12:        end if
13:     else if state = WARNING then
14:        if E + E (1+) E then
15:           state = INCREASE
16:        else
17:           numCh = max(numCh - Ch, 1)
18:           state = RECOVERY
19:        end if
20:     else if state = RECOVERY then
21:        if E + E (1+) E then
22:           state = INCREASE
23:        else
24:           numCh = min(numCh + Ch, maxCh)
25:           state = INCREASE
26:        end if
27:     end if
28:     updateWeights()
29:     for dataset in datasets do
30:        ccLevel = weight numCh
31:        updateChannels()
32:     end for
33:  end for
Algorithm 4 Minimum energy algorithm

The minimum energy algorithm tries to achieve minimum energy consumption using two different strategies: 1) increasing the concurrency level only if that results in a lower estimated energy usage; 2) increasing the active core count and the CPU frequency only if the CPU is reaching full utilization, while reducing the number of active cores and CPU frequency if the CPU load is lower than a certain threshold.

During the Slow Start phase (line 1-9), the algorithm updates the weights assigned to each datasets, based on the remaining data size, and redistributes the channels across the datasets (line 7-10)

Once every timeout, the algorithm assesses whether or not the last parameter change has caused a performance improvement (line 10-14). Depending on the feedback, it either increases the channel count (line 17) or enters state Warning (line 19).

While in state Warning, the algorithm tries to assess whether the performance drop has been caused by an excessively high concurrency level or by a change in available bandwidth (line 21-27). In order to do that, it estimates the transfer energy consumption, and if still higher than the previous estimate, it decreases the channel count and moves to state Recovery. If the energy spike was only temporary (line 22), it goes back to state Increase.

In state Recovery, the algorithm determines whether or not the parameter reduction has caused the energy consumption to lower (line 29). If that is the case, the previous channel count was too high and the new value is closer to optimal than the previous one. Otherwise (line 32), the available bandwidth has changed and the algorithm shifts back to state Increase to find a new optimal channel count.

At every iteration, the algorithm recalculates weights and concurrency levels for each dataset based on the remaining data size. Slower datasets will receive a higher fraction of channels in order to complete the transfer at approximately the same time.

Iv-B Energy-efficient maximum throughput algorithm

1:  SlowStart()
2:  for Timeout do
3:     calculateThroughput()
4:     if state = INCREASE then
5:        if avgTput (1+) refTput then
6:           numCh = min(numCh + Ch, maxCh)
7:           refTput = avgTput
8:        else if avgTput (1-) refTput then
9:           state = WARNING
10:        end if
11:     else if state = WARNING then
12:        if avgTput (1-) refTput then
13:           state = INCREASE
14:        else
15:           numCh = max(numCh - Ch, 1)
16:           state = RECOVERY
17:        end if
18:     else if state = RECOVERY then
19:        if avgTput (1-) refTput then
20:           state = INCREASE
21:        else
22:           numCh = min(numCh + Ch, maxCh)
23:           state = INCREASE
24:           refTput = avgTput
25:        end if
26:     end if
27:     updateWeights()
28:     for dataset in datasets do
29:        ccLevel = weight numCh
30:        updateChannels()
31:     end for
32:  end for
Algorithm 5 Energy-efficient maximum throughput algorithm

The energy-efficient maximum throughput algorithm tries to maximize the throughput while keeping the number of channels as low as possible. It reaches this goals by avoiding to increase the channel count if doing so does not increase the throughput.

The algorithm starts by executing the Slow Start phase (line 1-10). It also updates the reference throughput to the average throughput measured in the Slow Start phase. The reference throughput is set to the best achieved throughput in state Increase, and is used to determine the feedback received from the channel while in other states.

Initially, the algorithm starts in state Increase. Upon timeout, it measures the average throughput, and if higher than the reference throughput by at least a factor , it increases the number of channels and updates the reference throughput (line 11-16). Otherwise, if the feedback was negative, it enters state Warning to determine whether the performance drop has been caused by an excessively high channel count or by a change in available bandwidth (line 17-19).

If the algorithm receives positive or neutral feedback while in state Warning, it goes back to state Increase, assuming the performance drop was only temporary (line 21-23). Conversely, a negative feedback causes the algorithm to enter state Recovery and temporarily reduce the number of channels (line 24-26).

If decreasing the channel count improved the throughput, the algorithm transitions to state Increase without further changing the parameter values (line 28-30); otherwise, it restores the previous channel count, assuming that the performance drop was caused by a reduction in available bandwidth, and updates the reference throughput to the last measured average throughput (line 31-34).

Finally, no matter which state the algorithm is in, the weights for each datasets are recalculated based on the remaining data size, and channels are redistributed among datasets based on their weights (line 37-39).

Iv-C Energy-efficient target throughput algorithm

1:  SlowStart()
2:  for Timeout do
3:     calculateThroughput()
4:     if state = INCREASE then
5:        if avgTput (1+) targetTput or avgTput (1-) targetTput then
6:           state = RECOVERY
7:        end if
8:     else if state = RECOVERY then
9:        if avgTput (1+) targetTput then
10:           numCh = max(numCh - Ch, 1)
11:        else if avgTput (1-) targetTput then
12:           numCh = min(numCh + Ch, maxCh)
13:        end if
14:        state = INCREASE
15:     end if
16:     updateWeights()
17:     for dataset in datasets do
18:        ccLevel = weight numCh
19:        updateChannels()
20:     end for
21:  end for
Algorithm 6 Energy-efficient target throughput algorithm

The energy-efficient target throughput algorithm aims at reaching the target throughput using as few channels as possible. It follows a simplified Finite State Machine with only 3 states, Slow Start, Increase, and Recovery, in order to have a faster reaction time to changes in the channel.

After running through the Slow Start phase (line 1-10), the algorithm measures the throughput and compares it with the target throughput. If it is higher or lower by a factory of or , respectively, it enters state Recovery. After one more timeout, if the throughput is still higher than the target by at least a factor , the channel count is reduced; on the other hand, if the throughput is lower than the target by at least a factor , the channel count is increased instead. No matter which feedback the algorithm received, it transitions back to state Increase to keep measuring the achieved throughput.

Finally, the algorithm updates the weights for each dataset and reassigns the channels across transfers similarly to the other algorithms.

V Experimental Results

Testbed Bandwidth RTT BDP CPU architecture
Chameleon Cloud 10 Gbps 32 ms 40 MB Haswell (server)
Haswell (client)
CloudLab 1 Gbps 36 ms 4.5 MB Haswell (server)
Broadwell (client)
DIDCLab 1 Gbps 44 ms 5.5 MB Haswell (server)
Bloomfield (client)
TABLE I: Characterics of testbeds
Dataset Num files Total size Avg file size Std dev
Small files 20,000 1.94 GB 101.92 KB 29.06 KB
Medium files 5,000 11.70 GB 2.40 MB 0.27 MB
Large files 128 27.85 GB 222.78 MB 15.19 MB
TABLE II: Characterics of datasets
(a) Average throughput (Chameleon Cloud)
(b) Average throughput (CloudLab)
(c) Average throughput (DIDCLAB)
(d) Energy consumption (Chameleon Cloud)
(e) Energy consumption (CloudLab)
(f) Energy consumption (DIDCLAB)
Fig. 2: Comparison of throughput and energy consumption across 3 different testbeds

We evaluated the parameter tuning algorithms on three different testbeds: i) Chameleon Cloud, with the server located at University at Chicago and the client at Texas Advance Computing Center; ii) CloudLab, with the server located at University of Wisconsin and the client at University of Utah; iii) DIDCLab, with the client located at the Data-Intensive and Distributed Computing Lab at University at Buffalo and the server at University at Chicago. An overview of the three testbeds is provided in table I.

In order to compare the algorithms across different scenarios, we used four different datasets in the experiments: the three described in table II, and a mixed dataset, which is a combination of the previous three datasets.

We compared our algorithms with the only alternative solutions by Ismail et al. and some other commonly used data transfer tools: i) curl and wget, which are standard command line tools to transfer single files and datasets; ii) http/2.0, which is an upgrade to http/1.1 and promises to offer better performance by implementing multiplexing, which allows to transfer multiple streams over the same TCP connections.

The energy consumption has been measured using a Yokogawa WT210 digital power meter on the client in the DIDCLab testbed, while Intel RAPL has been used on every other node. Intel RAPL [5] is a software power model that estimates energy usage by using hardware performance counters and I/O models. Its accuracy for the Haswell and Broadwell CPUs used in the experiments has been proved in previous work [12, 9, 22]. Since Intel RAPL provides accurate measurements for CPU and memory usage, we reduced the impact of disk activity to a minimum by performing all transfers memory-to-memory.

(a) Average throughput with target (Chameleon Cloud)
(b) Average throughput with target (CloudLab)
(c) Energy consumption with target (Chameleon Cloud)
(d) Energy consumption with target (CloudLab)
Fig. 3: Comparison of target throughput algorithms

V-a Comparison of minimum energy and maximum throughput algorithms

Figure 2 shows the throughput and energy consumption of the transfer tools and algorithms tested on the Chameleon Cloud, CloudLab, and DIDCLAB testbeds.

Wget and curl perform very poorly due to the lack of any optimization, with high energy consumption and very low throughput. On the other hand, http/2.0 achieves better performance thanks to multiplexing, which reduces the impact of RTTs, especially when transferring small files. However, on a wide area network, http/2.0 is not able to fully use the bandwidth due to the lack of parallelism and concurrency tuning.

Conversely, Minimum Energy and Maximum Throughput algorithms by Ismail et al. achieve much better performance, although they suffer from some major drawbacks, which are evident when transferring the large and mixed datasets on the large BDP testbed (Chameleon Cloud): i) static parameter tuning, which at times leads to suboptimal parameters; ii) in both algorithms, as the buffer size grows to match the network BDP, the parallelism level drops to 1, causing poor performance.

On the other hand, our algorithms outperform the state-of-the-art solutions across all different scenarios. ME reduces the energy consumption by up 48% with respect to Min Energy algorithm by Ismail et al. when transferring the mixed dataset. Moreover, EEMT achieves better throughput by up to 80% when transferring the mixed dataset compared to Maximum Throughput algorithm by Ismail et al. and reduces the energy consumption by up to 43%.

The reduced energy consumption is mainly due to better dynamic tuning of the transfer parameters and the use of dynamic scaling of CPU frequency and number of active cores.

V-B Comparison of target throughput algorithms

Figure 3 shows a comparison between our target throughput algorithm and the state-of-the-art solution by Ismail et al., using the mixed datasets and different target values (80%, 60%, 40%, and 20% of the maximum theoretical bandwidth). We excluded the DIDCLab testbed from the comparison, due to the low available bandwidth.

Our target throughput algorithm achieves a throughput withing 5-10% of the target across all scenarios, with the only exception of Chameleon Cloud when the target is set to 8 Gbps. However, since no algorithm achieves more than 7 Gbps (figure 2, it is most likely due to low available bandwidth. The target throughput algorithm by Ismail et al. is able to achieve the target for low throughput values (20% of the bandwidth on Chameleon Cloud and 20-40% on CloudLab). The poor performance is most likely due to 2 factors: i) the algorithm starts with one channel and slowly increments its channel count, taking a very long time to achieve the target; ii) the algorithm does not distribute the channels across datasets based on the remaining size or current speed, resulting in slower datasets becoming bottlenecks.

Even when achieving a very similar throughput, our algorithm consumes much less energy than the target throughput algorithm by Ismail et al., achieving 20% reduced energy consumption on Chameleon Cloud for a target of 2 Gbps and 29% less energy consumption on CloudLab for a target of 400 Mbps.

The only scenario in which the algorithm by Ismail et al. consumes slightly less energy than EETT is on CloudLab when the target is set to 200 Mbps; however, this is due to the fact that it achieves a throughput 60% higher than the target, greatly reducing the transfer time.

(a) Energy consumption (Chameleon Cloud)
(b) Energy consumption (CloudLab)
(c) Energy consumption (DIDCLAB)
Fig. 4: Effect of frequency and core scaling on the client’s energy consumption

V-C Effect of frequency scaling and number of active cores

In order to analyze the extent to which CPU frequency and core scaling reduce the energy consumption, we removed the load control module from the 2 algorithms ME (Minimum Energy) and EEMT (Energy Efficient Maximum Throughput). We then measured the energy consumption on the client, since there is no frequency scaling on the server.

Figure 4 shows a comparison of all algorithms across the 3 testbeds. Without frequency and active core scaling, ME reduces the energy consumption by up to 42% on Chameleon Cloud with respect to Min Energy algorithm by Alan et al., while EEMT achieves 30% less energy consumption than Max Throughput algorithm by Alan et al on the same testbed. However, when adding frequency and core scaling to both algorithms, the energy consumption drops by an additional 19% on ME and 17% on EEMT, bringing the total energy reduction to 53% with respect to Min Energy (Alan et al.) and 43% with respect to Max Throughput (Alan et al.).

On the DIDCLab testbed, the bandwidth is much more limited, and therefore the potential for energy saving is lower than on a large bandwidth testbed such as Chameleon Cloud. In fact, ME’s energy consumption is 9% lower than Min Energy by Alan et al. with no scaling. However, when using frequency and active core scaling, the energy saving rises to 22%, showing the potential of such technique on limited bandwidth scenarios. Similarly, EEMT achieves 8% less energy consumption than Max Throughput by Ismail et al. without scaling, whereas the reduction in energy rises to 23% when using CPU frequency and core scaling.

Vi Related Work

Application layer network optimization mainly focuses on tuning protocol parameters to avoid congestion. Earlier works proposed models that allocate TCP socket buffer to saturate the network links [11]. However, TCP buffer size allocation alone fails to achieve the optimal bandwidth in a long RTT networks. Numerous work has been proposed to open multiple parallel streams to increase the transfer throughput [16, 21, 20, 13, 6]. Liu et al. [14] introduced a GridFTP based solution that can open multiple transfer sessions to facilitate concurrent the file transfers from a single dataset. Pipelining [7] was introduced to reduce the one RTT delay between each small file transfers.

Energy efficiency in data communication deals mostly with network infrastructure. Many works suggested different power modes [8]. S. Nedevshi et al. [17] explored the joint effect of sleeping support and network rate adaptation based on workloads. Hardware level energy efficiency was proposed at 802.3az standards  [1] to make ethernet cards more energy efficient. Alan et al. investigated the energy consumption and throughput of data transfer under different concurrency and parallelism levels. They proposed a heuristic based parameter search to improve performance and energy consumption [2, 3].

Vii Conclusion

In this paper, we introduced three novel application-level parameter tuning algorithms to provide SLA-based energy-efficient data transfer service. The algorithms combine heuristics and runtime tuning to satisfy the SLA requirements set by the user. Our model reduces the energy consumption by dynamically tuning the CPU frequency and changing the number of active cores, as well as adjusting the pipelining, parallelism, and concurrency levels. Experimental results show that the proposed algorithms outperform the state-of-the-art solutions, providing up to 48% reduced energy consumption and 80% better throughput.

References

  • [1] IEEE energy efficient ethernet standards. 10.1109/IEEESTD.2010.5621025, Oct. 2010.
  • [2] I. Alan, E. Arslan, and T. Kosar. Power-aware data scheduling algorithms. In Proceedings of IEEE/ACM Supercomputing Conference (SC15), November 2015.
  • [3] I. Alan, E. Arslan, and T. Kosar. Energy-Performance Trade-offs in Data Transfer Tuning at the End-Systems. Sustainable Computing: Informatics and Systems Journal, Under Review, 2014.
  • [4] J. Chabarek, J. Sommers, P. Barford, C. Estan, D. Tsiang, and S. Wright. Power awareness in network design and routing. In IEEE INFOCOM 2008-The 27th Conference on Computer Communications, pages 457–465. IEEE, 2008.
  • [5] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le. Rapl: Memory power estimation and capping. In 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), pages 189–194, Aug 2010.
  • [6] E. Deelman, T. Kosar, C. Kesselman, and M. Livny. What makes workflows work in an opportunistic environment? Concurrency and Computation: Practice and Experience, 18(10):1187–1199, 2006.
  • [7] N. Freed. SMTP service extension for command pipelining. http://tools.ietf.org/html/rfc2920.
  • [8] M. Gupta and S. Singh. Greening of the internet. In ACM SIGCOMM, pages 19Ð26, 2003.
  • [9] M. Hähnel, B. Döbel, M. Völp, and H. Härtig. Measuring energy consumption for short code paths using rapl. SIGMETRICS Perform. Eval. Rev., 40(3):13–17, Jan. 2012.
  • [10] B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis, P. Sharma, S. Banerjee, and N. McKeown. Elastictree: Saving energy in data center networks. In Nsdi, volume 10, pages 249–264, 2010.
  • [11] M. Jain, R. S. Prasad, and C. Dovrolis. The tcp bandwidth-delay product revisited: network buffering, cross traffic, and socket buffer auto-sizing. 2003.
  • [12] K. N. Khan, M. Hirki, T. Niemi, J. K. Nurminen, and Z. Ou. Rapl in action: Experiences in using rapl for power measurements. ACM Trans. Model. Perform. Eval. Comput. Syst., 3(2):9:1–9:26, Mar. 2018.
  • [13] T. Kosar. Data Placement in Widely Distributed Sytems. PhD thesis, University of Wisconsin–Madison, 2005.
  • [14] W. Liu, B. Tieman, R. Kettimuthu, and I. Foster. A data transfer framework for large-scale science experiments. In Proceedings of DIDC Workshop, 2010.
  • [15] D. Lopez-Perez, X. Chu, A. V. Vasilakos, and H. Claussen. Power minimization based resource allocation for interference mitigation in ofdma femtocell networks. IEEE Journal on Selected Areas in Communications, 32(2):333–344, 2014.
  • [16] D. Lu, Y. Qiao, P. A. Dinda, and F. E. Bustamante. Modeling and taming parallel tcp on the wide area network. In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, pages 68b–68b. IEEE, 2005.
  • [17] S. Nedevschi, L. Popa, G. Iannaccone, S. Ratnasamy, and D. Wether-all. Reducing network energy consumption via rate-adaptation and sleeping. In Proceedings Of NSDI, April 2008.
  • [18] Tsunami of data’ could consume one fifth of global electricity by 2025. https://www.theguardian.com/environment/2017/dec/11/tsunamiofdatacouldconsumefifthglobalelectricityby2025, 2017.
  • [19] C. Systems. Visual networking index: Forecast and methodology, 2017–2022, January 2019.
  • [20] E. Yildirim and T. Kosar. End-to-end data-flow parallelism for throughput optimization in high-speed networks. Journal of Grid Computing, pages 1–24, 2012.
  • [21] E. Yildirim, D. Yin, and T. Kosar. Balancing tcp buffer vs parallel streams in application level throughput optimization. In Proceedings of DADC Workshop, 2009.
  • [22] H. Zhang and H. Hoffmann. A quantitative evaluation of the rapl power control system. 2014.