Efficient Memory Management for GPU-based Deep Learning Systems

02/19/2019 ∙ by Junzhe Zhang, et al. ∙ National University of Singapore 0

GPU (graphics processing unit) has been used for many data-intensive applications. Among them, deep learning systems are one of the most important consumer systems for GPU nowadays. As deep learning applications impose deeper and larger models in order to achieve higher accuracy, memory management becomes an important research topic for deep learning systems, given that GPU has limited memory size. Many approaches have been proposed towards this issue, e.g., model compression and memory swapping. However, they either degrade the model accuracy or require a lot of manual intervention. In this paper, we propose two orthogonal approaches to reduce the memory cost from the system perspective. Our approaches are transparent to the models, and thus do not affect the model accuracy. They are achieved by exploiting the iterative nature of the training algorithm of deep learning to derive the lifetime and read/write order of all variables. With the lifetime semantics, we are able to implement a memory pool with minimal fragments. However, the optimization problem is NP-complete. We propose a heuristic algorithm that reduces up to 13.3 complexity. With the read/write semantics, the variables that are not in use can be swapped out from GPU to CPU to reduce the memory footprint. We propose multiple swapping strategies to automatically decide which variable to swap and when to swap out (in), which reduces the memory cost by up to 34.2 communication overhead.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

GPU has boosted the performance of many data-intensive applications, including database management [1, 2, 3, 4, 5, 6], graph processing [7]

, and machine learning tasks 

[8, 9]

. Among them, deep learning, also known as deep neural networks (DNNs), is one of the most successful and popular applications of GPU. The training algorithm of DNNs involves many large matrix production operations. By accelerating the matrix operations via thousands of processing units in parallel, GPU enables us to train complex DNN models efficiently, speeding up the training for one order of magnitude

[10, 11]. Numerous studies have shown that larger and deeper DNNs can significantly increase the model accuracy [12, 13]

for computer vision and natural language processing applications. However, GPU has limited memory while DNNs are memory hungry. For instance, the AlexNet

[9] was trained on two GPUs (each with 3 Giga Bytes) in parallel to overcome the memory limitation, while the VGG network is much larger and has to be trained on a 4-GPU system [14]. This limitation has been a bottleneck to explore deep and wide DNNs to capture complex regularities of the big data [12]. In fact, there are many system challenges [15, 16, 17] for deep learning. In this paper, we focus on memory optimization from the system perspective.

Various techniques in reducing GPU memory footprint have been proposed (see Section II for the details), including (i) buffering and paging; (ii) model compression, (iii) memory sharing, (iv) trading computation for memory, and (v) memory swapping. However, those approaches have their respective drawbacks or limitations. For instance, the general buffering and paging strategy work at the coarse memory granularity, which is not optimal in memory saving for deep learning whose variables’ sizes vary significantly. The model compression approaches tend to decrease the accuracy or introduce quantization error [18]. Existing swapping approaches require a lot of human intervention.

Towards these issues, we propose two automatic approaches for efficient and effective GPU memory optimization that can be easily adopted into existing deep learning systems and are transparent to the end-users. They are also orthogonal to multiple-GPU systems that partition the model or data onto GPUs to reduce the memory footprint of each GPU. Our approaches do not alter the model structure or training algorithm; hence, there is no effect on the accuracy and convergence. They do not require computational graph semantics or knowledge of specific DNN models, but exploit the iterative nature of the deep learning training algorithms to derive the lifetime and read/write order of all variables for memory optimization. In particular, the first approach implements a smart GPU memory pool that optimizes the memory allocation based on the lifetime of all variables. For example, variables without overlapping in lifetime can be allocated into the same memory space, i.e. memory sharing. However, finding the optimal allocation scheme is an NP-complete problem. In this paper, we propose a heuristic solution to solve it.

The second approach automatically swaps variables not in use from GPU to CPU memory and swaps them back before the next access. We observe that the back-propagation algorithm for training DNNs has a special pattern that variables from the bottom layers are only accessed at the beginning and end of each iteration. Consequently, the swapping approach has the potential to significantly save the GPU memory. However, it incurs communication and synchronization overhead. The swapping schedule, that decides which variable to swap and when to swap, has to be designed carefully to hide the overhead. We propose multiple scheduling strategies to trade off the overhead and the memory reduction.

The contributions of this paper include:

  • We propose two approaches to reduce the GPU memory cost of deep learning training, including a memory pool and an automatic swapping mechanism.

  • We implement the two approaches under a unified abstraction, which collects the lifetime and read/write orders of the variables, and then runs the memory pool and memory swapping.

  • We conduct experiments to evaluate the performance of our approaches. The results confirm the superiority of our approaches against baselines. In particular, our memory pool reduces up to 13.3% of memory compared with Nvidia’s default memory pool called CnMem with equal time-complexity. Our memory swapping approach reduces memory cost by up to 34.2% without incurring any communication overhead.

The rest of this paper is organized as follows. We review the related work in Section II. Two optimization algorithms for memory pool management and memory swapping are introduced in Section III and Section IV respectively, followed by their implementations in Section V. The performance of the proposed methods is evaluated in Section VI. We conclude this paper in Section VII.

Ii Related Work and Background

Due to the limited GPU memory size, effective memory management is a must for handling large data sets on GPUs. In this section, we review related work on general and GPU memory management, then we give some background information of the training algorithm of DNNs to introduce the iterative nature. There are two terms used frequently in this section, namely, memory footprint and memory load:

Definition 1

Memory footprint or memory consumption is the actual GPU memory consumed by the program. They are used interchangeably in this paper.

Definition 2

Memory load at a point of time is the summation of sizes of all the variables which are currently residing in the GPU memory. Memory load does not consider memory allocation, such as the memory imposed by buffers, fragmentation, etc.

Ii-a Memory Management

Memory management is an important component of computer systems, including database systems. Various techniques have been proposed such as paging and cache optimization [19, 20]. Memory pool is one effective memory optimization technique. It pre-allocates a continuous chunk of memory and takes over the memory management from the operating system. There are numerous variations of memory pool implementations [21], in order to obtain small time complexity and competitive ratio [22]. CnMem[23] is a GPU memory pool developed by Nvidia. However, it does not have optimization for the training of deep learning. In this paper, we propose to optimize the memory allocation within the GPU memory pool by exploiting variables’ lifetime and size information to achieve a better competitive ratio with low time complexity.

Definition 3

Competitive ratio is the ratio of memory footprint to peak memory load.

Definition 4

Peak memory load is the maximum of the memory load in a training iteration. The logical time where the peak memory load occurs is defined as the peak time.

Ii-B GPU Memory Management

There have been proposals on system supports for paging on the GPU [24, 25]. However, those approaches usually require modifications on hardware or drivers. In data processing, many existing studies (e.g., [5, 7]) partition the data into chunks and ensure that the processing of each data chunk can fit into the GPU memory. This approach is effective only if the data accesses are regular and partitionable into chunks. The variables’ sizes of DNNs vary a lot, which requires the optimization algorithm to work at a more fine-grained granularity, e.g., variable, than page or buffer. Wang et al. [4] proposed a cost-driven replacement policy on the GPU memory, which combines the effects of data size, eviction latency, and data locality. However, those studies do not take advantage of the iterative nature of deep learning training, which we demonstrate to significantly improve the effectiveness of memory management for deep learning systems.

Ii-C GPU Memory Management for Deep Learning

Ii-C1 Model Compression

Reducing the precision of parameters or the complexity of the network structure reduces the memory load in deep learning. Half precision (16FP), single bit, and mixed precision numbers have been applied in DNNs [26, 27, 28]. Typically, using low precision numbers introduces quantization error [18]. Model pruning that removes the layers or connections in the DNNs [29, 30] reduces the complexity and memory load; however, it also decreases the training accuracy. It should be noted that our proposed solutions are orthogonal to the model compression approaches stated above.

Ii-C2 Memory Sharing

Two types of memory sharing have been proposed for deep learning frameworks, which are in-place operation and buffer reuse. In-place operation is to store the output at the physical address of the input. For instance, can be stored at the memory block of directly when computing . There are only a few in-place operations in DNNs. Buffer reuse is to share the memory between variables whose lifetime does not overlap, as implemented in MXNet-memonger [31]

. It can be conveniently implemented into deep learning frameworks of declarative paradigm (e.g. TensorFlow, MXNet), where the entire computation graph is constructed before computing, providing the topology and data dependencies for smarter memory allocation

[32, 31]. It should be noted that the SmartPool approach proposed in this paper provides superior (more fine-grained) memory sharing, which will be elaborated in Section III.

Ii-C3 Trading Computation for Memory

Some intermediate variables, such as feature maps, are freed during forward-propagation and recomputed during backward-propagation to compute gradients. Relevant works are seen in MXNet-memonger [31], SuperNeurons [33]

, and the DenseNets implementation via PyTorch

[34], all of which have the cost-awareness idea to selectively drop the feature maps which are easy to recompute. However, recomputing variable requires high-level semantics, i.e. computation graph and hence cannot be done at the memory management level.

Ii-C4 Memory Swapping

Memory load can be reduced by swapping variables to CPU memory when they are not in use, and swapping them back to GPU memory right before their next access. Ideally, the communication in both directions should be hidden under computation (via separate streams) to minimize the communication overhead [1]. In GeePS [35]

, the decision to swap which layer or which tensor is made by the end user. It requires the end user to have a good understanding of the model, including the memory and time consumption of each layer. SuperNeurons

[33] restricts to swap only convolution layers. Big tensors of other layers are not considered for swapping. It also requires either the computational graph of the DNN or the end-user’s intervention. It is desired to provide a fully automatic approach that is transparent to end-users and is able to swap memory with small communication overhead.

Ii-D The Iterative Training Algorithm

Many machine learning algorithms such as DNNs are iterative [36]

. The stochastic gradient descent (SGD) algorithm for training DNNs typically needs thousands of iterations (or even more) to converge. In each iteration, the network undergoes forward-propagation and backward-propagation as illustrated in Fig.

1, during which up to tens of thousands of variables of different size are created, read, updated and deleted. The network computes feature maps, i.e., , during forward-propagation, obtains the loss at the end of forward-propagation, and then computes gradients to update weights during backward-propagation. Feature maps computed in forward-propagation are normally retained in GPU memory until they are used to compute weights gradients. Hence, the overall memory usage increases during forward-propagation, peaks at the end of forward-propagation, and decreases during backward-propagation. The memory load profile over a 5-iteration training process for the VGG network [14] is shown in Fig. 2, with each operation index referring to one malloc / free / read / write operation. We observe that lifetime, sizes, and read/write sequences of the variables are rather stable across all the middle iterations (from the second to the second last) as shown in Fig. 2, for most DNNs except stochastic neural networks [37]. This iterative nature enables us to collect the lifetime and read/write semantics of variables at the beginning iterations, and optimize memory management for the later iterations (in Sections III and IV).

Fig. 1: Forward-propagation and backward-propagation.
Fig. 2: Memory load for VGG16 in 5 iterations.

Iii Memory Pool: SmartPool

In this section, we exploit the iterative nature of the training algorithm of DNNs to optimize the allocation of variables in a memory pool. We call the proposed memory pool SmartPool.

Iii-a Dynamic Storage Allocation (DSA)

Over one training iteration, there are tens of thousands of variables allocated and freed at the fixed logical time, i.e., the operation indices. This can be illustrated in a 2D view shown in Fig. 3, where the -axis represents the logical time in one iteration, and the -axis represents the memory location in Mega Bytes (MB). Each rectangle refers to a variable, where the width and the height refer to the lifetime and size of the variable respectively. The lifetime information tells at which operation indices a variable is allocated and then deallocated. Optimization of the memory footprint can be done by moving the rectangles vertically such that some rectangles without overlapping in lifetime can share the same memory, i.e., overlapping vertically. The problem of allocating and deallocating memory blocks has been well studied as Dynamic Storage Allocation (DSA) [22]. DSA is an NP-complete problem which was first shown by Stockmeyer [38, 39]. Depending on the semantics, the problem can be categorized as: (i) online DSA, where items must be allocated once they arrive without the information of the items yet to arrive; and (ii) offline DSA, where the lifetime and size of all variables are known before allocating the first object. Offline DSA obviously provides more semantics and normally is able to achieve better performance [40]. The iterative nature of deep learning enables us to collect the lifetime and size of all variables in one iteration, and hence the memory footprint optimization problem is an offline DSA problem.

Fig. 3: Illustration of the lifetime and size of variables.

Iii-B Weighted Graph Representation of Offline DSA

We follow the graph coloring methods to solve the offline DSA problem in the deep learning context, which guarantees desirable time complexity, competitive ratio, and scalability. A method named Weighted Interval Color (WIC) [39] was first developed, which provides a good representation of the offline DSA problem by adopting the weight parameter into the graph, denoted by .

In this representation, vertices denotes the memory blocks storing the variables. Edges denotes the pairwise relationship between vertices, with representing that there is overlapping in lifetime of two memory blocks (i.e. vertices) and , or otherwise. The weights here are on the vertices rather than edges, denoting the sizes of the memory blocks.

The weighted interval coloring procedure assigns a range of discrete color values to each node (i.e., variable) with the length equal to its weight instead of one color value, such that the color range is never shared with any of its neighbors. The color range assigned is, in fact, the range of memory on the GPU RAM where a corresponding memory block resides.

In interval graph coloring, clique number, denoted as , refers to the largest clique size in a graph, which is equivalent to the smallest number of colors required for coloring. In the context of weighted interval coloring, becomes:

(1)

here is actually the peak memory load defined in Section II. The chromatic number of WIC is the number of colors assigned to the graph by this algorithm. It is actually equivalent to the memory footprint of this allocation, and is bounded between 1 and times , with being the competitive ratio in the DSA problem.

(2)
Fig. 4: Illustration of the difference between first-fit and best-fit.

Iii-C SmartPool Algorithm

In this section, we present our heuristic algorithm that implements WIC [39]. Our algorithm drops the idea of rounding each weight to a power of two, which saves memory space by up to half.

Given a list of variables with their pairwise time-overlapping relationship, we first sort these variables in the descending order of size. Starting from the largest variable, we assign a memory block for each variable such that the memory block does not overlap (in lifetime) with any of its neighbors that had been allocated to other variables. In particular, the algorithm tries to fit the variable into one of the free holes between preoccupied memory blocks. There are two methods of fitting blocks into the memory pool, which are first-fit and best-fit. Both are simple yet effective [41]. To demonstrate the idea, the difference between first fit and best fit is illustrated in Fig. 4. In the figure, when a variable is to be allocated by the memory pool, the first-fit method will allocate it to the first hole that the variable can fit in, while the best-fit method will allocate it to the smallest hole where the variable can fit in. Furthermore, in the case that the memory pool does not have such a big hole for the variable, the memory pool will be extended and the new memory will be used to for the variable. As a result, the memory footprint increases as allocation progresses.

In our implementation, best-fit is the default option. Note should be taken that the algorithm only needs to run once when constructing the memory pool; the real allocation is done by looking up a hash table, which will be elaborated in Section VI-A. Hence, the overhead introduced by best-fit is negligible compared to the entire training process.

In previous work, one variable shares memory with only another variable (one-to-one). In SmartPool, as can be seen from Fig. 4, memory can be shared among those blocks(variables) without overlap in lifetime and this sharing is not restricted to one-to-one sharing. SmartPool allows more superior sharing such that one large variable can share memory with several small ones (whether consecutive or not), and vice versa.

Iv Memory Swapping: AutoSwap

Each variable during its lifetime might be accessed multiple times for reading and writing. For example, in Fig. 1, the variables from the bottom layers are accessed at the beginning and the end of each training iteration. These variables could be swapped to CPU memory after one access to reduce the memory load, and then prefetched before the next access. However, swapping incurs communication and synchronization overhead, which would slow down the training speed. Even though we can perform the swapping using 2 separate cudaStreams (one for swap-out and the other for swap-in) in parallel to the computation cudaStream, the communication and synchronization overhead may not be completely hidden under computation. In this section, we propose AutoSwap to schedule the swapping in order to minimize the overhead and maximize the reduction of memory load.

It applies simple filters to get a set of candidate variables (Section IV-A) for swapping. These candidate variables are ranked according to priority scores (Section IV-B); Given the GPU memory limit, AutoSwap schedules (Section IV-D) the swapping by selecting the variables with higher rankings.

Iv-a Candidate Swapping Variables

We set up two simple criteria to obtain a list of candidate variables for swapping. First, we filter out tiny variables that are smaller than a threshold, e.g. , 1 MB. Because small variables like the bias vectors in the DNNs, have little effect in memory load reduction if they are swapped out. Moreover, we have to bear the cost of the latency in data transmission. Second, we only consider swapping the variables whose lifetime are across the peak time. There are two advantages: (i) The memory load is most tight near the peak time and hence most eagerly to be reduced; (ii) Intermediate variables whose lifetime spans across only a few layers during forward-propagation or backward-propagation are excluded, since they have to be swapped back to GPU in a short while. In other words, swapping these variables contributes to memory reduction for a very short period. Moreover, it occupies the I/O (i.e., PCIe) bandwidth.

Fig. 5: Illustration of the swapping priority scores for variables.

Iv-B Priority Scores

According to the analysis in Section IV-A, to evaluate if a variable is worth swapping, we shall look at: (i). the size of the variable, (ii). the time span that it is able to be absent from the GPU memory (from swap-out completes till swap-in starts) and (iii). the position of the time span within a training iteration. To quantify the relative priority of the candidate variables, we come up with a set of Priority Scores (PS) where each of the PS takes one or two factors into account. The four PS are discussed below, with illustration shown in Fig. 5.

(i) Duration of Absence (DOA): It is simply the duration between two accesses minus the time spent in swap-out and swap-in. A candidate variable with larger DOA is preferred for swapping as the variable resides outside of the GPU for a longer period within one iteration. However, DOA does not care about the variable size, or where the absence is located.

(ii) Area of Absence (AOA): It is the product of DOA and size, which quantifies the amount of memory load reduction over a time period. It can be viewed as removing an AOA amount of area below the memory load curve. A candidate variable with larger AOA is preferred for swapping as it reduces a larger portion of memory and/or for a longer period. However, it does not consider the location of the absence.

For large candidate variable in the top layers of the neural networks, DOA can be negative because the swap-in and swap-out time (), i.e., data transfer time, may be larger than the period between two consecutive accesses of the same variable without swapping (). In this case, the definition of AOA is changed to the product of the inverse of variable size and DOA. As such, for the negative AOA, larger AOA represents a relatively higher contribution of memory load reduction.

Fig. 6: Illustration of the procedures to determine Submodular Weighted Duration of Absence (SWDOA).

(iii) Weighted Duration of Absence (WDOA): WDOA of a candidate variable is the area under the original memory load curve between two consecutive accesses without swapping. It considers both DOA and the memory load during the period when the variable is swapped out, but not the variable size. A variable with larger WDOA is preferred as it indicates that this variable will be accessed in a longer time frame and the accesses are nearer the peak memory load.

Fig. 7: Illustration of scheduling.

(iv) Submodular WDOA (SWDOA): WDOA for all candidate variables is computed based on the original memory load without swapping any variable. However, with the selection process progressing, WDOA for the remaining variables are inaccurate since the memory load changes due to that some variables are swapped.

SWDOA computes WDOA in a submodular way with updated memory load profile. It computes WDOA for all candidate variables under the original memory load, selects the variable with the highest WDOA, and updates the memory load. This process is repeated for all candidate variables using the updated memory load. An illustration of the SWDOA procedure to determine the swapping priorities of three variables are shown in Fig. 6. SWDOA addresses the problem in which the peak time in the updated memory load profile shifts to somewhere else after a number of variables are selected.

Iv-C Bayesian Optimization

As discussed in Section IV-B, none of the 4 PS are able to cover all the three factors. Moreover, as will be discussed in Section VI, it is complex to find which factor is dominating in a particular scenario. Therefore, it is desirable to have a powerful PS that can combine the 4 existing PS. In this paper, we propose to exploit Bayesian Optimization to generate the aggregated priority score, denoted as BO.

In particular, we generate the aggregated PS via a linear combination of the 4 basic scores, as shown in the equation below. The 4 priority scores are standardized before feeding into the formula. We use Bayesian Optimization to automatically tunes the weights , , , and , whose values range between [-1,1]. Gaussian Process prior is chosen [42], and the optimization objective is set to be the communication overhead. The weights are randomly initialized at the beginning. Then we run the swapping scheduling algorithm using the combined PS. The communication overhead is measured and fed into the Bayesian Optimization framework to update the weights. After updating about 30 times, the weights become stable, i.e., converge. Compared with the total training time that takes over thousands of iterations, the cost of the optimization procedure (about 30 iterations) is negligible.

(3)

Iv-D Swapping Variables Selection

To select the swapping variables, the PS or BO of all candidate variables are calculated. These variables are then sorted in descending order and inserted into a swapping list one by one according to the sorted order, while the memory load is updated (reduced) at the same time. When the peak memory load is no larger than the user-defined memory load limit, the selection stops.

Definition 5

Memory load limit: In the AutoSwap approach, users define the value of memory load limit such that the peak memory load after swapping is no larger than this limit.

Iv-E Swapping Scheduling Algorithm

AutoSwap schedules to swap the selected variables out when they are not in use, and swap them in before the next access, i.e., prefetching. To maximize the PCI-e bandwidth, we schedule all the swap-out events in ascending order of their access time. A variable is swapped out immediately after the access completes. The swap-in events are scheduled according to the next access time of each variable. A variable starts swap-in in advance to make sure its second access is not delayed by the communication. The schedule is designed in the way such that a variable starts swap-out (-in) only after the previous swap-out (-in) event completes.

Communication and synchronization can be completely hidden under computation with relaxed memory load limit. However, under the condition of stringent memory load limit, memory load would exceed the limit due to that the variables cannot be swapped out quickly. If it occurs during forward-propagation, the next Malloc will be delayed until sufficient variables are swapped out and the memory is freed; if during backward-propagation, a swap-in can start only after freeing some variables. This is where communication and synchronization overhead occurs and is not able to be hidden.

A simple example of the scheduling is illustrated in Fig. 7. In this example, the peak memory load is 120 MB and the memory load limit is 60 MB. 3 variables () are selected to meet the memory load limit. The scheduling is conducted based on above-mentioned strategy, where there are two types of time involved: the time with a superscript refers to the scheduled time of the -th swapping variables, where the first superscript letters and represent the start and end of swap event respectively, and the second superscript letters and denote the swap out and swap in directions respectively; the time without a superscript refers to the time of -th operation index. shows the updated load profile without considering updated memory load exceeding the limit in the first place; shows the updated load that guarantees its peak never exceeds the limit of 60 MB, where a certain operation has to be delayed.

V Implementation

In this section, we introduce the unified program abstraction for implementing SmartPool and AutoSwap, which could be adopted into an existing deep learning framework with ease. The core abstraction is the Device class, which is shown below:

class Device {
  Block* Malloc(size_t);
  void Free(Block*);
  void Exec(Function, vector<Block*>,
    vector<Block*>);

  SmartPool PoolOpt();
  map<int, Event> SwapOpt();

  SmartPool pool;
  map<int, Event> schedule;
}

The Device class represents a computing device, i.e., a GPU. For every variable that uses GPU memory, its memory (represented by Block) is managed via Malloc and Free. At the beginning of a training process when the memory pool is not created yet, these two functions are implemented by calling the native cudaMalloc and cudaFree respectively. Meanwhile, malloc/free requests are recorded into a list for detecting an iteration. Once two consecutive subsequences are detected to be repeating, the subsequence is fed into PoolOpt for constructing the memory pool. This subsequence represents all operations involved in one training iteration. PoolOpt extracts the lifetime of each variable and runs the SmartPool algorithm from Section III-C to create a SmartPool instance, i.e., pool. pool maps from each operation index (with a malloc request) to a GPU memory address for memory allocation. Upon creation of pool, Malloc and Free will execute according to the map (implemented as a hash table).

In the meantime, all arithmetic operations running on the GPU device are executed by function Exec, including cuDNN operations111cudnn operations are provided by Nvidia’s cuDNN library (https://developer.nvidia.com/cudnn) and user-defined CUDA222https://developer.nvidia.com/cuda-toolkit kernel operations. Two lists of variables are passed to Exec, one for the variables to be read and the other for the variables to be written. Consequently, these read/write requests are recorded along with the malloc/free requests in a list which undergoes repeatability test as well for applying the AutoSwap approach. Then SwapOpt is called to optimize the swapping schedule using algorithms from Section IV. Note that before swapping schedule is made, the system swaps out variables as many as necessary when it is about to exceed the limit, and swaps them in when a variable that is not on the GPU memory is accessed again. No prefetching is applied. Hence, it would be slow in the early iterations. The swap-out and swap-in events are executed with 2 separate cudaStreams along with the default cudaStreams for computing. Swapping and synchronization are based on the swapping scheduling algorithm illustrated in Section IV-E.

To combine SmartPool and AutoSwap, we create AutoSwap at first and then create SmartPool. The order cannot be exchanged because SmartPool depends on the sequences of Malloc and Free, and AutoSwap calls additional Malloc and Free due to swap-out and swap-in.

Vi Evaluation

In this section, we evaluate the performance of SmartPool, AutoSwap, and the combined approach, i.e. the mode when the orthogonal SmartPool and AutoSwap approaches work concurrently. We conduct the experiments on a workstation equipped with CPU Intel Xeon E5-1650 v4 and GPU GeForce GTX 1080 Ti with 11 GB DDR5 RAM. The CPU memory is 64 GB DDR4 RAM. The motherboard is ASUS X99-E WS, which gives 16 PCI-E for data communication between CPU memory and GPU memory. The environment is Ubuntu 16.04 with CUDA 8.0 and cuDNN 5.1. We choose the fastest cuDNN algorithm (without workspace limit) for all the experiments. We use the image dataset CIFAR-10 [43], and evaluate training performance using commonly used benchmark DNNs ResNet333ResNet: http://torch.ch/blog/2016/02/04/resnets.html and VGG444VGG: http://torch.ch/blog/2015/07/30/cifar.html of different depths.

Firstly, we evaluate the competitive ratio and the time complexity of SmartPool compared with the baselines CnMem Pool and cudaMalloc. Secondly, we evaluate the memory load reduction and the corresponding overhead of AutoSwap by using different PS and BO. Thirdly, we evaluate the scalability of the combined approach by comparing its memory footprint with CnMem Pool and cudaMalloc for different DNNs at various batch sizes. Finally, we compare the performance of our combined approach with 3 other memory reduction baselines.

TABLE I: Competitive Ratio and Time Complexity of SmartPool, CnMem Pool and cudaMalloc

Vi-a SmartPool

Our experiments show that first-fit and best-fit are comparable in terms of memory footprint. Thus, we directly compare the performance of SmartPool, Nvidia’s native cudaMalloc, and its default memory pool CnMem Pool. We conduct the experiments on ResNet and VGG of different depths at a batch size of 100, with results shown in Table I.

cudaMalloc consumes the exact amount of memory when allocating each and every variable. Hence, it gives optimal memory consumption which is equal to the peak memory load. However, frequently calling the cudaMalloc inevitably increases the time complexity of training and causes fragmentation due to various variable sizes. In this experiment, we take cudaMalloc as a baseline, and compare our SmartPool with CnMem Pool in terms of competitive ratio and speedup. According to the definition of , the of SmartPool and CnMem Pool here equals to their respective memory footprints over the footprint of cudaMalloc, the closer to 1 the better; speedup refers to the relative training speed compared to cudaMalloc, with higher value correspond to a shorter time spent per training iteration.

SmartPool achieves a very low compared to CnMem Pool for all the network models in the experiment. Comparing for VGG11, SmartPool reduces up to 13.3% of memory footprint compared with CnMem Pool. SmartPool under the worst condition (VGG13) only introduces 1.6 % fragmentation.

Exploitation of the iterative nature in training is the key to achieve near-optimal . Firstly, it reduces the online DSA to an offline one, allowing smarter memory allocation based on the lifetime and sizes of variables. Secondly, as discussed in Section III-C, SmartPool enables superior memory sharing such that memory of a large variable can be shared with several smaller ones, and vice versa. It provides efficient memory sharing and reduces the memory footprint. Thirdly, our algorithm colors the variables with decreasing size, and hence the largest variables are allocated at the bottom of the memory pool, which also contributes to low even for networks with large variables like VGG.

Concerning the time complexity, both CnMem Pool and SmartPool are much faster than cudaMalloc with up to two times speedup. In SmartPool, Malloc is implemented by looking up a table (C++ std::map) of n variables, with time complexity is ; in CnMem Malloc is done by searching over a linked list of m holes with time complexity , where .

Vi-B AutoSwap

In this subsection, we evaluate the memory load reduced by AutoSwap. Note that the evaluation does not concern the memory footprint. We start by comparing the communication overhead using different priority scores under various memory load limits for VGG16. We then analyze the optimality of the method. Finally, we present the results for other DNNs.

Vi-B1 Minimum Memory Load,

Fig. 8 shows the memory load profile of VGG16 under different conditions: (i) original load without swap, (ii) updated load with swap at load limit of 500 MB and (iii) 400 MB respectively, and (iv) lowest load profile that is achieved by swapping out and in all candidate variables synchronously. Note that the peak at the 4th curve is located at the start of the iteration (or at the end of the iteration in other cases), the location of which is different from the original peak. We define this peak value as the minimum memory load, . AutoSwap can swap memory load efficiently when the user-defined limit is above , as it only needs to swap once for each portion of the memory load. If in the extreme scenario where the user-defined limit is below , we can still achieve it by swapping variables not only from the candidate variables. In other words, a certain part of the memory needs to be swapped more than once and hence it is less cost-effective. As different DNNs vary significantly in , we only evaluate the communication overhead for memory load limit above the for each DNN.

Fig. 8: Memory load of VGG16 under different conditions.
Fig. 9: Overhead of different priority scores and Bayesian Optimization for VGG16.

Vi-B2 Comparing the Priority Scores

We compare the communication overhead for the 4 PS and BO under the network VGG16 at a batch size of 100. Fig. 9 shows the comparison results. The results are across a wide range of memory load limits from 647 MB to 280 MB (slightly above the

). Comparing the 4 PS, we observe that WDOA and SWDOA exhibit lower overhead under most of the memory load limits, while under certain limits AOA or DOA performs better. The performances of these greedy methods differ by less than 3% overhead as of one training iteration. However, it does not show a clear trend whether at a certain regime one method is better than the rest. All the methods can maintain a reasonable overhead of less than 20 % throughout the experiment. In contrast, the use of BO safeguards the overhead to be no larger than the minimum of the 4 PS. Since there are only 4 hyperparameters to optimize, the time complexity is rather reasonable and it is able to converge within 30 to 40 evaluations. As a result, the system can achieve zero overhead with the memory load reduced to 447 MB, which is about 30% of memory load reduction without increasing the training time. It can be noted from Fig.

9 that the overhead does not always monotonically increase with lower memory load. This is in fact due to the granularity of candidate variables.

Vi-B3 Optimality Analysis

We now zoom into the conditions with overheads, to analyze if some portion of the overhead is avoidable by any chance. We observe that the overhead in most cases occurs during forward-propagation, where memory load increases significantly faster than backward-propagation, as shown in Fig. 8. If there is any time when the swap-out cudaStream is idle during forward-propagation, we explore if any variables that were not selected could be swapped during this time and hence indirectly avoid the overhead contributed by other variables.

For example, the memory load of VGG16, due to more variables being created in forward-propagation, takes 24 ms to ramp up to 95% of the peak memory load from the start of an iteration. Swapping out 300 MB to make the memory load below 350 MB takes around 28.9 ms, which incurs 7.3 ms overhead by the BO method, out of which only less than 1.7 ms is avoidable overhead. In another word, our algorithm manages to hide more than 94% of the communication that is avoidable in the background.

We conduct this analysis of various models under different memory load limits. We observe that in most cases there is no avoidable overhead or only a negligible amount of overhead is avoidable. It indicates that our greedy algorithm is indeed near optimal.

Moreover, it is worth pointing out that there are a number of variables whose lifetime and two consecutive accesses are cross one or more iterations, such as weights that are kept till the end of the process. They are more favorable for swapping which can be swapped out at one iteration after the last access and swapped back at the next iteration before the first access. This provides wider time window such that the communication is fully hidden under computation. Our PS and BO successfully capture them which significantly maximize the PCI-e bandwidth and hence makes the overhead ideally low.

TABLE II: Maximum Memory Load Reduction without Overhead

Vi-B4 Overhead of Various Models

In this section, we evaluate other models by the BO method. Table II shows the maximum memory load reduction without overhead under different network depths of VGG and ResNet using the same batch size of 100. For VGG network structure, it reduces up to 30.9 % memory footprint without overhead, from 647 MB to 447 MB in VGG16. For ResNet network structure, it reduces up to 34.2 % memory footprint without overhead, from 4418 MB to 2906 MB in ResNet50. We also evaluate the overhead under different memory limits for each model, which are presented in Section VI-C (Fig. 11), where the -axis is the footprint rather than memory load. The swapping performance shows considerable scalability for both different memory limits and different depths of networks.

Vi-C Combined Approach

Vi-C1 Comparisons with cudaMalloc and CnMem Pool

Now, we consider both the allocation of variables and the memory load reduction. With the SmartPool and AutoSwap applied simultaneously, we perform experiments to evaluate the memory footprint reduction under different network depths and different batch sizes. Fig. 10 compares the memory footprint of cudaMalloc, CnMem Pool, SmartPool, and the combined approach SmartPool+AutoSwap. We take the footprint of CnMem Pool as the baseline. At each batch size for each model, we can see the clear gap between the footprint of CnMem Pool and the combined approach with different values of overheads. It shows that we can reduce up to 1/3 of the footprint without increasing the training time, and reduce no less than 60% of the footprint with smaller than 15% overhead. The performance is scalable when we increase the depth of the network and use a larger batch size. The percentage of memory reduction of the combined approach with 15% overhead does not decrease when we increase model depth. The curve for each network is linear with a smaller slope than the baseline, which shows that it works well in deeper networks and even more promising with larger batch sizes.

Fig. 10: Memory footprint of different memory management schemes at various batch sizes.

Vi-C2 Comparisons with Other Baselines

Finally, we compare our combined approach with three other memory reduction methods as baselines. They are MXNet-memonger based on MXNet [31]

, GeePS based on Caffe

[35], and a recent deep learning framework SuperNeurons [33]. MXNet-memonger uses memory sharing, in-place operation, and trading computation for memory. The optimization is performed automatically. However, the performance of memory reduction depends on the choice of searching nodes for recomputing. SuperNeurons provides options to recompute and/or swap, where the swapping is restricted to convolution layers. It uses memory pool which is only for efficient allocation and deallocation of variables and does not optimize the memory allocation inside the memory pool. GeePS is a distributed parameter server system which solely uses memory sharing. This method allows us to choose different GPU memory limitation and the decision of which layer or which tensor to swap is made by the end user.

Since different frameworks vary significantly in the absolute value of computation time and memory consumption, we present the results with the percentage of overhead incurred as of one iteration and the percentage of memory footprint reduction with memory optimization option. To compute the percentage of the overhead, we measure the average time per iteration with and without memory optimizing option for each baseline. Memory consumption can be measured from nvidia-smi at the microsecond level, except for MXNet-memonger, which takes up more GPU memory at the beginning for searching the optimal convolution algorithm. Hence, its actual memory consumption is measured from the steady-state of the training phase.

We managed to cover the widest possible range of data points for the baselines and our combined approach. The combined approach is able to give a wide range of memory consumption from baseline footprint to . For MXNet-memonger, we change the position of searching nodes at different layers and unit blocks, and obtain 3 data points for each network. For SuperNeurons, we can have options to recompute and/or swap, which provides 3 different combinations of compilation and hence 3 data points. For GeePS, we run the training model just on one machine and one GPU instead of a distributed environment for fair and accurate comparison. It gives a wide range of memory consumption; the baseline time and memory consumption are obtained using original Caffe. The batch size is fixed at 100 for CIFAR-10 dataset using ResNet and VGG of various depths. For those networks that are not able to be fit in our GPU (11 Giga Bytes) under certain baselines, no data point is available.

The percentage of overhead versus the percentage of footprint reduction for all the four methods are shown in Fig. 11. MXNet-memonger shows significant memory reduction for VGG models, reducing 40% of the memory consumption with only less than 30% of overhead. However, it has less effect on ResNet models. SuperNeurons is able to reduce more memory consumption than MXNet-memonger by recomputation and swapping. However, its overhead is constantly higher than that of the MXNet-memonger. GeePS exhibits zero or nearly zero overhead within a wide range of footprint. In overall, the overhead of GeePS is comparable with that of SmartPool+AutoSwap. Under some conditions, its overhead is smaller than our combined approach. However, the memory consumption of GeePS cannot be reduced further to what can be achieved by SmartPool+AutoSwap, as cudaMalloc would be out of memory in constructing the network when the memory limit is below a certain value for certain models. Moreover, users of GeePS have to manually decide which layers and tensors to swap, and hence the actual performance depends on the skills and knowledge of the end user.

Compared to other methods, our combined approach has the following advantages: Firstly, it provides adjustability on the percentage of footprint reduction within a wide range. It meets the requirement while not swapping memory excessively, and hence avoiding redundant data communication. Secondly, our approach is transparent to users and starts working automatically at the early iterations. It does not require knowledge of the user on the memory consumption of the network, or deciding which layer to swap, etc. Thirdly, it gives low overhead with considerable scalability as we vary the DNN type and depth, batch size, and the percentage of footprint reduction.

Fig. 11: Memory footprint reduction and overhead for different depths of VGG and ResNet models.

Vii Conclusion and Future Work

In this paper, we exploit the iterative nature in training DNNs and propose two orthogonal approaches, SmartPool and AutoSwap, to reduce the GPU memory consumption efficiently and effectively. They are transparent to the end users and do not require human intervention. Experiments show that SmartPool effectively optimizes the allocation of variables in the memory pool; AutoSwap efficiently reduces the memory load by swapping out the currently not-in-use variables to CPU memory with ideally low communication overhead; the combined approach further reduces the memory footprint. In addition, it scales well for different network architectures as we vary the network depth, the batch size, and the memory load limit.

In the future work, we will extend our solutions for the applications whose iterations have slight variations, explore memory reduction in a distributed environment [44]

, and adapt our approaches into other memory-hungry applications with iterative nature, such as large-scale K-Means running on GPU.

References

  • [1] B. He and J. X. Yu, “High-throughput transaction executions on graphics processors,” Proceedings of the VLDB Endowment, vol. 4, no. 5, pp. 314–325, 2011.
  • [2] B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander, “Relational joins on graphics processors,” in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’08.   New York, NY, USA: ACM, 2008, pp. 511–524. [Online]. Available: http://doi.acm.org/10.1145/1376616.1376670
  • [3] K. Zhang, J. Hu, B. He, and B. Hua, “Dido: Dynamic pipelines for in-memory key-value stores on coupled cpu-gpu architectures,” in 2017 IEEE 33rd International Conference on Data Engineering (ICDE), April 2017, pp. 671–682.
  • [4] K. Wang, K. Zhang, Y. Yuan, S. Ma, R. Lee, X. Ding, and X. Zhang, “Concurrent analytical query processing with gpus,” Proc. VLDB Endow., vol. 7, no. 11, pp. 1011–1022, Jul. 2014. [Online]. Available: http://dx.doi.org/10.14778/2732967.2732976
  • [5] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander, “Relational query coprocessing on graphics processors,” ACM Trans. Database Syst., vol. 34, no. 4, pp. 21:1–21:39, Dec. 2009. [Online]. Available: http://doi.acm.org/10.1145/1620585.1620588
  • [6] J. Zhou, Q. Guoy, H. V. Jagadish, L. Krcaly, S. L. W. Luan, A. K. H. Tung, Y. Yang, and Y. Zheng, “A generic inverted index framework for similarity search on the gpu,” ICDE, 2018.
  • [7] M. Sha, Y. Li, B. He, and K.-L. Tan, “Accelerating dynamic graph analytics on gpus,” Proc. VLDB Endow., vol. 11, no. 1, pp. 107–120, Sep. 2017. [Online]. Available: https://doi.org/10.14778/3151113.3151122
  • [8] Z. Wen, J. Shi, B. He, J. Chen, and Y. Chen, “Efficient multi-class probabilistic svms on gpus,” IEEE Transactions on Knowledge and Data Engineering, pp. 1–1, 2018.
  • [9]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [10] R. Dogaru and I. Dogaru, “Optimization of gpu and cpu acceleration for neural networks layers implemented in python,” 5th International Symposium on Electrical and Electronics Engineering (ISEEE), 2017.
  • [11] S. Scanzio, S. Cumani, R. Gemello, F. Mana, and P. Laface, “Parallel implementation of artificial neural network training for speech recognition,” Pattern Recognition Letters, vol. 31, pp. 1302–1309, 2010.
  • [12] S. Bahrampour, N. Ramakrishnan, L. Schott, and M. Shah, “Comparative study of deep learning software frameworks,” arXiv preprint arXiv:1511.06435, 2015.
  • [13] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng, “On optimization methods for deep learning,” in Proceedings of the 28th International Conference on International Conference on Machine Learning.   Omnipress, 2011, pp. 265–272.
  • [14] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556v6, 2015.
  • [15] W. Wang, M. Zhang, G. Chen, H. Jagadish, B. C. Ooi, and K.-L. Tan, “Database meets deep learning: Challenges and opportunities,” ACM SIGMOD Record, vol. 45, no. 2, pp. 17–22, 2016.
  • [16] C. Ré, D. Agrawal, M. Balazinska, M. Cafarella, M. Jordan, T. Kraska, and R. Ramakrishnan, “Machine learning and databases: The sound of things to come or a cacophony of hype?” in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.   ACM, 2015, pp. 283–284.
  • [17] T. Li, J. Zhong, J. Liu, W. Wu, and C. Zhang, “Ease. ml: towards multi-tenant resource sharing for machine learning workloads,” Proceedings of the VLDB Endowment, vol. 11, no. 5, pp. 607–620, 2018.
  • [18] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural networks on cpus,” in Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, vol. 1.   Citeseer, 2011, p. 4.
  • [19] H. Zhang, G. Chen, B. C. Ooi, K.-L. Tan, and M. Zhang, “In-memory big data management and processing: A survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 7, pp. 1920–1948, 2015.
  • [20] S. Palkar, J. Thomas, D. Narayanan, P. Thaker, R. Palamuttam, P. Negi, A. Shanbhag, M. Schwarzkopf, H. Pirk, S. Amarasinghe et al., “Evaluating end-to-end optimization for data analytics applications in weld,” Proceedings of the VLDB Endowment, vol. 11, no. 9, pp. 1002–1015, 2018.
  • [21] J. Alger, C++ for real programmers.   Academic Press, Inc., 2000.
  • [22] M. A. Bender, M. Farach-Colton, S. P. Fekete, J. T. Fineman, and S. Gilbert, “Cost-oblivious storage reallocation,” ACM Transactions on Algorithms (TALG), vol. 13, no. 3, p. 38, 2017.
  • [23] NVIDIA, “cnmem,” https://github.com/NVIDIA/cnmem, 2015.
  • [24] R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J. Rossbach, and O. Mutlu, “Mosaic: A gpu memory manager with application-transparent support for multiple page sizes,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-50 ’17.   New York, NY, USA: ACM, 2017, pp. 136–150. [Online]. Available: http://doi.acm.org/10.1145/3123939.3123975
  • [25] T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler, “Towards high performance paged memory for gpus,” in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), March 2016, pp. 345–357.
  • [26] M. Courbariaux, Y. Bengio, and J. David, “Low precision arithmetic for deep learning,” CoRR, abs/1412.7024, vol. 4, 2014.
  • [27] S. Le Grand, A. W. Götz, and R. C. Walker, “Spfp: Speed without compromise—a mixed precision model for gpu accelerated molecular dynamics simulations,” Computer Physics Communications, vol. 184, no. 2, pp. 374–380, 2013.
  • [28] M. D. McDonnell, “Training wide residual networks for deployment using a single bit for each weight,” arXiv preprint arXiv:1802.08530, 2018.
  • [29] B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in neural information processing systems, 1993, pp. 164–171.
  • [30] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on.   IEEE, 2016, pp. 243–254.
  • [31] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” arXiv preprint arXiv:1604.06174, 2016.
  • [32] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.
  • [33] L. Wang, J. Ye, Y. Zhao, W. Wu, A. Li, S. L. Song, Z. Xu, and T. Kraska, “Superneurons: Dynamic gpu memory management for training deep neural networks,” arXiv preprint arXiv:1801.04380, 2018.
  • [34] G. Pleiss, D. Chen, G. Huang, T. Li, L. van der Maaten, and K. Q. Weinberger, “Memory-efficient implementation of densenets,” arXiv preprint arXiv:1707.06990, 2017.
  • [35] H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, “Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server,” in Proceedings of the Eleventh European Conference on Computer Systems.   ACM, 2016, p. 4.
  • [36] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
  • [37] C. Turchetti, Stochastic Models of Neural Networks.   IOS Press, 2004.
  • [38] M. R. Garey, D. S. Johnson, and L. Stockmeyer, “Some simplified np-complete graph problems,” Theoretical computer science, vol. 1, no. 3, pp. 237–267, 1976.
  • [39] H. A. Kierstead, “A polynomial time approximation algorithm for dynamic storage allocation,” Discrete Mathematics, vol. 88, no. 2-3, pp. 231–237, 1991.
  • [40] L. Lu, X. Shi, Y. Zhou, X. Zhang, H. Jin, C. Pei, L. He, and Y. Geng, “Lifetime-based memory management for distributed data processing systems,” Proceedings of the VLDB Endowment, vol. 9, no. 12, pp. 936–947, 2016.
  • [41] S. V. Pemmaraju, S. Penumatcha, and R. Raman, “Approximating interval coloring and max-coloring in chordal graphs,” in International Workshop on Experimental and Efficient Algorithms.   Springer, 2004, pp. 399–416.
  • [42] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” in Advances in neural information processing systems, 2012, pp. 2951–2959.
  • [43] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech Report, 2009.
  • [44] J. Jiang, B. Cui, C. Zhang, and L. Yu, “Heterogeneity-aware distributed parameter servers,” in Proceedings of the 2017 ACM International Conference on Management of Data.   ACM, 2017, pp. 463–478.