In the life cycle of deep learning, serving models for inferences is a vital stage and usually incurs significant operational costs. An Amazon user study found that model serving is responsible for -
% of the total cost of ownership of data science solutions(amazon-tco)
. One important reason is that most of today’s platforms that serve deep neural network (DNN) models, such as Nexus(shen2019nexus), Clipper (crankshaw2017clipper), Pretzel (lee2018pretzel)
, TensorFlow Serving(olston2017tensorflow), and Rafiki (wang2018rafiki), are standalone systems that are totally decoupled from the data management systems. From the perspective of end-to-end applications, this decoupling incurs significant costs as follows:
(1) Existing deep learning serving frameworks are compute-focused and require models, input features, and intermediate feature maps all fit in memory. Failing to meet such requirements leads to the failing of the system. For large models with large working set, which is common in applications such as natural language processing and extreme multi-label classification(extreme-classification), this problem significantly impacts the availability of a model serving system.
(2) The physical decoupling of data serving and model serving introduces management complexity and extra latency to transfer input features from the databases where input features are extracted to the deep learning frameworks.
Therefore, it is imperative to investigate the serving of deep learning models natively from the relational database management system (RDBMS) (yuan2020tensor; jankov2019declarative; nakandala2020tensor; karanasos2019extending; hutchison2017laradb; DBLP:journals/pvldb/KoutsoukosNKSAI21; wang2020spores; dolmatova2020relational; boehm2016systemml). RDBMS has a long history of optimizing the memory locality for computations (i.e., queries), whether the working set size exceeds memory capacity or not, through effective buffer pool management. It also eases the management of data through data independence, views, and fine-grained authorization. All of these capabilities, if leveraged for model serving, will significantly reduce the operational costs and simplify system management for a broad class of real-world workloads (olteanu2020relational), such as credit-card fraud detection, personalized targeting recommendation, and personalized conversational-AI for customer supports. In such applications, the features are extracted from various historical transaction records or customer profiles, which are stored in RDBMS.
As aforementioned, unlike deep learning frameworks, workloads in RDBMS are not expected to have a working set fit in the available memory. The RDBMS buffer pool manager moves pages between disk and memory to optimize the data locality while continuing query processing. This allows more models to be served concurrently than deep learning frameworks such as TensorFlow with the same memory capacity. Nonetheless, there is always a desire to increase buffer reuse and minimize page displacement. To achieve this in model serving, we look into model deduplication.
Serving multiple similar models, such as ensemble and personalized model serving, can greatly improve the accuracy and customer experiences, and thus becomes a common pattern of DNN model serving (anyscale; crankshaw2015scalable; crankshaw2017clipper). Such DNN models contain abundant similar tensor blocks that can be deduplicated without affecting the accuracy. As a result, proper deduplication of such DNN models significantly reduces the storage space, memory footprint, and cache misses, and thus reduces the inference costs and latency.
However, existing deduplication techniques for tensors (vartak2018mistique), files (meyer2012study; zhu2008avoiding; bhagwat2009extreme; li2016cachededup; debnath2010chunkstash; wang2020austere), relational data (elmagarmid2006duplicate; bilenko2006adaptive; ananthakrishna2002eliminating; hernandez1995merge; borthwick2020scalable; yu2016generic; xiao2008ed), and MapReduce platforms (kolb2012load; kolb2012dedoop; chu2016distributed), are not applicable to the above problem, because: (1) They do not consider the impacts on model inference accuracy; (2) They do not consider how existing database storage functionalities, including indexing, page packing, and caching, should be enhanced to better support the inference and the deduplication of DNN models. The challenges that we focus on in this work include:
1. How to leverage indexing to efficiently detect similar parameters that can be deduplicated without hurting the inference accuracy?
2. A database page can contain multiple tensor blocks. How to pack tensor blocks into pages to maximize page sharing across multiple models and minimize the total number of needed pages for representing all tensors?
3. How to augment the caching policy to increase the data locality for deduplicated model parameters, so that pages that are needed by multiple models have a higher priority to be kept in memory?
To address these challenges, in this work, we propose a novel RDBMS storage design optimized for tensors and DNN inference workloads. We mainly leverage our previous works on Tensor Relational Algebra (yuan2020tensor; jankov2019declarative) to map deep learning computations to relational algebra expressions. A tensor is partitioned and stored as a set of tensor blocks of equivalent shape, where each block contains the metadata that specifies its position in the tensor. A tensor is similar to a relation and a tensor block is similar to a tuple. A DNN model inference is represented as a relational algebra graph, as detailed in Sec. 2
. This high-level abstraction is also consistent with many popular systems that integrate database and machine learning, such as SystemML(boehm2016systemml), Spark MLlib (meng2016mllib), SciDB (stonebraker2011architecture), SPORES (wang2020spores), LaraDB (hutchison2017laradb), among others.
Similar to the classical physical representation of a relation, we store a tensor as a set of database pages, with each page containing multiple tensor blocks. The difference is that each tensor relation consists of a set of private pages, and an array of references to shared pages that belong to more than one tensor, as detailed in Sec. 3. On top of such physical representation, we propose novel and synergistic indexing, paging, and caching techniques as follows:
Tensor block index for fast duplication detection (Sec. 4). It is widely observed that a small portion of model parameters (e.g., weights, bias) are critical to prediction accuracy. Deduplicating these parameters will lead to a significant reduction in accuracy (lee2020fast). To address the problem, different from existing tensor deduplication works (vartak2018mistique), we propose to first measure each tensor block’s sensitivity to prediction accuracy based on weight magnitude or other post-hoc analysis (han2015learning), and thus avoid deduplicating accuracy-critical blocks. Because pair-wise similarity-based comparison across tensor blocks exhibits inhibitive overhead, we used the Locality Sensitive Hash (LSH) based on Euclidean (L2) distance (indyk1998approximate; zhou2020s), to facilitate the nearest neighbor clustering.
Packing distinct tensor blocks to pages for minimizing storage size (Sec. 5). The problem is a variant of the bin-packing problem with different constraints: (1) Two bins (i.e., pages) can share space if they contain the same set of items (i.e., tensor blocks) (korf2002new; sindelar2011sharing); (2) For each tensor, there must exist a set of pages that exactly contain all blocks of that tensor. To address this problem, we propose a concept called equivalent class so that blocks that are owned by the same set of tensors will be assigned to the same class. Then, we propose a two-stage algorithm that first employs a divide-and-conquer approach to pack tensor blocks in each equivalent class to pages respectively, and later it adopts an approximation algorithm to repack the tensor blocks from non-full pages.
Deduplication-aware buffer pool management (Sec. 6). Existing deduplication-aware cache replacement strategies (li2016cachededup; wang2020austere) do not consider the locality patterns of different sets of pages, which are important for model inference workloads where the input and the output of each layer have different locality patterns. However, existing locality-aware buffer pool management policies (chou1986evaluation) do not distinguish private pages and shared pages. To address this problem, we propose a cost model for page eviction, which considers the reference count of a page (i.e., the number of locality sets/tensors that share this page) and gives pages that are shared by more tensors higher priority to be kept in memory.
The key contributions of our work are as follows:
1. We are the first to systematically explore the storage optimization for DNN models in RDBMS, with an overall goal of supporting deep learning model serving (i.e., inferences) natively from RDBMS.
2. We propose three synergistic storage optimizations: (a) A novel index based on L2 LSH and magnitude ordering to accelerate the discovery of duplicate tensor blocks with limited impacts on the accuracy; (b) A two-stage strategy to group tensor blocks to pages to minimize the number of pages that are needed to store the tensor blocks across all tensors; (c) A novel caching algorithm that recognizes and rewards shared pages across locality sets. It is noteworthy that our optimization can work together with other compression techniques such as pruning (han2015deep; han2015learning) and quantization (jacob2018quantization) to achieve a better compression ratio, as detailed in Sec. 7.6.
3. We implement the system in an object-oriented relational database based on our previous work of PlinyCompute (zou2018plinycompute; zou2019pangea; zou2020architecture; zou2020lachesis), called netsDB 111https://github.com/asu-cactus/netsdb. Related documentation can be found in https://github.com/asu-cactus/netsdb/tree/master/model-inference/.. We evaluate the proposed techniques using the serving of (1) multiple customized Word2Vec embedding models; (2) multiple versions of text classification models; (3) multiple specialized models for extreme classification. The results show that our proposed deduplication techniques achieved to reduction in storage size, speeded up the inference by to , and improved the cache hit ratio by up to . The results also show that netsDB outperformed TensorFlow for these workloads.
2. Background and Related Works
2.1. ML Model Inferences as Queries
Existing works (luo2018scalable; meng2016mllib; boehm2016systemml) propose to: (1) Abstract the tensor as a set of tensor blocks; (2) Encode local linear algebra computation logics that manipulate single or a pair of tensor blocks, in user defined functions (UDFs), also called as kernel functions, such as matrix multiplication, matrix addition, etc.; (3) Apply the relational algebra operators nested with these UDFs for performing linear algebra computations. Based on the above ideas, tensor relational algebra (TRA) (yuan2020tensor) further introduces a set of tensor-oriented relational operations, such as tile, concat, rekey, transform, join, aggregation, selection, etc. We found that most ML workloads can be decomposed into linear algebra operations that are further represented in such TRA.
For example, matrix multiplication is a join followed by aggregation (yuan2020tensor; boehm2016systemml). The join pairs two blocks from the two tensors if the first block’s column index equals the second’s row index. Then each joined pair of tensor blocks is applied with a UDF that multiplies these two tensor blocks. An output block has its row index being the first block’s row index and its column index being the second block’s column index. Then all tensor blocks output from the transformation are grouped by their row and column indexes, and all tensor blocks in the same group will be added up in an aggregate/reduce UDF. Similarly, matrix addition is a join. In addition, matrix transpose is a rekey (yuan2020tensor); activations
such as relu, tanh, and sigmoid aretransforms; softmax and normalization can be represented as an aggregation followed by a transform.
Therefore, as illustrated in Fig. 1, a fully-connected feed-forward network (FFNN) can be represented in relational algebra (jankov2019declarative; luo2018scalable).
While the experiments in this work (Sec. 7), mainly used aforementioned operators, other types of neural networks can also be represented in relational algebra. For example, convolution can be converted into a multiplication of two matrices (Conv-spatial-rewrite; spatial-rewrite)
, where the first matrix is created by spatially flattening every filtered area of the input features into a vector, and concatenating these vectors, and the second matrix is created by concatenating all filters and bias. Long short-term memory (LSTM) consists ofconcat, matrix multiplication, matrix addition, tanh, and sigmoid; and the transformer’s attention mechanism consists of matrix multiplication, transpose, softmax, etc (vaswani2017attention).
The storage optimization techniques proposed in this work can be easily extended to other tensor/array-based machine learning systems, which adopt a similar tensor representation that chunks a tensor to blocks, such as SystemML (boehm2016systemml), Spark MLlib (meng2016mllib), SciDB (stonebraker2011architecture), SPORES (wang2020spores), LaraDB (hutchison2017laradb), etc. In contrast, Raven (karanasos2019extending) and HummingBird (nakandala2020tensor) propose to transform relational data to tensors and leverage deep learning frameworks to run tensor computations. We will investigate how to apply proposed deduplication techniques to such and other systems (DBLP:journals/pvldb/KoutsoukosNKSAI21; dolmatova2020relational) in the future.
2.2. Tensor Deduplication and Virtualization
Mistique (vartak2018mistique) proposed a data store for managing and querying the historical intermediate data generated from ML models. It optimized the storage of fuzzy intermediate data using quantization, summarization, and deduplication. However, these techniques are designed for diagnosis queries, which are not linear algebra computations and have significantly less stringent accuracy and latency requirements compared to model inferences. While they considered both exact and approximate deduplication for traditional ML models, they only considered exact deduplication for DNN models, which is another limitation. In addition, they didn’t consider page packing and caching optimization.
Jeong and et al. (jeong2020accelerating)
proposed to merge related models resulting from ensemble learning, transfer learning, and retraining into a single model through input-weight connectors, so that multiple models can be served in one process and context switch overheads caused by running multiple concurrent model processes can be avoided. However, their method makes strong assumptions about the model architecture, achieves only coarse-grained deduplication, and is not applicable to models that are owned by different individuals and organizations.
Weight virtualization (lee2020fast) is a recently proposed technique for edge device environments. It merges pages across multiple heterogeneous models into a single page that is shared by these models. However, their work relied on each weight’s fisher information that must be extracted from the training process, which is usually not available at the serving stage in production. It also models the page matching and merging process as an expensive optimization process, which may work for small-scale models on edge devices, but not scalable to large-scale models. In addition, they didn’t consider the integration with relational databases.
2.3. Other Existing Deduplication Techniques
Deduplication of relational data in RDBMS, also known as record linkage, identifies duplicate items through entity matching (elmagarmid2006duplicate), using various blocking techniques to avoid the pair-wise comparison for dissimilar items (bilenko2006adaptive; ananthakrishna2002eliminating; hernandez1995merge; borthwick2020scalable). Various distributed algorithms were proposed to further accelerate such deduplication (chu2016distributed). For example, Dedoop (kolb2012load; kolb2012dedoop) leveraged the MapReduce platform, and Dis-Dedup (chu2016distributed) provided strong theoretical guarantees for load balance. In addition, various similarity join techniques were proposed to identify pairs of similar items, which leveraged similarity functions to filter out pairs that have similarity scores below a threshold (xiao2008ed) or used LSH to convert similarity join to an equi-join problem (yu2016generic). While these works are helpful for cleaning data in RDBMS, they are not optimized for numerical tensor data. For example, they never considered how deduplication of tensor data will affect the accuracy of ML applications.
There exists abundant work in storage deduplication to facilitate the file backup process (meyer2012study). Bhagwat et al. (bhagwat2009extreme) proposed a two-tier index managing the fingerprints and file chunks. Zhu et al. (zhu2008avoiding) proposed RAM prefetching and bloom-filter based techniques, which can avoid disk I/Os on close to of the index lookups. ChunkStash (debnath2010chunkstash) proposed to construct the chunk index using flash memory. CacheDedup (li2016cachededup) proposed duplication-aware cache replacement algorithms (D-LRU, DARC) to optimize both cache performance and endurance. AustereCache (wang2020austere) proposed a new flash caching design that aims for memory-efficient indexing for deduplication and compression. All such works focus on exact deduplication of file chunks, because information integrity is required for file storage. However, the storage of model parameters for model serving can tolerate a certain degree of approximation if such approximation will not harm the inference accuracy.
3. System Overview
Leveraging tensor relational algebra (yuan2020tensor; jankov2019declarative), a tensor is represented as a set of tensor blocks222Luo et al (luo2021automatic) proposed an auto tuning strategy for blocking tensors for TRA (yuan2020tensor).. Without deduplication, the set is physically stored in an array of pages of equivalent size, where each page consists of multiple tensor blocks. With deduplication, certain pages will be shared by multiple tensors. These shared pages are stored separately in a special type of set. Each tensor not only stores an array of private pages, but also maintains a list of page IDs that points to the shared pages that belong to the set.
Given a set of models, we propose a novel deduplication process, as illustrated in Fig. 2 and described below:
(1) An LSH-based index is incrementally constructed to group tensor blocks based on similarity, so that similar tensor blocks can be replaced by one representative tensor block in their group, with limited impacts on the model inference accuracy. To achieve the goal, the main ideas include: (a) Always examining the tensor blocks in the ascending ordering of their estimated impacts on the accuracy; (b) Periodically testing the deduplicated model inference accuracy along the duplication detection process, and stopping the deduplication for tensor blocks from a model, if its accuracy drops below a threshold. ( Sec.4)
(2) Each set of tensor blocks is physically stored as an array of pages of fixed size on disk. Distinct tensor blocks identified by the indexing are carefully grouped to pages so that each tensor is exactly covered by a subset of pages, and the number of pages that are required by all models is minimized. We optimize these objectives by assigning distinct tensor blocks that are shared by the same set of tensors to one equivalent class. Then blocks in the same equivalent class are grouped to the same set of pages. After this initial packing, tensor blocks from non-full pages are repacked to further improve the storage efficiency. (Sec. 5)
(3) The pages are automatically cached in the buffer pool. When memory resources become insufficient, the buffer pool manager will consider the locality patterns of each tensor and give hot pages and shared pages higher priority to be kept in memory through a novel cost model. (Sec. 6)
Block Metadata. A major portion of overhead of the proposed deduplication mechanism is incurred by the additional metadata used to map each tensor block in these shared pages to the correct position in each tensor. Each tensor block needs integers to specify such mapping, where is the number of tensors that share the block and is the number of dimensions of the tensor. The metadata size is usually much smaller than the block size. For an megabytes block (e.g., with double precision), its metadata for position mapping is merely bytes, supposing such a D block is shared by tensors, using short type to store block indexes. Even when we use small block sizes such as , the block size is hundreds times larger than the metadata size.
As aforementioned, an important pattern of model serving involves multiple versions of models that have the same architecture, e.g., obtained by retraining/finetuning a model using different datasets. We found that the deduplication of such models does not require tensor block remapping at all, as a shared tensor block is often mapped to the same position of all tensors it belongs to. That’s because during the process of finetuning and retraining, only partial weights will change. For a tensor block in such scenarios, we only need integers to specify the IDs of tensors that share it.
Model Removal and Updates. To remove a tensor, all private pages belonging to the tensor will be removed, and then, for each shared page belonging to this tensor, its reference count will be decremented. Once a shared page’s reference count is dropped to , this shared page will be moved from the shared page set to the private set of the tensor that owns the page. Given that the models in a serving scenario are less frequently updated than models in a training scenario, an update is implemented as a removal of the old tensor followed by an insertion of the new tensor. However, the index can be easily extended to facilitate model updates at a fine-grained level, as discussed in Sec. 4.
4. Index for Duplication Detection
4.1. Problem Description
In this section, we focus on one problem: For the tensors with same blocking shapes, how to divide all tensor blocks of these tensors into distinct groups, so that the tensor blocks in each group can replace each other without a significant drop in the inference accuracy of each model? We can further pick one block, i.e., the first identified block, in each group as a representative tensor block to replace other blocks in its group, without significant accuracy drop. The problem is formalized as follows:
Given tensors:, the -th tensor is split into tensor blocks: . The question is how to divide all tensor blocks, , into clusters: , so that (1) ; (2) , ; (3) , . Here, means that can be replaced by so that the drop in model accuracy is smaller than a threshold .
4.2. Main Ideas
4.2.1. Magnitude-aware Duplicate Detection
Existing works about deduplication (li2016cachededup; elmagarmid2006duplicate; bilenko2006adaptive; ananthakrishna2002eliminating; hernandez1995merge; borthwick2020scalable; chu2016distributed; kolb2012load; kolb2012dedoop) and tensor chunk deduplication (vartak2018mistique), include exact page deduplication and similar/approximate page deduplication, as detailed in Sec. 2.2 and 2.3. However, we found these works cannot be directly applied to tensor block deduplication for model serving applications:
(1) Exact deduplication of tensor chunks does not consider the fuzziness or similarity of model weights. In fact, the number of tensor blocks that can be deduplicated based on exact match is significantly lower than similarity-based match.
(2) We also found it ineffective
to perform deduplication solely based on the similarity, without considering the impact of model weights on the prediction accuracy. For example, we found that deduplicating similar blocks in a batch normalization layer in a ResNet50 model (two blocks with less thandifferent weights were considered as similar), without considering the importance of weights, will reduce accuracy from to .
Therefore, it is critical to develop new methods to identify tensor blocks that can be deduplicated with limited impacts on accuracy.
Motivated by the iterative pruning process (han2015deep; han2015learning), in which weights with small magnitude are pruned first, we developed a process of magnitude-aware duplicate detection, where blocks of smaller magnitude are deduplicated first, and the model accuracy is periodically validated after deduplicating every blocks.
4.2.2. LSH-based Approximate Tensor Block Deduplication
To reduce the pair-wise similarity comparison overhead, we consider leveraging Locality Sensitive Hash (LSH), which is a popular technique to solve nearest neighbor problems. LSH based on Hamming distance (datar2004locality), Euclidean distance (indyk1998approximate)
, and cosine similarity(charikar2002similarity) are designed to identify similar numerical vectors with fixed dimensions, and can be directly applied to detect similar tensor blocks. In addition, the MinHash based on Jaccard similarity (broder1997resemblance) is designed to identify similar binary vectors or similar sets of items. In this work, we mainly use the LSH based on Euclidean distance (indyk1998approximate; chen2019locality), which we call L2 LSH, because it is easy to compute (e.g., it does not require an expensive numeric value discretization process like MinHash) and it can be linked to the JS-divergence (lin1991divergence)
of weights’ probability distributions of two tensor blocks(chen2019locality).
For each block, its LSH signature is computed and used as the search key, and the identifier of the block (TensorID, BlockID) is used as the value. The key-value pair is sent to an index to look up a group of similar blocks that collide on the signature. For each group, the first indexed block is used as the representative block of this group, and other blocks are replaced by this representative block if accuracy drop is tolerable. If another block in the group has the same BlockID with the representative block, the BlockID field, which encodes the block’s position along all dimensions of the tensor, can be omitted to save space.
4.3. Index Building
Given a set of models, we execute following steps for each model:
Step 1. Calculate an aggregated magnitude value (e.g., average, median, 1st percentile, 3rd percentile, etc.) for each tensor block in the tensors of the model. We use the 3rd percentile, because even if the block contains only a few large magnitude weights, it may impact the inference accuracy significantly and should not be deduplicated. 3rd percentile can better reflect the magnitude of large weights in this block than aforementioned alternatives.
Step 2. Order all tensor blocks in the model by their magnitude values in ascending order.
Step 3. Select blocks that have the lowest magnitude values, and for each block, its LSH signature is computed and used to query the index. If the index has seen similar blocks before, the block’s identifier will be added to the corresponding group and this block will be replaced by the representative block, which is the first indexed block in this group. If the index hasn’t seen similar block before, a new group will be created, and this block becomes the representative block in the group.
Step 4. We will test the model using a validation dataset to check whether its inference accuracy drop is less significant than a threshold . If so, the algorithm repeats Step 3 and 4. Otherwise, it will stop deduplication for this model. That said, it simply adds each remaining block to the corresponding group, but such block will NOT be replaced by the representative block in the group. Such remaining blocks as well as the representative blocks are called as distinct blocks 333It is possible a remaining block is also a representative block in its own group., each of which has only one physical copy.
We repeat the above process for each model to incrementally construct the index, as illustrated in Alg. 1. The inputs of the algorithm include: (1) , which is a set of tensors belonging to the model; (2) , which maps an LSH signature to a representative block and a cluster consisting of the identifiers of blocks of which the signatures collide and thus are similar to the representative block; (3) , which is a list of distinct tensor blocks derived from previous models. The and are shared by all models and will be updated during the execution of the algorithm.
The output of the algorithm is . Each is a mapping for the -th tensor in the model, which specifies the identifier of the distinct tensor block corresponding to each (logical) block in the tensor. The deduplication is achieved by allowing multiple tensor blocks across models mapped to one distinct block. The output information is needed to pack distinct tensor blocks to pages as detailed in Sec. 5.
Further Optimizations. In order to further improve the accuracy, after deduplicating the models based on the constructed index, an additional parameter finetune stage can be carried out to optimize the accuracy after deduplication. In our implementation, for simplicity, during the finetune process, the tensor blocks that are shared by multiple models will be frozen, and only the weights in the private pages will be tuned for each model.
Removal and Updates. If a tensor block in a model needs to be removed, the LSH signature of the block is computed to query the index. If there exists a match and the block’s identifier exists in the corresponding group, the identifier will be removed from the group. Adding or removing blocks from the group will not affect the representative block of the group. If the representative block is the only block in the group, and it is to be removed, the group will be removed. The update of a tensor block can be regarded as a removal followed by an insertion.
5. Grouping Tensor Blocks into Pages
Based on Sec. 4, we obtained a mapping from each (logical) tensor block to a (physical) distinct block. Each tensor may consist of both private distinct blocks that belong to only one tensor and shared distinct blocks that belong to multiple tensors. Now we investigate the problem of how to pack multiple tensor blocks to database pages, so that we can maximize the sharing of pages and minimize the total number of pages that are needed.
5.1. Inconsistent Pages and Tensor Blocks
Database storage organizes data in pages, so that a page is the smallest unit of data for I/O read/write and cache load/evict operations. Analytics databases usually use a page size significantly larger than a tensor block (e.g., Spark uses megabytes page size and block shape by default (meng2016mllib)). As a result, a database page may contain multiple tensor blocks. Each tensor consists of a set of pages that should contain exactly the set of tensor blocks belonging to the tensor: no more and no less. If these pages contain tensor blocks that do not belong to the tensor, it will significantly complicate the scanning and various operations over the tensor.
However, the default paging process used in database systems cannot work well with deduplication. By default, tensor blocks are packed into pages based on the ordering of the time when each block is written to the storage. If a page can hold up to tensor blocks, every batch of consecutive tensor blocks are packed into one page. Then a page deduplication process is performed, so that each distinct page will be physically stored once. However, such default packing with page-level deduplication is sub-optimal, because deduplicable tensor blocks may not be adjacent to each other spatially. As illustrated in Fig. 3, the default packing requires pages, while the optimal packing scheme requires only pages.
5.2. Problem Formalization
The problem is: How to group the tensor blocks across all models to pages to satisfy that: (1) For each tensor, we can find a subset of pages so that the set of tensor blocks contained in the pages is exactly the set of all tensor blocks that belong to the tensor; (2) The total number of distinct pages that need to be physically stored is minimized.
Here we formalize the problem definition as a variant of the bin packing problem, where each bin represents a page that holds a limited number of tensor blocks, and each distinct tensor block represents an item. Given tensors and a set of distinct tensor blocks derived from these tensors, a Boolean value specifies whether exists in the -th tensor. , , as described in Sec. 4. The problem is to look for a bin-packing scheme that packs the items (i.e., distinct tensor blocks) to bins (i.e., pages), denoted as , where each bin can hold at most items and each item can be allocated to one or more bins, denoted as and . Boolean value denotes whether exists in . The bin packing mechanism must satisfy conditions as follows: (1) the total number of bins, , is minimized, where the Boolean value denotes whether the is used; (2) , so that , which means the set of distinct items contained in a tensor is equivalent to the set of distinct items contained in all bins belonging to .
Problem Importance and Hardness. It is an important problem, because large page sizes up to hundreds of megabytes, are widely adopted in analytics databases (zaharia2010spark) and when memory resource becomes insufficient, even saving only a few pages may significantly reduce the memory footprint and improve the performance.
The problem is a variant of the bin-packing problem where items (i.e., distinct blocks) can share space when packed into a bin (i.e., pages) (korf2002new; sindelar2011sharing), which is NP-hard. A dynamic programming strategy, which searches packing plans for one tensor first, and then repeatedly pack for more tensors based on previously searched packing plans, will easily fail with exploded search space.
5.3. Equivalent Class-based Packing
While approximation algorithms (coffman2013bin), such as Best-fit, First-fit, Next-fit, are widely used for general bin-packing problems, they are suboptimal for the above problem, because they didn’t consider how tensor blocks are shared by tensors.
To solve the problem, we propose to group tensor blocks that are shared by the same set of tensors together into equivalent classes. Different tensor blocks that are shared by the same set of tensors are regarded as equivalent in terms of page packing.
As illustrated in Fig. 4, which depicts the tensor sharing relationship for the example in Fig. 3, distinct blocks are shared by Tensor 1() and Tensor 2(), these distinct tensor blocks can be grouped to the same equivalent class . Four distinct tensor blocks are private to and they can be grouped to the same equivalent class , and so do the blocks private to ().
It is beneficial to use a divide and conquer strategy to pack for each equivalent class in parallel by grouping the blocks falling into the same equivalent class to the same page(s). That’s because each page can be shared by all tensors associated with the page’s corresponding equivalent class. By doing so, in the above example (Fig. 3 and Fig. 4), the distinct blocks in equivalent class will be packed to three pages, the four distinct blocks in will be packed to one page, and the four distinct blocks in will be packed to one page, which leads to the optimal plan, as shown in Fig. 3. The algorithm is illustrated in Alg. 2.
5.4. A Two-Stage Page Packing Strategy
The problem with the equivalent class-based packing is that it may lead to non-full pages, because items in certain equivalent classes may not fully fill the bins. For another example as illustrated in Fig. 5, if a bin can maximally hold two items, the items in , , will be packed to three non-full bins respectively. However, a better scheme is to pack these items into two bins: and
. Considering that a page may have a size up to tens or hundreds of megabytes, and repacking non-full pages will enable significant improvement in storage efficiency, memory footprint, and data locality. Therefore, we propose a two-stage strategy for optimizing page packing schemes. At the first stage, items from each equivalent class are packed to bins separately, and no bin is allowed to mix items from different non-equivalent classes. Then, at the second stage, we repack items from non-full bins, by applying an approximation algorithm based on the following heuristics: (1) Largest-Tensor-first. A tensor that contains more tensor blocks to be repacked is more likely to generate pages that can be reused by other tensors. (2) Hottest-Block-First. Frequently shared tensor blocks, if packed together, are more likely to generate pages that can be reused across multiple tensors.
The approximation algorithm picks the tensor that has the most tensor blocks in non-full pages to repack first. When it repacks for a given tensor, it first attempts to identify and reuse packed pages that cover as many blocks to repack as possible. Then it orders the remaining tensor blocks based on their sharing frequency (i.e., the number of tensors a block is shared by), and simply packs these blocks to pages in order, without leaving any holes in a page except for the last page. We formalized the algorithm for the second stage as Alg. 3. The algorithm for the first stage is the same with Alg. 2.
Online Packing The proposed algorithms can also be utilized for online packing of tensor blocks to pages. Each time when a new tensor is about to be added to the database, the list of tensor blocks in this tensor as well as all related tensors (i.e., tensors which share at least one block with the new tensor) will be retrieved to run the proposed algorithm to obtain a new packing scheme. Then the difference between the new packing scheme and the existing packing scheme will be computed. Only these pages that need to be changed will be repacked again.
6. Buffer Pool Management
A model serving workload involves multiple types of tensors that have different locality patterns. For example, the model parameter tensors at each layer are persisted to disk and are repeatedly read for making inferences; the input feature vector also needs to be persisted, but is read only once. The intermediate features output from each layer do not need to be persisted and are read only once.
Existing works proved that compared to LRU/MRU/LFU, which only consider reference time/distance/frequency, a fine-grained buffer pool management strategy that groups different types of data based on a locality set abstraction (zou2019pangea; zou2020architecture; chou1986evaluation) and considers the access pattern and durability requirements of each locality set, can achieve better data locality for large-scale data analytics processing (zou2019pangea; zou2020architecture). A locality set is a set of pages that will be processed similarly. For example, the pages in each equivalent class are regarded as a separate locality set. Users can configure the page eviction policy, e.g., MRU or LRU, for each locality set. When pages need to be evicted from the buffer pool to make room for new pages, the system chooses a locality set to be the victim locality set if the next page-to-be-evicted from the locality set has the lowest expected eviction cost among all locality sets. The expected eviction cost is formalized in Eq. 6.
Here, is the cost for writing out the page, is the cost for loading it back for reading, and
is the probability of accessing the page within the nexttime ticks. The formulation of and in existing works (zou2020architecture; zou2019pangea; chou1986evaluation) have considered the lifetime, durability requirements, access patterns, etc. of each locality set, and can be reused for this work. However, when modeling , existing works did not consider page sharing caused by model deduplication. To address the problem, we need to reformulate this factor.
In the scenario of serving multiple models, we propose to apply the queueing theory (kleinrock1976queueing) to model the page accesses so that each page is like a server, and each model inference request that triggers a page access is like a customer. Because a page may be shared by multiple models, inference requests from each model will be dispatched to a queue associated with the model. If we assume the arrival time of the next access to each page from each queue as an independent Poisson point process (kleinrock1976queueing), the probability of reusing each page (i.e., the probability that the page will be accessed within time ticks) can be estimated using Eq. 7. Here, represents a set of models that share this page, and denotes the access rate per time tick for the model .
This approach is more accurate than simply estimating based on the reference frequency/distance measured for each page, because the access patterns of various datasets involved in each model inference is fixed, mostly affected by .
In this section, we will answer the following questions:
(1) How effective is the proposed synergistic model deduplication mechanism in reducing the latency and improving the storage efficiency for various model serving scenarios? (Sec. 7.2)
(2) How will the proposed index approach affect the time required for detecting the duplicate blocks, the overall storage efficiency, and the model serving accuracy? (Sec. 7.3)
(3) How will the proposed strategies of packing blocks to pages affect the storage efficiency and the computation overheads, compared to various baselines? (Sec. 7.4)
(4) How will optimized caching improve memory locality? (Sec. 7.5)
(5) How will deduplication work with popular model compression techniques, such as pruning and quantization? (Sec. 7.6)
7.1. Evaluation and Workloads
7.1.1. Multiple Versions of Personalized Text Embedding Models
Text embedding is important for many natural language processing applications, and its accuracy can be greatly improved using large open corpus like Wikipedia (wikipedia-data). However, at the same time, every enterprise or domain has its own terminologies, which are not covered in the open data. To personalize the text embeddings, for each domain, we need to train a separate model on both the shared open data and the private domain/enterprise data. Word2Vec is a two-layer neural network used to generate word embeddings. We use skip-gram Word2Vec as well as negative sampling, with
negative samples, and noise contrastive estimation (NCE) loss. We deploy a Word2Vec model pretrained using a Wikipedia dump and downloaded from TFHub(tfhub). The model embeds the 1 million most frequent tokens in the corpus. Then we finetune the pre-trained model using different domain-specific corpus including texts extracted from Shakespeare’s plays (TF-shakespeare), posts collected from Firefox support forum (Web-text-corpus), articles collected from Fine Wine Diary (Web-text-corpus), Yelp reviews (zhang2015character)
, IMDB reviews(maas2011learning). The input document is processed with a skip window size of 1. The Word2Vec embedding layer has one million dimensional embedding vectors corresponding to one million words in the dictionary. Therefore, a weight tensor is in the shape of .
The inference of a word2vec model on netsDB is implemented via matrix multiplication, where an input feature vector is of the shape of , representing a batch of input words, sentences, or documents. A word can be represented as a
dimensional one-hot encoding vector, where the corresponding word in the vocabulary is specified as, and other words are specified as . Then, multiplying the batch of encoding vectors of words with the embedding weight matrix will output the batch of embedding vectors for these words. Similarly, the encoding vector for a sentence or a document, which is seen as a "bag of words", can be represented as the sum of the one-hot encoding vectors of all the words in this sentence or document. By multiplying the batch of encoding vectors and the embedding weight matrix, the embedding for each sentence or document is obtained as the weighted sum of the embedding vectors of the words in this sentence or document.
7.1.2. Multiple Versions of Text Classification Models
We further investigate a scenario that serves five different text semantic classification models. Each classification task takes a review as input and outputs a binary label to indicate the input is toxic or nontoxic (zhang2015character; maas2011learning; borkan2019nuanced). All tasks use the same model architecture. Each model uses three layers. The first layer is a Word2Vec layer as mentioned in Sec. 7.2.1, using a vocabulary size of one million and an embedding dimension of . The second layer is a fully connected layer that consists of merely parameters, and the third layer is an output layer that consists of parameters. Because the fully connected layer is small in size, we encode it in a UDF that is applied to the output of the Word2Vec embedding layer.
The first two text semantic classification models are trained using the same IMDB review datasets. The difference is that Model-1’s Word2Vec layer uses the weights of a pre-trained model directly downloaded from TFHub as mentioned in Sec. 7.2.1, which is set as Non-Trainable, so that only the weights of the fully connected layers are changed during the training process. However, Model-2’s Word2Vec layer is set to be Trainable, which means the weights of the layer will also change during the training process. Similarly, Model-3 and Model-4 are trained using Yelp datasets, with the Word2Vec layer set to be Non-Trainable and Trainable respectively. The Model-5 is trained using the civil comments (borkan2019nuanced), which are collected from news sites with labeled toxicity values, and the Word2Vec layer in this model is set to be Trainable.
7.1.3. Transfer Learning of Extreme Classification Models
Following TRA (yuan2020tensor)
, a two-layer feed-forward neural network (FFNN) is implemented in our proposed system for the AmazonCat-14K(mcauley2015image; mcauley2015inferring) benchmark. This FFNN requires five parameter tensors: the weight tensors and bias tensors of the two layers, and the input tensor for which predictions are generated. The input tensor includes data points that have features, and the extreme classification task uses labels. The hidden layer has neurons. Therefore, the weight tensor (denoted as ) in the first layer has parameters, and the weight tensor (denoted as ) in the second layer has parameters.
A transfer learning scenario is tested, where the first layer is freezed, and is specialized for different tasks. Only for this scenario, the inputs, weights, and biases are randomly generated instead of being trained from real-world data like other scenarios. The experiments are still reasonable as deduplication in this scenario hardly affects the inference accuracy. That is because used in all the models are the same and thus no weights need to be approximated for deduplicating it, and we also choose not to deduplicate any blocks from the specialized and smaller layer.
The implementation of the feed-forward inference at each fully-connected layer is illustrated in Fig. 1.
Evaluation Environment Setup Unless explicitly specified, most of the experiments used an AWS r4xlarge instance that has four vCPU cores and gigabytes RAM. The storage volumes include a GB SSD, and a GB hard disk drive. For the experiments on the GPU, we used an AWS g4dn.2xlarge instance that is installed with one NVIDIA T4 Tensor Core GPU that has gigabytes memory, besides eight CPU cores and gigabytes host memory.
7.2. Overall Evaluation Results
7.2.1. Multiple Versions of Personalized Text Embeddings
We find that word embedding models finetuned from the same TFHub pretrained Word2Vec model share more than of pages. (The accuracy of each Word2Vec model after finetuning is above .) Each model is a tensor, stored in a set of tensor blocks in the shape of , each weight is stored in double precision. Without our proposed deduplication mechanism, storing six word embedding models separately requires more than gigabytes storage space. However, by applying our work, only gigabytes storage space is required, which is a reduction. Note that the overall memory requirements for serving models will be higher than the storage requirements, as we also need to cache the intermediate data, which includes the join HashMap constructed for probing the model parameters, and about gigabytes input data.
In Tab. 1 and 2, we measured the total latency of making a batch of inferences on all six models using different configurations for buffer pool size and storage hardwares. We observed that our proposed deduplication mechanism brought up to and speedups in model serving latency for SSD and HDD storage respectively, as illustrated in Tab. 1 and Tab. 2.
|num models||disk type||w/o dedup||w/ dedup & optimized caching|
|disk type||buffer pool size||w/o dedup||w/ dedup||w/ dedup & optimized caching|
We also compared the netsDB’s performance to the CPU-based TensorFlow on the same AWS r4.xlarge instance and the GPU-based TensorFlow on a g4dn.2xlarge instance. On TensorFlow, we developed two approaches for Word2Vec inference.
The first approach used matrix multiplication (tf.matmul), which is similar to the netsDB implementation of Word2Vec inference as mentioned in Sec. 7.1.1. In the experiments of comparing this approach and netsDB, we used double precision for both systems.
The second approach is based on embedding lookup by using Keras’ Word2Vec embedding layer (i.e.,keras.layers.Embedding). The implementation takes a list of IDs as input, and searches the embedding for each ID (via index) in parallel.
For the second approach, because Keras’ embedding layer enforces single precision, we changed netsDB implementation to use the single-precision float type. The experiments for this approach used million IDs in each batch. For netsDB’s implementation based on matrix multiplication, we assume the million IDs are from documents, and each document has different words, so its input features include vectors, each vector is a sum of the one-hot embedding vectors of words, as mentioned in Sec. 7.1.1. The input batch has megabytes in size for the implementation based on matrix multiplication, but only megabytes for the implementation based on embedding lookup.
In Tab. 3, TF-mem, TF-file, and TF-DB loads an input batch from the local memory, the local CSV file, and a PostgreSQL table ( BLOB fields for the first approach, and BLOB field for the second approach), respectively. We observed that netsDB supports the inference of significantly more models in the same system than TensorFlow. For this case, we did not observe performance gain brought by GPU acceleration in TensorFlow, mainly because inference is less complicated than training and a batch of such inferences cannot fully utilize the GPU parallelism.
|TensorFlow CPU||TensorFlow GPU|
|Matrix-Multiplication-based inference, double precision|
|Embedding-lookup-based inference ( million IDs/batch), single precision|
7.2.2. Multiple Versions of Text Classification Models
Based on the above results, we further evaluated the proposed techniques on the text classification task described in Sec. 7.1.2.
We imported these text classification models into netsDB. The default page size used in this experiment is megabytes and when using a block shape of , each text classification model requires
pages of storage size before deduplication. We first compared the required number of private and shared pages after deduplication as well as the classifier inference accuracy before and after deduplication. The comparison results are illustrated in Tab.4.
Without deduplication, the total storage space required is GB for pages in total. After applying the proposed deduplication mechanism, the total storage space required is reduced to GB for pages, using the block size of .
|private pages||num shared pages||auc before dedup||auc after dedup|
|pages shared by 5 models|
|pages shared by 4 models|
|pages shared by 3 models|
|pages shared by 2 models|
Each shared page may have a different reference count (i.e., shared by a different set of tensors). So we illustrate the reference counts of pages for each model in Tab. 5.
The comparison of the overall inference latency of all five text classification models, using different block sizes and storage configurations, is illustrated in Tab. 6. We observed that to speedup were achieved by applying our proposed techniques.
7.2.3. Transfer Learning of Extreme Classification Models
In this experiment, all three models have the same architecture as described in Sec. 7.1.3, using double precision weights, and are specialized from the same feed-forward model through transfer learning and they share a fully connected layer, which contains millions of parameters. This layer is stored as a shared set in netsDB, and it accounts for gigabytes of storage space. Each model’s specialized layer only accounts for gigabytes of storage space. Therefore, with deduplication of the shared layer, the overall required storage space is reduced from gigabytes to gigabytes. We need to note that the required memory size for storing the working sets involved in this model-serving workload is almost twice of the required storage space, considering the input batch of the dimensional feature vectors and the intermediate data between layers for both models.
Besides a significant reduction in storage space, we also observed up to and speedup in SSD and HDD storage respectively, because of the improvement in cache hit ratio (). Because this is a transfer learning scenario, the shared pages have no approximation at all, there exists no influence on accuracy.
|disk type||buffer pool size||w/o dedup||w/ dedup||w/ dedup & optimized caching|
|disk type||buffer pool size||w/o dedup||w/ dedup||w/ dedup & optimized caching|
We also compared the netsDB performance to TensorFlow, using the Keras implementation of the FFNN model. As illustrated in Tab. 8, netsDB outperforms TensorFlow for loading input from a CSV file and a Blob field of a PostgreSQL table. If we compute and store the input feature vectors in a table of Blob fields, the TF-DB latency for CPU and GPU is and seconds respectively, significantly slower than the latency on netsDB, which serves data and model in the same system.
|TensorFlow CPU||TensorFlow GPU|
7.3. Evaluation of Duplicate Block Detection
We compared our indexing strategy as illustrated in Alg. 1 to two baselines: (1) A naive indexing scheme using pair-wise comparison to identify similar blocks based on Euclidean distance; (2) Mistique’s approximate deduplication using MinHash (vartak2018mistique). As illustrated in Fig. 6, we observed significant accuracy improvement brought by our proposed deduplication detection approaches (w/ and w/o finetune) for deduplicating the same amount of blocks. That’s because both baselines failed to consider a block’s magnitude as well as its impact on accuracy.
Moreover, we also compared the compression ratio, the average latency for querying one tensor block from the index, and the accuracy of our proposed approach to (1) Mistique exact deduplication approach, where two tensor blocks are deduplicated only if they have the same hash code; (2) Mistique approximate deduplication; and (3) Enhanced pairwise comparison approach with magnitude ordering applied. Both (2) and (3) used periodic accuracy checks, for which, we evaluate the accuracy of a model once for indexing every five blocks from the model, and we stop deduplication for a model once its accuracy drop exceeds . However, we do not roll back to ensure the accuracy drop is within for these experiments, though such rollbacks can be easily implemented. As illustrated in Tab. 9 and Tab. 10, the proposed approach based on L2 LSH still achieved the best compression ratio. The Mistique’s approximate approach is significantly slower in querying the index because a new block requires to be discretized and the MinHash generation requires multiple rounds of permutations. Due to such overhead, the latency required for building an index using the Mistique approximate approach is significantly higher than our proposed approach.
|Blocks w/o dedup||Blocks w/ dedup||
|Mistique Exact Dedup|
|Mistique Approximate Dedup|
|Proposed (w/o finetune)|
|Mistique Exact Dedup|
|Mistique Approximate Dedup|
|Proposed (w/o finetune)|
We also visualized the distribution of duplicate blocks across the models for the text classification workload, as illustrated in Fig. 7. The results showed that the blocks that are shared across models tend to be located in the same position of the tensor. This observation leads to the optimization of metadata as described in Sec. 3: metadata such as the index (i.e., position) of a shared tensor block in each tensor can be simplified.
7.4. Evaluation of Page Packing Algorithms
We evaluated our proposed page packing algorithms using four evaluation scenarios: (1) Two-stage algorithm, which used Alg. 2 in stage 1, and then apply Alg. 3 to items in non-full bins in stage 2. (2) Greedy-1 algorithm that is based on equivalent classes (Alg. 2); (3) Greedy-2 algorithm that applies Alg. 3 to overall page packing. (4) A baseline algorithm, where we simply pack tensor blocks to pages in order, and then we eliminate the duplicate pages which contain the same set of tensor blocks.
We observed significant improvement in storage efficiency brought by our proposed two-stage algorithm compared to alternatives, as illustrated in Tab. 11. In addition, the computation efficiency of the two-stage algorithm is comparable to Greedy-1, as illustrated in Tab. 12. As mentioned, the extreme classification workload involves models that share the same fully connected layer, which means all tensor blocks in that layer are fully shared by all models. In such a special case, all algorithms achieve similar storage efficiency.
|Scenario (block size, page size)||Baseline||Two-Stage||Greedy-1||Greedy-2|
|word2vec (, MB)||98||98|
|text classification (, MB)||87||87|
|text classification (, MB)||104|
|text classification (, MB)||195|
|Scenario (block size, page size)||Baseline||Two-Stage||Greedy-1||Greedy-2|
|text classification (, MB)||0.68||0.01||0.01||0.52|
|text classification (, MB)||13.65||0.05||0.05||11.50|
|text classification (, MB)||44.72||0.04||0.04||42.72|
Above testing results are based on the offline page packing. We also tested the online approach of page packing. We find that in the text classification workload, when using block size and megabytes page, each time we involve a new model, about of pages need to be reorganized, while of pages can be reused and thus do not need to be changed, as illustrated in Tab. 13.
|Step||New model to pack||pages reused||pages discarded||pages created|
7.5. Evaluation of Caching Optimization
We also compare the proposed caching optimization to a number of baselines, including LRU, MRU, as well as the locality set-based page replacement policy without considering the page sharing. The detailed cache hit ratio comparison for the Word2Vec and text classification applications are illustrated in Tab. 8. Locality Set-M/L refers to the locality set page replacement policy (zou2020architecture; zou2019pangea) that treats shared pages as one locality set and applies the MRU/LRU to this locality set of shared pages. Optimized M/L refers to the localitySet-M/L with the proposed caching optimization applied (i.e., shared pages will be given a higher priority to be kept in memory).
We observed that, after deduplication, the cache hit ratio improved significantly because of the reduction in memory footprint. In addition, with the proposed deduplication approach applied, Optimized-M/L achieved a significantly better cache hit ratio than alternative page replacement policies.
7.6. Relationship to Model Compression
Besides deduplication, there exist a number of model compression techniques, such as pruning (han2015deep; han2015learning) and quantization (jacob2018quantization), which can only be applied to each single model separately. In this work, we found that as a cross-model compression technique, model deduplication can be applied after pruning or quantization individual models, which achieved to better storage efficiency. The reason is that pruning and quantization will not significantly change the similarity of tensor blocks across models.
We also observed similar results for an ensemble of VGG-16 models. We omit the details here, because the use cases of convolutional neural networks in RDBMS are unclear and the volume of model parameters is relatively small (up to hundreds of megabytes).
Serving deep learning models from RDBMS can greatly benefit from the RDBMS’ physical data independence and manageability. This work proposed several synergistic storage optimization techniques covering indexing, page packing, and caching. We implemented the system in netsDB, an object-oriented relational database. We evaluated these proposed techniques using several typical model serving scenarios, including the serving of (1) multiple fine-tuned word embedding models, (2) multiple text classification models, and (3) multiple extreme classification models specialized through transfer learning. The results showed that our proposed deduplication techniques achieved to reduction in storage size, speeded up the inference by to , and improved the cache hit ratio by up to . The results also showed that significantly more models can be served from RDBMS than TensorFlow, which helps to reduce the operational costs of model inferences.