1. Introduction
Data compression is a popular technique for improving the efficiency of data processing workloads such as SQL queries over compressed databases (abadi2006integrating; li2013bitweaving; wesley2014leveraging; elgohary2016compressed; wang2017experimental) and more recently, machine learning with classical batch gradient methods (elgohary2016compressed). However, to the best of our knowledge, there is no such study of data compression for minibatch stochastic gradient descent (MGD) (hogwild; wu2017bolt; bismarck; kaoudi2017cost; qin2017scalable), which is known for its fast convergence rate and statistical stability, and is arguably the workhorse algorithm (ruder2016overview; hinton2012neural) of modern ML. This research gap is getting more crucial as training dataset sizes in ML keep growing (chelba2013one; russakovsky2015imagenet). For example, if no compression is used to train ML models on large datasets that cannot fit into memory capacity or even distributed memory capacity, disk IO time becomes a significant overhead (elgohary2016compressed; yu2012large) for MGD. Figure 1A highlights this issue in more detail.
Despite the need for a good data compression scheme to improve the efficiency of MGD workloads, unfortunately, the main existing data compression schemes designed for general data files or batch gradient methods are not a good fit for the data access pattern of MGD. Figure 1B highlights these existing solutions. For examples, general compression schemes (GC) such as Gzip and Snappy are designed for general data files. GC typically has good compression ratios on minibatches; however, a minibatch has to be decompressed before any computation can be carried out, and the decompression overhead is significant (elgohary2016compressed) for MGD. Lightweight matrix compression schemes (LMC) include classical methods such as compressed sparse row (saad2003iterative) and value indexing (kourtis2008optimizing) and more recently, a stateoftheart technique called compressed linear algebra (elgohary2016compressed). LMC is suitable for batch gradient methods because the compression ratio of LMC is satisfactory on the whole dataset and matrix operations can directly operate on the encoded output without decompression overheads. Nevertheless, the compression ratio of LMC on minibatches is not as good as GC in general, which makes it less attractive for MGD.
In this paper, we fill this crucial research gap by proposing a lossless matrix compression scheme called tupleoriented compression (TOC), whose name is based on the fact that tuple boundaries (i.e., boundaries between columns/rows in the underlying tabular data) are preserved. Figure 1C highlights the advantage of TOC over existing compression schemes. TOC has both good compression ratios on minibatches and no decompression overheads for matrix operations, which are the main operations executed by MGD on compressed data. Orthogonal to existing works like GC and LMC, TOC takes inspirations from an unlikely source—a popular string/text compression scheme LempelZivWelch (LZW) (welch1984technique; ziv1977universal; ziv1978compression)—and builds a compression scheme with compression ratios comparable to Gzip on minibatches. In addition, this paper also proposes a suite of compressed matrix operation execution techniques, which are tailored to the TOC compression scheme, that operate directly over the compressed data representation and avoid decompression overheads. Even for a small dataset that fits into memory, these compressed execution techniques are often faster than uncompressed execution techniques because they can reduce computational redundancies in matrix operations. Collectively, these techniques present a fresh perspective that weaving together ideas from databases, text processing, and ML can achieve substantial efficiency gains for popular MGDbased ML workloads. Figure 1D highlights the effect of TOC in reducing the MGD runtimes, especially on large datasets.
TOC consists of three components at different layers of abstraction: sparse encoding, logical encoding, and physical encoding. All these components respect the boundaries between rows and columns in the underlying tabular data so that matrix operations can be carried out on the encoded output directly. Sparse encoding uses the wellknown sparse row technique (saad2003iterative) as a starting point. Logical encoding uses a prefix tree encoding algorithm, which is based on the LZW compression scheme, to further compress matrices. Specifically, we notice that there are sequences of column values which are repeating across matrix rows. Thus, these repeated sequences can be stored in a prefix tree and each tree node represents a sequence. The occurrences of these sequences in the matrix can be encoded as indexes to tree nodes to reduce space. Note that we only need to store the encoded matrix and the first layer of the prefix tree as encoded outputs, as the prefix tree can be rebuilt from the encoded outputs if needed. Lastly, physical encoding encodes integers and float numbers efficiently.
We design a suite of compressed execution techniques that operate directly over the compressed data representation without decompression overheads for three classes of matrix operations. These matrix operations are used by MGD to train popular ML models such as Linear/Logistic regression, Support vector machine, and Neural network. These compressed execution techniques only need to scan the encoded table and the prefix tree at most once. Thus, they are fast, especially when TOC exploits significant redundancies. For example, right multiplication (e.g., matrix times vector) and left multiplication (e.g., vector times matrix) can be computed with one scan of the encoded table and the prefix tree. Lastly, since these compressed execution techniques for matrix operations differ drastically from the uncompressed execution techniques, we provide mathematical analysis to prove the correctness of these compressed techniques.
To summarize, the main contributions of this paper are:

To the best of our knowledge, this is the first work to study lossless compression techniques to reduce the memory/storage footprints and runtimes for minibatch stochastic gradient descent (MGD), which is the workhorse algorithm of modern ML. We propose a lossless matrix compression scheme called tupleoriented compression (TOC) with compression ratios comparable to Gzip on minibatches.

We design a suite of novel compressed matrix operation execution techniques tailored to the TOC compression scheme that directly operate over the compressed data representation and avoid decompression overheads for MGD workloads.

We provide a formal mathematical analysis to prove the correctness of the above compressed matrix operation execution techniques.

We perform an extensive evaluation of TOC compared to seven compression schemes on six real datasets. Our results show that TOC consistently achieves substantial compression ratios by up to 51x. Moreover, TOC reduces MGD runtimes for three popular ML models by up to 5x compared to the stateoftheart compression schemes and by up to 10.2x compared to the encoding methods in some popular ML systems (e.g., ScikitLearn (scikitlearn), Bismarck (bismarck)
and TensorFlow
(abadi2016tensorflow)). An integration of TOC into Bismarck also confirmed that TOC can greatly benefit MGD performance in ML systems.
The remainder of this paper proceeds as follows: § 2 presents some required background information. § 3 explains our TOC compression scheme, while § 4 presents the techniques to execute matrix operations on the compressed data. § 5 presents the experimental results and § 6 discusses TOC extensions. § 7 presents related work, and we conclude in § 8.
2. Background
In this section, we discuss two important concepts: ML training in the generalized setting and data compression.
2.1. Machine Learning Training
2.1.1. Empirical Risk Minimization
We begin with a description of ML training in the generalized setting based on standard ML texts (SSSSS10; shai). Formally, we have a hypothesis space , an instance set
, and a loss function
. Given a training set which are i.i.d. draws based on a distribution on , and a hypothesis , our goal is to minimize the empirical risk over the training set defined as(1) 
Many ML models including Logistic/Linear regression, Support vector machine, and Neural network fit into this generalized setting
(SSSSS10).2.1.2. Gradient Descent
ML training can be viewed as the process to find the optimal such that This is essentially an optimization problem, and gradient descent is a common and established class of algorithms for solving this problem. There are three main variants of gradient descent: batch gradient descent, stochastic gradient descent, and minibatch stochastic gradient descent.
Batch Gradient Descent (BGD). BGD uses all the training data to compute the gradient and update per iteration.
Stochastic Gradient Descent (SGD). SGD uses a single tuple to compute the gradient and update per iteration.
Minibatch Stochastic Gradient Descent (MGD). MGD uses a small batch of tuples (typically tens or hundreds of tuples) to compute the gradient and update per iteration:
(2) 
where is the current th minibatch we visit, is a tuple from , and is the learning rate.
Note that MGD can cover the spectrum of gradient descent methods by setting the minibatch size . For examples, MGD morphs into SGD and BGD by setting and , respectively.
MGD gains its popularity due to its fast convergence rate and statistical stability. It typically requires fewer epochs (the whole pass over a dataset is an epoch) to converge than BGD and is more stable than SGD
(ruder2016overview). Figure 2 illustrates the optimization efficiencies of these gradient descent variants, among which MGD with hundreds of rows in a minibatch achieves the best balance between fast convergence rate and statistical stability. Thus, in this paper, we focus on MGD with minibatch size ranging from tens to hundreds of tuples.2.1.3. Shuffle Once v.s. Shuffle Always
The random sampling for tuples in SGD/MGD is typically done without replacement, which is achieved by shuffling the dataset before an epoch (bengio2012practical). However, shuffling data at every epoch (shuffle always) is expensive and incurs a high overhead. Thus, we follow the standard technique of shuffling once (bismarck; wu2017bolt; bengio2012practical) (i.e., shuffling the data once upfront) to improve the ML training efficiency.
2.1.4. Core Matrix Operations for Gradient Descent
The core operations, which dominate the CPU time, for using gradient descent to optimize many ML models (e.g., Linear/Logistic regression, Support vector machine, and Neural network) are matrix operations (elgohary2016compressed). We illustrate this point using an example of Linear regression, and summarize the core matrix operations for these ML models in Table 1.
ML models  

Linear regression  
Logistic regression  
Support vector machine  
Neural network 
Example. Consider a supervised classification algorithm Linear regression where , , , , and . Let matrix , vector , then the aggregated gradients of the loss function is:
(3) 
Thus, there are two core matrix operations—matrix times vector and vector times matrix—to compute Equation 3.
2.2. Data Compression
Data compression, also known as source coding, is an important technique to reduce data sizes. There are two main components in a data compression scheme, an encoding process that encodes the data into coded symbols (hopefully with fewer bits), and a decoding process that reconstructs the original data from the compressed representation.
Based on whether the reconstructed data differs from the original data, data compression schemes usually can be classified into
lossless compression or lossy compression. In this paper, we propose a lossless compression scheme called tupleoriented compression (TOC) which is inspired by a classical lossless string/text compression scheme that has both gained academic influence and industrial popularity, LempelZivWelch (welch1984technique; ziv1977universal; ziv1978compression). For examples, Unix file compression utility^{1}^{1}1ncompress.sourceforge.net and GIF (wiggins2001image) image format are based on LZW.3. Tupleoriented Compression
In this section, we introduce our tupleoriented compression (TOC). The goal of TOC is to (1) compress a minibatch as much as possible and (2) preserve the row/column boundaries in the underlying tabular data so that matrix operations can directly operate on the compressed representation without decompression overheads. Following the popular sparse row technique (saad2003iterative), we use sparse encoding as a starting point, and introduce two new techniques: logical encoding and physical encoding. Figure 3 demonstrates a running example of the encoding process.
For sparse encoding, we ignore the zero values and then prefix each nonzero value with its column index. We call the value with its column index prefix as column_index:value pair. For example, in Figure 3, tuple R2  [1.1, 2, 3, 0] is encoded as [1:1.1, 2:2, 3:3], where 1:1.1 is a column_index:value pair. As a result of sparse encoding, the original table (A) in Figure 3 is converted to the sparse encoded table (B).
3.1. Logical Encoding
The sparse encoded table (e.g., B in Figure 3) can be further compressed logically. The key idea is that there are repeating sequences of column_index:value pairs across tuples in the table. For example, R2 and R4 in the table B both have the same sequence [1:1.1, 2:2]. Thus, these occurrences of the same sequence can be encoded as the same index pointing to a dictionary entry, which represents the original sequence. Since many of these sequences have common prefixes, a prefix tree is used to store all the sequences. Finally, each original tuple is encoded as a vector of indexes pointing to prefix tree nodes.
We present the prefix tree structure and its APIs in § 3.1.1. In § 3.1.2, we present the actual prefix tree encoding algorithm, including how to dynamically build the tree and encode tuples. The comparison between our prefix tree encoding algorithm and LZW is presented in § 3.1.3.
3.1.1. Prefix Tree Structure and APIs
Each node of the prefix tree has an index. Except for the root node, each node stores a column_index:value pair as its key. Each node also represents a sequence of column_index:value pairs, which are obtained by concatenating the keys from the prefix tree root to the node itself. For example, in the prefix tree C in Figure 3, the left bottom tree node has index 9, stores key 3:3, and represents the sequence of column_index:value pairs [1:1.1, 2:2, 3:3].
The prefix tree supports two main APIs: AddNode and GetIndex.
. This API creates a new prefix tree node which has key and is a child of the tree node with index . It also returns the index of the newly created tree node in , which is assigned from a sequence number starting from 0.
. This API looks up the tree node which has key and is a child of the tree node with index . It also returns the index of the found tree node in . If there is no such node, it returns 1.
The implementation of AddNode is straightforward. The implementation of GetIndex is more involved, and we use a standard technique reported in (blelloch2001introduction). In essence, for each tree node, we create a hash map mapping from its child node keys to its child node indexes.
3.1.2. Prefix Tree Encoding Algorithm
Our prefix tree encoding algorithm encodes the sparse encoded table (e.g., B in Figure 3) to an encoded table (e.g., D in Figure 3). During the encoding process, we build a prefix tree (e.g., C in Figure 3) and each original tuple is encoded as a vector of indexes pointing to prefix tree nodes. Algorithm 1 presents the pseudocode of the algorithm. Figure 3 presents a running example of executing the algorithm and encoding table B to table D.
The prefix tree encoding algorithm has two main phases. In phase 1 (line 5 to line 8 of Algorithm 1), we initialize the prefix tree with all the unique column_index:value pairs in the sparse encoded table as the children of the root node.
In phase 2 (line 9 to line 17 of Algorithm 1), we leverage the repeated sequences of the tuples so that the same sequence, for example R2 and R4 in Figure 3 both have the sequence [1:1, 2:2], is encoded as the same index to the prefix tree node. At its heart, we scan all the tuples to detect if part of the tuple can match a sequence that already exists in the prefix tree and build up the prefix tree along the way. We use the function LongestMatchFromTree in Algorithm 1, to find the longest sequence in the prefix tree that matches the sequence in the tuple t starting from the position . The function returns the tree node index of the longest match in , and the next matching starting position in . If , we add a new node to the prefix tree which is the child of the tree node with index and has key to capture this new sequence in the tuple t. In this way, later tuples can leverage this new sequence. Note that the longest match found is at least of length one because we store all the unique column_index:value pairs as the children of the root node in phase 1. Table 2 gives a running example of executing Algorithm 1 on table B in Figure 3.
Our prefix tree encoding and LZW are both linear algorithms in the sense that each input unit is read at most twice and the operation on each input unit is constant. So the time complexity of Algorithm 1 is , where is the number of column_index:value pairs in the sparse encoded table B.
LMFromTree  App  AddNode  

R1  0  1 [1:1.1]  1  6 [1:1.1, 2:2] 
1  2 [2:2]  2  7 [2:2, 3:3]  
2  3 [3:3]  3  8 [3:3, 4:1.4]  
3  4 [4:1.4]  4  NOT called  
R2  0  6 [1:1.1, 2:2]  6  9 [1:1.1, 2:2, 3:3] 
2  3 [3:3]  3  NOT called  
R3  0  5 [2:1.1]  5  10 [2:1.1, 3:3] 
1  8 [3:3, 4:1.4]  8  NOT called  
R4  0  6 [1:1.1, 2:2]  6  NOT called 
3.1.3. Comparisons with LempelZivWelch (LZW)
Our prefix tree encoding algorithm is inspired by the classical compression scheme LZW. However, a key difference between LZW and our algorithm is that we preserve the row and column boundaries in the underlying tabular data, which is crucial to directly operate matrix operations on the compressed representation. For examples, our algorithm encodes each tuple separately (although the dictionary is shared) to respect the row boundaries, and the compression unit is a column_index:value pair to respect the column boundaries. In contrast, LZW simply encodes a blob of bytes without preserving any structure information. The reason for that is LZW was invented primarily for string/text compression. There are other several noticeable differences between our algorithm and LZW, which are summarized in Table 3.
LZW  Ours  

Input  bytes  sparse encoded table 
Encode unit  8 bits  cv pair 
Tree init.  all values of 8 bits  all unique cv pairs 
Tuple bound.  lost  preserved 
Output  a vector of codes  encoded table & prefix tree first layer 
3.2. Physical Encoding
The output of the logical encoding (i.e., I and D in Figure 3) can be further encoded physically to reduce sizes. We use two simple techniques—bit packing (lemire2015decoding) and value indexing (kourtis2008optimizing)— that can reduce sizes without incurring significant overheads when accessing the original values.
We notice that some information in I and D can be stored using arrays of nonnegative integers, and these integers are typically small. For example, the maximal column index in I of Figure 3 is 4, so 1 byte is enough to encode a single integer. Bit packing is used to encode these arrays of small nonnegative integers efficiently. Specifically, we use bytes to encode each nonnegative integer in an array. Each encoded array has a header that tells the number of integers in the array and the number of bytes used per integer. More advanced encoding methods such as Varint (dean2009challenges) and SIMDBP128 (lemire2015decoding), can also be used, and point to interesting directions for future work.
Value indexing is essentially a dictionary encoding technique. That is, we store all the unique values (excluding column indexes) in the column_index:value pairs (e.g., I in Figure 3) in an array. Then, we replace the original values with the indexes pointing to the values in the array.
Figure 3 illustrates an example of how we encode the input (e.g., I and D) to physical bytes. For I, the column indexes are encoded using bit packing, while the values are encoded using value indexing. The value indexes from applying value indexing are also encoded using bit packing. For D, we concatenate the tree node indexes from all the tuples and encode them all together using bit packing. We also encode the tuple starting indexes using bit packing.
4. Matrix Operation Execution
In this section, we introduce how to execute matrix operations on the TOC output. Most of the operations can directly operate on the compressed representation without decoding the original matrix. This direct execution avoids the tedious and expensive decoding process and reduces the runtime to execute matrix operations and MGD.
Let be a TOC compressed matrix, be a scalar, be an uncompressed vector/matrix respectively, we discuss four common classes of matrix operations:

Sparsesafe elementwise (elgohary2016compressed) operations (e.g., and ).

Right multiplication operations (e.g., and ).

Left multiplication operations (e.g., and ).

Sparseunsafe elementwise operations (elgohary2016compressed) (e.g., and ).
Informally speaking, sparsesafe operation means that zero elements in the matrix remain as zero after the operation; sparseunsafe operation means that zero elements in the matrix may not be zero after the operation.
Figure 4 gives an overview of how to execute different matrix operations on the TOC output. The first three classes of operations can directly operate over the compressed representation without decoding the original matrix. The last class of operations needs to decode the original matrix. However, it is less likely to be used for training machine learning models because it changes the input data.
4.1. Shared Operators
In this subsection, we discuss some shared operators for executing matrix operations on the TOC output.
4.1.1. Access Values of I and D From Physical Bytes
As shown in Figure 4, executing many matrix operations requires scanning I or D, which are encoded to physical bytes using bit packing and value indexing as explained in § 3.2. Thus, we briefly discuss how to access values of I and D from encoded physical bytes.
To access a nonnegative integer encoded using bit packing, one can simply seek to the starting position of the integer, and cast its bytes to uint_8, uint_16, or uint_32 respectively. Unfortunately, most programming languages do not support uint_24 natively. Nevertheless, one can copy the bytes into an uint_32 and mask its leading byte as zero.
To access values encoded using value indexing, one can look up the array which stores the unique values using the value indexes.
4.1.2. Build Prefix Tree For Decoding
As shown in Figure 4, executing all matrix operations except for sparsesafe elementwise operations needs to build the prefix tree , which is a simplified variant of the prefix tree C built during encoding. Each node in has the same index and key with the node in C. The difference is that each node in stores the index to its parent, but it does NOT store indexes to its children. Table 4 demonstrates an example of .
Index  0  1  2  3  4  5  6  7  8  9  10 

Key  1:1.1  2:2  3:3  4:1.4  2:1.1  2:2  3:3  4:1.4  3:3  3:3  
ParentIndex  0  0  0  0  0  1  2  3  6  5 
Algorithm 2 presents how to build . There are two main phases in Algorithm 2. In phase 1, and F are both initialized by I, where F stores the first column_index:value pair of the sequence represented by each tree node.
In phase 2, we scan D to build mimicking how C is built in Algorithm 1. From line 11 to line 13 of Algorithm 2, we add a new prefix tree node indexed by idx_seq_num. More specifically, the new tree node is a child of the tree node indexed by D[][] (line 11), the first column_index:value pair of the sequence represented by the new tree node is the same with its parent (line 12), and the key of the new tree node is the first column_index:value pair of the sequence represented by the next tree node indexed by D[][] (line 13).
4.2. Sparsesafe Elementwise Operations
To execute sparsesafe elementwise operations (e.g., and ) on the TOC output directly, one can simply scan and modify I because all the unique column_index:value pairs in are stored in I. Algorithm 3 demonstrates how to execute matrix times scalar operation (i.e., ) on the TOC output. Algorithms for other sparsesafe elementwise operations are similar.
4.3. Right Multiplication Operations
We first do some mathematical analysis to transform the uncompressed execution of right multiplication operations to the compressed execution that operates directly on the TOC output without decoding the original table. The analysis also proves the algorithm correctness since the algorithm follows the transformed form directly. Then, we demonstrate the detailed algorithm. In the rest of this subsection, we use as an example. We put the result of (similar to ) in Appendix B for brevity.
Theorem 4.1 ().
Let , , D be the output of TOC on , be the prefix tree built for decoding, be the sequence of the tree node defined in § 3.1.1, be the key of the tree node defined in § 4.1.2, and be the parent index of the tree node defined in § 4.1.2. Note that and are both sparse representations of vectors (i.e., and ). Define function to be
(4) 
Then, we have
(5) 
and
(6) 
Proof.
See Appendix A.1 ∎
Remark on Theorem 4.1. can be directly executed on the TOC output following Equation 5 by scanning first and scanning second. The detailed steps are demonstrated in Algorithm 4.
4.4. Left Multiplication Operations
We first give the mathematical analysis and then present the detailed algorithm. The reason for doing so is similar to that is given in § 4.3. We only demonstrate the result of and put the result of to Appendix B for brevity.
Theorem 4.2 ().
Let , , D be the output of TOC on , be the prefix tree built for decoding, .seq be the sequence of the tree node defined in § 3.1.1, be the key of the tree node defined in § 4.1.2, and be the parent index of the tree node defined in § 4.1.2. Note that and are both sparse representations of vectors (i.e., and ). Define function to be
(7) 
Then, we have
(8) 
Proof.
See Appendix A.2. ∎
Remark on Theorem 4.2. We can compute following Equation 8 by simply scanning D first and scanning second. Algorithm 5 presents the detailed steps. First, we scan D to compute function defined in Equation 7. Specifically, we initialize as a zero vector, and then add to for each (lines 68 in Algorithm 5). After this step is done, = 1, 2, , len.
4.5. Sparseunsafe Elementwise Operations
For sparseunsafe elementwise operations (e.g., and ), we need to fully decode first and then execute the operations on . Although this process is slow due to the tedious decoding step, fortunately, sparseunsafe elementwise operations are less likely to be used for training ML models because the input data is changed. Algorithm 6 presents the detailed steps.
4.6. Time Complexity Analysis
We give detailed time complexity analysis of different matrix operations except for and , which we put to Appendix C for brevity. For , we only need to scan I, so the time complexity is .
For and , we need to build , scan , and scan D. As shown in Algorithm 2, building needs to scan I and D, and . So the complexity of building and scanning are . Overall, the complexity of and are . This indicates that the computational redundancy incurred by the data redundancy is generally avoided by TOC matrix execution algorithms. Thus, theoretically speaking, TOC matrix execution algorithms have good performance when there are many data redundancies.
For , we need to decompress A first. Similar to LZW, the decompression of TOC is linear in the sense that each element has to be outputted and the cost of each output element is constant. Thus, the complexity of decompressing A is . Overall, the complexity of is also .
5. Experiments
In this section, we answer the following questions:

Can TOC compress minibatches effectively?

Can common matrix operations be executed efficiently on TOC compressed minibatches?

Can TOC reduce the endtoend MGD runtimes significantly for training common machine learning models?

Can TOC compress/decompress minibatches fastly?
Summary of Answers. We answer these questions positively by conducting extensive experiments. First, on datasets with moderate sparsity, TOC reduces minibatch sizes notably with compression ratios up to 51x. Compression ratios of TOC are up to 3.8x larger than the stateoftheart lightweight matrix compression schemes, and comparable to general compression schemes such as Gzip. Second, the matrix operation runtime of TOC is comparable to the lightweight matrix compression schemes, and up to 20,000x better than the stateoftheart general compression schemes. Third, TOC reduces the endtoend MGD runtimes by up to 1.4x, 5.6x, and 4.8x compared to the stateoftheart compression schemes for training Neural network, Logistic regression, and Support vector machine, respectively. TOC also reduces the MGD runtimes by up to 10.2x compared to the best encoding methods used in popular machine learning systems: Bismarck, ScikitLearn, and TensorFlow. Finally, the compression speed of TOC is much faster than Gzip but slower than Snappy, whereas the decompression speed of TOC is faster than both Gzip and Snappy.
Datasets. We use six realworld datasets. The first four datasets we chose have moderate sparsity, which is a typical phenomenon for enterprise machine learning (ashari2015optimizing; harnik2012estimation). Rcv1 and Deep1Billion represent the extremely sparse and dense dataset respectively. Table 5 lists the dataset statistics.
Dataset  Dimensions  Size  Sparsity 

US Census (elgohary2016compressed)  2.5 M * 68  0.46 GB  0.43 
ImageNet (elgohary2016compressed)  1.2 M * 900  2.8 GB  0.31 
Mnist8m (elgohary2016compressed)  8.1 M * 784  11.3 GB  0.25 
Kdd99 (Lichman:2013)  4 M * 42  1.6 GB  0.39 
Rcv1 (amini2009learning)  800 K * 47236  0.96 GB  0.0016 
Deep1Billion (babenko2016efficient)  1 B * 96  475 GB  1.0 
Compared Methods. We compare TOC with one baseline (DEN), four lightweight matrix compression schemes (CSR, CVI, DVI, and CLA), and two general compression schemes (Snappy and Gzip). A brief summary of these methods is as follows:

DEN: This is the standard dense binary format for dense matrices. We store the matrix row by row and each value is encoded using IEEE754 double format. Categorical features are encoded using the standard onehot (dummy) encoding (garavaglia1998smart).

CSR: This is the standard compressed sparse row encoding for sparse matrices. We store the matrix row by row. For each row, we only store the nonzero values and associated column indexes.

CVI: This format is also called as CSRVI (kourtis2008optimizing; elgohary2016compressed). We first encode the matrix using CSR and then encode nonzero values via the value indexing in Section 3.2.

DVI: We first encode the matrix using DEN and then encode the values via the value indexing in Section 3.2.

CLA: This method (elgohary2016compressed) divides the matrix into different columngroups and compresses each columngroup columnwisely. Note that matrix operations can be executed on compressed CLA matrices directly.

Snappy: We compress the serialized bytes of DEN using Snappy.

Gzip: We use Gzip to compress the serialized bytes of DEN.
Machine and System Setup. All experiments were run on Google cloud ^{2}^{2}2https://cloud.google.com/ using a typical machine with a core, GHz Intel Xeon CPU, GB RAM (unless otherwise specified), and OS Ubuntu . We did not choose a more powerful machine because of the higher cost. For example, our machine costs $ per month, while a machine with cores and GB RAM costs $ per month. Thus, our techniques can save costs for ML workloads, especially in such cloud settings.
Our techniques were implemented in C++ and compiled using g++ with the flag O3 optimization. We also compare with four machine learning systems: ScikitLearn 0.19.1^{3}^{3}3http://scikitlearn.org/stable/, Systemml 1.3.0^{4}^{4}4https://systemml.apache.org/, Bismarck 2.0^{5}^{5}5http://i.stanford.edu/hazy/victor/bismarck/, and TensorFlow^{6}^{6}6https://www.tensorflow.org/. Furthermore, we integrate TOC into Bismarck to realize fair comparison. We put the integration detail into Appendix D.1 for brevity.
5.1. Compression Ratios
Setup. We are not aware of a firstprinciple way in literature to set minibatch sizes (# of rows in a minibatch). In practice, the minibatch size typically depends on system constraints (e.g. number of CPUs) and is set to some number ranging from 10 to 250 (mishkin2016systematic). Thus, we tested five minibatch sizes 50, 100, 150, 200, and 250, which cover the most common use cases. Compression ratio is defined as the uncompressed minibatch size (encoded using DEN) over the compressed minibatch size. We implemented DEN, CSR, CVI, and DVI by ourselves but use CLA from Systemml and Gzip/Snappy from standard libraries. We tested minibatches from all the real datasets with the sizes mentioned above.
Overall Results. Figure 5 presents the overall results. For the very sparse dataset Rcv1, CSR is the best encoding method and TOC’s performance is similar to CSR. For the very dense dataset Deep1Billion, which does not contain repeated subsequences of column values, Gzip is the best method but it only achieved a marginal 1.15x compression ratio. CSR and TOC have similar performance because of the sparse encoding.
For the other 4 datasets with moderate sparsity, TOC has larger compression ratios than all the other methods except on Mnist, where TOC is inferior to Gzip. The main reason is that Mnist does not contain many repeated subsequences of column values that TOC logical encoding can exploit, this is also verified by the ablation study in Figure 6.
Overall, TOC is suitable for datasets of moderate sparsity, which are commonly used datasets in enterprise ML. TOC is not suitable for very sparse datasets and very dense datasets that do not contain repeated subsequences of columns values. Note that these datasets are challenging for other compression methods too. Nevertheless, one can simply test TOC on a minibatch sample and figure out if TOC is suitable for the dataset.
Ablation Study. We conduct an ablation study to show the effectiveness of different components (e.g., sparse encoding, logical encoding, and physical encoding) in TOC. TOC_SPARSE_AND_LOGICAL compresses better than TOC_SPARSE. TOC_FULL with all the encoding techniques compresses the best. This shows the effectiveness of all our encoding components. Figure 6 shows the ablation study results.
Large Minibatches. We compare different compression methods on large minibatches. Figure 7 shows the results. As the minibatch size grows, TOC becomes more competitive. When the percent of rows of the whole dataset in the minibatch is 1.0, this is essentially batch gradient descent (BGD) and TOC has the best compression ratio in this case. This shows the potential of applying TOC to BGD related workloads.
5.2. Matrix Operation Runtimes
Setup. We tested three classes of matrix operations: sparsesafe elementwise operation (), left multiplication operations ( and ), and right multiplication operations ( and ), where is a scalar, is an uncompressed vector, is an uncompressed matrix, and is the compressed minibatch. We set minibatch size as 250 (results for other minibatch sizes are similar). Figure 8 presents the results.
Sparsesafe Elementwise Operations (). In general, DVI, CVI, and TOC are fastest methods. This shows the effectiveness of value indexing (kourtis2008optimizing) which is used by all these methods. It is noteworthy that TOC can be four orders of magnitude faster than Gzip and Snappy (e.g. on Imagenet). The slowness of these general compression schemes is caused by their significant decompression overheads.
Right Multiplication Operations ( and ). For , CSR/DEN are the best methods for Rcv1/Deep1Billion due to their extreme sparsity/density respectively. For the remaining datasets of moderate sparsity, DEN, CSR, CVI, DVI, and TOC are fastest methods. CLA and general compression schemes like Snappy and Gzip are much slower. We do see that TOC is 23x slower than CSR on dataset Imagenet and Mnist. There are two main reasons for the slowness. First, building the prefix tree in TOC takes extra time. Second, TOC compression ratios over CSR compression ratios are relatively small on these datasets, which render the computational redundancies exploited by TOC on these datasets also smaller.
For , we set the row size of as 20. TOC is consistently the fastest on all the datasets except for Rcv1 and Deep1Billion due to its extreme sparsity/density. CLA in Systemml does not support yet, thus CLA is excluded.
Left Multiplication Operations ( and ). The results of left multiplication operations are similar to right multiplication operations. Thus, we leave them for brevity.
Summary. Overall, TOC achieves the best runtime performance on operations: , , and . TOC can be 23x slower than the fastest method on operations and . However, as we will show shortly in § 5.3, it has negligible effect in the context of overall ML training time.
5.3. EndtoEnd MGD Runtimes
In this subsection, we discuss the endtoend MGD runtime performance with different compression schemes.
Compared Methods.
We compare TOC with DEN, CSR, CVI, DVI, Snappy, and Gzip in C++ implementation.
We also integrate TOC into Bismarck and compare it with DEN and CSR implemented in Bismarck, ScikitLearn, and TensorFlow. They are
denoted as ML system name plus data format, e.g., BismarckTOC, ScikitLearnDEN, and TensorFlowCSR.
Machine Learning Models.
We choose three ML models: Logistic regression (LR), Support vector machine (SVM), and Neural network (NN). LR/SVM/NN use the standard logistic/hinge/crossentropy loss respectively. For LR and SVM, we use the standard oneversustheother technique to do the multiclass classification. Our NN has a feedforward structure with two hidden layers of 200 and 50 neurons using the sigmoid activation function, and the output layer activation function is sigmoid for binary output and softmax for multiclass outputs. For Mnist, the output has 10 classes, while all the other datasets have binary outputs.
MGD Training. We use MGD to train the ML models. Each dataset is divided into minibatches with 250 rows encoded with different methods. For the sake of simplicity, we run MGD for fixed 10 epochs. The results of more sophisticated termination conditions are similar. In each epoch, we visit all the minibatches and update ML models using each minibatch. For SVM/LR, we train sequentially. For NN, we use the classical way (dean2012large) to train the network parallelly. The endtoend MGD runtimes include all the epochs of training time but do NOT include the compression time because in practice it is a onetime cost and is typically amortized among different ML models.
Dataset Generation. We use the same technique reported in (elgohary2016compressed) to generate scaled real datasets, e.g., Imagenet1m (1 million rows) and Mnist25m (25 million rows).
Summary of Results. Table 6 presents the overall results on datasets Imagenet and Mnist. We put the results on the remaining datasets to Appendix D.2 for brevity.
On Imagenet1m and Mnist1m, minibatches encoded using all the methods fit into memory. In this case, CVI and TOC are the fastest methods. General compression schemes like Snappy and Gzip are much slower due to their significant decompression overheads. It is interesting to see that TOC is even faster than CVI for LR and SVM on ImageNet1m despite the fact that matrix operations of TOC are slower on ImageNet1m and all the data fit into memory. The reason is that TOC reduces the initial IO time because of its better compression ratio. For example, TOC uses 10 seconds to read the data while CVI takes 36 seconds to read the data on ImageNet1m. On Mnist1m, CVI is faster than TOC for LR and SVM because we need to train ten LR/SVM models and there are more matrix operations involved.
On Imagenet25m and Mnist25m, only minibatches encoded using Snappy, Gzip, and TOC fit into memory. In this case, TOC is up to 1.4x/5.6x/4.8x faster than the stateoftheart methods for NN/LR/SVM respectively. The speedup of TOC for LR/SVM is larger on Imagenet25m than Mnist25m, as Mnist25m has ten output classes and we train ten models so there are more matrix operations involved.
Methods  Imagenet1m  Imagenet25m  Mnist1m  Mnist25m  
NN  LR  SVM  NN  LR  SVM  NN  LR  SVM  NN  LR  SVM  
TOC (ours)  12.3  0.7  0.7  249  13  13  9.0  2.1  2.1  182  52  54 
DEN  14.6  3.9  3.8  666  374  360  15.8  7.9  7.8  708  526  545 
CSR  12.7  2.1  2.1  428  199  187  10.8  1.6  1.6  346  156  155 
CVI  12.5  1.0  1.1  323  98  83  9.6  1.4  1.4  250  92  91.6 
DVI  13.0  1.2  1.2  311  73.1  63  14.5  6.2  6.4  385  224  226 
Snappy  14.8  3.9  4.0  348  126  127  15.8  8.5  8.4  363  210  213 
Gzip  20.8  11.7  12.5  463  247  255  20.5  12.6  12.9  393  238  243 
BismarckTOC  12.6  0.76  0.77  264  13.8  13.7  10.3  2.2  2.2  198  54  57 
BismarckDEN  N/A  3.5  3.2  N/A  309  310  N/A  7.2  7.1  N/A  428  421 
BismarckCSR  N/A  2.4  2.2  N/A  141  134  N/A  1.8  1.7  N/A  114  110 
ScikitLearnDEN  14.7  4  3.6  633  454  456  14.8  8.1  7.2  638  536  488 
ScikitLearnCSR  42.7  2.4  2.2  1003  332  334  32.9  4.4  3.3  865  303  284 
TensorFlowDEN  11.2  3.6  3.4  550  426  439  10.9  4.4  4.2  537  439  427 
TensorFlowCSR  18.4  4.4  4.3  601  373  359  14.8  6.7  6.5  512  372  341 
More Dataset Sizes. We also study the MGD runtime over more different dataset sizes. Figure 9 presents the results. In general, TOC remains the fastest method among all the settings we have tested. When the dataset is small, CSR, CVI, and DVI have comparable performance to TOC because all the data fit into memory. When the dataset is large, TOC is faster than other methods because only TOC, Gzip, and Snappy data fit into memory and TOC avoids the decompression. The speedup of TOC is larger on Logistic regression than on Neural network because there are more matrix operations involved in training Neural network.
Ablation Study. We conduct an ablation study to verify whether the components in Figure 3 actually matter for TOC’s performance in reducing MGD runtimes. Specifically, we compare three variants of TOC: TOC_SPARSE (sparse encoding), TOC_SPARSE_AND_LOGICAL (sparse and logical encoding), and TOC_FULL (all the encoding techniques). Figure 10 presents the results. With more encoding techniques used, TOC’s performance becomes better, which shows the effectiveness of all our encoding components.
Comparisons with Popular Machine Learning Systems.
Table 6 also includes the MGD runtimes of DEN and CSR in Bismarck, ScikitLearn, and TensorFlow.
We change the code of using TensorFlow and ScikitLearn a bit so that they can do diskbased learning when the dataset does
not fit into memory. The table also includes BismarckTOC, which typically
has less than 10 percent overhead compared with running TOC in our c++ program. This overhead is caused by the
fudge factor of the database storage thus a bit larger disk IO time. On Imagenet1m and Mnist1m, BismarckTOC is comparable with the best methods in these systems (TensorFlowDEN) for NN but up to 3.2x/2.9x faster than the best methods in these systems for LR/SVM
respectively. On Imagenet25m and Mnist25m, BismarckTOC is up to
2.6x/10.2x/9.8x faster than the best methods in other systems for NN/LR/SVM respectively because only the TOC data fit into memory. Thus, integrating TOC into these ML systems can greatly benefit their MGD performance.
Accuracy Study.
We also plot the error rate of neural network and logistic regression as a function of time on Mnist.
The goal is to compare the convergence rate of BismarckTOC with other standard tools like ScikitLearn and TensorFlow.
For Mnist1m (7GB) and Mnist25m (170GB), we train 30 epochs and 10 epochs, respectively.
Figure 11 presents the results. On Mnist1m and a 15GB RAM machine, BismarckTOC and TensorFlowDEN finished the training roughly at the same time, this verified our
claim that BismarckTOC has comparable performance with the stateoftheart ML system if the data fit into memory.
On Mnist25m and a 15GB RAM machine, BismarckTOC finished the training much faster than other ML systems because only TOC data fit into memory.
We also used a machine with 180GB RAM on Mnist25m which did not change BismarckTOC’s performance but boosted the performance of TensorFlow and
ScikitLearn to be comparable with BismarckTOC as all the data fit into memory. However, renting a 180GB RAM machine is more expensive than renting a 15GB RAM machine.
Thus, we believe BismarckTOC can significantly reduce users’ cloud cost.
5.4. Compression and Decompression Runtimes
We measured the compression and decompression time of Snappy, Gzip, and TOC on minibatches with 250 rows. The results are similar for other minibatch sizes. Figure 12 presents the results. TOC is faster than Gzip but slower than Snappy for compression. However, TOC is faster than both Gzip and Snappy for decompression.
6. Discussion
Advanced Neural Network.
It is possible to apply TOC to more advanced neural networks such as convolutional neural networks on images. One just need to apply the common imagetocolumn
(lai2018cmsis) operation, which replicates the pixels of each sliding window as a matrix column. This way, the convolution operation can be expressed as the matrix multiplication operation over the replicated matrix. The replicated matrix can be compressed by TOC and we expect higher compression ratios due to the data replication.7. Related Work
Data Compression for Analytics. There is a long line of research (abadi2006integrating; raman2013db2; li2013bitweaving; wesley2014leveraging; elgohary2016compressed; wang2017experimental) of integrating data compression into databases and relational query processing workloads on the compressed data. TOC is orthogonal to these works since TOC focuses on a different workload—minibatch stochastic gradient descent of machine learning training.
Machine Learning Analytics Systems. There are a number of systems (e.g., MLib (meng2016mllib), MadLib (hellerstein2012madlib), Systemml (boehm2014hybrid; elgohary2016compressed), Bismarck (bismarck), SimSQL (cai2013simulation), ScikitLearn (scikitlearn), MLBase (kraska2013mlbase), and TensorFlow (abadi2016tensorflow)) for machine learning workloads. Our work focuses on the algorithm perspective and is complementary to these systems, i.e., integrating TOC into these systems can greatly benefit their ML training performance.
Compressed Linear Algebra (CLA). CLA (elgohary2016compressed) compresses the whole dataset and applies batch gradient descent related operations such as vanilla BGD, LBFGS, and conjugate gradient methods, while TOC focuses on MGD. Furthermore, CLA needs to store an explicit dictionary. When applying CLA to BGD, there are many references to dictionary entries so the dictionary cost is amortized. On a small minibatch, there are not that many references to the dictionary entries so the explicit dictionary cost makes the CLA compression ratio less desirable. On the contrary, TOC is adapted from LZW and it does not store an explicit dictionary, so TOC achieves good compression ratios even on small minibatches.
Factorized Learning. Factorized machine learning techniques (orion; olteanuf; kumar2016join; chen2017towards) push machine learning computations through joins and avoid the schemabased redundancy on denormalized datasets. These techniques need a schema to define the static redundancies in the denormalized datasets, while TOC can find the redundancies in the datasets automatically without a schema. Furthermore, factorized learning techniques work for BGD while TOC focuses on MGD.
8. Conclusion and Future Work
Minibatch stochastic gradient descent (MGD) is a workhorse algorithm of modern ML. In this paper, we propose a lossless data compression scheme called tupleoriented compression (TOC) to reduce memory/storage footprints and runtimes for MGD. TOC follows a design principle that tailors the compression scheme to the data access pattern of MGD in a way that preserves row/column boundaries in minibatches and adapts matrix operation executions to the compression scheme as much as possible. This enables TOC to attain both good compression ratios and decompressionfree executions for matrix operations used by MGD. There are a number of interesting directions for future work, including determining more workloads that can execute directly on TOC outputs and investigating the common structures between the adaptable workloads and compression schemes.
Acknowledgments
We thank all the anonymous reviewers. This work was partially supported by a gift from Google.
References
Appendix A Proof of Theorems
a.1. Theorem 4.1
a.2. Theorem 4.2
Appendix B More Algorithms
b.1. Right Multiplication
We present how to compute where is an uncompressed matrix and is a compressed matrix. This is an extension of right multiplication with vector in § 4.3.
Theorem B.1 ().
Let , , D be the output of TOC on , be the prefix tree built for decoding, be the sequence of the tree node defined in § 3.1.1, be the key of the tree node defined in § 4.1.2, and be the parent index of the tree node defined in § 4.1.2. Note that and are both sparse representations of vectors (i.e., and ). Define function to be
(12) 
Then, we have
(13) 
Proof.
Algorithm 7 shows the details. First, we scan similar to right multiplication with vector, and we use H[,:] to remember the computed value of .
Second, we scan D to compute stored in R. For th column of the result R and each D[][], we simply add H[D[][]][] to R[][]. Because H[D[][]][] is a random access of H, we let the loop of going over each column be the most inner loop so that we can scan D only once and have better cache performance.
b.2. Left Multiplication
We discuss how to compute where is an uncompressed matrix and is a compressed matrix. This is an extension of left multiplication with vector in § 4.4.
Theorem B.2 ().
Let , , D be the output of TOC on , be the prefix tree built for decoding, .seq be the sequence of the tree node defined in § 3.1.1, be the key of the tree node defined in § 4.1.2, and be the parent index of the tree node defined in § 4.1.2. Note that and are both sparse representations of vectors (i.e., and ). Define function to be
(15) 
Then, we have
(16) 
Proof.
Algorithm 8 shows the details. First, we similarly scan D as left multiplication with vector. Specifically, we initialize as a zero matrix, and then add to for each . Note that the H here is stored in transposed manner so that we only need to scan once and have good cache performance at the same time.
Second, we scan backwards to actually compute stored in R. Specifically, for th row and each , we add .key * to the result of th row R[i,:] and add to .parent].
Appendix C More Time Complexity Analysis
Executing and needs to build , scan , and scan D. As shown in Algorithm 2, building has complexity and . When scanning and , each element needs to do r operations for / respectively. Thus, the time complexity for and is and respectively.
Appendix D More Experiments
d.1. Integration TOC into Bismarck
We integrated TOC into Bismarck and replaced its existing matrix kernels. There are three key parts of the integration. First, we allocate an arena space in Bismarck shared memory for storing the ML models. Second, we replace the existing Bismarck matrix kernel with the TOC matrix kernel for updating the ML models. Third, a database table is used to store the TOC compressed minibatches and the serialized bytes of each TOC compressed minibatch are stored as a bytes field of variable length in the row. After all these, we modified the UDF of ML training to read the compressed minibatch from the table and use the replaced matrix kernel to update the ML model in the arena.
Methods  Census15m  Census290m  Kdd7m  Kdd200m  
NN  LR  SVM  NN  LR  SVM  NN  LR  SVM  NN  LR  SVM  
TOC (ours)  35  0.8  0.7  702  16  14  16.1  0.2  0.2  323  6.1  5.9 
DEN  39  4.0  4.0  1108  253  251  29  4.6  4.4  1003  608  615 
CSR  38  1.8  1.8  942  161  167  19.2  0.4  0.4  438  56  53 
CVI  37  1.1  1.0  844  80  67  18.5  0.3  0.3  422  31  30 
DVI  38  1.2  1.1  800  46  43  28.4  1.2  1.1  611  71  71 
Snappy  41  4.7  4.6  905  121  115  27.2  3.5  3.5  616  127  128 
Gzip  46  11.1  11.1  965  244  241  33.5  7.5  7.5  683  235  235 
BismarckTOC  38  0.87  0.88  742  17.4  14.8  16.8  0.3  0.31  329  6.4  6.3 
BismarckDEN  N/A  4.2  4.3  N/A  321  310  N/A  4.0  3.8  N/A  645  644 
BismarckCSR  N/A  3.2  3.2  N/A  222  234  N/A  0.9  0.9  N/A  114  115 
ScikitLearnDEN  73.2  7.3  6.6  1715  604  580  42  5  4.6  1797  771  772 
ScikitLearnCSR  105.1  5.7  5.1  2543  421  408.8  44  1.7  1.5  1476  166  160 
TensorFlowDEN  38.1  9.4  10.5  1073  638  610  21.4  5.5  5.1  1199  781  779 
TensorFlowCSR  54.7  15.1  14.0  1244  681  661  15.2  4.1  4.4  577  300  274 
d.2. EndtoEnd MGD Runtimes
MGD runtimes on Census and Kdd99 are reported in Table 7. Overall, the results are similar to those presented in § 5.3. On small datasets like Census15m and Kdd7m, TOC has comparable performance with other methods. On large datasets like Census290m and Kdd200m, TOC is up to 1.8x/17.8x/18.3x faster than the stateoftheart compression schemes for NN/LR/SVM respectively. We leave the results of datasets Rcv1 and Deep1Billion because of their extreme sparsity/density such that we do not expect better performance from TOC.
Comments
There are no comments yet.