When Lempel-Ziv-Welch Meets Machine Learning: A Case Study of Accelerating Machine Learning using Coding

02/22/2017 ∙ by Fengan Li, et al. ∙ University of Wisconsin-Madison University of California, San Diego 0

In this paper we study the use of coding techniques to accelerate machine learning (ML). Coding techniques, such as prefix codes, have been extensively studied and used to accelerate low-level data processing primitives such as scans in a relational database system. However, there is little work on how to exploit them to accelerate ML algorithms. In fact, applying coding techniques for faster ML faces a unique challenge: one needs to consider both how the codes fit into the optimization algorithm used to train a model, and the interplay between the model structure and the coding scheme. Surprisingly and intriguingly, our study demonstrates that a slight variant of the classical Lempel-Ziv-Welch (LZW) coding scheme is a good fit for several popular ML algorithms, resulting in substantial runtime savings. Comprehensive experiments on several real-world datasets show that our LZW-based ML algorithms exhibit speedups of up to 31x compared to a popular and state-of-the-art ML library, with no changes to ML accuracy, even though the implementations of our LZW variants are not heavily tuned. Thus, our study reveals a new avenue for accelerating ML algorithms using coding techniques and we hope this opens up a new direction for more research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. A: If no compression is used to train ML models on large datasets that cannot fit into memory, loading a mini-batch (IO time) from disk can be significantly more expensive than matrix operations (CPU time) performed on the mini-batch for MGD. B: One typically uses a compression scheme to compress mini-batches so that they can fit into memory. For general compression schemes (GC), a mini-batch has to be decoded before any computation can be carried out. For light-weight matrix compression schemes (LMC) and our proposed tuple-oriented compression (TOC), matrix operations can directly operate on the encoded output without decompression overheads. C: TOC has compression ratios comparable to GC. Similar to LMC, matrix operations can directly operate on the TOC output without decoding the mini-batch. D: Since TOC has good compression ratios and no decompression overheads, it reduces the MGD training time especially on large datasets. For small datasets, TOC has comparable performance to LMC. Note that MGD training time grows sharply once the data is spilled to disk.

Data compression is a popular technique for improving the efficiency of data processing workloads such as SQL queries over compressed databases (abadi2006integrating; li2013bitweaving; wesley2014leveraging; elgohary2016compressed; wang2017experimental) and more recently, machine learning with classical batch gradient methods (elgohary2016compressed). However, to the best of our knowledge, there is no such study of data compression for mini-batch stochastic gradient descent (MGD)  (hogwild; wu2017bolt; bismarck; kaoudi2017cost; qin2017scalable), which is known for its fast convergence rate and statistical stability, and is arguably the workhorse algorithm (ruder2016overview; hinton2012neural) of modern ML. This research gap is getting more crucial as training dataset sizes in ML keep growing (chelba2013one; russakovsky2015imagenet). For example, if no compression is used to train ML models on large datasets that cannot fit into memory capacity or even distributed memory capacity, disk IO time becomes a significant overhead (elgohary2016compressed; yu2012large) for MGD. Figure 1A highlights this issue in more detail.

Despite the need for a good data compression scheme to improve the efficiency of MGD workloads, unfortunately, the main existing data compression schemes designed for general data files or batch gradient methods are not a good fit for the data access pattern of MGD. Figure 1B highlights these existing solutions. For examples, general compression schemes (GC) such as Gzip and Snappy are designed for general data files. GC typically has good compression ratios on mini-batches; however, a mini-batch has to be decompressed before any computation can be carried out, and the decompression overhead is significant (elgohary2016compressed) for MGD. Light-weight matrix compression schemes (LMC) include classical methods such as compressed sparse row (saad2003iterative) and value indexing (kourtis2008optimizing) and more recently, a state-of-the-art technique called compressed linear algebra (elgohary2016compressed). LMC is suitable for batch gradient methods because the compression ratio of LMC is satisfactory on the whole dataset and matrix operations can directly operate on the encoded output without decompression overheads. Nevertheless, the compression ratio of LMC on mini-batches is not as good as GC in general, which makes it less attractive for MGD.

In this paper, we fill this crucial research gap by proposing a lossless matrix compression scheme called tuple-oriented compression (TOC), whose name is based on the fact that tuple boundaries (i.e., boundaries between columns/rows in the underlying tabular data) are preserved. Figure 1C highlights the advantage of TOC over existing compression schemes. TOC has both good compression ratios on mini-batches and no decompression overheads for matrix operations, which are the main operations executed by MGD on compressed data. Orthogonal to existing works like GC and LMC, TOC takes inspirations from an unlikely source—a popular string/text compression scheme Lempel-Ziv-Welch (LZW) (welch1984technique; ziv1977universal; ziv1978compression)—and builds a compression scheme with compression ratios comparable to Gzip on mini-batches. In addition, this paper also proposes a suite of compressed matrix operation execution techniques, which are tailored to the TOC compression scheme, that operate directly over the compressed data representation and avoid decompression overheads. Even for a small dataset that fits into memory, these compressed execution techniques are often faster than uncompressed execution techniques because they can reduce computational redundancies in matrix operations. Collectively, these techniques present a fresh perspective that weaving together ideas from databases, text processing, and ML can achieve substantial efficiency gains for popular MGD-based ML workloads. Figure 1D highlights the effect of TOC in reducing the MGD runtimes, especially on large datasets.

TOC consists of three components at different layers of abstraction: sparse encoding, logical encoding, and physical encoding. All these components respect the boundaries between rows and columns in the underlying tabular data so that matrix operations can be carried out on the encoded output directly. Sparse encoding uses the well-known sparse row technique (saad2003iterative) as a starting point. Logical encoding uses a prefix tree encoding algorithm, which is based on the LZW compression scheme, to further compress matrices. Specifically, we notice that there are sequences of column values which are repeating across matrix rows. Thus, these repeated sequences can be stored in a prefix tree and each tree node represents a sequence. The occurrences of these sequences in the matrix can be encoded as indexes to tree nodes to reduce space. Note that we only need to store the encoded matrix and the first layer of the prefix tree as encoded outputs, as the prefix tree can be rebuilt from the encoded outputs if needed. Lastly, physical encoding encodes integers and float numbers efficiently.

We design a suite of compressed execution techniques that operate directly over the compressed data representation without decompression overheads for three classes of matrix operations. These matrix operations are used by MGD to train popular ML models such as Linear/Logistic regression, Support vector machine, and Neural network. These compressed execution techniques only need to scan the encoded table and the prefix tree at most once. Thus, they are fast, especially when TOC exploits significant redundancies. For example, right multiplication (e.g., matrix times vector) and left multiplication (e.g., vector times matrix) can be computed with one scan of the encoded table and the prefix tree. Lastly, since these compressed execution techniques for matrix operations differ drastically from the uncompressed execution techniques, we provide mathematical analysis to prove the correctness of these compressed techniques.

To summarize, the main contributions of this paper are:

  1. To the best of our knowledge, this is the first work to study lossless compression techniques to reduce the memory/storage footprints and runtimes for mini-batch stochastic gradient descent (MGD), which is the workhorse algorithm of modern ML. We propose a lossless matrix compression scheme called tuple-oriented compression (TOC) with compression ratios comparable to Gzip on mini-batches.

  2. We design a suite of novel compressed matrix operation execution techniques tailored to the TOC compression scheme that directly operate over the compressed data representation and avoid decompression overheads for MGD workloads.

  3. We provide a formal mathematical analysis to prove the correctness of the above compressed matrix operation execution techniques.

  4. We perform an extensive evaluation of TOC compared to seven compression schemes on six real datasets. Our results show that TOC consistently achieves substantial compression ratios by up to 51x. Moreover, TOC reduces MGD runtimes for three popular ML models by up to 5x compared to the state-of-the-art compression schemes and by up to 10.2x compared to the encoding methods in some popular ML systems (e.g., ScikitLearn (scikit-learn), Bismarck (bismarck)

    and TensorFlow 

    (abadi2016tensorflow)). An integration of TOC into Bismarck also confirmed that TOC can greatly benefit MGD performance in ML systems.

The remainder of this paper proceeds as follows: § 2 presents some required background information. § 3 explains our TOC compression scheme, while § 4 presents the techniques to execute matrix operations on the compressed data. § 5 presents the experimental results and § 6 discusses TOC extensions. § 7 presents related work, and we conclude in § 8.

2. Background

In this section, we discuss two important concepts: ML training in the generalized setting and data compression.

2.1. Machine Learning Training

2.1.1. Empirical Risk Minimization

We begin with a description of ML training in the generalized setting based on standard ML texts (SSSSS10; shai). Formally, we have a hypothesis space , an instance set

, and a loss function

. Given a training set which are i.i.d. draws based on a distribution on , and a hypothesis , our goal is to minimize the empirical risk over the training set defined as

(1)

Many ML models including Logistic/Linear regression, Support vector machine, and Neural network fit into this generalized setting 

(SSSSS10).

2.1.2. Gradient Descent

ML training can be viewed as the process to find the optimal such that This is essentially an optimization problem, and gradient descent is a common and established class of algorithms for solving this problem. There are three main variants of gradient descent: batch gradient descent, stochastic gradient descent, and mini-batch stochastic gradient descent.

Batch Gradient Descent (BGD). BGD uses all the training data to compute the gradient and update per iteration.

Stochastic Gradient Descent (SGD). SGD uses a single tuple to compute the gradient and update per iteration.

Mini-batch Stochastic Gradient Descent (MGD). MGD uses a small batch of tuples (typically tens or hundreds of tuples) to compute the gradient and update per iteration:

(2)

where is the current -th mini-batch we visit, is a tuple from , and is the learning rate.

Note that MGD can cover the spectrum of gradient descent methods by setting the mini-batch size . For examples, MGD morphs into SGD and BGD by setting and , respectively.

MGD gains its popularity due to its fast convergence rate and statistical stability. It typically requires fewer epochs (the whole pass over a dataset is an epoch) to converge than BGD and is more stable than SGD 

(ruder2016overview). Figure 2 illustrates the optimization efficiencies of these gradient descent variants, among which MGD with hundreds of rows in a mini-batch achieves the best balance between fast convergence rate and statistical stability. Thus, in this paper, we focus on MGD with mini-batch size ranging from tens to hundreds of tuples.

Figure 2. Optimization efficiencies of BGD, SGD, and MGD for training a neural network with one hidden layer (no convolutional layers) on Mnist. MGD (250 rows) has 250 rows in a mini-batch. MGD-20%, MGD-50%, and MGD-80% has 20, 50, and 80 percent of rows of the whole dataset in each mini-batch, respectively.

2.1.3. Shuffle Once v.s. Shuffle Always

The random sampling for tuples in SGD/MGD is typically done without replacement, which is achieved by shuffling the dataset before an epoch (bengio2012practical). However, shuffling data at every epoch (shuffle always) is expensive and incurs a high overhead. Thus, we follow the standard technique of shuffling once (bismarck; wu2017bolt; bengio2012practical) (i.e., shuffling the data once upfront) to improve the ML training efficiency.

2.1.4. Core Matrix Operations for Gradient Descent

The core operations, which dominate the CPU time, for using gradient descent to optimize many ML models (e.g., Linear/Logistic regression, Support vector machine, and Neural network) are matrix operations (elgohary2016compressed). We illustrate this point using an example of Linear regression, and summarize the core matrix operations for these ML models in Table 1.

ML models
Linear regression
Logistic regression
Support vector machine
Neural network
Table 1. The core matrix operations when using gradient descent to optimize popular ML models. is a batch of data for updating models where . and are either ML model parameters or intermediate results for computing gradients. We use logistic loss for Logistic regression, hinge loss for Support vector machine, and mean squared loss for Linear regression and Neural network. For the sake of simplicity, our neural network structure has a feed forward structure with a single hidden layer.

Example. Consider a supervised classification algorithm Linear regression where , , , , and . Let matrix , vector , then the aggregated gradients of the loss function is:

(3)

Thus, there are two core matrix operations—matrix times vector and vector times matrix—to compute Equation 3.

2.2. Data Compression

Data compression, also known as source coding, is an important technique to reduce data sizes. There are two main components in a data compression scheme, an encoding process that encodes the data into coded symbols (hopefully with fewer bits), and a decoding process that reconstructs the original data from the compressed representation.

Based on whether the reconstructed data differs from the original data, data compression schemes usually can be classified into

lossless compression or lossy compression. In this paper, we propose a lossless compression scheme called tuple-oriented compression (TOC) which is inspired by a classical lossless string/text compression scheme that has both gained academic influence and industrial popularity, Lempel-Ziv-Welch (welch1984technique; ziv1977universal; ziv1978compression). For examples, Unix file compression utility111ncompress.sourceforge.net and GIF (wiggins2001image) image format are based on LZW.

3. Tuple-oriented Compression

Figure 3. A running example of the TOC encoding process. TOC has three components: sparse encoding, logical encoding, and physical encoding. The red dotted lines connect these components. Sparse encoding encodes the original table A to the sparse encoded table B. Logical encoding encodes B to the encoded table D. It also outputs I, which is the column_index:value pairs in the first layer of the prefix tree C. Physical encoding encodes I and D to physical bytes efficiently.

In this section, we introduce our tuple-oriented compression (TOC). The goal of TOC is to (1) compress a mini-batch as much as possible and (2) preserve the row/column boundaries in the underlying tabular data so that matrix operations can directly operate on the compressed representation without decompression overheads. Following the popular sparse row technique (saad2003iterative), we use sparse encoding as a starting point, and introduce two new techniques: logical encoding and physical encoding. Figure 3 demonstrates a running example of the encoding process.

For sparse encoding, we ignore the zero values and then prefix each non-zero value with its column index. We call the value with its column index prefix as column_index:value pair. For example, in Figure 3, tuple R2 - [1.1, 2, 3, 0] is encoded as [1:1.1, 2:2, 3:3], where 1:1.1 is a column_index:value pair. As a result of sparse encoding, the original table (A) in Figure 3 is converted to the sparse encoded table (B).

3.1. Logical Encoding

The sparse encoded table (e.g., B in Figure 3) can be further compressed logically. The key idea is that there are repeating sequences of column_index:value pairs across tuples in the table. For example, R2 and R4 in the table B both have the same sequence [1:1.1, 2:2]. Thus, these occurrences of the same sequence can be encoded as the same index pointing to a dictionary entry, which represents the original sequence. Since many of these sequences have common prefixes, a prefix tree is used to store all the sequences. Finally, each original tuple is encoded as a vector of indexes pointing to prefix tree nodes.

We present the prefix tree structure and its APIs in § 3.1.1. In § 3.1.2, we present the actual prefix tree encoding algorithm, including how to dynamically build the tree and encode tuples. The comparison between our prefix tree encoding algorithm and LZW is presented in § 3.1.3.

3.1.1. Prefix Tree Structure and APIs

Each node of the prefix tree has an index. Except for the root node, each node stores a column_index:value pair as its key. Each node also represents a sequence of column_index:value pairs, which are obtained by concatenating the keys from the prefix tree root to the node itself. For example, in the prefix tree C in Figure 3, the left bottom tree node has index 9, stores key 3:3, and represents the sequence of column_index:value pairs [1:1.1, 2:2, 3:3].

The prefix tree supports two main APIs: AddNode and GetIndex.

. This API creates a new prefix tree node which has key and is a child of the tree node with index . It also returns the index of the newly created tree node in , which is assigned from a sequence number starting from 0.

. This API looks up the tree node which has key and is a child of the tree node with index . It also returns the index of the found tree node in . If there is no such node, it returns -1.

The implementation of AddNode is straightforward. The implementation of GetIndex is more involved, and we use a standard technique reported in (blelloch2001introduction). In essence, for each tree node, we create a hash map mapping from its child node keys to its child node indexes.

3.1.2. Prefix Tree Encoding Algorithm

Our prefix tree encoding algorithm encodes the sparse encoded table (e.g., B in Figure 3) to an encoded table (e.g., D in Figure 3). During the encoding process, we build a prefix tree (e.g., C in Figure 3) and each original tuple is encoded as a vector of indexes pointing to prefix tree nodes. Algorithm 1 presents the pseudo-code of the algorithm. Figure 3 presents a running example of executing the algorithm and encoding table B to table D.

The prefix tree encoding algorithm has two main phases. In phase 1 (line 5 to line 8 of Algorithm 1), we initialize the prefix tree with all the unique  column_index:value pairs in the sparse encoded table as the children of the root node.

In phase 2 (line 9 to line 17 of Algorithm 1), we leverage the repeated sequences of the tuples so that the same sequence, for example R2 and R4 in Figure 3 both have the sequence [1:1, 2:2], is encoded as the same index to the prefix tree node. At its heart, we scan all the tuples to detect if part of the tuple can match a sequence that already exists in the prefix tree and build up the prefix tree along the way. We use the function LongestMatchFromTree in Algorithm 1, to find the longest sequence in the prefix tree that matches the sequence in the tuple t starting from the position . The function returns the tree node index of the longest match in , and the next matching starting position in . If , we add a new node to the prefix tree which is the child of the tree node with index and has key to capture this new sequence in the tuple t. In this way, later tuples can leverage this new sequence. Note that the longest match found is at least of length one because we store all the unique column_index:value pairs as the children of the root node in phase 1. Table 2 gives a running example of executing Algorithm 1 on table B in Figure 3.

Our prefix tree encoding and LZW are both linear algorithms in the sense that each input unit is read at most twice and the operation on each input unit is constant. So the time complexity of Algorithm 1 is , where is the number of column_index:value pairs in the sparse encoded table B.

1:function PrefixTreeEncode(B)
2:     inputs: sparse encoded table B
3:     outputs: column_index:value pairs in the first layer
4:of the prefix tree I and encoded table D
5:     Initialize C with a root node with index 0.
6:     for each tuple t in B do phase 1: initialization
7:         for each column_index:value pair t[] in t do
8:              if C.GetIndex() = -1 then
9:                  C.AddNode()                             
10:     for each tuple t in B do phase 2: encoding
11:          set the matching starting position
12:          initialize as an empty vector
13:         while  len(tdo
14:               LongestMatchFromTree(t, , C)
15:              D[t].append()
16:              if  len(tthen
17:                  C.AddNode(, t[])               
18:                             
19:     I first_layer(C)
20:     return(I, D)
21:
22:function LongestMatchFromTree(t, , C)
23:     inputs: input tuple t, matching starting position
24:in t, and prefix tree C
25:     outputs: index of the tree node of the longest match
26: and next matching starting position
27:     
28:      C.GetIndex(, t[]) matching 1st element
29:     
30:      try matching the next element
31:     if  len(t) then
32:          C.GetIndex(, t[]) return -1 if such a tree node does not exist
33:     else
34:          reaching the end of tuple t      
35:     return(, )
Algorithm 1 Prefix Tree Encoding Algorithm
LMFromTree App AddNode
R1 0 1 [1:1.1] 1 6 [1:1.1, 2:2]
1 2 [2:2] 2 7 [2:2, 3:3]
2 3 [3:3] 3 8 [3:3, 4:1.4]
3 4 [4:1.4] 4 NOT called
R2 0 6 [1:1.1, 2:2] 6 9 [1:1.1, 2:2, 3:3]
2 3 [3:3] 3 NOT called
R3 0 5 [2:1.1] 5 10 [2:1.1, 3:3]
1 8 [3:3, 4:1.4] 8 NOT called
R4 0 6 [1:1.1, 2:2] 6 NOT called
Table 2. We show the steps of running Algorithm 1 on table B in Figure 3. We omit the phase 1 of the algorithm, which initializes the prefix tree with nodes: 1 [1:1.1], 2 [2:2], 3 [3:3], 4 [4:1.4], and 5 [2:1.1], where the left side of the arrow is the tree node index and the right side of the arrow is the sequence of column_index:value pairs represented by the tree node. Each entry here illustrates an iteration of the while loop (line 12 - line 17) of Algorithm 1. Column is the starting position of the tuple that we try to match the sequence in the prefix tree. Column LMFromTree shows the index and the corresponding sequence of the found longest match by the function LongestMatchFromTree. Column App is the appended tree node index for encoding the tuples in table B. Column AddNode shows the index and the corresponding sequence of the newly added tree node.

3.1.3. Comparisons with Lempel-Ziv-Welch (LZW)

Our prefix tree encoding algorithm is inspired by the classical compression scheme LZW. However, a key difference between LZW and our algorithm is that we preserve the row and column boundaries in the underlying tabular data, which is crucial to directly operate matrix operations on the compressed representation. For examples, our algorithm encodes each tuple separately (although the dictionary is shared) to respect the row boundaries, and the compression unit is a column_index:value pair to respect the column boundaries. In contrast, LZW simply encodes a blob of bytes without preserving any structure information. The reason for that is LZW was invented primarily for string/text compression. There are other several noticeable differences between our algorithm and LZW, which are summarized in Table 3.

LZW Ours
Input bytes sparse encoded table
Encode unit 8 bits c-v pair
Tree init. all values of 8 bits all unique c-v pairs
Tuple bound. lost preserved
Output a vector of codes encoded table & prefix tree first layer
Table 3. Differences between LZW and our prefix tree encoding. c-v stands for column-index:value.

3.2. Physical Encoding

The output of the logical encoding (i.e., I and D in Figure 3) can be further encoded physically to reduce sizes. We use two simple techniques—bit packing (lemire2015decoding) and value indexing (kourtis2008optimizing)— that can reduce sizes without incurring significant overheads when accessing the original values.

We notice that some information in I and D can be stored using arrays of non-negative integers, and these integers are typically small. For example, the maximal column index in I of Figure 3 is 4, so 1 byte is enough to encode a single integer. Bit packing is used to encode these arrays of small non-negative integers efficiently. Specifically, we use bytes to encode each non-negative integer in an array. Each encoded array has a header that tells the number of integers in the array and the number of bytes used per integer. More advanced encoding methods such as Varint (dean2009challenges) and SIMD-BP128 (lemire2015decoding), can also be used, and point to interesting directions for future work.

Value indexing is essentially a dictionary encoding technique. That is, we store all the unique values (excluding column indexes) in the column_index:value pairs (e.g., I in Figure 3) in an array. Then, we replace the original values with the indexes pointing to the values in the array.

Figure 3 illustrates an example of how we encode the input (e.g., I and D) to physical bytes. For I, the column indexes are encoded using bit packing, while the values are encoded using value indexing. The value indexes from applying value indexing are also encoded using bit packing. For D, we concatenate the tree node indexes from all the tuples and encode them all together using bit packing. We also encode the tuple starting indexes using bit packing.

4. Matrix Operation Execution

In this section, we introduce how to execute matrix operations on the TOC output. Most of the operations can directly operate on the compressed representation without decoding the original matrix. This direct execution avoids the tedious and expensive decoding process and reduces the runtime to execute matrix operations and MGD.

Let be a TOC compressed matrix, be a scalar, be an uncompressed vector/matrix respectively, we discuss four common classes of matrix operations:

  1. Sparse-safe element-wise (elgohary2016compressed) operations (e.g., and ).

  2. Right multiplication operations (e.g., and ).

  3. Left multiplication operations (e.g., and ).

  4. Sparse-unsafe element-wise operations (elgohary2016compressed) (e.g., and ).

Informally speaking, sparse-safe operation means that zero elements in the matrix remain as zero after the operation; sparse-unsafe operation means that zero elements in the matrix may not be zero after the operation.

Figure 4 gives an overview of how to execute different matrix operations on the TOC output. The first three classes of operations can directly operate over the compressed representation without decoding the original matrix. The last class of operations needs to decode the original matrix. However, it is less likely to be used for training machine learning models because it changes the input data.

Figure 4. An overview of how to execute different matrix operations on the TOC output. For sparse-safe element-wise operations, right multiplication operations, and left multiplication operations, we can execute them on the TOC output directly. For sparse-unsafe element-wise operations, we need to fully decode the input and then apply the operation on .

4.1. Shared Operators

In this subsection, we discuss some shared operators for executing matrix operations on the TOC output.

4.1.1. Access Values of I and D From Physical Bytes

As shown in Figure 4, executing many matrix operations requires scanning I or D, which are encoded to physical bytes using bit packing and value indexing as explained in § 3.2. Thus, we briefly discuss how to access values of I and D from encoded physical bytes.

To access a non-negative integer encoded using bit packing, one can simply seek to the starting position of the integer, and cast its bytes to uint_8, uint_16, or uint_32 respectively. Unfortunately, most programming languages do not support uint_24 natively. Nevertheless, one can copy the bytes into an uint_32 and mask its leading byte as zero.

To access values encoded using value indexing, one can look up the array which stores the unique values using the value indexes.

4.1.2. Build Prefix Tree For Decoding

As shown in Figure 4, executing all matrix operations except for sparse-safe element-wise operations needs to build the prefix tree , which is a simplified variant of the prefix tree C built during encoding. Each node in has the same index and key with the node in C. The difference is that each node in stores the index to its parent, but it does NOT store indexes to its children. Table 4 demonstrates an example of .

Index 0 1 2 3 4 5 6 7 8 9 10
Key 1:1.1 2:2 3:3 4:1.4 2:1.1 2:2 3:3 4:1.4 3:3 3:3
ParentIndex 0 0 0 0 0 1 2 3 6 5
Table 4. An example of , which is a simplified variant of C in Figure 3. Each node in only stores the index to its parent but NOT indexes to its children.

Algorithm 2 presents how to build . There are two main phases in Algorithm 2. In phase 1, and F are both initialized by I, where F stores the first column_index:value pair of the sequence represented by each tree node.

In phase 2, we scan D to build mimicking how C is built in Algorithm 1. From line 11 to line 13 of Algorithm 2, we add a new prefix tree node indexed by idx_seq_num. More specifically, the new tree node is a child of the tree node indexed by D[][] (line 11), the first column_index:value pair of the sequence represented by the new tree node is the same with its parent (line 12), and the key of the new tree node is the first column_index:value pair of the sequence represented by the next tree node indexed by D[][] (line 13).

1:function BuildPrefixTree(I, D)
2:     inputs: column_index:value pairs in the first layer
3:of the prefix tree I and encoded table D
4:     outputs: A prefix tree used for decoding
5:     for  1 to len(Ido phase 1: initialize with I
6:         .key
7:         .parent
8:          F stores the first column_index:value pair of the sequence of the node      
9:     idx_seq_num len(I) + 1
10:     for  0 to len(D) - 1 do phase 2: build
11:         for  0 to len(D[]) -2  do skip last element
12:              [idx_seq_num].parent D[][]
13:              F[idx_seq_num] F[D[][]]
14:              [idx_seq_num].key F[D[][]]
15:              idx_seq_num idx_seq_num + 1               
16:     return ()
Algorithm 2 Build Prefix Tree

4.2. Sparse-safe Element-wise Operations

To execute sparse-safe element-wise operations (e.g., and ) on the TOC output directly, one can simply scan and modify I because all the unique column_index:value pairs in are stored in I. Algorithm  3 demonstrates how to execute matrix times scalar operation (i.e., ) on the TOC output. Algorithms for other sparse-safe element-wise operations are similar.

1:function MatrixTimesScalar(I, c)
2:     inputs: column_index:value pairs in the first layer
3:of the prefix tree I and a scalar c
4:     outputs: the modified I
5:     for  0 to len(I) -1 do
6:         .value .value * c      
7:     return(I)
Algorithm 3 Execute matrix times scalar operation (i.e., ) on the TOC output.

4.3. Right Multiplication Operations

We first do some mathematical analysis to transform the uncompressed execution of right multiplication operations to the compressed execution that operates directly on the TOC output without decoding the original table. The analysis also proves the algorithm correctness since the algorithm follows the transformed form directly. Then, we demonstrate the detailed algorithm. In the rest of this subsection, we use as an example. We put the result of (similar to ) in Appendix B for brevity.

Theorem 4.1 ().

Let , , D be the output of TOC on , be the prefix tree built for decoding, be the sequence of the tree node defined in § 3.1.1, be the key of the tree node defined in § 4.1.2, and be the parent index of the tree node defined in § 4.1.2. Note that and are both sparse representations of vectors (i.e., and ). Define function to be

(4)

Then, we have

(5)

and

(6)
Proof.

See Appendix A.1

Remark on Theorem 4.1. can be directly executed on the TOC output following Equation 5 by scanning first and scanning second. The detailed steps are demonstrated in Algorithm 4.

First, we scan to compute function defined in Equation 4 (lines 5-7 in Algorithm 4). The dynamic programming technique is used following Equation 6. Specifically, we use H[] to remember the computed value of . We compute each H[] as the sum of and H[[].parent], which is computed already.

Second, we scan D to compute and store it in R following Equation 5 (lines 8-11 in Algorithm 4). For each D[][], we simply add H[D[][]] to R[].

1:function MatrixTimesVector(D, I, )
2:     inputs: column_index:value pairs in the first layer of the prefix tree I, encoded table D, and vector
3:     outputs: the result of in R
4:      BuildPrefixTree(I, D)
5:     H initialize as a zero vector
6:     for  = 1 to len()-1 do scan to compute H
7:         H[] [].key + H[[].parent]      
8:     R initialize as a zero vector
9:     for  = 0 to len(D)-1 do scan D to compute R
10:         for  = 0 to len(D[,:])-1 do
11:              R[] R[] + H[D[][]]               
12:     return(R)
Algorithm 4 Execute matrix times vector operation (i.e., ) on the TOC output.

4.4. Left Multiplication Operations

We first give the mathematical analysis and then present the detailed algorithm. The reason for doing so is similar to that is given in § 4.3. We only demonstrate the result of and put the result of to Appendix B for brevity.

Theorem 4.2 ().

Let , , D be the output of TOC on , be the prefix tree built for decoding, .seq be the sequence of the tree node defined in § 3.1.1, be the key of the tree node defined in § 4.1.2, and be the parent index of the tree node defined in § 4.1.2. Note that and are both sparse representations of vectors (i.e., and ). Define function to be

(7)

Then, we have

(8)
Proof.

See Appendix A.2. ∎

Remark on Theorem 4.2. We can compute following Equation 8 by simply scanning D first and scanning second. Algorithm 5 presents the detailed steps. First, we scan D to compute function defined in Equation 7. Specifically, we initialize as a zero vector, and then add to for each (lines 6-8 in Algorithm 5). After this step is done, = 1, 2, , len.

Second, we scan backwards to actually compute and store it in R following Equation 8 (lines 10-12 in Algorithm 5). The dynamic programming technique is used following Equation 6. Specifically, for each , we add to R and add to .

1:function VectorTimesMatrix(D, I, )
2:     inputs: column_index:value pairs in the first layer of the prefix tree I, encoded table D, and vector
3:     outputs: the result of in R
4:      BuildPrefixTree(I, D)
5:     H initialize as a zero vector
6:     for  = 0 to len(D)-1 do scan D to compute H
7:         for  = 0 to len(D[,:]) -1 do
8:              H[D[][]] [] + H[D[][]]               
9:     R initialize as a zero vector
10:     for  = len() -1 to 1 do scan to compute R
11:         R R + [].key H[]
12:         H[.parent] H[[i].parent] + H[]      
13:     return(R)
Algorithm 5 Execute vector times matrix operation (i.e., ) on the TOC output.

4.5. Sparse-unsafe Element-wise Operations

For sparse-unsafe element-wise operations (e.g., and ), we need to fully decode first and then execute the operations on . Although this process is slow due to the tedious decoding step, fortunately, sparse-unsafe element-wise operations are less likely to be used for training ML models because the input data is changed. Algorithm 6 presents the detailed steps.

1:function MatrixPlusScalar(D, I, )
2:     inputs: column_index:value pairs in the first layer of the prefix tree I, encoded table D, and scalar
3:     outputs: the result of in R
4:      BuildPrefixTree(I, D)
5:     for  = 0 to len(D)-1 do
6:         B[] [] initialize B[] as an empty vector
7:         for  = 0 to len(D[,:])-1 do
8:              reverse_seq []
9:              tree_index D[][]
10:              while tree_index do backtrack to get the reversed sequence of the tree node D[][]
11:                  reverse_seq.Append([tree_index].key)
12:                  tree_index [tree_index].parent               
13:              for  = len(reverse_seq)-1 to 0 do
14:                  B[].Append(reverse_seq[])                             
15:      TransformToDense(B)
16:     R
17:     return(R)
Algorithm 6 Execute matrix plus scalar element-wise operation (i.e., ) on the TOC output.

4.6. Time Complexity Analysis

We give detailed time complexity analysis of different matrix operations except for and , which we put to Appendix C for brevity. For , we only need to scan I, so the time complexity is .

For and , we need to build , scan , and scan D. As shown in Algorithm 2, building needs to scan I and D, and . So the complexity of building and scanning are . Overall, the complexity of and are . This indicates that the computational redundancy incurred by the data redundancy is generally avoided by TOC matrix execution algorithms. Thus, theoretically speaking, TOC matrix execution algorithms have good performance when there are many data redundancies.

For , we need to decompress A first. Similar to LZW, the decompression of TOC is linear in the sense that each element has to be outputted and the cost of each output element is constant. Thus, the complexity of decompressing A is . Overall, the complexity of is also .

5. Experiments

In this section, we answer the following questions:

  1. Can TOC compress mini-batches effectively?

  2. Can common matrix operations be executed efficiently on TOC compressed mini-batches?

  3. Can TOC reduce the end-to-end MGD runtimes significantly for training common machine learning models?

  4. Can TOC compress/decompress mini-batches fastly?

Summary of Answers. We answer these questions positively by conducting extensive experiments. First, on datasets with moderate sparsity, TOC reduces mini-batch sizes notably with compression ratios up to 51x. Compression ratios of TOC are up to 3.8x larger than the state-of-the-art light-weight matrix compression schemes, and comparable to general compression schemes such as Gzip. Second, the matrix operation runtime of TOC is comparable to the light-weight matrix compression schemes, and up to 20,000x better than the state-of-the-art general compression schemes. Third, TOC reduces the end-to-end MGD runtimes by up to 1.4x, 5.6x, and 4.8x compared to the state-of-the-art compression schemes for training Neural network, Logistic regression, and Support vector machine, respectively. TOC also reduces the MGD runtimes by up to 10.2x compared to the best encoding methods used in popular machine learning systems: Bismarck, ScikitLearn, and TensorFlow. Finally, the compression speed of TOC is much faster than Gzip but slower than Snappy, whereas the decompression speed of TOC is faster than both Gzip and Snappy.

Datasets. We use six real-world datasets. The first four datasets we chose have moderate sparsity, which is a typical phenomenon for enterprise machine learning (ashari2015optimizing; harnik2012estimation). Rcv1 and Deep1Billion represent the extremely sparse and dense dataset respectively. Table 5 lists the dataset statistics.

Dataset Dimensions Size Sparsity
US Census (elgohary2016compressed) 2.5 M * 68 0.46 GB 0.43
ImageNet (elgohary2016compressed) 1.2 M * 900 2.8 GB 0.31
Mnist8m (elgohary2016compressed) 8.1 M * 784 11.3 GB 0.25
Kdd99 (Lichman:2013) 4 M * 42 1.6 GB 0.39
Rcv1 (amini2009learning) 800 K * 47236 0.96 GB 0.0016
Deep1Billion (babenko2016efficient) 1 B * 96 475 GB 1.0
Table 5. Dataset statistics. Except for Deep1Billion, which is in the binary format, we report the sizes of the datasets stored in the text format. Sparsity is defined as .

Compared Methods. We compare TOC with one baseline (DEN), four light-weight matrix compression schemes (CSR, CVI, DVI, and CLA), and two general compression schemes (Snappy and Gzip). A brief summary of these methods is as follows:

  1. DEN: This is the standard dense binary format for dense matrices. We store the matrix row by row and each value is encoded using IEEE-754 double format. Categorical features are encoded using the standard one-hot (dummy) encoding (garavaglia1998smart).

  2. CSR: This is the standard compressed sparse row encoding for sparse matrices. We store the matrix row by row. For each row, we only store the non-zero values and associated column indexes.

  3. CVI: This format is also called as CSR-VI (kourtis2008optimizing; elgohary2016compressed). We first encode the matrix using CSR and then encode non-zero values via the value indexing in Section 3.2.

  4. DVI: We first encode the matrix using DEN and then encode the values via the value indexing in Section 3.2.

  5. CLA: This method (elgohary2016compressed) divides the matrix into different column-groups and compresses each column-group column-wisely. Note that matrix operations can be executed on compressed CLA matrices directly.

  6. Snappy: We compress the serialized bytes of DEN using Snappy.

  7. Gzip: We use Gzip to compress the serialized bytes of DEN.

Machine and System Setup. All experiments were run on Google cloud 222https://cloud.google.com/ using a typical machine with a core, GHz Intel Xeon CPU, GB RAM (unless otherwise specified), and OS Ubuntu . We did not choose a more powerful machine because of the higher cost. For example, our machine costs $ per month, while a machine with cores and GB RAM costs $ per month. Thus, our techniques can save costs for ML workloads, especially in such cloud settings.

Our techniques were implemented in C++ and compiled using g++ with the flag O3 optimization. We also compare with four machine learning systems: ScikitLearn 0.19.1333http://scikit-learn.org/stable/, Systemml 1.3.0444https://systemml.apache.org/, Bismarck 2.0555http://i.stanford.edu/hazy/victor/bismarck/, and TensorFlow666https://www.tensorflow.org/. Furthermore, we integrate TOC into Bismarck to realize fair comparison. We put the integration detail into Appendix D.1 for brevity.

5.1. Compression Ratios

Figure 5. Compression ratios of different methods on mini-batches with varying sizes.
Figure 6. Compression ratios of TOC variants on varying size mini-batches. TOC_SPARSE uses sparse encoding. TOC_SPARSE_AND_LOGICAL uses sparse and logical encoding. TOC_FULL uses all the encoding techniques.
Figure 7. Compression ratios of different methods on large mini-batches. The x-axis is the percent of rows of the whole dataset in the mini-batch.

Setup. We are not aware of a first-principle way in literature to set mini-batch sizes (# of rows in a mini-batch). In practice, the mini-batch size typically depends on system constraints (e.g. number of CPUs) and is set to some number ranging from 10 to 250 (mishkin2016systematic). Thus, we tested five mini-batch sizes 50, 100, 150, 200, and 250, which cover the most common use cases. Compression ratio is defined as the uncompressed mini-batch size (encoded using DEN) over the compressed mini-batch size. We implemented DEN, CSR, CVI, and DVI by ourselves but use CLA from Systemml and Gzip/Snappy from standard libraries. We tested mini-batches from all the real datasets with the sizes mentioned above.

Overall Results. Figure 5 presents the overall results. For the very sparse dataset Rcv1, CSR is the best encoding method and TOC’s performance is similar to CSR. For the very dense dataset Deep1Billion, which does not contain repeated subsequences of column values, Gzip is the best method but it only achieved a marginal 1.15x compression ratio. CSR and TOC have similar performance because of the sparse encoding.

For the other 4 datasets with moderate sparsity, TOC has larger compression ratios than all the other methods except on Mnist, where TOC is inferior to Gzip. The main reason is that Mnist does not contain many repeated subsequences of column values that TOC logical encoding can exploit, this is also verified by the ablation study in Figure 6.

Overall, TOC is suitable for datasets of moderate sparsity, which are commonly used datasets in enterprise ML. TOC is not suitable for very sparse datasets and very dense datasets that do not contain repeated subsequences of columns values. Note that these datasets are challenging for other compression methods too. Nevertheless, one can simply test TOC on a mini-batch sample and figure out if TOC is suitable for the dataset.

Ablation Study. We conduct an ablation study to show the effectiveness of different components (e.g., sparse encoding, logical encoding, and physical encoding) in TOC. TOC_SPARSE_AND_LOGICAL compresses better than TOC_SPARSE. TOC_FULL with all the encoding techniques compresses the best. This shows the effectiveness of all our encoding components. Figure 6 shows the ablation study results.

Large Mini-batches. We compare different compression methods on large mini-batches. Figure 7 shows the results. As the mini-batch size grows, TOC becomes more competitive. When the percent of rows of the whole dataset in the mini-batch is 1.0, this is essentially batch gradient descent (BGD) and TOC has the best compression ratio in this case. This shows the potential of applying TOC to BGD related workloads.

5.2. Matrix Operation Runtimes

Figure 8.

Average runtimes (5 runs) and 95% confidence intervals to execute different matrix operations on compressed mini-batches. From top to bottom are different matrix operations, where

is a scalar, is an uncompressed vector, is an uncompressed matrix, and is the compressed matrix. From left to right are different datasets.

Setup. We tested three classes of matrix operations: sparse-safe element-wise operation (), left multiplication operations ( and ), and right multiplication operations ( and ), where is a scalar, is an uncompressed vector, is an uncompressed matrix, and is the compressed mini-batch. We set mini-batch size as 250 (results for other mini-batch sizes are similar). Figure 8 presents the results.

Sparse-safe Element-wise Operations (). In general, DVI, CVI, and TOC are fastest methods. This shows the effectiveness of value indexing (kourtis2008optimizing) which is used by all these methods. It is noteworthy that TOC can be four orders of magnitude faster than Gzip and Snappy (e.g. on Imagenet). The slowness of these general compression schemes is caused by their significant decompression overheads.

Right Multiplication Operations ( and ). For , CSR/DEN are the best methods for Rcv1/Deep1Billion due to their extreme sparsity/density respectively. For the remaining datasets of moderate sparsity, DEN, CSR, CVI, DVI, and TOC are fastest methods. CLA and general compression schemes like Snappy and Gzip are much slower. We do see that TOC is 2-3x slower than CSR on dataset Imagenet and Mnist. There are two main reasons for the slowness. First, building the prefix tree in TOC takes extra time. Second, TOC compression ratios over CSR compression ratios are relatively small on these datasets, which render the computational redundancies exploited by TOC on these datasets also smaller.

For , we set the row size of as 20. TOC is consistently the fastest on all the datasets except for Rcv1 and Deep1Billion due to its extreme sparsity/density. CLA in Systemml does not support yet, thus CLA is excluded.

Left Multiplication Operations ( and ). The results of left multiplication operations are similar to right multiplication operations. Thus, we leave them for brevity.

Summary. Overall, TOC achieves the best runtime performance on operations: , , and . TOC can be 2-3x slower than the fastest method on operations and . However, as we will show shortly in § 5.3, it has negligible effect in the context of overall ML training time.

5.3. End-to-End MGD Runtimes

In this subsection, we discuss the end-to-end MGD runtime performance with different compression schemes.

Compared Methods. We compare TOC with DEN, CSR, CVI, DVI, Snappy, and Gzip in C++ implementation. We also integrate TOC into Bismarck and compare it with DEN and CSR implemented in Bismarck, ScikitLearn, and TensorFlow. They are denoted as ML system name plus data format, e.g., BismarckTOC, ScikitLearnDEN, and TensorFlowCSR.
Machine Learning Models.

We choose three ML models: Logistic regression (LR), Support vector machine (SVM), and Neural network (NN). LR/SVM/NN use the standard logistic/hinge/cross-entropy loss respectively. For LR and SVM, we use the standard one-versus-the-other technique to do the multi-class classification. Our NN has a feed-forward structure with two hidden layers of 200 and 50 neurons using the sigmoid activation function, and the output layer activation function is sigmoid for binary output and softmax for multi-class outputs. For Mnist, the output has 10 classes, while all the other datasets have binary outputs.


MGD Training. We use MGD to train the ML models. Each dataset is divided into mini-batches with 250 rows encoded with different methods. For the sake of simplicity, we run MGD for fixed 10 epochs. The results of more sophisticated termination conditions are similar. In each epoch, we visit all the mini-batches and update ML models using each mini-batch. For SVM/LR, we train sequentially. For NN, we use the classical way (dean2012large) to train the network parallelly. The end-to-end MGD runtimes include all the epochs of training time but do NOT include the compression time because in practice it is a one-time cost and is typically amortized among different ML models.

Dataset Generation. We use the same technique reported in (elgohary2016compressed) to generate scaled real datasets, e.g., Imagenet1m (1 million rows) and Mnist25m (25 million rows).

Summary of Results. Table 6 presents the overall results on datasets Imagenet and Mnist. We put the results on the remaining datasets to Appendix D.2 for brevity.

On Imagenet1m and Mnist1m, mini-batches encoded using all the methods fit into memory. In this case, CVI and TOC are the fastest methods. General compression schemes like Snappy and Gzip are much slower due to their significant decompression overheads. It is interesting to see that TOC is even faster than CVI for LR and SVM on ImageNet1m despite the fact that matrix operations of TOC are slower on ImageNet1m and all the data fit into memory. The reason is that TOC reduces the initial IO time because of its better compression ratio. For example, TOC uses 10 seconds to read the data while CVI takes 36 seconds to read the data on ImageNet1m. On Mnist1m, CVI is faster than TOC for LR and SVM because we need to train ten LR/SVM models and there are more matrix operations involved.

On Imagenet25m and Mnist25m, only mini-batches encoded using Snappy, Gzip, and TOC fit into memory. In this case, TOC is up to 1.4x/5.6x/4.8x faster than the state-of-the-art methods for NN/LR/SVM respectively. The speed-up of TOC for LR/SVM is larger on Imagenet25m than Mnist25m, as Mnist25m has ten output classes and we train ten models so there are more matrix operations involved.

Methods Imagenet1m Imagenet25m Mnist1m Mnist25m
NN LR SVM NN LR SVM NN LR SVM NN LR SVM
TOC (ours) 12.3 0.7 0.7 249 13 13 9.0 2.1 2.1 182 52 54
DEN 14.6 3.9 3.8 666 374 360 15.8 7.9 7.8 708 526 545
CSR 12.7 2.1 2.1 428 199 187 10.8 1.6 1.6 346 156 155
CVI 12.5 1.0 1.1 323 98 83 9.6 1.4 1.4 250 92 91.6
DVI 13.0 1.2 1.2 311 73.1 63 14.5 6.2 6.4 385 224 226
Snappy 14.8 3.9 4.0 348 126 127 15.8 8.5 8.4 363 210 213
Gzip 20.8 11.7 12.5 463 247 255 20.5 12.6 12.9 393 238 243
BismarckTOC 12.6 0.76 0.77 264 13.8 13.7 10.3 2.2 2.2 198 54 57
BismarckDEN N/A 3.5 3.2 N/A 309 310 N/A 7.2 7.1 N/A 428 421
BismarckCSR N/A 2.4 2.2 N/A 141 134 N/A 1.8 1.7 N/A 114 110
ScikitLearnDEN 14.7 4 3.6 633 454 456 14.8 8.1 7.2 638 536 488
ScikitLearnCSR 42.7 2.4 2.2 1003 332 334 32.9 4.4 3.3 865 303 284
TensorFlowDEN 11.2 3.6 3.4 550 426 439 10.9 4.4 4.2 537 439 427
TensorFlowCSR 18.4 4.4 4.3 601 373 359 14.8 6.7 6.5 512 372 341
Table 6. End-to-end MGD runtimes (in minutes) for training machine learning models: Neural network (NN), Logistic regression (LR), and Support vector machine (SVM) on datasets Imagenet and Mnist. Imagenet1m and Imagenet25m are 7GB and 170GB respectively, while Mnist1m and Mnist25m are 6GB and 150GB respectively.

More Dataset Sizes. We also study the MGD runtime over more different dataset sizes. Figure 9 presents the results. In general, TOC remains the fastest method among all the settings we have tested. When the dataset is small, CSR, CVI, and DVI have comparable performance to TOC because all the data fit into memory. When the dataset is large, TOC is faster than other methods because only TOC, Gzip, and Snappy data fit into memory and TOC avoids the decompression. The speed-up of TOC is larger on Logistic regression than on Neural network because there are more matrix operations involved in training Neural network.

Figure 9. End-to-end MGD runtimes of ML training.

Ablation Study. We conduct an ablation study to verify whether the components in Figure 3 actually matter for TOC’s performance in reducing MGD runtimes. Specifically, we compare three variants of TOC: TOC_SPARSE (sparse encoding), TOC_SPARSE_AND_LOGICAL (sparse and logical encoding), and TOC_FULL (all the encoding techniques). Figure 10 presents the results. With more encoding techniques used, TOC’s performance becomes better, which shows the effectiveness of all our encoding components.

Figure 10. Ablation study of TOC for MGD runtimes.

Comparisons with Popular Machine Learning Systems. Table 6 also includes the MGD runtimes of DEN and CSR in Bismarck, ScikitLearn, and TensorFlow. We change the code of using TensorFlow and ScikitLearn a bit so that they can do disk-based learning when the dataset does not fit into memory. The table also includes BismarckTOC, which typically has less than 10 percent overhead compared with running TOC in our c++ program. This overhead is caused by the fudge factor of the database storage thus a bit larger disk IO time. On Imagenet1m and Mnist1m, BismarckTOC is comparable with the best methods in these systems (TensorFlowDEN) for NN but up to 3.2x/2.9x faster than the best methods in these systems for LR/SVM respectively. On Imagenet25m and Mnist25m, BismarckTOC is up to 2.6x/10.2x/9.8x faster than the best methods in other systems for NN/LR/SVM respectively because only the TOC data fit into memory. Thus, integrating TOC into these ML systems can greatly benefit their MGD performance.
Accuracy Study. We also plot the error rate of neural network and logistic regression as a function of time on Mnist. The goal is to compare the convergence rate of BismarckTOC with other standard tools like ScikitLearn and TensorFlow. For Mnist1m (7GB) and Mnist25m (170GB), we train 30 epochs and 10 epochs, respectively. Figure 11 presents the results. On Mnist1m and a 15GB RAM machine, BismarckTOC and TensorFlowDEN finished the training roughly at the same time, this verified our claim that BismarckTOC has comparable performance with the state-of-the-art ML system if the data fit into memory. On Mnist25m and a 15GB RAM machine, BismarckTOC finished the training much faster than other ML systems because only TOC data fit into memory. We also used a machine with 180GB RAM on Mnist25m which did not change BismarckTOC’s performance but boosted the performance of TensorFlow and ScikitLearn to be comparable with BismarckTOC as all the data fit into memory. However, renting a 180GB RAM machine is more expensive than renting a 15GB RAM machine. Thus, we believe BismarckTOC can significantly reduce users’ cloud cost.

Figure 11. Test error rate on Mnist dataset as a function of time on different systems.

5.4. Compression and Decompression Runtimes

We measured the compression and decompression time of Snappy, Gzip, and TOC on mini-batches with 250 rows. The results are similar for other mini-batch sizes. Figure 12 presents the results. TOC is faster than Gzip but slower than Snappy for compression. However, TOC is faster than both Gzip and Snappy for decompression.

Figure 12. Left: Compression time of Snappy, Gzip and TOC on a mini-batch with 250 rows. Right: Decompression time of Snappy, Gzip, and TOC on a mini-batch with 250 rows.

6. Discussion

Advanced Neural Network.

It is possible to apply TOC to more advanced neural networks such as convolutional neural networks on images. One just need to apply the common image-to-column 

(lai2018cmsis) operation, which replicates the pixels of each sliding window as a matrix column. This way, the convolution operation can be expressed as the matrix multiplication operation over the replicated matrix. The replicated matrix can be compressed by TOC and we expect higher compression ratios due to the data replication.

7. Related Work

Data Compression for Analytics. There is a long line of research (abadi2006integrating; raman2013db2; li2013bitweaving; wesley2014leveraging; elgohary2016compressed; wang2017experimental) of integrating data compression into databases and relational query processing workloads on the compressed data. TOC is orthogonal to these works since TOC focuses on a different workload—mini-batch stochastic gradient descent of machine learning training.

Machine Learning Analytics Systems. There are a number of systems (e.g., MLib (meng2016mllib), MadLib (hellerstein2012madlib), Systemml (boehm2014hybrid; elgohary2016compressed), Bismarck (bismarck), SimSQL (cai2013simulation), ScikitLearn (scikit-learn), MLBase (kraska2013mlbase), and TensorFlow (abadi2016tensorflow)) for machine learning workloads. Our work focuses on the algorithm perspective and is complementary to these systems, i.e., integrating TOC into these systems can greatly benefit their ML training performance.

Compressed Linear Algebra (CLA). CLA (elgohary2016compressed) compresses the whole dataset and applies batch gradient descent related operations such as vanilla BGD, L-BFGS, and conjugate gradient methods, while TOC focuses on MGD. Furthermore, CLA needs to store an explicit dictionary. When applying CLA to BGD, there are many references to dictionary entries so the dictionary cost is amortized. On a small mini-batch, there are not that many references to the dictionary entries so the explicit dictionary cost makes the CLA compression ratio less desirable. On the contrary, TOC is adapted from LZW and it does not store an explicit dictionary, so TOC achieves good compression ratios even on small mini-batches.

Factorized Learning. Factorized machine learning techniques (orion; olteanuf; kumar2016join; chen2017towards) push machine learning computations through joins and avoid the schema-based redundancy on denormalized datasets. These techniques need a schema to define the static redundancies in the denormalized datasets, while TOC can find the redundancies in the datasets automatically without a schema. Furthermore, factorized learning techniques work for BGD while TOC focuses on MGD.

8. Conclusion and Future Work

Mini-batch stochastic gradient descent (MGD) is a workhorse algorithm of modern ML. In this paper, we propose a lossless data compression scheme called tuple-oriented compression (TOC) to reduce memory/storage footprints and runtimes for MGD. TOC follows a design principle that tailors the compression scheme to the data access pattern of MGD in a way that preserves row/column boundaries in mini-batches and adapts matrix operation executions to the compression scheme as much as possible. This enables TOC to attain both good compression ratios and decompression-free executions for matrix operations used by MGD. There are a number of interesting directions for future work, including determining more workloads that can execute directly on TOC outputs and investigating the common structures between the adaptable workloads and compression schemes.

Acknowledgments

We thank all the anonymous reviewers. This work was partially supported by a gift from Google.

References

Appendix A Proof of Theorems

a.1. Theorem 4.1

Proof.

Without loss of generality, we use a specific row in the proof. First, we substitute with sequences stored in the prefix tree , then

(9)

Plug Equation 4 into Equation 9, we get Equation 5. Following the definition of the sequence of the tree node, we immediately get Equation 6. ∎

a.2. Theorem 4.2

Proof.

We substitute with sequences stored in

(10)

Merge terms in Equation 10 with same sequences

(11)

Plug Equation 7 into Equation 11, we get Equation 8. ∎

Appendix B More Algorithms

b.1. Right Multiplication

We present how to compute where is an uncompressed matrix and is a compressed matrix. This is an extension of right multiplication with vector in § 4.3.

Theorem B.1 ().

Let , , D be the output of TOC on , be the prefix tree built for decoding, be the sequence of the tree node defined in § 3.1.1, be the key of the tree node defined in § 4.1.2, and be the parent index of the tree node defined in § 4.1.2. Note that and are both sparse representations of vectors (i.e., and ). Define function to be

(12)

Then, we have

(13)
Proof.

Without loss of generality, we use a specific row in the proof. First, we substitute with sequences stored in prefix tree , then

(14)

Plug Equation 12 into Equation 14, we get Equation 13. ∎

Algorithm 7 shows the details. First, we scan similar to right multiplication with vector, and we use H[,:] to remember the computed value of .

Second, we scan D to compute stored in R. For th column of the result R and each D[][], we simply add H[D[][]][] to R[][]. Because H[D[][]][] is a random access of H, we let the loop of going over each column be the most inner loop so that we can scan D only once and have better cache performance.

1:function MatrixTimesUncompressedMatrix(D, I, )
2:     inputs: column_index:value pairs in the first layer of I, encoded table D, and uncompressed matrix
3:     outputs: the result of in R
4:      BuildPrefixTree(I, D)
5:     H

initialize as a zero matrix

6:     for  = 1 to len()-1 do scan to compute H
7:         for  = 0 to num_of_columns()-1 do
8:              H[]               
9:     R initialize as a zero matrix
10:     for  = 0 to len(D)-1 do scan D to compute R
11:         for  = 0 to len(D[,:])-1 do
12:              for  = 0 to num_of_columns()-1 do
13:                                               
14:     return(R)
Algorithm 7 Execute on the TOC output.

b.2. Left Multiplication

We discuss how to compute where is an uncompressed matrix and is a compressed matrix. This is an extension of left multiplication with vector in § 4.4.

Theorem B.2 ().

Let , , D be the output of TOC on , be the prefix tree built for decoding, .seq be the sequence of the tree node defined in § 3.1.1, be the key of the tree node defined in § 4.1.2, and be the parent index of the tree node defined in § 4.1.2. Note that and are both sparse representations of vectors (i.e., and ). Define function to be

(15)

Then, we have

(16)
Proof.

We substitute with sequences stored in

(17)

Merge terms in Equation 17 with same sequences

(18)

Plug Equation 15 into Equation 18, we get Equation 16. ∎

Algorithm 8 shows the details. First, we similarly scan D as left multiplication with vector. Specifically, we initialize as a zero matrix, and then add to for each . Note that the H here is stored in transposed manner so that we only need to scan once and have good cache performance at the same time.

Second, we scan backwards to actually compute stored in R. Specifically, for th row and each , we add .key * to the result of th row R[i,:] and add to .parent].

1:function UncompressedMatrixTimesMatrix(D, I, )
2:     inputs: column_index:value pairs in the first layer of I, encoded table D, and uncompressed matrix
3:     outputs: the result of in R
4:      BuildPrefixTree(I, D)
5:     H initialize as a zero matrix
6:     for  = 0 to len(D)-1 do scan D to compute H
7:         for  = 0 to len(D[,:]) -1 do
8:              for  = 0 to num_of_rows() -1 do
9:                                               
10:     R initialize as a zero matrix
11:     for  = len() -1 to 1 do scan to compute R
12:         for  = 0 to num_of_rows() -1 do
13:              .key *
14:              Add to .parent               
15:     return(R)
Algorithm 8 Execute on the TOC output.

Appendix C More Time Complexity Analysis

Executing and needs to build , scan , and scan D. As shown in Algorithm 2, building has complexity and . When scanning and , each element needs to do r operations for / respectively. Thus, the time complexity for and is and respectively.

Appendix D More Experiments

d.1. Integration TOC into Bismarck

We integrated TOC into Bismarck and replaced its existing matrix kernels. There are three key parts of the integration. First, we allocate an arena space in Bismarck shared memory for storing the ML models. Second, we replace the existing Bismarck matrix kernel with the TOC matrix kernel for updating the ML models. Third, a database table is used to store the TOC compressed mini-batches and the serialized bytes of each TOC compressed mini-batch are stored as a bytes field of variable length in the row. After all these, we modified the UDF of ML training to read the compressed mini-batch from the table and use the replaced matrix kernel to update the ML model in the arena.

Methods Census15m Census290m Kdd7m Kdd200m
NN LR SVM NN LR SVM NN LR SVM NN LR SVM
TOC (ours) 35 0.8 0.7 702 16 14 16.1 0.2 0.2 323 6.1 5.9
DEN 39 4.0 4.0 1108 253 251 29 4.6 4.4 1003 608 615
CSR 38 1.8 1.8 942 161 167 19.2 0.4 0.4 438 56 53
CVI 37 1.1 1.0 844 80 67 18.5 0.3 0.3 422 31 30
DVI 38 1.2 1.1 800 46 43 28.4 1.2 1.1 611 71 71
Snappy 41 4.7 4.6 905 121 115 27.2 3.5 3.5 616 127 128
Gzip 46 11.1 11.1 965 244 241 33.5 7.5 7.5 683 235 235
BismarckTOC 38 0.87 0.88 742 17.4 14.8 16.8 0.3 0.31 329 6.4 6.3
BismarckDEN N/A 4.2 4.3 N/A 321 310 N/A 4.0 3.8 N/A 645 644
BismarckCSR N/A 3.2 3.2 N/A 222 234 N/A 0.9 0.9 N/A 114 115
ScikitLearnDEN 73.2 7.3 6.6 1715 604 580 42 5 4.6 1797 771 772
ScikitLearnCSR 105.1 5.7 5.1 2543 421 408.8 44 1.7 1.5 1476 166 160
TensorFlowDEN 38.1 9.4 10.5 1073 638 610 21.4 5.5 5.1 1199 781 779
TensorFlowCSR 54.7 15.1 14.0 1244 681 661 15.2 4.1 4.4 577 300 274
Table 7. End-to-end MGD runtimes (in minutes) for training machine learning models: Neural network(NN), Logistic regression (LR), and Support vector machine (SVM) on datasets Census and Kdd99. Census15m and Census290m is 7GB and 140GB respectively, while Kdd7m and Kdd200m is 7GB and 200GB respectively.

d.2. End-to-End MGD Runtimes

MGD runtimes on Census and Kdd99 are reported in Table 7. Overall, the results are similar to those presented in § 5.3. On small datasets like Census15m and Kdd7m, TOC has comparable performance with other methods. On large datasets like Census290m and Kdd200m, TOC is up to 1.8x/17.8x/18.3x faster than the state-of-the-art compression schemes for NN/LR/SVM respectively. We leave the results of datasets Rcv1 and Deep1Billion because of their extreme sparsity/density such that we do not expect better performance from TOC.