Universal and Succinct Source Coding of Deep Neural Networks

04/09/2018 ∙ by Sourya Basu, et al. ∙ University of Illinois at Urbana-Champaign 0

Deep neural networks have shown incredible performance for inference tasks in a variety of domains. Unfortunately, most current deep networks are enormous cloud-based structures that require significant storage space, which limits scaling of deep learning as a service (DLaaS) and use for on-device augmented intelligence. This paper is concerned with finding universal lossless compressed representations of deep feedforward networks with synaptic weights drawn from discrete sets, and directly performing inference without full decompression. The basic insight that allows less rate than naive approaches is the recognition that the bipartite graph layers of feedforward networks have a kind of permutation invariance to the labeling of nodes, in terms of inferential operation. We provide efficient algorithms to dissipate this irrelevant uncertainty and then use arithmetic coding to nearly achieve the entropy bound in a universal manner. We also provide experimental results of our approach on the MNIST dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep learning has achieved incredible performance for inference tasks such as speech recognition, image recognition, and natural language processing. Most current deep neural networks, however, are enormous cloud-based structures that are

too large and too complex to perform fast, energy-efficient inference on device. Even in providing personalized deep learning as a service (DLaaS), each customer for an application like bank fraud detection may require a different trained network, but scaling to millions of stored networks is not possible even in the cloud. Compression, with the capability of providing inference without full decompression, is important. We develop new universal source coding techniques for feedforward deep networks having synaptic weights drawn from finite sets that essentially achieve the entropy lower bound, which we also compute. Further, we provide an algorithm to use these compressed representations for inference tasks without complete decompression. Structures that can represent information near the entropy bound while also allowing efficient operations on them are called succinct structures [2, 3, 4, 5]. Thus, we provide a succinct structure for feedforward neural networks, which may fit on-device and may enable scaling of DLaaS in the cloud.

Over the past couple of years, there has been growing interest in compact representations of neural networks [6, 7, 8, 9, 10, 11, 12, 13, 14], largely focused on lossy representations, see [15] for a recent survey of developed techniques including pruning, pooling, and factoring. These works largely lack strong information-theoretic foundations and may discretize real-valued weights through simple uniform quantization, perhaps followed by independent entropy coding applied to each. It is worth noting that binary-valued neural networks (having only a network structure [16] rather than trained synaptic weights) can often achieve high-fidelity inference [17, 18]

and that there is a view in neuroscience that biological synapses may be discrete-valued

[19].

Taking advantage of certain invariances in the structure of neural networks (previously unrecognized, e.g. [20]) in performing lossless entropy coding, however, can lead to rate reductions on top of the lossy representation techniques that has been developed [15]. In particular, the structure of feedforward deep networks in layers past the input layer are unlabeled bipartite graphs where node labeling is irrelevant, much like for nonsequential data [21, 22, 23]. By dissipating the uncertainty in this invariance, lossless coding can compress more than universal graph compression for labeled graphs [24], essentially a gain of bits for networks with nodes.

The remainder of the paper first develops the entropy limits, once the appropriate invariances are recognized. Then it designs an appropriate “sorting” of synaptic weights to put them into a canonical order where irrelevant uncertainty beyond the invariants is removed. Finally arithmetic coding is used to represent the weights [25, 26]. The coding algorithm essentially achieves the entropy bound. Further, we provide an efficient inference algorithm that uses the compressed form of the feedforward neural network to calculate its output without completely decoding it, taking additional dynamic space for a network with nodes in the layer with maximum number of nodes. We also provide experimental results of our compression and inference algorithms on a feedforward neural network trained to perform a classification task on the MNIST dataset. A preliminary version of this work only dealt with universal compression and not succinctness [1].

Ii Feedforward Neural Network Structure

Neural networks are composed of nodes connected by directed edges. Feedforward networks have connections in one direction, arranged in layers. An edge from node to node propagates an activation value from to , and each edge has a synaptic weight that determines the sign/strength of the connection. Each node

computes an activation function

applied to the weighted sum of its inputs, which we can note is a permutation-invariant function:

for any permutation . Nodes in the second layer are indistinguishable.

Consider a -layer feedforward neural network with each layer having nodes (for notational convenience), such that nodes in the first layer are labeled and all nodes in each of the remaining layers are indistinguishable from each other (when edges are ignored). Suppose there are possible colorings of edges (corresponding to synaptic weights), and that connections from each node in a layer to any given node in the next layer takes color

with probability

, , where is the probability of no edge. The goal is to universally find an efficient representation of this neural network structure, first considering two substructures that comprise it and later considering it as a whole. Later, we will consider the problem of inference without the need to decode.

Let us consider two substructures: partially-labeled bipartite graphs and unlabeled bipartite graphs, see Fig. 1. A partially-labeled bipartite graph consists of two sets of vertices, and . The set contains labeled vertices, whereas the set contains unlabeled vertices. For any pair of vertices with one vertex from each set, there is a connecting edge of color with probability , , with as the probability the two nodes are disconnected. Multiple edges between nodes are not allowed. Unlabeled bipartite graphs are a variation of partially-labeled bipartite graphs where both sets and consist of unlabeled vertices; for simplicity, in the sequel we assume there is only a single color for all nodes and that any two nodes from two different sets are connected with probability .

To construct the -layer neural network from the two substructures, one can think of it as made of a partially-labeled bipartite graph for the first and last layers and a cascade of layers of unlabeled bipartite graphs for the remaining layers. An alternate construction, however, may be more insightful: the first two layers are still a partially-labeled bipartite graph but then each time the nodes of an unlabeled layer are connected, we treat it as a labeled layer, based on its connection to the previous labeled layer (i.e. we can label the unlabeled nodes based on the nodes of the previous layer it is connected to), and iteratively complete the -layer neural network.

1

2

3

4

5

5

1

3

4

2

4

1

3

4

Fig. 1: (a) Partially-labeled bipartite graph with edge colors , where there is an edge of color between a vertex from and a vertex from if they are not connected in the figure. (b) Unlabeled bipartite graph.

Iii Representing Partially-Labeled Bipartite Graphs

Consider a matrix representing the edges in a partially-labeled bipartite graph, such that each row represents an unlabeled node from and each column represents a node from

. A non-zero matrix element

indicates there is an edge between the corresponding two nodes of color , whereas a indicates they are disconnected. Observe that if the order of the rows of this matrix is randomly permuted (preserving the order of the columns), then the corresponding bipartite graph remains the same. That is, to represent the matrix, the order of rows does not matter

. Hence the matrix can be viewed as a multiset of vectors, where each vector corresponds to a row of the matrix. Using these facts, we calculate the entropy of a partially-labeled bipartite graph. Our proofs for entropy of random bipartite graphs follow that of

[24] for entropy of random graphs.

Theorem 1.

For large , and for all satisfying and , the entropy of a partially-labeled bipartite graph, with each set containing vertices and binary colored edges is , where , and the notation means .

Proof:

Consider a random bipartite graph model , where graphs are randomly generated on two sets of vertices, and , having labeled vertices each, with edges chosen independently between any two vertices belonging to different sets with probability . Then, for a graph with edges,

Now, consider a partially-labeled random bipartite graph model which is formed in the same way as a random bipartite graph, except that the vertices in the set are unlabeled. Thus, for each , which represents a partially-labeled structure of a bipartite graph, there can exist a number of for . We say is isomorphic to if can be formed by making all the vertices in set of unlabeled, keeping all the edge connections the same. If the set of all bipartite graphs isomorphic to partially-labeled bipartite graphs is represented by , then,

The automorphism of a graph, for , is defined as an adjacency-preserving permutation of the vertices of a graph. Considering only the permutations of vertices in the set , we have a total of permutations. Given that each partially-labeled graph corresponds to number of bipartite graphs, and each bipartite graph corresponds to (which is equal to) number of adjacency-preserving permutations of vertices in the graph, from [27, 28] one can observe that:

By definition, the entropy of a random bipartite graph, , is where . The entropy of a partially-labeled graph is:

Now [29] shows that for all satisfying the conditions in this theorem, a random graph on vertices with edges occurring between any two vertices with probability is symmetric with probability for some positive constant . We have stated and proved Lem. 17 in the Appendix to provide a similar result on symmetry of random bipartite graphs which will be used to compute its entropy.

Note that for asymmetric graphs, hence

We know that , hence . Therefore,

Hence, for any constant ,

This completes the proof. ∎

We can also provide an alternate expression for the entropy of partially-labeled graphs with possible colors that will be amenable to comparison with the rate of a universal coding scheme.

Lemma 2.

The entropy of a partially-labeled bipartite graph, with each set containing nodes and edges colored with possibilities is , where and the s are non-negative integers that sum to .

Proof:

As observed earlier, the adjacency matrix of a partially-labeled bipartite graph is nothing but a multiset of vectors. From [21], we know that the empirical frequency of all elements of a multiset completely describes it. Each cell of the vector can be filled in ways corresponding to colors or no connection (color ), hence there can be in total possible vectors. The probability of a vector with the th element having appearances is:

Here, is the probability of occurrence of each of the possible vectors. In the th vector, let the number of edges with color be . Then, . Hence, the entropy of the multiset is:

and

where represents the number of vectors having edges of color . By linearity of expectation and rearranging terms, we get:

Now,

Thus,

Next we present Alg. 1, a universal algorithm for compressing a partially-labeled bipartite graph, and its performance analysis.

1:  Encode the total number of multisets in the root node of an ()-ary tree using an integer code and initialize depth, .
2:  Form child nodes of the root node, and encode the th child node with the number , the number of vectors with th cell having the th color under the multinomial distribution. The vector follows a multinomial distribution , where represents the probability vector . Increase depth by 1.
3:  while  do
4:     for each of the nodes at the current depth do
5:        Form child nodes of the current node (say, the current node is encoded with the number ), and encode the child node of color with the number , where represents the number of vectors with the th column having color and all previous columns from to having the same colors in the same order as that of the ancestor nodes of the child node starting from the root node. Here, follows a multinomial distribution .
6:     end for
7:     increase the depth by 1.
8:  end while
Algorithm 1 Compressing a partially-labeled bipartite graph.
Lemma 3.

If Alg. 1 takes bits to represent the partially-labeled bipartite graph, then .

Proof:

We know, for any node encoded with with the encodings of its child nodes , that is distributed as a multinomial distribution, . So, using arithmetic coding to encode all the nodes, the expected number of bits required to encode all the nodes is

(1)

Here, the summation is over all non-zero nodes of the ()-ary tree. Hence (1) can be simplified as

When the term is summed over all nodes, then all terms except those corresponding to the nodes of depth cancel, i.e. . Similarly, the term can be simplified as , since in the adjacency matrix of the graph, each cell can have colors from to with probability , and for each color , the expected number of cells having color is . Thus, we find

Since we are using an arithmetic coder, it takes at most 2 extra bits [30, Ch. 13.3]. ∎

Theorem 4.

The expected compressed length generated by Alg. 1 is within 2 bits of the entropy bound.

Proof:

The result follows from Lem. 2 and Lem. 3 by comparing the entropy expression of a partially-labeled random bipartite graph with the expected length in using Alg. 1. ∎

Alg. 1 achieves near-optimal compression of partially-labeled bipartite graphs, but we also wish to use such graphs as two-layered neural networks without fully decompressing. We next present Alg. 2 to directly use compressed graphs for the inference operations of two-layered neural networks. Structures that take space equal to the information-theoretic minimum with only a little bit of redundancy while also supporting various relevant operations on them are called succinct structures [4]. In particular, if is the information-theoretic minimum number of bits required to store some data, then we call a structure succinct if it represents the data in bits, while allowing relevant operations on the data.

1:  Input: , the input vector to the neural network, and , the compressed representation of the partially-labeled bipartite graph obtained from Alg. 1.
2:  Output: , the output vector of the neural network, and , the compressed representation as obtained from input.
3:  Initialize: = = ,

, the number of neurons covered at the current depth,

, an empty queue , and an empty string which would return the compressed representation once the algorithm has executed.
4:  Enqueue with .
5:  while  is not empty and  do
6:      = the first element obtained after dequeuing .
7:     .
8:     while  and  do
9:        decode the child node of corresponding to color and store it as .
10:        Encode back in .
11:        Enqueue in .
12:        Add to each of to
13:        Add to
14:        if  equals 1 and at least one non-zero node has been processed at the current depth then
15:            = +
16:        end if
17:     end while
18:  end while
19:  Update the vector using the required activation function.
Algorithm 2 Inference algorithm for compressed network.

Alg. 2 is a breadth-first search algorithm, which traverses through the compressed tree representation of the two-layered neural network and updates the output of the neural network, say , simultaneously. Note that the vector obtained from Alg. 2 is a permutation of the original vector obtained from the original uncompressed network. Observe that, each element of has a corresponding vector indicating its connection with the input to the neural network, say , and when all these elements are sorted in a decreasing manner based on these connections, it gives . This happens due to the design of Alg. 2 in giving the same vector independent of the arrangement in . Based on this invariance in the output of the compressed neural network, we can rearrange the weights of the next layers of the neural network accordingly before compressing them to get a -layered neural network with the desired output.

Proposition 5.

Output obtained from Alg. 2 is a permutation of , the output from the uncompressed neural network representation.

Proof:

We need to show that the obtained from Alg. 2 is a permutation of , obtained by direct multiplication of the weight matrix with the input vector without any compression. Say, we have an vector to be multiplied with an weight matrix , to get the output , an vector. Then, , and so the th element of , . In Alg. 2, while traversing a particular depth , we multiply all with and hence when we reach depth , we get the vector as required. The change in permutation of with respect to is because while compressing , we do not encode the permutation of the columns, retaining the row permutation. ∎

Proposition 6.

The additional dynamic space requirement of Alg. 2 is .

Proof:

It can be seen that Alg. 2 uses some space in addition to the compressed data. The symbols decoded from is encoded into , hence, the combined space taken by both of them at any point in time remains almost the same as the space taken by at the beginning of the algorithm. However, the main dynamic space requirement is because of the decoding of individual nodes, and the queue, . Clearly, the space required for , storing up to two depths of nodes in the tree, is much more than the space required for decoding a single node.

We next show that the expected space complexity corresponding to is less than or equal to using Elias-Gamma integer codes (with a small modification to be able to encode as well) for each entry in . Note that has nodes from at most two consecutive depths, and since only the child nodes of non-zero nodes are encoded, and the number of non-zero nodes at any depth is less than , we can have a maximum of nodes encoded in . Let be the non-zero tree nodes at some depth of the tree, where . Let be the total space required to store . Using integer codes, we can encode any positive number in bits, and to allow , we need bits. Thus, the arithmetic-geometric inequality implies

Theorem 7.

The compressed representation formed in Alg. 1 is succinct in nature.

Proof:

From Prop. 5 and Prop. 6 we know that the additional dynamic space required for Alg. 2 is , while the entropy of a partially-labeled bipartite graph is . Thus, from the definition of succinctness, it follows that the structure is succinct. ∎

Iv Unlabeled Bipartite Graphs

Next we consider an unlabeled bipartite graph for which we construct the adjacency matrix similarly as before, but now the possible entries in each cell will be or corresponding to whether or not there is an edge, respectively.

Although the structure is slightly different from the previous case, it also has some interesting properties. The connectivity pattern is independent of the order of the row vectors and column vectors in this bipartite adjacency matrix. We call a rearrangement of the matrix valid if we can change the order of the rows keeping the order of columns constant, or if we can change the order of the columns keeping the order of the rows constant. Observe that, after all possible valid rearrangements, the set of elements in a row vector (corresponding to a particular element) remains the same, and the set of elements in a column vector (corresponding to a particular element) also remains the same. Let us call the set of elements in a row to be a row block and similarly the set of elements in a column to be a column block. Then, every element of the adjacency matrix has a unique pair of row and column blocks for which it is the intersecting point of the pair, which does not change with any valid rearrangement.

We will next show that the entropy of an unlabeled random bipartite graph is , following which, we will provide an algorithm which is optimal up to the second-order term.

Theorem 8.

For large , and for all satisfying and , the entropy of an unlabeled bipartite graph, with each set containing vertices and binary colored edges is , where , and the notation means .

Proof:

From Thm. 1, we know that for a graph with edges,

Now consider an unlabeled random bipartite graph model which is formed in the same way as that of a random bipartite graph, except that the vertices in both the sets, and , are unlabeled, but the sets themselves and remains labeled, i.e. two sets of unlabeled vertices having the same edge connections as that of a random bipartite graph. Thus, for each , which represents an unlabeled structure of a bipartite graph, there can exist a number of for . We say is isomorphic to if can be formed by making all the vertices of unlabeled, keeping all the edge connections the same. If the set of all bipartite graphs isomorphic to unlabeled bipartite graphs is represented by , then,

The automorphism of a graph, for , is defined as an adjacency-preserving permutation of the vertices of a graph. Considering the permutations of vertices in the sets and themselves, we have a total of permutations. Given that each unlabeled graph corresponds to number of bipartite graphs, and each bipartite graph corresponds to (which is equal to), we get the number of adjacency-preserving permutations of vertices in the graph, from [27, 28], as:

We also know that the entropy of random bipartite graph, , is . The entropy of an unlabeled graph is:

We will next use a result, Lem. 18 in the Appendix, on symmetry of random bipartite graphs to compute entropy.

Note that for asymmetric graphs and so:

We know that , hence . Therefore,

Further, note that where . Hence, for any constant ,

This completes the proof. ∎

The following Alg. 3 is an algorithm to efficiently compress the adjacency matrix of any unlabeled bipartite graph of the previously described type. All encodings are done using arithmetic codes.

1:  Choose any random cell containing 1 (call it 1-cell) from the adjacency matrix (or any cell containing 0 (0-cell) only if no 1 cell is available) and using valid rearrangements make this cell the top left element of the matrix. Call it the parent cell. Initially, all cells are unmarked.
2:  Form two trees and , and store in the root nodes of each of the trees. Initialize depth, .
3:  while depth of  do
4:     Divide every non-empty leaf node at the current depth of tree into two child nodes. The left child denotes the number of 1-cells that are unmarked in the column block containing the parent cell; similarly the right child denotes the remaining 0-cells that are unmarked.
5:     Mark all unmarked cells in the column block containing the parent cell.
6:     Remove an element from the leftmost node of the tree .
7:     Choose any cell from the newly formed leftmost child of the tree as the parent cell.
8:     Divide all the leaf nodes at the current depth of the tree into two child nodes. The left child denotes the number of unmarked 1-cells in the row block containing the parent cell; similarly the right child denotes the remaining 0-cells that are unmarked.
9:     Choose any cell from the newly formed leftmost child of the tree as the parent cell.
10:     Mark all the unmarked cells in the row block containing the parent cell.
11:     Remove an element from the leftmost node of the tree .
12:     Increase depth of and by 1.
13:  end while
Algorithm 3 Compressing an unlabeled bipartite graph.

In Alg. 3, all the child nodes of a node with value, say , is first stored using bits, followed by an arithmetic coder. Note that while encoding numbers in Alg. 3

, the binomial distribution has been used for arithmetic coding, with

as the probability of existence of an edge between any two nodes of the bipartite graph and as the probability that the two nodes are disconnected.

Now we bound the compression performance of Alg. 3. The proof for this bound is based on a theorem for compression of graphical structures [24] and before stating our result and its proof, we recall two lemmas from there.

Lemma 9.

For all integers and ,

where satisfies and for ,

Lemma 10.

For all and ,

such that satisfies and for ,

Theorem 11.

If an unlabeled bipartite graph can be represented by Alg. 3 in bits, then , where is an explicitly computable constant, and is a fluctuating function with a small amplitude independent of .

Proof:

We need to find the expected value of the sum of all the encoding-lengths in all nodes of both trees. It can be observed that the structure of the trees formed in Alg. 3 is the same as in [24] except that there are two trees in our algorithm and the first tree does not lose an element from the root node on its first division. Nevertheless, the expected value of length of encoding for both trees can be upper-bounded by an expression provided in [24].

Let us formally prove that both encodings are upper-bounded by this expression. Let be the number of elements in some node of either of the trees (say , where can be or ). Then the total number of bits required for encoding a tree is , where is any tree with the root node containing elements and losing an element before its first division. Similarly, is a tree with root node containing nodes with the tree nodes not losing any element before divisions. Define and . So, the total expected bit length is before using arithmetic coding. These encoded bits are further compressed using arithmetic encoder which results in bits in total. Now define

In our setting, if and are the number of bits required to represent trees and , respectively, then the following equations hold.

Similarly, and are the number of bits required to represent trees and after using arithmetic coding, respectively. Using Lem. 9 and Lem. 10, and bounds on and from [24] it implies that for any :

Hence, the sum:

where is an explicitly computable constant and is a fluctuating function with a small amplitude independent of . This completes the proof. ∎

It can be observed that by using Alg. 3 for unlabeled bipartite graphs, we save roughly bits when compared to compressing partially-labeled bipartite graph using Alg. 1.

V Deep Neural Networks

Now we consider the -layer neural network model. First we will extend the algorithm for unlabeled bipartite graph to compress -layered unlabeled graph, and then store the permutation of the first and last layers. This would give us an efficient compression algorithm to compress a -layered neural network, saving around bits compared to standard arithmetic coding of weight matrices.

1:  Form root nodes of binary trees corresponding to layers of the neural network, and store in the root node of all the trees, corresponding to the neural network nodes in each of the layers.
2:  Initialize iteration number, , and layer number, . Let represent the set of indices of trees corresponding to layers neighboring to the th layer of the neural network.
3:  while depth of  do
4:     while depth of  do
5:        Selection: Select a node of the neural network from layer that corresponds to one of the neural network nodes in the leftmost non-zero node of and subtract 1 from the leftmost non-zero node of .
6:        Division: Divide every non-empty leaf node of the trees for into two child nodes based on the connections of the neural network nodes corresponding to the leaf nodes with the selected node in the previous step. The left child denotes the number of neural network nodes not connected to the selected node; similarly the right child denotes the neural network nodes connected to the selected node.
7:        Increment by 1.
8:     end while
9:     Increment by 1.
10:  end while
Algorithm 4 Compressing a -layer unlabeled graph.
Theorem 12.

Let be the number of bits required to represent a -layer neural network model using Alg. 4. Then , where is an explicitly computable constant, and is a fluctuating function with a small amplitude independent of .

Proof:

The encoding of Alg. 4 is similar to the encoding of Alg. 3. For all trees, the child nodes of any node with non-zero value are stored using bits. Let the number of bits required to encode the th layer be . These bits are further compressed using an arithmetic coder, which gives us, say, bits for the th layer. Observe that the trees for the st and th layer are nothing but and respectively. Hence, based on results from previous sections,

But the binary trees formed for the layers to are different. Instead of a subtraction from the leftmost non-zero node at each division after the first divisions as in a type of tree, in these type of trees, let us call them type of trees, subtraction takes place in every alternate division after the first divisions. We will follow the same procedure for compression of to as for and , i.e. we will encode the child nodes of a node with value with bits followed by an arithmetic coder. Now define,

We will next show that