## I Introduction

Deep learning has achieved incredible performance for inference tasks such as speech recognition, image recognition, and natural language processing. Most current deep neural networks, however, are enormous cloud-based structures that are

*too large*and

*too complex*to perform fast, energy-efficient inference on device. Even in providing personalized deep learning as a service (DLaaS), each customer for an application like bank fraud detection may require a different trained network, but scaling to millions of stored networks is not possible even in the cloud. Compression, with the capability of providing inference without full decompression, is important. We develop new universal source coding techniques for feedforward deep networks having synaptic weights drawn from finite sets that essentially achieve the entropy lower bound, which we also compute. Further, we provide an algorithm to use these compressed representations for inference tasks without complete decompression. Structures that can represent information near the entropy bound while also allowing efficient operations on them are called

*succinct structures*[2, 3, 4, 5]. Thus, we provide a succinct structure for feedforward neural networks, which may fit on-device and may enable scaling of DLaaS in the cloud.

Over the past couple of years, there has been growing interest in compact representations of neural networks [6, 7, 8, 9, 10, 11, 12, 13, 14], largely focused on lossy representations, see [15] for a recent survey of developed techniques including pruning, pooling, and factoring. These works largely lack strong information-theoretic foundations and may discretize real-valued weights through simple uniform quantization, perhaps followed by independent entropy coding applied to each. It is worth noting that binary-valued neural networks (having only a network structure [16] rather than trained synaptic weights) can often achieve high-fidelity inference [17, 18]

and that there is a view in neuroscience that biological synapses may be discrete-valued

[19].Taking advantage of certain invariances in the structure of neural networks (previously unrecognized, e.g. [20]) in performing lossless entropy coding, however, can lead to rate reductions on top of the lossy representation techniques that has been developed [15]. In particular, the structure of feedforward deep networks in layers past the input layer are unlabeled bipartite graphs where node labeling is irrelevant, much like for nonsequential data [21, 22, 23]. By dissipating the uncertainty in this invariance, lossless coding can compress more than universal graph compression for labeled graphs [24], essentially a gain of bits for networks with nodes.

The remainder of the paper first develops the entropy limits, once the appropriate invariances are recognized. Then it designs an appropriate “sorting” of synaptic weights to put them into a canonical order where irrelevant uncertainty beyond the invariants is removed. Finally arithmetic coding is used to represent the weights [25, 26]. The coding algorithm essentially achieves the entropy bound. Further, we provide an efficient inference algorithm that uses the compressed form of the feedforward neural network to calculate its output without completely decoding it, taking additional dynamic space for a network with nodes in the layer with maximum number of nodes. We also provide experimental results of our compression and inference algorithms on a feedforward neural network trained to perform a classification task on the MNIST dataset. A preliminary version of this work only dealt with universal compression and not succinctness [1].

## Ii Feedforward Neural Network Structure

Neural networks are composed of nodes connected by directed edges. Feedforward networks have connections in one direction, arranged in layers. An edge from node to node propagates an activation value from to , and each edge has a synaptic weight that determines the sign/strength of the connection. Each node

computes an activation function

applied to the weighted sum of its inputs, which we can note is a permutation-invariant function:for any permutation . Nodes in the second layer are indistinguishable.

Consider a -layer feedforward neural network with each layer having nodes (for notational convenience), such that nodes in the first layer are labeled and all nodes in each of the remaining layers are indistinguishable from each other (when edges are ignored). Suppose there are possible colorings of edges (corresponding to synaptic weights), and that connections from each node in a layer to any given node in the next layer takes color

with probability

, , where is the probability of no edge. The goal is to universally find an efficient representation of this neural network structure, first considering two substructures that comprise it and later considering it as a whole. Later, we will consider the problem of inference without the need to decode.Let us consider two substructures: partially-labeled bipartite graphs and unlabeled bipartite graphs, see Fig. 1. A partially-labeled bipartite graph consists of two sets of vertices, and . The set contains labeled vertices, whereas the set contains unlabeled vertices. For any pair of vertices with one vertex from each set, there is a connecting edge of color with probability , , with as the probability the two nodes are disconnected. Multiple edges between nodes are not allowed. Unlabeled bipartite graphs are a variation of partially-labeled bipartite graphs where both sets and consist of unlabeled vertices; for simplicity, in the sequel we assume there is only a single color for all nodes and that any two nodes from two different sets are connected with probability .

To construct the -layer neural network from the two substructures, one can think of it as made of a partially-labeled bipartite graph for the first and last layers and a cascade of layers of unlabeled bipartite graphs for the remaining layers. An alternate construction, however, may be more insightful: the first two layers are still a partially-labeled bipartite graph but then each time the nodes of an unlabeled layer are connected, we treat it as a labeled layer, based on its connection to the previous labeled layer (i.e. we can label the unlabeled nodes based on the nodes of the previous layer it is connected to), and iteratively complete the -layer neural network.

## Iii Representing Partially-Labeled Bipartite Graphs

Consider a matrix representing the edges in a partially-labeled bipartite graph, such that each row represents an unlabeled node from and each column represents a node from

. A non-zero matrix element

indicates there is an edge between the corresponding two nodes of color , whereas a indicates they are disconnected. Observe that if the order of the rows of this matrix is randomly permuted (preserving the order of the columns), then the corresponding bipartite graph remains the same. That is, to represent the matrix, the*order of rows does not matter*

. Hence the matrix can be viewed as a multiset of vectors, where each vector corresponds to a row of the matrix. Using these facts, we calculate the entropy of a partially-labeled bipartite graph. Our proofs for entropy of random bipartite graphs follow that of

[24] for entropy of random graphs.###### Theorem 1.

For large , and for all satisfying and , the entropy of a partially-labeled bipartite graph, with each set containing vertices and binary colored edges is , where , and the notation means .

###### Proof:

Consider a random bipartite graph model , where graphs are randomly generated on two sets of vertices, and , having labeled vertices each, with edges chosen independently between any two vertices belonging to different sets with probability . Then, for a graph with edges,

Now, consider a partially-labeled random bipartite graph model which is formed in the same way as a random bipartite graph, except that the vertices in the set are unlabeled. Thus, for each , which represents a partially-labeled structure of a bipartite graph, there can exist a number of for . We say is *isomorphic* to if can be formed by making all the vertices in set of unlabeled, keeping all the edge connections the same. If the set of all bipartite graphs isomorphic to partially-labeled bipartite graphs is represented by , then,

The *automorphism* of a graph, for , is defined as an adjacency-preserving permutation of the vertices of a graph. Considering only the permutations of vertices in the set , we have a total of permutations. Given that each partially-labeled graph corresponds to number of bipartite graphs, and each bipartite graph corresponds to (which is equal to) number of adjacency-preserving permutations of vertices in the graph, from [27, 28] one can observe that:

By definition, the entropy of a random bipartite graph, , is where . The entropy of a partially-labeled graph is:

Now [29] shows that for all satisfying the conditions in this theorem, a random graph on vertices with edges occurring between any two vertices with probability is symmetric with probability for some positive constant . We have stated and proved Lem. 17 in the Appendix to provide a similar result on symmetry of random bipartite graphs which will be used to compute its entropy.

Note that for asymmetric graphs, hence

We know that , hence . Therefore,

Hence, for any constant ,

This completes the proof. ∎

We can also provide an alternate expression for the entropy of partially-labeled graphs with possible colors that will be amenable to comparison with the rate of a universal coding scheme.

###### Lemma 2.

The entropy of a partially-labeled bipartite graph, with each set containing nodes and edges colored with possibilities is , where and the s are non-negative integers that sum to .

###### Proof:

As observed earlier, the adjacency matrix of a partially-labeled bipartite graph is nothing but a multiset of vectors. From [21], we know that the empirical frequency of all elements of a multiset completely describes it. Each cell of the vector can be filled in ways corresponding to colors or no connection (color ), hence there can be in total possible vectors. The probability of a vector with the th element having appearances is:

Here, is the probability of occurrence of each of the possible vectors. In the th vector, let the number of edges with color be . Then, . Hence, the entropy of the multiset is:

and

where represents the number of vectors having edges of color . By linearity of expectation and rearranging terms, we get:

Now,

Thus,

∎

Next we present Alg. 1, a universal algorithm for compressing a partially-labeled bipartite graph, and its performance analysis.

###### Lemma 3.

If Alg. 1 takes bits to represent the partially-labeled bipartite graph, then .

###### Proof:

We know, for any node encoded with with the encodings of its child nodes , that is distributed as a multinomial distribution, . So, using arithmetic coding to encode all the nodes, the expected number of bits required to encode all the nodes is

(1) |

Here, the summation is over all non-zero nodes of the ()-ary tree. Hence (1) can be simplified as

When the term is summed over all nodes, then all terms except those corresponding to the nodes of depth cancel, i.e. . Similarly, the term can be simplified as , since in the adjacency matrix of the graph, each cell can have colors from to with probability , and for each color , the expected number of cells having color is . Thus, we find

Since we are using an arithmetic coder, it takes at most 2 extra bits [30, Ch. 13.3]. ∎

###### Theorem 4.

The expected compressed length generated by Alg. 1 is within 2 bits of the entropy bound.

###### Proof:

Alg. 1 achieves near-optimal compression of partially-labeled bipartite graphs, but we also wish to use such graphs as two-layered neural networks *without fully decompressing*. We next present Alg. 2 to directly use compressed graphs for the inference operations of two-layered neural networks. Structures that take space equal to the information-theoretic minimum with only a little bit of redundancy while also supporting various relevant operations on them are called *succinct structures* [4]. In particular, if is the information-theoretic minimum number of bits required to store some data, then we call a structure succinct if it represents the data in bits, while allowing relevant operations on the data.

Alg. 2 is a breadth-first search algorithm, which traverses through the compressed tree representation of the two-layered neural network and updates the output of the neural network, say , simultaneously. Note that the vector obtained from Alg. 2 is a permutation of the original vector obtained from the original uncompressed network. Observe that, each element of has a corresponding vector indicating its connection with the input to the neural network, say , and when all these elements are sorted in a decreasing manner based on these connections, it gives . This happens due to the design of Alg. 2 in giving the same vector independent of the arrangement in . Based on this invariance in the output of the compressed neural network, we can rearrange the weights of the next layers of the neural network accordingly before compressing them to get a -layered neural network with the desired output.

###### Proposition 5.

Output obtained from Alg. 2 is a permutation of , the output from the uncompressed neural network representation.

###### Proof:

We need to show that the obtained from Alg. 2 is a permutation of , obtained by direct multiplication of the weight matrix with the input vector without any compression. Say, we have an vector to be multiplied with an weight matrix , to get the output , an vector. Then, , and so the th element of , . In Alg. 2, while traversing a particular depth , we multiply all with and hence when we reach depth , we get the vector as required. The change in permutation of with respect to is because while compressing , we do not encode the permutation of the columns, retaining the row permutation. ∎

###### Proposition 6.

The additional dynamic space requirement of Alg. 2 is .

###### Proof:

It can be seen that Alg. 2 uses some space in addition to the compressed data. The symbols decoded from is encoded into , hence, the combined space taken by both of them at any point in time remains almost the same as the space taken by at the beginning of the algorithm. However, the main dynamic space requirement is because of the decoding of individual nodes, and the queue, . Clearly, the space required for , storing up to two depths of nodes in the tree, is much more than the space required for decoding a single node.

We next show that the expected space complexity corresponding to is less than or equal to using Elias-Gamma integer codes (with a small modification to be able to encode as well) for each entry in . Note that has nodes from at most two consecutive depths, and since only the child nodes of non-zero nodes are encoded, and the number of non-zero nodes at any depth is less than , we can have a maximum of nodes encoded in . Let be the non-zero tree nodes at some depth of the tree, where . Let be the total space required to store . Using integer codes, we can encode any positive number in bits, and to allow , we need bits. Thus, the arithmetic-geometric inequality implies

∎

###### Theorem 7.

The compressed representation formed in Alg. 1 is succinct in nature.

## Iv Unlabeled Bipartite Graphs

Next we consider an unlabeled bipartite graph for which we construct the adjacency matrix similarly as before, but now the possible entries in each cell will be or corresponding to whether or not there is an edge, respectively.

Although the structure is slightly different from the previous case, it also has some interesting properties. The connectivity pattern is independent of the order of the row vectors and column vectors in this bipartite adjacency matrix. We call a rearrangement of the matrix *valid* if we can change the order of the rows keeping the order of columns constant, or if we can change the order of the columns keeping the order of the rows constant. Observe that, after all possible valid rearrangements, the set of elements in a row vector (corresponding to a particular element) remains the same, and the set of elements in a column vector (corresponding to a particular element) also remains the same. Let us call the set of elements in a row to be a *row block* and similarly the set of elements in a column to be a *column block*. Then, every element of the adjacency matrix has a unique pair of row and column blocks for which it is the intersecting point of the pair, which does not change with any valid rearrangement.

We will next show that the entropy of an unlabeled random bipartite graph is , following which, we will provide an algorithm which is optimal up to the second-order term.

###### Theorem 8.

For large , and for all satisfying and , the entropy of an unlabeled bipartite graph, with each set containing vertices and binary colored edges is , where , and the notation means .

###### Proof:

From Thm. 1, we know that for a graph with edges,

Now consider an unlabeled random bipartite graph model which is formed in the same way as that of a random bipartite graph, except that the vertices in both the sets, and , are unlabeled, but the sets themselves and remains labeled, i.e. two sets of unlabeled vertices having the same edge connections as that of a random bipartite graph. Thus, for each , which represents an unlabeled structure of a bipartite graph, there can exist a number of for . We say is isomorphic to if can be formed by making all the vertices of unlabeled, keeping all the edge connections the same. If the set of all bipartite graphs isomorphic to unlabeled bipartite graphs is represented by , then,

The automorphism of a graph, for , is defined as an adjacency-preserving permutation of the vertices of a graph. Considering the permutations of vertices in the sets and themselves, we have a total of permutations. Given that each unlabeled graph corresponds to number of bipartite graphs, and each bipartite graph corresponds to (which is equal to), we get the number of adjacency-preserving permutations of vertices in the graph, from [27, 28], as:

We also know that the entropy of random bipartite graph, , is . The entropy of an unlabeled graph is:

We will next use a result, Lem. 18 in the Appendix, on symmetry of random bipartite graphs to compute entropy.

Note that for asymmetric graphs and so:

We know that , hence . Therefore,

Further, note that where . Hence, for any constant ,

This completes the proof. ∎

The following Alg. 3 is an algorithm to efficiently compress the adjacency matrix of any unlabeled bipartite graph of the previously described type. All encodings are done using arithmetic codes.

In Alg. 3, all the child nodes of a node with value, say , is first stored using bits, followed by an arithmetic coder. Note that while encoding numbers in Alg. 3

, the binomial distribution has been used for arithmetic coding, with

as the probability of existence of an edge between any two nodes of the bipartite graph and as the probability that the two nodes are disconnected.Now we bound the compression performance of Alg. 3. The proof for this bound is based on a theorem for compression of graphical structures [24] and before stating our result and its proof, we recall two lemmas from there.

###### Lemma 9.

For all integers and ,

where satisfies and for ,

###### Lemma 10.

For all and ,

such that satisfies and for ,

###### Theorem 11.

If an unlabeled bipartite graph can be represented by Alg. 3 in bits, then , where is an explicitly computable constant, and is a fluctuating function with a small amplitude independent of .

###### Proof:

We need to find the expected value of the sum of all the encoding-lengths in all nodes of both trees. It can be observed that the structure of the trees formed in Alg. 3 is the same as in [24] except that there are two trees in our algorithm and the first tree does not lose an element from the root node on its first division. Nevertheless, the expected value of length of encoding for both trees can be upper-bounded by an expression provided in [24].

Let us formally prove that both encodings are upper-bounded by this expression. Let be the number of elements in some node of either of the trees (say , where can be or ). Then the total number of bits required for encoding a tree is , where is any tree with the root node containing elements and losing an element before its first division. Similarly, is a tree with root node containing nodes with the tree nodes not losing any element before divisions. Define and . So, the total expected bit length is before using arithmetic coding. These encoded bits are further compressed using arithmetic encoder which results in bits in total. Now define

In our setting, if and are the number of bits required to represent trees and , respectively, then the following equations hold.

Similarly, and are the number of bits required to represent trees and after using arithmetic coding, respectively. Using Lem. 9 and Lem. 10, and bounds on and from [24] it implies that for any :

Hence, the sum:

where is an explicitly computable constant and is a fluctuating function with a small amplitude independent of . This completes the proof. ∎

## V Deep Neural Networks

Now we consider the -layer neural network model. First we will extend the algorithm for unlabeled bipartite graph to compress -layered unlabeled graph, and then store the permutation of the first and last layers. This would give us an efficient compression algorithm to compress a -layered neural network, saving around bits compared to standard arithmetic coding of weight matrices.

###### Theorem 12.

Let be the number of bits required to represent a -layer neural network model using Alg. 4. Then , where is an explicitly computable constant, and is a fluctuating function with a small amplitude independent of .

###### Proof:

The encoding of Alg. 4 is similar to the encoding of Alg. 3. For all trees, the child nodes of any node with non-zero value are stored using bits. Let the number of bits required to encode the th layer be . These bits are further compressed using an arithmetic coder, which gives us, say, bits for the th layer. Observe that the trees for the st and th layer are nothing but and respectively. Hence, based on results from previous sections,

But the binary trees formed for the layers to are different. Instead of a subtraction from the leftmost non-zero node at each division after the first divisions as in a type of tree, in these type of trees, let us call them type of trees, subtraction takes place in every alternate division after the first divisions. We will follow the same procedure for compression of to as for and , i.e. we will encode the child nodes of a node with value with bits followed by an arithmetic coder. Now define,

We will next show that and