1 Introduction
Recently, DNNs have achieved great success [Lecun et al.2015, He et al.2016, ZhiHua Zhou2017], and there are many related works trying to explain DNNs [Setiono and Liu1995, Moskvina and Liu2016, Bietti and Mairal2017]. However, there is no comprehensive informational interpretation of DNNs, which has promising applications in the optimization process and the internal organization of DNNs called “black boxes” [Alain and Bengio2016]. Due to the nonlinear function and the uncertain number of layers and neural units used in the DNNs, the network structure shows nonlinearity and complexity. So, it is difficult to propose the interpretation of DNNs from the perspective of information theory [ShwartzZiv and Tishby2017].
There are three approaches to solve the above problems. The first approach is the VapnikChervonenkis (VC) dimension [Vapnik and Chervonenkis2015], which represents the largest amount of samples in a dataset divided by a hypothesis space. The VC dimension shows a hypothesis space’s powerfulness [Vapnik2000]. In [Vapnik et al.1994], the authors suggest that we should determine the VC dimension by experience, but they also state that the approach described cannot be used in neural networks because they are “beyond theory”. So far, the VC dimension can only be used to approximate neural networks. The second approach called capacity scaling law [Friedland and Krell2017]
predicts the behavior of perceptron networks by calculating the lossless memory (LM) dimension and Mackey (MK) dimension
[MacKay2003]. However, this approach is based on an ideal neural network and only applies to perceptron networks. The third approach named Information Bottleneck (IB) theoretical bound is a technique in information theory introduced by [Tishby et al.2000]. In [ShwartzZiv and Tishby2017], the authors demonstrate the effectiveness of the Information Plane visualization of DNNs by the IB theoretical bound. However, the IB bound doesn’t have good performance on more complex and deeper networks. So, it is important to propose an interpretation theory that can apply to all DNNs and has more precise rules of measurement.To address the above difficulties, we introduce the typical network models of the DNNs named ConvACs [Cohen et al.2016b]
. The ConvACs can be viewed as convolutional networks that calculate highdimensional functions by tensor decompositions
[Cohen et al.2017]. Thus, we can map the ConvACs to concrete mathematical formulas through the decomposited highdimensional functions. Obviously, it is convenient for us to analyze the mathematical formulas by mathematical method. That is, we can analyze the complex DNNs by mathematical method, especially the information theory, which allows us to understand the complicated DNNs more concretely and provides us designing guidelines of the DNNs.In this paper, we propose a novel information scaling law that can interpret the network from the perspective of information theory, which provides a better understanding of the expressive efficiency of the DNNs. Firstly, we show the informational interpretation of activation function. Secondly, we prove that the Information entropy increases when the information is transmitted through the ConvACs. Finally, we propose the information scaling law of ConvACs through making a reasonable assumption.
The remainder of this paper is organized as follows. Section 2 provides an introduction of tensor theory and ConvACs. Section 3 describes the informational interpretation of DNNs and demonstrates that the information entropy increases through ConvACs. In Section 4, we propose the information scaling law of ConvACs. We conclude in Section 5.
2 Preliminaries
2.1 Tensor Theory
A tensor can be regarded as a multidimensional array , where and denotes the set . We call the number of indexing entries as modes, which is known as the order of the tensor. The number of values that each mode can take is known as the dimension of the mode. So, the tensor mentioned before is of order with dimension in th mode. In this paper, we consider that , i.e. . The tensor product denoted by
is the basic operator in tensor analysis, which is the generalization of outer product of the vectors to tensors.
The main methods which we use from tensor theory in our paper is tensor decompositions .The rank1 decomposition is the most common tensor decomposition, which is viewed as a CANDECOMP/PARAFAC (CP) decomposition. Similar to the matrix decoposition, the rank CP decomposition of a tensor can be represented as:
(1) 
where and are the parameters.
Another decomposition is hierarchical and known as the Hierarchical Tucker (HT) decomposition [Hackbusch2014]. Unlike the CP decomposition combining vectors into higher order tensors with only one step, the HT decomposition follows a tree structure. It combines vectors into matrices, and combines these matrices into 4th ordered tensor and so on recursively in a hierarchically way. Specifically, the recursive formula of the HT decomposition for a tensor is described as follows, where ,
(2) 
where the parameters of the decomposition are the vectors and the top level vector , and the scalars are referred to as the ranks of the decomposition. Similar to the CP decomposition, any tensor can be converted to an HT decomposition by only a polynomial increase in the number of parameters.
2.2 Convolutional Arithmetic Circuits
In 2016, Cohen and Sharir raised a family of models called Convolutional Arithmetic Circuits (ConvACs) [Cohen et al.2016b]. The ConvACs are convolutional networks which have a particular choice of nonlinearities. They set pointwise activations to be linear (as opposed to sigmoid), and the pooling operators are based on product (as opposed to max or average). In fact, The ConvACs is related to many mathematical fields (tensor analysis, measure theory, functional analysis, theoretical physics, graph theory and more), which makes them especially amendable to theoretical analysis.
In [Cohen et al.2016a], Cohen represents the input signal by a sequence of lowdimensional local structures . is typically considered as an image, where each local structure
corresponds to a local patch from that image. Mixture models is one of the simplest forms of ConvACs. The probability distribution of mixture models is defined by the convex combination of
mixing components(e.g. Normal distributions):
. It is easy to learn mixture models, and many of them can be used to approximate any probability distribution if given enough numbers of components, which makes them suitable for various tasks.Formally, for all there exists such that , where is a hidden variable presenting the matching component of the th local structure. So, the probability density of sampling is described by:
(3) 
where
represents the prior probability of assigning components
to their rspective local structures. As with typical mixture models, any probability density function
could be approximated arbitrarily well by (3) as . The prior probability can be represented by a tensor , which is of order N with dimension M in each mode.3 Information Analysis of ConvACs
We analyze the ConvACs by the information theory. The analysis include the activation function and network’s structure. With the tensor algebra, the ConvACs can be used to represent both shallow and deep networks.
3.1 Information Theory
We will need the following three definitions to calculate the entropy [Jacobs2007].
Definition 1.
Conditioning does not increase entropy,
(4) 
with equality if and only if X and Y are independent of each other. Where and are the entropy and conditonal entropy of and , respectively.
Definition 2.
Independence bound on information entropy. Given random variables
, with a joint distribution
, then(5) 
with equality if and only if are independent of each other. Where and are the entropy and joint entropy of and , respectively.
Definition 3.
Mutual Information. Given any two random variables, and , with a joint distribution , their Mutual Information is defined as:
(6) 
where and are the entropy and conditonal entropy of and , respectively.
3.2 Activation Function Analysis
First, we analyze how activation function affects information entropy. We come up with two theorems by analyzing the sigmoid function and the rectified linear unit (ReLU) function
[He et al.2015].Theorem 1.
Sigmoid function reduces the information entropy. Given the probability density function of , the sigmoid function , we can get:
(7) 
where , are the information entropy of and , respectively.
Proof.
By the probability Theory, we can get the probability density function of
(8) 
where is the inverse function. Then, the entorpy of :
where , and , are the information entropy of and , respectively. ∎
Theorem 2.
ReLU function doesn’t change the information entropy when used in probability models. Given the probability density function of , the ReLU function , we can get:
(9) 
where , are the information entropy of and , respectively.
Proof.
Our models are probability models, the must be nonnegative number. So the ReLU function could be turn into , which is linear and doesn’t change the information entropy. ∎
3.3 CPModel Analysis
Secondly, by representing the priors tensor according to the CP decomposition, we can get the following equation called CPmodel [Sharir et al.2017]:
(10) 
As shown in Fig. 1, the CPmodel is a shallow network. The first layer is a representation layer followed by a convolutional layer, and then a global pooling layer, finally is the output layer. For CPmodel, (10) can be decomposed into three equations:
(11) 
(12) 
(13) 
That is to say, the CPmodel is composed of two mixture models (11)(13) and a continued product (12). At first, we introduce a definition about the continued product.
Definition 4.
The continued product doesn’t change information entropy. i.e. for following equation:
(14) 
the relationship of information entropy is:
(15) 
where , , are the information entropy of and , respectively.
3.4 HTModel Analysis
Finally, we analyze the CPmodel, which is a shallow network. By the HT decomposition (3), we can get the deep network called HTmodel shown in Fig. 2 [Sharir et al.2017]. And the HTmodel can be presented by following equation:
(20) 
where ,, presents the th component in the th layer.
Each layer in the HTmodel can be seen as a CPmodel (). We can get following equations by the conclusion in last subsection:
(21) 
where represents the entropy of the each channel () of each layer (), . We naturally derive the following equation:
(22) 
Combining (19) and (22), we prove that the information entropy increases when information is transmitted through the ConvACs. In next section, we will focus on the concrete information scaling law.
4 Information Scaling Law of ConvACs
Now, we prove that the loss of information exists in the ConvACs. In this section, we will propose the information scaling law of the ConvACs.
For the graphical description of HTmodel shown in Fig. 3, the information loss could be analysized as followling. We call the operation (the dotted box shown in Fig. 3) as a fusion. A fusion can be presented as follows.
(23) 
(24) 
(25) 
where ,, presents the th component in the th layer.
Theorem 3.
The operation we call a mapping of probability space is shown in Fig. 5, the relationship of information entropy between and can be described by following fomular:
(26) 
Proof.
If matrix is invertible, we can get:
and by the Definition 3, we can get:
(31) 
(32) 
i.e. if matrix is invertible, there is no information loss. So we can propose a reasonable assumption:
Assumption 1.
For (26), when is a singular matrix, there exists maximum information loss. i.e. the uncertainty of is larger than . And the information entropy increases with the increase of uncertainty. So we can get the following fomular:
(33) 
and another fomular for the same meaning:
(34) 
where , is the information entropy of and , respectively. And presents the maximum information entropy increase, and presents the maximum gain ratio of information entropy.
We now present that the information scaling law of the ConvACs.
Theorem 4.
Information scaling law of HTmodel:
(35) 
(36) 
where is the information entropy of , and is the information entropy of local strcture . is the maximum information entropy increase of a mapping of probability space, and is the maximum gain ratio of information entropy of a mapping of probability space.
Proof.
As mentioned before, a fusion could be descirbed by (23)(24)(25). By the Assumption 1, from (24)(25), we can get:
(37) 
(38) 
From (23), by Definition 4, we can get:
(39) 
Combining (38) and (39), we can get the information scaling law of a fusion:
(40) 
(41) 
As shown in Fig. 3, the ConvACs has layers, and at the th layer has fusions. So we get :
(42) 
(43) 
We get the following equations by and Assumption 1:
(44) 
(45) 
Combining (42)(44) and (43)(45), respectively, we can get the information scaling law of HTmodel:
(46) 
(47) 
∎
From Assumption 1 and (10), the following theorem about CPmodel is a direct result of Theorem 4.
Theorem 5.
Information scaling law of CPmodel:
(48) 
(49) 
5 Conclusion
In this paper, we have revealed an information scaling law of DNNs. At first, we convert the complex DNNs to rigorous mathematics by exploiting the ConvACs. It is convenient for us to explore the inner organization of DNNs with proper mathematical presentation. Therefore, the information scaling law allows us to understand the complicated DNNs more concretely from the perspective of information theory and provides us designing suggestions of DNNs.
References

[Alain and
Bengio2016]
Guillaume Alain and Yoshua Bengio.
Understanding intermediate layers using linear classifier probes.
2016.  [Bietti and Mairal2017] Alberto Bietti and Julien Mairal. Invariance and stability of deep convolutional representations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6211–6221. Curran Associates, Inc., 2017.
 [Cohen et al.2016a] Nadav Cohen, Or Sharir, and Amnon Shashua. Deep simnets. In Computer Vision and Pattern Recognition, pages 4782–4791, 2016.

[Cohen et al.2016b]
Nadav Cohen, Or Sharir, and Amnon Shashua.
On the expressive power of deep learning: A tensor analysis.
Computer Science, 2016.  [Cohen et al.2017] Nadav Cohen, Or Sharir, Yoav Levine, Ronen Tamari, David Yakira, and Amnon Shashua. Analysis and design of convolutional networks via hierarchical tensor decompositions. 2017.
 [Friedland and Krell2017] Gerald Friedland and Mario Krell. A capacity scaling law for artificial neural networks. 2017.
 [Hackbusch2014] Wolfgang Hackbusch. Tensor spaces and numerical tensor calculus (springer series in computational mathematics). 2014.

[He et al.2015]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
pages 1026–1034, 2015.  [He et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, pages 770–778, 2016.
 [Jacobs2007] Konrad Jacobs. Elements of information theory. Optica Acta International Journal of Optics, 39(7):1600–1601, 2007.
 [Lecun et al.2015] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
 [MacKay2003] D. J. C. MacKay. Information theory, inference, and learning algorithms. Cambridge University Press, 2003.

[Moskvina and
Liu2016]
Anastasia Moskvina and Jiamou Liu.
How to build your network? A structural analysis.
In
Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 915 July 2016
, pages 2597–2603, 2016.  [Setiono and Liu1995] Rudy Setiono and Huan Liu. Understanding neural networks via rule extraction. In International Joint Conference on Artificial Intelligence, pages 480–485, 1995.
 [Sharir et al.2017] Or Sharir, Ronen Tamari, Nadav Cohen, and Amnon Shashua. Tractable generative convolutional arithmetic circuits. 2017.
 [ShwartzZiv and Tishby2017] Ravid ShwartzZiv and Naftali Tishby. Opening the black box of deep neural networks via information. 2017.
 [Tishby et al.2000] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. University of Illinois, 411(2930):368–377, 2000.
 [Vapnik and Chervonenkis2015] V. N. Vapnik and A. Ya. Chervonenkis. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Springer International Publishing, 2015.
 [Vapnik et al.1994] Vladimir Vapnik, Esther Levin, and Yann Le Cun. Measuring the VCdimension of a learning machine. MIT Press, 1994.

[Vapnik2000]
Vladimir N. Vapnik.
The nature of statistical learning theory
. Springer,, 2000.  [ZhiHua Zhou2017] Ji Feng ZhiHua Zhou. Deep forest: Towards an alternative to deep neural networks. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI17, pages 3553–3559, 2017.