Recently, DNNs have achieved great success [Lecun et al.2015, He et al.2016, Zhi-Hua Zhou2017], and there are many related works trying to explain DNNs [Setiono and Liu1995, Moskvina and Liu2016, Bietti and Mairal2017]. However, there is no comprehensive informational interpretation of DNNs, which has promising applications in the optimization process and the internal organization of DNNs called “black boxes” [Alain and Bengio2016]. Due to the nonlinear function and the uncertain number of layers and neural units used in the DNNs, the network structure shows nonlinearity and complexity. So, it is difficult to propose the interpretation of DNNs from the perspective of information theory [Shwartz-Ziv and Tishby2017].
There are three approaches to solve the above problems. The first approach is the Vapnik-Chervonenkis (VC) dimension [Vapnik and Chervonenkis2015], which represents the largest amount of samples in a dataset divided by a hypothesis space. The VC dimension shows a hypothesis space’s powerfulness [Vapnik2000]. In [Vapnik et al.1994], the authors suggest that we should determine the VC dimension by experience, but they also state that the approach described cannot be used in neural networks because they are “beyond theory”. So far, the VC dimension can only be used to approximate neural networks. The second approach called capacity scaling law [Friedland and Krell2017]
predicts the behavior of perceptron networks by calculating the lossless memory (LM) dimension and Mackey (MK) dimension[MacKay2003]. However, this approach is based on an ideal neural network and only applies to perceptron networks. The third approach named Information Bottleneck (IB) theoretical bound is a technique in information theory introduced by [Tishby et al.2000]. In [Shwartz-Ziv and Tishby2017], the authors demonstrate the effectiveness of the Information Plane visualization of DNNs by the IB theoretical bound. However, the IB bound doesn’t have good performance on more complex and deeper networks. So, it is important to propose an interpretation theory that can apply to all DNNs and has more precise rules of measurement.
To address the above difficulties, we introduce the typical network models of the DNNs named ConvACs [Cohen et al.2016b]
. The ConvACs can be viewed as convolutional networks that calculate high-dimensional functions by tensor decompositions[Cohen et al.2017]. Thus, we can map the ConvACs to concrete mathematical formulas through the decomposited high-dimensional functions. Obviously, it is convenient for us to analyze the mathematical formulas by mathematical method. That is, we can analyze the complex DNNs by mathematical method, especially the information theory, which allows us to understand the complicated DNNs more concretely and provides us designing guidelines of the DNNs.
In this paper, we propose a novel information scaling law that can interpret the network from the perspective of information theory, which provides a better understanding of the expressive efficiency of the DNNs. Firstly, we show the informational interpretation of activation function. Secondly, we prove that the Information entropy increases when the information is transmitted through the ConvACs. Finally, we propose the information scaling law of ConvACs through making a reasonable assumption.
The remainder of this paper is organized as follows. Section 2 provides an introduction of tensor theory and ConvACs. Section 3 describes the informational interpretation of DNNs and demonstrates that the information entropy increases through ConvACs. In Section 4, we propose the information scaling law of ConvACs. We conclude in Section 5.
2.1 Tensor Theory
A tensor can be regarded as a multi-dimensional array , where and denotes the set . We call the number of indexing entries as modes, which is known as the order of the tensor. The number of values that each mode can take is known as the dimension of the mode. So, the tensor mentioned before is of order with dimension in -th mode. In this paper, we consider that , i.e. . The tensor product denoted by
is the basic operator in tensor analysis, which is the generalization of outer product of the vectors to tensors.
The main methods which we use from tensor theory in our paper is tensor decompositions .The rank-1 decomposition is the most common tensor decomposition, which is viewed as a CANDECOMP/PARAFAC (CP) decomposition. Similar to the matrix decoposition, the rank- CP decomposition of a tensor can be represented as:
where and are the parameters.
Another decomposition is hierarchical and known as the Hierarchical Tucker (HT) decomposition [Hackbusch2014]. Unlike the CP decomposition combining vectors into higher order tensors with only one step, the HT decomposition follows a tree structure. It combines vectors into matrices, and combines these matrices into 4th ordered tensor and so on recursively in a hierarchically way. Specifically, the recursive formula of the HT decomposition for a tensor is described as follows, where ,
where the parameters of the decomposition are the vectors and the top level vector , and the scalars are referred to as the ranks of the decomposition. Similar to the CP decomposition, any tensor can be converted to an HT decomposition by only a polynomial increase in the number of parameters.
2.2 Convolutional Arithmetic Circuits
In 2016, Cohen and Sharir raised a family of models called Convolutional Arithmetic Circuits (ConvACs) [Cohen et al.2016b]. The ConvACs are convolutional networks which have a particular choice of non-linearities. They set point-wise activations to be linear (as opposed to sigmoid), and the pooling operators are based on product (as opposed to max or average). In fact, The ConvACs is related to many mathematical fields (tensor analysis, measure theory, functional analysis, theoretical physics, graph theory and more), which makes them especially amendable to theoretical analysis.
In [Cohen et al.2016a], Cohen represents the input signal by a sequence of low-dimensional local structures . is typically considered as an image, where each local structure
corresponds to a local patch from that image. Mixture models is one of the simplest forms of ConvACs. The probability distribution of mixture models is defined by the convex combination ofmixing components
(e.g. Normal distributions):. It is easy to learn mixture models, and many of them can be used to approximate any probability distribution if given enough numbers of components, which makes them suitable for various tasks.
Formally, for all there exists such that , where is a hidden variable presenting the matching component of the -th local structure. So, the probability density of sampling is described by:
represents the prior probability of assigning componentsto their rspective local structures
. As with typical mixture models, any probability density functioncould be approximated arbitrarily well by (3) as . The prior probability can be represented by a tensor , which is of order N with dimension M in each mode.
3 Information Analysis of ConvACs
We analyze the ConvACs by the information theory. The analysis include the activation function and network’s structure. With the tensor algebra, the ConvACs can be used to represent both shallow and deep networks.
3.1 Information Theory
We will need the following three definitions to calculate the entropy [Jacobs2007].
Conditioning does not increase entropy,
with equality if and only if X and Y are independent of each other. Where and are the entropy and conditonal entropy of and , respectively.
Mutual Information. Given any two random variables, and , with a joint distribution , their Mutual Information is defined as:
where and are the entropy and conditonal entropy of and , respectively.
3.2 Activation Function Analysis
Sigmoid function reduces the information entropy. Given the probability density function of , the sigmoid function , we can get:
where , are the information entropy of and , respectively.
By the probability Theory, we can get the probability density function of
where is the inverse function. Then, the entorpy of :
where , and , are the information entropy of and , respectively. ∎
ReLU function doesn’t change the information entropy when used in probability models. Given the probability density function of , the ReLU function , we can get:
where , are the information entropy of and , respectively.
Our models are probability models, the must be nonnegative number. So the ReLU function could be turn into , which is linear and doesn’t change the information entropy. ∎
3.3 CP-Model Analysis
Secondly, by representing the priors tensor according to the CP decomposition, we can get the following equation called CP-model [Sharir et al.2017]:
As shown in Fig. 1, the CP-model is a shallow network. The first layer is a representation layer followed by a convolutional layer, and then a global pooling layer, finally is the output layer. For CP-model, (10) can be decomposed into three equations:
The continued product doesn’t change information entropy. i.e. for following equation:
the relationship of information entropy is:
where , , are the information entropy of and , respectively.
3.4 HT-Model Analysis
Finally, we analyze the CP-model, which is a shallow network. By the HT decomposition (3), we can get the deep network called HT-model shown in Fig. 2 [Sharir et al.2017]. And the HT-model can be presented by following equation:
where ,, presents the -th component in the -th layer.
Each layer in the HT-model can be seen as a CP-model (). We can get following equations by the conclusion in last subsection:
where represents the entropy of the each channel () of each layer (), . We naturally derive the following equation:
4 Information Scaling Law of ConvACs
Now, we prove that the loss of information exists in the ConvACs. In this section, we will propose the information scaling law of the ConvACs.
For the graphical description of HT-model shown in Fig. 3, the information loss could be analysized as followling. We call the operation (the dotted box shown in Fig. 3) as a fusion. A fusion can be presented as follows.
where ,, presents the -th component in the -th layer.
The operation we call a mapping of probability space is shown in Fig. 5, the relationship of information entropy between and can be described by following fomular:
If matrix is invertible, we can get:
and by the Definition 3, we can get:
i.e. if matrix is invertible, there is no information loss. So we can propose a reasonable assumption:
For (26), when is a singular matrix, there exists maximum information loss. i.e. the uncertainty of is larger than . And the information entropy increases with the increase of uncertainty. So we can get the following fomular:
and another fomular for the same meaning:
where , is the information entropy of and , respectively. And presents the maximum information entropy increase, and presents the maximum gain ratio of information entropy.
We now present that the information scaling law of the ConvACs.
Information scaling law of HT-model:
where is the information entropy of , and is the information entropy of local strcture . is the maximum information entropy increase of a mapping of probability space, and is the maximum gain ratio of information entropy of a mapping of probability space.
From (23), by Definition 4, we can get:
As shown in Fig. 3, the ConvACs has layers, and at the -th layer has fusions. So we get :
We get the following equations by and Assumption 1:
From Assumption 1 and (10), the following theorem about CP-model is a direct result of Theorem 4.
Information scaling law of CP-model:
In this paper, we have revealed an information scaling law of DNNs. At first, we convert the complex DNNs to rigorous mathematics by exploiting the ConvACs. It is convenient for us to explore the inner organization of DNNs with proper mathematical presentation. Therefore, the information scaling law allows us to understand the complicated DNNs more concretely from the perspective of information theory and provides us designing suggestions of DNNs.
Guillaume Alain and Yoshua Bengio.
Understanding intermediate layers using linear classifier probes.2016.
- [Bietti and Mairal2017] Alberto Bietti and Julien Mairal. Invariance and stability of deep convolutional representations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6211–6221. Curran Associates, Inc., 2017.
- [Cohen et al.2016a] Nadav Cohen, Or Sharir, and Amnon Shashua. Deep simnets. In Computer Vision and Pattern Recognition, pages 4782–4791, 2016.
[Cohen et al.2016b]
Nadav Cohen, Or Sharir, and Amnon Shashua.
On the expressive power of deep learning: A tensor analysis.Computer Science, 2016.
- [Cohen et al.2017] Nadav Cohen, Or Sharir, Yoav Levine, Ronen Tamari, David Yakira, and Amnon Shashua. Analysis and design of convolutional networks via hierarchical tensor decompositions. 2017.
- [Friedland and Krell2017] Gerald Friedland and Mario Krell. A capacity scaling law for artificial neural networks. 2017.
- [Hackbusch2014] Wolfgang Hackbusch. Tensor spaces and numerical tensor calculus (springer series in computational mathematics). 2014.
[He et al.2015]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.pages 1026–1034, 2015.
- [He et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, pages 770–778, 2016.
- [Jacobs2007] Konrad Jacobs. Elements of information theory. Optica Acta International Journal of Optics, 39(7):1600–1601, 2007.
- [Lecun et al.2015] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
- [MacKay2003] D. J. C. MacKay. Information theory, inference, and learning algorithms. Cambridge University Press, 2003.
Anastasia Moskvina and Jiamou Liu.
How to build your network? A structural analysis.
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 2597–2603, 2016.
- [Setiono and Liu1995] Rudy Setiono and Huan Liu. Understanding neural networks via rule extraction. In International Joint Conference on Artificial Intelligence, pages 480–485, 1995.
- [Sharir et al.2017] Or Sharir, Ronen Tamari, Nadav Cohen, and Amnon Shashua. Tractable generative convolutional arithmetic circuits. 2017.
- [Shwartz-Ziv and Tishby2017] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. 2017.
- [Tishby et al.2000] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. University of Illinois, 411(29-30):368–377, 2000.
- [Vapnik and Chervonenkis2015] V. N. Vapnik and A. Ya. Chervonenkis. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Springer International Publishing, 2015.
- [Vapnik et al.1994] Vladimir Vapnik, Esther Levin, and Yann Le Cun. Measuring the VC-dimension of a learning machine. MIT Press, 1994.
Vladimir N. Vapnik.
The nature of statistical learning theory. Springer,, 2000.
- [Zhi-Hua Zhou2017] Ji Feng Zhi-Hua Zhou. Deep forest: Towards an alternative to deep neural networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 3553–3559, 2017.