DeepAI

# Information Scaling Law of Deep Neural Networks

With the rapid development of Deep Neural Networks (DNNs), various network models that show strong computing power and impressive expressive power are proposed. However, there is no comprehensive informational interpretation of DNNs from the perspective of information theory. Due to the nonlinear function and the uncertain number of layers and neural units used in the DNNs, the network structure shows nonlinearity and complexity. With the typical DNNs named Convolutional Arithmetic Circuits (ConvACs), the complex DNNs can be converted into mathematical formula. Thus, we can use rigorous mathematical theory especially the information theory to analyse the complicated DNNs. In this paper, we propose a novel information scaling law scheme that can interpret the network's inner organization by information theory. First, we show the informational interpretation of the activation function. Secondly, we prove that the information entropy increases when the information is transmitted through the ConvACs. Finally, we propose the information scaling law of ConvACs through making a reasonable assumption.

10/08/2020

### Approximating smooth functions by deep neural networks with sigmoid activation function

We study the power of deep neural networks (DNNs) with sigmoid activatio...
09/12/2022

### Deep Neural Networks as Complex Networks

Deep Neural Networks are, from a physical perspective, graphs whose `lin...
06/19/2022

### 0/1 Deep Neural Networks via Block Coordinate Descent

The step function is one of the simplest and most natural activation fun...
05/16/2019

### An Information Theoretic Interpretation to Deep Neural Networks

It is commonly believed that the hidden layers of deep neural networks (...
10/17/2022

### Asymptotic-Preserving Neural Networks for hyperbolic systems with diffusive scaling

With the rapid advance of Machine Learning techniques and the deep incre...
06/15/2021

### Learning to Compensate: A Deep Neural Network Framework for 5G Power Amplifier Compensation

Owing to the complicated characteristics of 5G communication system, des...
08/01/2019

### General Information Theory: Time and Information

This paper introduces time into information theory, gives a more accurat...

## 1 Introduction

Recently, DNNs have achieved great success [Lecun et al.2015, He et al.2016, Zhi-Hua Zhou2017], and there are many related works trying to explain DNNs [Setiono and Liu1995, Moskvina and Liu2016, Bietti and Mairal2017]. However, there is no comprehensive informational interpretation of DNNs, which has promising applications in the optimization process and the internal organization of DNNs called “black boxes” [Alain and Bengio2016]. Due to the nonlinear function and the uncertain number of layers and neural units used in the DNNs, the network structure shows nonlinearity and complexity. So, it is difficult to propose the interpretation of DNNs from the perspective of information theory [Shwartz-Ziv and Tishby2017].

There are three approaches to solve the above problems. The first approach is the Vapnik-Chervonenkis (VC) dimension [Vapnik and Chervonenkis2015], which represents the largest amount of samples in a dataset divided by a hypothesis space. The VC dimension shows a hypothesis space’s powerfulness [Vapnik2000]. In [Vapnik et al.1994], the authors suggest that we should determine the VC dimension by experience, but they also state that the approach described cannot be used in neural networks because they are “beyond theory”. So far, the VC dimension can only be used to approximate neural networks. The second approach called capacity scaling law [Friedland and Krell2017]

predicts the behavior of perceptron networks by calculating the lossless memory (LM) dimension and Mackey (MK) dimension

[MacKay2003]. However, this approach is based on an ideal neural network and only applies to perceptron networks. The third approach named Information Bottleneck (IB) theoretical bound is a technique in information theory introduced by [Tishby et al.2000]. In [Shwartz-Ziv and Tishby2017], the authors demonstrate the effectiveness of the Information Plane visualization of DNNs by the IB theoretical bound. However, the IB bound doesn’t have good performance on more complex and deeper networks. So, it is important to propose an interpretation theory that can apply to all DNNs and has more precise rules of measurement.

To address the above difficulties, we introduce the typical network models of the DNNs named ConvACs [Cohen et al.2016b]

. The ConvACs can be viewed as convolutional networks that calculate high-dimensional functions by tensor decompositions

[Cohen et al.2017]. Thus, we can map the ConvACs to concrete mathematical formulas through the decomposited high-dimensional functions. Obviously, it is convenient for us to analyze the mathematical formulas by mathematical method. That is, we can analyze the complex DNNs by mathematical method, especially the information theory, which allows us to understand the complicated DNNs more concretely and provides us designing guidelines of the DNNs.

In this paper, we propose a novel information scaling law that can interpret the network from the perspective of information theory, which provides a better understanding of the expressive efficiency of the DNNs. Firstly, we show the informational interpretation of activation function. Secondly, we prove that the Information entropy increases when the information is transmitted through the ConvACs. Finally, we propose the information scaling law of ConvACs through making a reasonable assumption.

The remainder of this paper is organized as follows. Section 2 provides an introduction of tensor theory and ConvACs. Section 3 describes the informational interpretation of DNNs and demonstrates that the information entropy increases through ConvACs. In Section 4, we propose the information scaling law of ConvACs. We conclude in Section 5.

## 2 Preliminaries

### 2.1 Tensor Theory

A tensor can be regarded as a multi-dimensional array , where and denotes the set . We call the number of indexing entries as modes, which is known as the order of the tensor. The number of values that each mode can take is known as the dimension of the mode. So, the tensor mentioned before is of order with dimension in -th mode. In this paper, we consider that , i.e. . The tensor product denoted by

is the basic operator in tensor analysis, which is the generalization of outer product of the vectors to tensors.

The main methods which we use from tensor theory in our paper is tensor decompositions .The rank-1 decomposition is the most common tensor decomposition, which is viewed as a CANDECOMP/PARAFAC (CP) decomposition. Similar to the matrix decoposition, the rank- CP decomposition of a tensor can be represented as:

 A=Z∑z=1azaz,1⊗⋯⊗az,N,⇒Ad1,...,dN=Z∑z=1azN∏i=1az,idi, (1)

where and are the parameters.

Another decomposition is hierarchical and known as the Hierarchical Tucker (HT) decomposition [Hackbusch2014]. Unlike the CP decomposition combining vectors into higher order tensors with only one step, the HT decomposition follows a tree structure. It combines vectors into matrices, and combines these matrices into 4th ordered tensor and so on recursively in a hierarchically way. Specifically, the recursive formula of the HT decomposition for a tensor is described as follows, where ,

 ϕ1,j,γ=r0∑α=1a1,j,γαa0,2j−1,α⊗a0,2j,α,⋯ϕl,j,γ=rl−1∑α=1al,j,γαϕl−1,2j−1,αorder2l−1⊗ϕl−1,2j,αorder2l−1,⋯ϕL−1,j,γ=rL−2∑α=1aL−1,j,γαϕL−2,2j−1,αorderN4⊗ϕL−2,2j,αorderN4,A=rL−1∑α=1aLαϕL−1,1,αorderN2⊗ϕL−1,2,αorderN2, (2)

where the parameters of the decomposition are the vectors and the top level vector , and the scalars are referred to as the ranks of the decomposition. Similar to the CP decomposition, any tensor can be converted to an HT decomposition by only a polynomial increase in the number of parameters.

### 2.2 Convolutional Arithmetic Circuits

In 2016, Cohen and Sharir raised a family of models called Convolutional Arithmetic Circuits (ConvACs) [Cohen et al.2016b]. The ConvACs are convolutional networks which have a particular choice of non-linearities. They set point-wise activations to be linear (as opposed to sigmoid), and the pooling operators are based on product (as opposed to max or average). In fact, The ConvACs is related to many mathematical fields (tensor analysis, measure theory, functional analysis, theoretical physics, graph theory and more), which makes them especially amendable to theoretical analysis.

In [Cohen et al.2016a], Cohen represents the input signal by a sequence of low-dimensional local structures . is typically considered as an image, where each local structure

corresponds to a local patch from that image. Mixture models is one of the simplest forms of ConvACs. The probability distribution of mixture models is defined by the convex combination of

mixing components

(e.g. Normal distributions):

. It is easy to learn mixture models, and many of them can be used to approximate any probability distribution if given enough numbers of components, which makes them suitable for various tasks.

Formally, for all there exists such that , where is a hidden variable presenting the matching component of the -th local structure. So, the probability density of sampling is described by:

 P(X)=M∑d1,...,dN=1P(d1,...,dN)N∏i=1P(xi|di;θdi), (3)

where

represents the prior probability of assigning components

to their rspective local structures

. As with typical mixture models, any probability density function

could be approximated arbitrarily well by (3) as . The prior probability can be represented by a tensor , which is of order N with dimension M in each mode.

## 3 Information Analysis of ConvACs

We analyze the ConvACs by the information theory. The analysis include the activation function and network’s structure. With the tensor algebra, the ConvACs can be used to represent both shallow and deep networks.

### 3.1 Information Theory

We will need the following three definitions to calculate the entropy [Jacobs2007].

###### Definition 1.

Conditioning does not increase entropy,

 H(X|Y)≤H(X), (4)

with equality if and only if X and Y are independent of each other. Where and are the entropy and conditonal entropy of and , respectively.

###### Definition 2.

Independence bound on information entropy. Given random variables

, with a joint distribution

, then

 H(X1,X2,...Xn)≤n∑i=1H(Xi), (5)

with equality if and only if are independent of each other. Where and are the entropy and joint entropy of and , respectively.

###### Definition 3.

Mutual Information. Given any two random variables, and , with a joint distribution , their Mutual Information is defined as:

 I(X;Y)=∑x∈X,y∈Yp(x,y)log(p(x,y)p(x)p(y))=H(X)−H(X|Y), (6)

where and are the entropy and conditonal entropy of and , respectively.

### 3.2 Activation Function Analysis

First, we analyze how activation function affects information entropy. We come up with two theorems by analyzing the sigmoid function and the rectified linear unit (ReLU) function

[He et al.2015].

###### Theorem 1.

Sigmoid function reduces the information entropy. Given the probability density function of , the sigmoid function , we can get:

 H(Y)

where , are the information entropy of and , respectively.

###### Proof.

By the probability Theory, we can get the probability density function of

 fy(y)=fx(h(y))|h′(y)|, (8)

where is the inverse function. Then, the entorpy of :

 H(Y)=−∫yfx(y)log(fy(y))dy=−∫y|h′(y)|fx(h(y))log(|h′(y)|fx(h(y))dy=−∫xfx(x)logfx(x)dx−∫xfx(x)log(1+e−x)2e−xdx=H(X)+C

where , and , are the information entropy of and , respectively. ∎

###### Theorem 2.

ReLU function doesn’t change the information entropy when used in probability models. Given the probability density function of , the ReLU function , we can get:

 H(Y)=H(X), (9)

where , are the information entropy of and , respectively.

###### Proof.

Our models are probability models, the must be nonnegative number. So the ReLU function could be turn into , which is linear and doesn’t change the information entropy. ∎

### 3.3 CP-Model Analysis

Secondly, by representing the priors tensor according to the CP decomposition, we can get the following equation called CP-model [Sharir et al.2017]:

 P(X)=M∑d1,...,dN(Z∑z=1azN∏i=1az,idi)N∏i=1P(xi|di;θdi)=Z∑z=1azN∏i=1M∑di=1az,idiP(xi|di;θdi). (10)

As shown in Fig. 1, the CP-model is a shallow network. The first layer is a representation layer followed by a convolutional layer, and then a global pooling layer, finally is the output layer. For CP-model, (10) can be decomposed into three equations:

 P(X)=Z∑z=1azP(X|z), (11)
 P(X|z)=N∏i=1P(xi|z), (12)
 P(xi|z)=M∑di=1az,idiP(xi|di;θdi). (13)

That is to say, the CP-model is composed of two mixture models (11)(13) and a continued product (12). At first, we introduce a definition about the continued product.

###### Definition 4.

The continued product doesn’t change information entropy. i.e. for following equation:

 P(X)=N∏i=1P(xi), (14)

the relationship of information entropy is:

 H(X)=N∑i=1H(xi), (15)

where , , are the information entropy of and , respectively.

According to Definition 1, Definition 4 and Theorem 3 (see Section 4), (11)(12)(13) can be transformed into the following presentations of information entropy:

 H(X)≥H(X|z), (16)
 H(X|z)=N∑i=1H(xi|z), (17)
 H(xi|z)≥H(xi|d). (18)

By combining above three formulas, we can get the total impact of CP-model on information entropy:

 H(X)≥H(X|z)=N∑i=1H(xi|z)≥N∑i=1H(xi|d). (19)

### 3.4 HT-Model Analysis

Finally, we analyze the CP-model, which is a shallow network. By the HT decomposition (3), we can get the deep network called HT-model shown in Fig. 2 [Sharir et al.2017]. And the HT-model can be presented by following equation:

 P(φ0i|d0)=P(xi|di;θdi),P(φ1i|d1)=2i∏j=2i−1r0∑α=1ad1αP(φ0j|d0),⋯P(φli|dl)=2i∏j=2i−1rl−1∑α=1ad1αP(φl−1j|dl−1),⋯P(φL|dL)=2∏j=1rL−1∑α=1adLαP(φL−1j|dL−1),P(X)=rL∑α=1aαP(φL|dL), (20)

where ,, presents the -th component in the -th layer.

Each layer in the HT-model can be seen as a CP-model (). We can get following equations by the conclusion in last subsection:

 H(φ0i|d0)=H(xi|di),H(φ1i|d1)≥H(φ02i−1|d0)+H(φ02i|d0),⋯H(φli|dl)≥H(φl−12i−1|d0)+H(φl−12i|d0),⋯H(φL|dL)≥H(φL−11|dL−1)+H(φL−12|dL−1),H(X)≥H(φL|dL), (21)

where represents the entropy of the each channel () of each layer (), . We naturally derive the following equation:

 H(X)≥N∑i=1H(xi|d). (22)

Combining (19) and (22), we prove that the information entropy increases when information is transmitted through the ConvACs. In next section, we will focus on the concrete information scaling law.

## 4 Information Scaling Law of ConvACs

Now, we prove that the loss of information exists in the ConvACs. In this section, we will propose the information scaling law of the ConvACs.

For the graphical description of HT-model shown in Fig. 3, the information loss could be analysized as followling. We call the operation (the dotted box shown in Fig. 3) as a fusion. A fusion can be presented as follows.

 P(φli|dl)=P(φl−12i−1|dl)⋅P(φl−12i|dl), (23)
 P(φl−12i−1|dl)=rl−1∑α=1adlαP(φl−12i−1|dl−1), (24)
 P(φl−12i|dl)=rl−1∑α=1adlαP(φl−12i|dl−1), (25)

where ,, presents the -th component in the -th layer.

###### Theorem 3.

The operation we call a mapping of probability space is shown in Fig. 5, the relationship of information entropy between and can be described by following fomular:

 H(φl−12i−1|dl)≥H(φl−12i−1|dl−1). (26)
###### Proof.

As shown in Fig. 5, we have further refined the dotted box shown in Fig. 4, i.e. the (24). The operation can be wrote as the following formula:

 u=A⋅v, (27)

where

 u=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝P(φl−12i−1|dl=1)P(φl−12i−1|dl=2)...P(φl−12i−1|dl=rl)⎞⎟ ⎟ ⎟ ⎟ ⎟⎠,v=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝P(φl−12i−1|dl−1=1)P(φl−12i−1|dl−1=2)...P(φl−12i−1|dl−1=rl−1)⎞⎟ ⎟ ⎟ ⎟ ⎟⎠

and . From (27), the following fomular is obvious:

 P(φl−12i−1|dl=j)=adl=j⋅v=rl−1∑α=1adl=jαP(φl−12i−1|dl−1=α). (28)

By the Definition 1, from (28) we can get

 H(φl−12i−1|dl=j)≥H(φl−12i−1|dl−1), (29)

and by the definition of information entropy:

 H(φl−12i−1|dl)=rl∑j=1P(dl=j)H(φl−12i−1|dl=j)≥rl∑j=1P(dl=j)H(φl−12i−1|dl−1)=H(φl−12i−1|dl−1). (30)

If matrix is invertible, we can get:

 v=A−1⋅u,

and by the Definition 3, we can get:

 H(φl−12i−1|dl−1)≥H(φl−12i−1|dl), (31)

combine (26) and (31):

 H(φl−12i−1|dl)=H(φl−12i−1|dl−1), (32)

i.e. if matrix is invertible, there is no information loss. So we can propose a reasonable assumption:

###### Assumption 1.

For (26), when is a singular matrix, there exists maximum information loss. i.e. the uncertainty of is larger than . And the information entropy increases with the increase of uncertainty. So we can get the following fomular:

 H(u)−H(v)≤C, (33)

and another fomular for the same meaning:

 H(u)H(v)≤β, (34)

where , is the information entropy of and , respectively. And presents the maximum information entropy increase, and presents the maximum gain ratio of information entropy.

We now present that the information scaling law of the ConvACs.

###### Theorem 4.

Information scaling law of HT-model:

 H(X)−N∑i=1H(xi|d)≤(2L+2L−1+...+1)C, (35)
 H(X)∑Ni=1H(xi|d)≤βL+1, (36)

where is the information entropy of , and is the information entropy of local strcture . is the maximum information entropy increase of a mapping of probability space, and is the maximum gain ratio of information entropy of a mapping of probability space.

###### Proof.

As mentioned before, a fusion could be descirbed by (23)(24)(25). By the Assumption 1, from (24)(25), we can get:

 H(φl−12i−1|dl)−H(φl−12i−1|dl−1)≤C,H(φl−12i|dl)−H(φl−12i|dl−1)≤C, (37)
 H(φl−12i−1|dl)H(φl−12i−1|dl−1)≤β,H(φl−12i|dl)H(φl−12i|dl−1)≤β. (38)

From (23), by Definition 4, we can get:

 H(φli|dl)=H(φl−12i−1|dl)+H(φl−12i|dl). (39)

Combining (38) and (39), we can get the information scaling law of a fusion:

 H(φli|dl)−(H(φl−12i−1|dl−1)+H(φl−12i|dl−1))≤2C, (40)
 H(φli|dl))H(φl−12i−1|dl−1)+H(φl−12i|dl−1)≤β. (41)

As shown in Fig. 3, the ConvACs has layers, and at the -th layer has fusions. So we get :

 H(φL|dL)−N∑i=1H(φ0i|d0)≤(2L+2L−1+...+2)C, (42)
 H(φL|dL)∑Ni=1H(φ0i|d0)≤βL. (43)

We get the following equations by and Assumption 1:

 H(X)−H(φL|dL)≤C, (44)
 H(X)H(φL|dL)≤β. (45)

Combining (42)(44) and (43)(45), respectively, we can get the information scaling law of HT-model:

 H(X)−N∑i=1H(xi|d)≤(2L+2L−1+...+2+1)C, (46)
 H(X)∑Ni=1H(xi|d)≤βL+1. (47)

From Assumption 1 and (10), the following theorem about CP-model is a direct result of Theorem 4.

###### Theorem 5.

Information scaling law of CP-model:

 H(X)−N∑i=1H(xi|d)≤(N+1)C, (48)
 H(X)∑Ni=1H(xi|d)≤β2. (49)

## 5 Conclusion

In this paper, we have revealed an information scaling law of DNNs. At first, we convert the complex DNNs to rigorous mathematics by exploiting the ConvACs. It is convenient for us to explore the inner organization of DNNs with proper mathematical presentation. Therefore, the information scaling law allows us to understand the complicated DNNs more concretely from the perspective of information theory and provides us designing suggestions of DNNs.

## References

• [Alain and Bengio2016] Guillaume Alain and Yoshua Bengio.

Understanding intermediate layers using linear classifier probes.

2016.
• [Bietti and Mairal2017] Alberto Bietti and Julien Mairal. Invariance and stability of deep convolutional representations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6211–6221. Curran Associates, Inc., 2017.
• [Cohen et al.2016a] Nadav Cohen, Or Sharir, and Amnon Shashua. Deep simnets. In , pages 4782–4791, 2016.
• [Cohen et al.2016b] Nadav Cohen, Or Sharir, and Amnon Shashua.

On the expressive power of deep learning: A tensor analysis.

Computer Science, 2016.
• [Cohen et al.2017] Nadav Cohen, Or Sharir, Yoav Levine, Ronen Tamari, David Yakira, and Amnon Shashua. Analysis and design of convolutional networks via hierarchical tensor decompositions. 2017.
• [Friedland and Krell2017] Gerald Friedland and Mario Krell. A capacity scaling law for artificial neural networks. 2017.
• [Hackbusch2014] Wolfgang Hackbusch. Tensor spaces and numerical tensor calculus (springer series in computational mathematics). 2014.
• [He et al.2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

pages 1026–1034, 2015.
• [He et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, pages 770–778, 2016.
• [Jacobs2007] Konrad Jacobs. Elements of information theory. Optica Acta International Journal of Optics, 39(7):1600–1601, 2007.
• [Lecun et al.2015] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
• [MacKay2003] D. J. C. MacKay. Information theory, inference, and learning algorithms. Cambridge University Press, 2003.
• [Moskvina and Liu2016] Anastasia Moskvina and Jiamou Liu. How to build your network? A structural analysis. In

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016

, pages 2597–2603, 2016.
• [Setiono and Liu1995] Rudy Setiono and Huan Liu. Understanding neural networks via rule extraction. In International Joint Conference on Artificial Intelligence, pages 480–485, 1995.
• [Sharir et al.2017] Or Sharir, Ronen Tamari, Nadav Cohen, and Amnon Shashua. Tractable generative convolutional arithmetic circuits. 2017.
• [Shwartz-Ziv and Tishby2017] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. 2017.
• [Tishby et al.2000] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. University of Illinois, 411(29-30):368–377, 2000.
• [Vapnik and Chervonenkis2015] V. N. Vapnik and A. Ya. Chervonenkis. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Springer International Publishing, 2015.
• [Vapnik et al.1994] Vladimir Vapnik, Esther Levin, and Yann Le Cun. Measuring the VC-dimension of a learning machine. MIT Press, 1994.
• [Vapnik2000] Vladimir N. Vapnik.

The nature of statistical learning theory

.
Springer,, 2000.
• [Zhi-Hua Zhou2017] Ji Feng Zhi-Hua Zhou. Deep forest: Towards an alternative to deep neural networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 3553–3559, 2017.