Neural graph111 Notably, if you are not familiar with neural graph, just treat it as usual neural network and ignore any algorithm-related parts of this paper.
Notably, if you are not familiar with neural graph, just treat it as usual neural network and ignore any algorithm-related parts of this paper.
is the intelligence architecture, which is characterized by both logics and neurons,[Xiao2017b]. This definition means that traditional algorithms (e.g. max flow method, A searching, etc.) could be embedded into neural architectures by the proposed principle of [Xiao2017a] as a dynamically graph-constructing process. However, regarding the objective of neural graph, the landscape is still unclear, thus the issue of optimization is temporarily an obstacle towards the complete theoretical foundation. Therefore, in this paper, we analyze the convex property of neural graph and provide a transformation methodology to reform the neural graph as a convex structure.
Mathematically, there exist three components in neural graph, namely operators, algorithms and functions. Operators indicate four element-wise arithmetic operations (i.e. plus, minus, multiplication and division), convolution and matrix multiplication. Algorithm means logics-based instructions, while function corresponds to element-wise mathematic mapping (e.g. ).
On one hand, neural graph is organized as a graph. The node corresponds to operator, algorithm or function, while the edge links from the inputs to the output of corresponding operations. On the other hand, the convexity is characterized by the inequality of second-order form as (6) shows. In essence, if we iteratively ensure the corresponding second-order condition of (6) in each node along the edge as backward propagation, the convex property of entire graph is guaranteed. For the example of Figure 1, to begin, we ensure the condition of (6) for the cross entropy, then we iteratively propagate the guarantee of (6) from objective to variables. In spirit, this process is similar as using backward propagation to work out the second-order form. Specifically, regarding the weight , the satisfaction of (6) means it is convex relative to the objective of cross entropy.
Notably, the algorithm plays a role of selection, which constructs a graph after the forward propagation, [Xiao2017a]. Thus, if the constructed graph ensures the convex condition of (6), the objective of neural graph with algorithms is convex to each variable. Besides, neural graph involves few operations of element-wise minus and division, hence we package the corresponding parts into functions. For example, we analyze the square loss as an entire function, rather than a “square multiplication” with a “minus” node.
In this paper, we prove that “regarding the tree-structured neural graph, it is convex in each variable, when the other variables are fixed, if the function nodes satisfy the convexification inequality.” The circle in neural graph indicates the structure where several variates for a node are functions of a specific symbol. For instance, in the graph of , as a multiplication node has two variates and , both of which are functions of the symbol , making a circle. From our proofs, we could conclude that the non-convex property only originates from circles and non-linear functions. Thus, we propose a factor scaling transformation that scale mechanism, which is the technique to modify the neural graph slightly for making it convex.
Finally, as [Kawaguchi2016] suggests, we conjecture that “every local minimum is nearly a global minimum for conventional neural graph” without proofs. Empirically, if there are indeed some poor local minimums, the multiple runs of a single graph may lead to many diverse results. However, nearly all the neural architectures could be repeated with reported accuracies, which implies our conjecture.
Contributions. This paper contributes in two aspects. (1.) We prove the non-convexity of neural graph only stems from circles and non-linear functions. (2.) We provide a simple methodology that scaling the non-linear functions for convexification of neural graph.
2 Related Work
Neural architectures are attractive in terms of its generalization properties, [Mhaskar et al.2016], while it also introduces non-convex issues in optimization, [Kawaguchi2016]. However, [Murty and Kabadi1987] has proved that discovering the global minimum of a general non-convex problem is NP-complete. Thus, we intuitively expect neural graph is nearly convex in nature, which could benefit from convex optimization such as AdaDelta [Zeiler2012].
Current researches only focus on linear or non-linear multiple layer perceptions (MLP). For example, [Baldi1988] attempts to study the optimization properties of shadow linear network, while [Goodfellow et al.2016] and [Choromanska et al.2015] address the conjecture and open problem for deep linear and non-linear network, respectively. Further, [Kawaguchi2016] provides an exact path to these issues with some assumptions.
However, practical neural graphs are more complex than simple MLP, such as LSTM-based models [Wang et al.2017][Yih et al.2013] [Tan et al.2015] and CNN-based methods [He et al.2015] [He et al.2016] [Yin and Schütze2015] [Romero et al.2014]. Thus, convexity analysis should be extended to conventional situations, which is the target of this paper and is barely focused by our community.
3 Notation, Gradient and Convexity
Without loss of generality, we adopt matrix as our basic numeric type for each node. Thus, the second-order item has the form of four-subscripted array, such as . For brevity, Einstein’s summation convention
is applied as a protocol for tensor analysis, which specifics an equation without sum symbol (i.e.). In detail, we sum out the variables with the subscript that does not exist in the other side. For example,
In this paper, we apply this notation by default.
The objective of neural graph (i.e. ) is a scalar. Hence, the first-order gradient of objective relative to a specific variable node is a matrix, while the second-order is a four-subscripted array, which are defined as below:
where is the matrix of node. Based on the definition of gradient, we expand the scalar objective with Taylor series (using Einstein’s convention), such as:
Besides, we have the convex definition of neural graph of objective function as below [Boyd and Vandenberghe2013]:
which implies that the second-order item (with Einstein convention) is larger than zero as
We reform this critical second-order condition in the form of matrix as:
is a vector ofindexed by , and is a matrix of indexed by . Notably, there exist a sum symbol with the subscript , that is omitted. However, for clarity, we present the condition (6) with the full formulation as below. We call this inequality as convex condition.
4 Proof of Convexification
In this section, we sequentially prove that operators, functions and algorithms guarantee the condition of (6), if the successor node has ensured this condition.
Assuming there is no circle in graph and the successor node ensures the condition of (6), the nodes of element-wise plus operator , element-wise multiplication operator , convolution and matrix multiplication operator also guarantee this condition.
Notably, we employ Einstein’s convention for a unified representation of operators, which refers to Section 3.
(1.) Firstly, we prove the case of plus. Let , we have:
Therefore, the conclusion is obvious.
(2.) Secondly, we discuss the case of element-wise multiplication. Let . Since there is no circle that is not a function of , we have:
Substitute into the right side of (6).
The inequation of (4) holds because of the arbitrariness of vector .
(3.) Thirdly, we explore the convolution. Let . Since there is no circle that is not a function of , we have:
where the convolutions of respectively make effect on the corresponding dimensions of the second-order of . Substitute into the right side of (6):
Because of the arbitrariness of vector , our conclusion of (4) is established.
Intuitively, this theorem represents that operator nodes keep the convexity in an iterative manner along the graph edge. Next, we will prove the situation of functions. Without loss of generality, we treat the functions in batch mode, where each column corresponds to an independent sample. This assumption accords to the practical situations, which does not harm the conclusion of this paper.
Assuming the successor node ensures the condition of (6) and each column corresponds to an independent sample, the function node also guarantees this condition if the convexification inequality holds as
Let , where is the element-wise function, such as . The gradients are as below:
We notate and . Because is unrelated to when , the is only non-zero regarding the position of . Also, assuming each column corresponds to an independent sample, and is only non-zero when . Thus, we reformulate the (15) given the fixed as:
Note that, satisfies (6), which implies it is a positive semi-defined matrix, because
indicates one independent sample. Thus, we perform the eigenvalue matrix decomposition as. Besides, the second item in (16) could be reformed as a diagonal matrix . Hence, we have:
With the satisfaction of convexification inequality, we have proved this theorem. ∎
Algorithms do not modify the convexity of neural graph, or the satisfaction of inequality (6).
Algorithms play a role of selection as dynamically constructing neural graph. If the constructed graph satisfies the convex condition of (6), the neural graph with algorithms would ensure the conclusion. Specifically, the constructed graph is only composed by operators and functions, which are characterized by Theorem 1 and 2. Thus, algorithms do not modify the convexity of neural graph. ∎
If the objective nodes of neural graph satisfy the condition of (6) and the convexification inequality holds for every function node, the objective of entire tree-structured neural graph is convex.
Previously stated, neural graph is composed by three parts: operators, algorithms and functions. Algorithm dynamically constructs the graph, thus does not modify the convexity. Operators (i.e. element-wise plus/multiplication, convolution and matrix multiplication) guarantee the convex conditions of (6) if the successor node is ensured and there is no circle. Specifically in this theorem, if the objective nodes ensure the condition of (6), the conclusion could be deducted iteratively from the objective to the input/variable nodes along the path, which establishes the proof. ∎
The square loss satisfies the condition of (6).
The square loss has a form of , where is the output of neural graph and is the corresponding label. Notably, each corresponds to an independent sample, which means the second-order gradient of matrix form as is only non-zero when . Thus, the first-order gradient is , then the second-order gradient is only when , otherwise it is . In conclusion, the condition of (6) is ensured. ∎
The absolute loss satisfies the condition of (6).
The absolute loss has a form of , where indicate the output and ground-truth label of neural graph, respectively. Similarly, the first-order gradient is where is the sign function, then the second-order gradient is . The situations is similar to Lemma 1, which ends the proof. ∎
The cross entropy satisfies the condition of (6).
Regarding the second-order form, is only non-zero when , because corresponds to one independent sample. Thus, we omit the subscript . The cross entropy has a form as , where indicate the output and ground-truth label of neural graph, respectively. Thus, the first-order gradient is , while the second-order gradient is when , otherwise . Because , the second-order gradient is non-zero. In conclusion, cross entropy satisfies the condition of (6). ∎
If there exists no circle in graph and the convexification inequality holds for every function node, the objective of neural graph with conventional loss (i.e. square, absolute and cross entropy) is convex to each variable.
5 Scale Mechanism for Convexification
Scale mechanism is the technique to modify the neural graph slightly for making it convex. The non-convex property stems from circles and functions, which could be modified for convexification by factor scaling in the manner of this section.
Circles could not exist in our framework, because there are more residual items in the second-order form. To begin, we notice there exist many paths from the loss to a symbol , the successor nodes of which are noted as . Thus, the second-order of could be composed by many similar items, such as (using Einstein’s convention):
There could be many possible forms of , but conventional neural graph only takes the recursive form such as RNN in the case of circle. Simply, if :
For the first term, we have discussed the situation in Theorem 1 and 2. Regarding the second term, we transform into , thus the residual term is as below:
If the scaling factor is sufficiently small, the residual item is sufficiently insignificant or the main component is sufficiently advantageous. In this manner, the circle in the recursive case guarantees the convex condition of (6). Notably, we could multiply the scaling factor in any node along the path from to .
There are three conventional non-linear functions in neural graph: ReLU, sigmoid and tanh, which will be discussed in this subsection with convexification inequality. Notably, novel non-linear functions could be analyzed in the same manner.
ReLU (Rectified Linear Units,[Glorot et al.2012]) takes the form of , the second-order item of which is zero. Thus, ReLU naturally satisfies the convexification inequality and ensures the condition of (6).
(2.) Sigmoid. Sigmoid is the most classical non-linear function as . The first- and second-order gradient items are and , respectively. Thus, the convexification inequality is as below:
Actually, the domain of is often (e.g. ), thus the value range is . What we expect is approaches , where of approaches . Thus, we scale as , the domain of which is . If the scaling factor is sufficiently small, the domain is nearly a neighbor of origin, where the second item of convexification inequality is nearly insignificant. In this manner, the condition of (6) is guaranteed.
(3.) Tanh. Tanh is the other classical non-linear function as , the analysis of which is similar with sigmoid. Firstly, we calculate the first- and second-order gradient items as and . Secondly, we present the convexification inequality:
Finally, we scale the non-linear function as , the domain of which is (where the original domain is ). Because the domain scales into the neighbor of origin, approaches , which makes the second item insignificant. In this manner, the condition of (6) is ensured.
5.3 Why Scale Works?
To demonstrate the convexification inequality and factor scaling mechanism, we take an example of . Regarding the node “sin”, , , and . Thus, we get the convexification inequality such as:
where challenges the inequality, which means the composed function is possibly non-convex. However, when we calculate the scaling factor as , the scaling leads to a convex function in as shown in Figure 2. Essentially, factor scaling weakens the effect of inner non-convex function, thus outer convex parts could dominate the convexity. Specifically, the non-convexity of inner “” is depressed relatively to the outer convex “”.
It is worthy to note that convexification inequality is a sufficient condition for convexity, not a necessary one. Thus, the practical scaling factor only needs to be sufficiently small but does not need to strictly satisfy this inequality.
To verify the scale mechanism, we conduct a practical MLP-based graph for image recognition on the dataset of MNIST [Lecun et al.1998] and LSTM-based graph for the task of sentence matching on the dataset of “Quora Paraphrase Dataset” [Xiao2017a]
. Besides, we also have conducted two tasks of variance reduction and faster convergence to testify our theoretical analysis.
6.1 Experimental Setting
Dataset. To testify our theory in a practically large-scale scenario, we choose two datasets: MNIST and “Quora Paraphrase Dataset”. The MNIST dataset [Lecun et al.1998] is a classic benchmark dataset, which consists of handwritten digit images, 28 x 28 pixels in size, organized into 10 classes (0 to 9) with 60,000 training and 10,000 test samples. “Quora Question Pairs”222The url of the dataset: https://data.quora.com [Xiao2017a] aims at paraphrase identification. Specifically, There are over 400,000 question pairs in this dataset, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other or not. We choose 10,000 samples respectively as developing/testing set and select 40,000 instances as the training set.
Baseline & Scale Mechanism. Different graphs are processed differently. Thus, we discuss the baselines and the corresponding scale mechanism for the specific method. Notably, both the baselines are applied the AdaDelta [Zeiler2012]
with moment asand regularization as .
Regarding MNIST, we apply a Multiple Layer Perception (MLP) with structures “input-300-100-output” and sigmoid as non-linear function. There is no circle in this case, thus we directly scale all the non-linear functions with different factors . We train the model until convergence but at most 1,000 rounds and attempt five times of training-testing run to obtain the accuracy and variance.
Regarding “Quora”, we apply the Siamese LSTM [Wang et al.2017] as our baseline. Specifically, Siamese LSTM encodes the input two sentences into two sentence vectors by LSTM, respectively, [Wang et al.2016]
. Based on the two sentence vectors, a cosine similarity is leveraged to make the final decision. Specifically, we initialize the word embedding with 300-dimensional GloVe[Pennington et al.2014] word vectors pre-trained in the 840B Common Crawl corpus [Pennington et al.2014] and then set the hidden dimension as 100 for each LSTM.
There are many recursive circles in LSTM, where the in each gate is a classic circle, because is a function of . Fortunately, all the paths from to pass the non-linear functions, which means, only scaling the non-linear function could fix the issues of both circles and functions.
6.2 Variance Reduction
Actually, multiple runs with random initialization could discover various saddle/local/global minimums, which generates the variance of accuracy. Scale mechanism could invoke a more convex surface, which could be expected to reduce the variance of accuracy in theory. In fact, stabilization is critical to industrial application, which is the significance of accuracy variance reduction.
Regarding the experimental protocol, we have tried four hyper-parameters of scaling factor : baseline , slight scale , medium scale and destroying scale . The corresponding results are present in Table 1. Thus, we have concluded as:
The standard deviation of medium setting is better than the baseline, which justifies our theory. Specifically, the corresponding objective is more convex. In this way, the scale mechanism could stabilize the solution to some extent.
Destroying setting is an extreme case, where the loss surface could be very singular, which increases the variance.
It is also worthy to note that the trade-off of scaling factor and non-linear function should be carefully tuned. Specifically, smaller scale could lead to more convex surface but make an uncertainty effect on accuracy. We suggest to apply a medium scaling factor, which takes care of both convexity and accuracy.
6.3 Faster Convergence
In theory, non-convexity would make the path from initialization to saddle/local/global minimums very bumpy, which slows the convergence down to some degree. Scale mechanism could generate a more convex objective, which could smooth the path of gradient descent methods. In this manner, each step of gradient descent method makes more effects, thus a faster convergence rate could be expected.
Regarding the experimental protocol, we have tried four hyper-parameters of scaling factor : baseline , slight scale , medium scale and extreme scale , while we test the model in each training round and record the accuracies. The corresponding results are present in Figure 3 and Table 2. Thus, we have concluded as:
The blue line is the baseline, while other lines perform much better than this line in the pre-converged period, which means the scale mechanism indeed speeds the convergence up. This result justifies our theoretical analysis.
The converged epoch of baseline is largest, which means scale mechanism really accelerates the training process.
The purple line (i.e. ) always performs better than the baseline and achieves a significantly better accuracy than other settings. This result demonstrates the convexity constructed by scale mechanism could lead to a better local or even the global minimum.
7 Conjectured Landscape
In the end of this paper, we formally propose the conjecture for the landscape of neural graph, which says that the objective of neural graph is nearly convex from an overview perspective and each local minimum is nearly a global minimum.
Regarding the first statement, we prove the conventional neural graph with scale mechanism is indeed nearly convex in the theoretical and experimental manner. In fact, from our proof, we could intuitively conclude that the neural graph without scale mechanism could still be approximately convex. We hope this conclusion could be generalized to the general case, which is beyond tree-structured and recursive graph.
Regarding the second statement, we formulate this concept with concentration inequality, such as
where is the set of local minimums, is the global minimum, is the objective of neural graph and are some parameters independent with . We conjecture this inequality is satisfied for the set of neural graphs without proof. We hope future work could prove/deny this conjecture.
In this paper, we propose an iterative method to analyze the convexity of neural graph. By the proof, we conclude that only the circles and non-linear functions could invoke non-convexity. Thus, we design the scale mechanism to transform a neural graph into a convex form. Experimentally, we demonstrate the scale mechanism could stabilize the accuracy, promote the probability for arriving at the global minimum and speed up the convergence, which justifies our theory.
- [Baldi1988] Pierre Baldi. Linear learning: landscapes and algorithms. Morgan Kaufmann Publishers Inc., 1988.
- [Boyd and Vandenberghe2013] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. 世界图书出版公司, 2013.
- [Choromanska et al.2015] Anna Choromanska, Yan Lecunn, and Gerard Ben Arous. Open problem : The landscape of the loss surfaces of multilayer networks. 2015.
[Glorot et al.2012]
Xavier Glorot, Antoine Bordes, and Yoshua Bengio.
Deep sparse rectifier neural networks.
International Conference on Artificial Intelligence and Statistics, pages 315–323, 2012.
- [Goodfellow et al.2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
- [He et al.2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. pages 770–778, 2015.
- [He et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. pages 630–645, 2016.
- [Kawaguchi2016] Kenji Kawaguchi. Deep learning without poor local minima. 2016.
- [Lecun et al.1998] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- [Mhaskar et al.2016] Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. Learning real and boolean functions: When is deep better than shallow. 2016.
- [Murty and Kabadi1987] Katta G. Murty and Santosh N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical Programming, 39(2):117–129, 1987.
[Pennington et al.2014]
Jeffrey Pennington, Richard Socher, and Christopher Manning.
Glove: Global vectors for word representation.
Conference on Empirical Methods in Natural Language Processing, pages 1532–1543, 2014.
- [Romero et al.2014] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. Computer Science, 2014.
- [Tan et al.2015] Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. Lstm-based deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108, 2015.
- [Wang et al.2016] Zhiguo Wang, Haitao Mi, and Abraham Ittycheriah. Semi-supervised clustering for short text via deep representation learning. 2016.
- [Wang et al.2017] Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multi-perspective matching for natural language sentences. 2017.
- [Xiao2017a] Han Xiao. Hungarian layer: Logics empowered neural architecture. arXiv preprint arXiv:1712.02555, 2017.
- [Xiao2017b] Han Xiao. NDT: Neual decision tree towards fully functioned neural graph. arXiv preprint arXiv:1712.05934, 2017.
- [Yih et al.2013] Wen Tau Yih, Ming Wei Chang, Christopher Meek, and Andrzej Pastusiak. Question answering using enhanced lexical semantic models. In Meeting of the Association for Computational Linguistics, pages 1744–1753, 2013.
- [Yin and Schütze2015] Wenpeng Yin and Hinrich Schütze. Convolutional neural network for paraphrase identification. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 901–911, 2015.
- [Zeiler2012] Matthew D. Zeiler. Adadelta: An adaptive learning rate method. Computer Science, 2012.