Universal Approximation by a Slim Network with Sparse Shortcut Connections

11/22/2018 ∙ by Fenglei Fan, et al. ∙ Rensselaer Polytechnic Institute 0

Over recent years, deep learning has become a mainstream method in machine learning. More advanced networks are being actively developed to solve real-world problems in many important areas. Among successful features of network architectures, shortcut connections are well established to take the outputs of earlier layers as the inputs to later layers, and produce excellent results such as in ResNet and DenseNet. Despite the power of shortcuts, there remain important questions on the underlying mechanism and associated functionalities. For example, will adding shortcuts lead to a more compact structure? How to use shortcuts for an optimal efficiency and capacity of the network model? Along this direction, here we demonstrate that given only one neuron in each layer, the shortcuts can be sparsely placed to let the slim network become an universal approximator. Potentially, our theoretically-guaranteed sparse network model can achieve a learning performance comparable to densely-connected networks on well-known benchmarks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently, deep learning [1, 2] has been rapidly advanced in the field of artificial intelligence and machine learning, and achieved great successes in many applications [3, 4, 5, 6, 7, 8]. Since AlexNet was reported [9]

, there has been a trend that the neural networks go deeper and deeper; for example, GoogleNet

[10], Inception [11], VGG [12], ResNet [13, 14], and DenseNet [15]. Among these excellent networks, ResNet and DenseNet utilize internal shortcuts, facilitating the training of extremely deep network structures by synthesis of multi-resolution features. As the key character of ResNet and DenseNet, shortcut structures are gaining an increasingly more traction in other networks. However, although they perform well in practice, theoretical studies on shortcut connections remains few.

Clearly, there are trade-offs in shortcut-facilitated network models. If there are too many shortcuts, computational burden and overfit risk would be also increased, such as with DenseNet where in each block each layer is concatenated with every prior layers resulting in a complexity of , with being the layer depth. Despite the effort [15] that alleviates such a problem by adopting relatively small convolutional kernels and transitional layers to squeeze the sizes of features, the network can still be over complex due to the nature of dense connections.

On the other hand, a natural quesiton is if such dense connections are really needed or not. In [16], Zhu et al. demonstrated that a densely connected structure is not an optimal design, as it involves too many parameters during internal aggregation. Then, what is an optimal structure and, more importantly, how to find it? To use skip connections reasonably, our overall hypothesis is that an optimal structure should achieve a balance between efficiency and capacity for a given problem.

In this paper, we start with an extreme network with just one neuron in every layer and then only add necessary shortcuts to make the network a universal approximator in the sense of the distance. Specifically, for any continuous function and any given precision , we claim that there exists a sparsely-connected neural network

with the ReLU activation and one neuron in each layer such that


Now, let us analyze the connection topology in such an universal approximator in the following features: (1) Theoretically, in our sparse network the number of connections is , which is significantly lighter than that in DenseNet. This saving can be critical in very deep or wide networks; (2) As a universal approximator with only one neuron in each layer, the slim network has a powerful representation ability; (3) The shortest gradient path in this network is , which means that the network is ideal to avoid the gradient vanishing or explosion problems. With the aforementioned desirable properties, we intuitively designed a network referred to as S3-Net (Sparse, Slim, Shortcut Network.) Experiments on MNIST demonstrate that our S3-Net performs competitively compared to some advanced networks.

Universal approximation is a prerequisite for a generic neural network. A classic result [17, 18, 19] referred to as the universal approximation theorem, states that a neural network with only one hidden layer can approximate a general function with any given accuracy. Specifically, in their proofs, to realize a strong representation power, the network is made wide. In contrast, inspired by successes of deep learning, new results were published, suggesting the advantages of depth over width of the networks. The basic idea behind these results is to construct a special class of functions that a deep network can efficiently represent while shallow networks cannot [20, 21, 22, 23, 24, 25, 26]. Impressively, [27] reported that given at most neurons per layer and allowing an infinite depth, a fully-connected deep network with the ReLU activation can approximate a Lebesgue-integrable -dimension function accurately in the -norm sense, which is the first width bounded version of the universal approximation theory. Furthermore, our group reported in [28] that no more than five neurons are needed to express any -dimension radial functions with our proposed quadratic neural network. To our best knowledge, theoretical results illustrating the role of shortcut connection is quite few. In [29], it was pointed out that ResNet with one neuron in each layer can approximate any continuous function.

To extract critical features and train deep networks effectively, there were great efforts made exploring the use of skip connections. Hypercolumn network [30] stacked the units at all layers as concatenated feature descriptors to obtain semantic information and precise localization. SegNet [31] used pooling indices in the encoder-decoder framework to facilitate the pixel-wise correspondence. Similarly, U-Net [32] exploited internal shortcuts bridging encoding and decoding layers to restore image texture. Highway Network [33] and ResNet network were great successes in training very deep networks. Fractal Network [34]

used a different skip connection design strategy, by which interacting subpaths are used without any pass-through or residual connections. DenseNet

[15] employed the topology with densely-connected shortcuts to maximally reuse the features created in each layer. Quite close to our work is [16]

, which heuristically proposed a variant of a sparsified DenseNet and numerically demonstrated that the sparsified network is efficient and robust as compared to ResNet and DenseNet. However, there is no theoretical insight on why such a structure can outperform ResNet and DenseNet.

With this paper we make the following points. First, our paper proves the efficacy admitted by shortcut connections; that is, with shortcuts even one-neuron S3-Net with the ReLU activation can work as a universal approximator. Aided by shortcut connections, the width of the network can be as small as one. Second, in our construction, given neurons only connections are required for the network, while a typical DenseNet needs connections. This fact suggests a redundancy in the DenseNet structure. Furthermore, we prototype a S3-Net structure based on our analysis. In the experiment on the MNIST dataset, our network achieved a state-of-the-art performance at a reduced computational cost. In the following, we will first give our proof. Then, we describe the our S3-Net and our numerical results on MNIST. Finally, we discuss related issues and conclude the paper.

Ii Universal Approximation with a Sparsified DenseNet

Our proof consists of two parts. In the first part, we show how to use DenseNet to approximate a continuous univariate function, which is the key part of our proof. Mathematically, any univariate continuous function can be approximated by a continuous piecewise linear function within a given closeness. Therefore, the question becomes how to implement this piecewise approximation with a sparsified Densenet structure. Assuming the linear activation in use, one neuron can represent a linear function. Then, the composition of any number of neurons still result in a linear function, which makes no sense for our intended general approximation. Hence, in our scheme ReLU will be used to produce and integrate piecewise linear segments over the interval of interest. In the second part, we take advantage of the Kolmogorov–Arnold representation theorem [35] that any multivariate function can be expressed as the composition of several univariate functions, and extend our univariate results to multivariate functions.

Ii-a Universal Approximation of a Univariate Continuous Function

Theorem 1: Given any univariate continuous function supported on [0,1], for any there exists a function represented by a network with one-neuron per layer, sparsified shortcuts and ReLU activation, satisfying:


As spline fitting theory implies, it holds true that any continuous function can be approximated by a piecewise linear function at any accuracy in the sense. Here, for convenience and without loss of generality, we impose the constraints that is continuous and without any nonzero constant offset. Then, when the range is partitioned into very tiny intervals, as shown in Figure 1, a general function can be approximated well by a non-constant continuous piecewise with segments Therefore, to realize Theorem 1, we just need to use a one-neuron DenseNet denoted as . For clarification, we give a rigorous definition of as follows:




where and jointly reflect that is continuously increasing or decreasing.

Fig. 1: Approximation of a target function with L piecewise continuous non-constant linear functions.

The network architecture with the positions of shortcuts highlighted is illustrated in Figure 2. The hyperparameters should be appropriately selected. The outputs of the modules are denoted as

. Our strategy for learning is to use each module to characterize a specific piece of a function: , where the shift is imposed as large as it is needed to put the functional value in the operating zone of ReLU, and use shortcut connections to combine these segments and compensate for with the final neuron for . For the module , its output is expressed in the following equation:


where means a operation for short. are weights and biases in the two neurons within the module correspondingly.

Fig. 2: DenseNet with L hidden modules.

In the following, mathematical induction is used to illustrate that our construction can approximate exactly.


Without loss of generality, we assume that the function is positive and sufficiently large over . Otherwise, as asserted above, a large positive number can be first added to and finally removed. This operation is important, as explained later. We use to implement the linear function in the first interval , . By setting , , the specific function of the first neuron becomes:


ReLU keeps the linearity for .

Ii-A2 Recurrent Relation

Suppose that we have obtained the desired module , we can proceed to design the module wit the goal to express the function over the interval , denoted as . The tricky point is the current neuron basically takes in the output of the previous neuron as the input, which is in the functional range instead of the input domain. Therefore, we need to perform an inverse affine transform.

For convenience, we define and ,


The trick we use is the inversion of back to the input domain and setting the slope as , which cancels the effect of imposed on , equivalently squeezing to only denote the segment for once and are added together. The parameters in the module are chosen as follows: and .

Thanks to the recurrent relation, each can accurately represent a segment of , , successively, and the neighboring summation will bring to zero for . To clarify, we exemplify the interplay between and with and . Let us calculate . As shown in Figure 3, crops the part for , and at the same time accurately represents the function over . When consists of segments, it can be well represented as by our construction.

Fig. 3: Interplay between and examplified by the summation of and .

Now we explain why we may need to shift up the function . As shown in Figure 4, for the computation of , must be kept positive, instead of being truncated for . Otherwise, will generate a residual interference for .

Fig. 4: Truncation of the functional value avoided by adding a positive constant.

Lemma 2: Any continuous and non-constant piecewise linear function denoted by can be accurately expressed by a one-neuron-wide deep DenseNet.

Proof: By combining the above steps, for any consisting of piecewise linear segments there will be a function denoted by a one-neural-wide L-layer sparsified DenseNet that can exactly representate .

Proof to Theorem 1: Combining Lemma 2 and the above arguments, we immediately arrive at Theorem 1.

Ii-B Universal Approximation of a Multivariate Continuous Function

In the preceding subsection, we have shown that a sparsified slim DenseNet can approximate any continuous univariate function up to any pre-specified precision. In this subsection, we will present a scheme to approximate a multivariate continuous function by stacking single-neuron-wide DenseNets as we designed. Considering a -dimensional continous function , our method is to decompose a high dimensional function into a combination of several univariate functions based on the Kolmogorov–Arnold representation theorem.

Kolmogorov–-Arnold representation theorem: According to the Kolmogorov–Arnold representation theorem, for any continuous function with , there exists continuous function: and such that


Hence, can be written as a composition of finitely many univariate functions. In the first part, we has proved that any univariate continuous function can be approximated by a slim denseNet of a finite depth. Therefore, a multivariate continuous function can also be approximated by a finite composition of such slim DenseNets.

Iii S3-Net

Based on the above proof, we design the following network, referred to as a Slim, Sparse, Shortcut Network (S3-Net), to enrich the armory of deep networks, which enjoys simplicity yet allows universal approximation. As shown in Figure 5, the operator module

performs multiple operations including batch normalization

[36], ReLU [37], convolution and so on, whose details are surpressed in our macro-structural view. Suppose that is the output of module in the network of layers, we characterize the workflow of the network architecture in the following way:


where is an aggregation operator. There are at least two aggregation modes: summation (as in ResNet) and concatenation (as in DenseNet). In this paper, we denote summation and concatenation as and respectively. We will explore both aggregation modes in the context of our architecture.

Fig. 5: Topology of our proposed S3-Net

Iii-a Properties of the S3-Net

To minimize the information loss, DenseNet makes all the previous layers connect a current layer. Every layer is loaded with the features from all the previous layers with little discrimination. From the perspective of trainability, the densely-connected structure alleviates the gradient vanishing and exploding problem. Although DenseNet produced promising results, it suffers from the associated weaknesses. Specifically, the required number of hyperparameters grow at the asymptotic quadratic rate with respect to the network depth. To avoid over-constraining and heavy overhead, DenseNet often uses small feature maps. Still, it has been suggested that DenseNet is not easy to fully use all of its skip connections, because a large number of parameters for the shortcuts are close to zero [16]. This fact indicates that DenseNet may not be as effective and efficient as it is expected.

In contrast, our S3-Net is significantly simpler and sparser than DenseNet. the number of parameters and the shortest gradient path are listed in Table I for ResNet, DenseNet, and our S3-Net respectively. The computational burden of the S3-Net roughly ties with that of ResNet and significantly smaller than that of DenseNet, in terms of either summation or concatenation operations. Notably, as far as the shortest gradient path is concerned, S3-Net is also preferable. Typically, the shortest gradient path determines the quality of gradient search. In S3-Net, each previous layer is connected to the final layer, which means that the feedback error can be passed to each and every layer directly, regardless where it is, instead of going through redundant intermediate connections as done in DenseNet. Therefore, our S3-Net is ideal for an effective and efficient gradient transport at a cheap computational cost.

Model Parameters Shortest Gradient Search Path ResNet DenseNet S3-Net (+) S3-Net ()

TABLE I: Comparison between ResNet, DenseNet, and S3-Net in terms of the number of involved parameters and the shorted gradient search path.

Iii-B Network Configuration

The above theoretical properties motivate us to justify our S3-Net model numerically. In this subsection, we describe a specific configuration of the S3-Net.

Composition Module: The composition module

includes four consecutive operations: the batch normalization (BN), Rectified linear Unit (ReLU), convolutional layer, and dropout layer

[38], where the use of the dropout layer is optional in our experiment, since it was reported that dropout would more or less improve the network performance.

Pooling Layer: The pooling layer is against concatenation, and thus in S3-Net() and DenseNet pooling layers are primarily used in transition layers between blocks. Similarly, the transition layer in S3-Net contains the batch normalization layer, convolutional layer and average pooling layer. On the other hand, pooling layers are extensively used in S3-Net(), especially when the computational cost is to be minimized.

Feature maps: The number of feature maps is specific to each S3-Net(). Assuming that there are layers in a network, and the layer generates feature maps, and then the total number of the feature maps to be concatenated will be . The number of feature maps here grows with respective to differently from that for DenseNet blocks. In DenseNet, the number of feature maps from the layer is .

Bottleneck Layer: In DenseNet, to improve the model efficiency, it is potentially advantageous to utilize bottleneck layers, where a 1X1 convolution is introduced in each block. To fairly compare our S3-Net with DenseNet, our will compare S3-Net to DenseNet in the cases with and without bottleneck layers respectively.

Iii-C Experimental Design and Results


The MNIST dataset includes 60,000 training instances and 10,000 testing instances of 28X28 greyscale images of handwritten digits 0-9 respectively. For digit recognizaiton, we compared DenseNet and S3-Net models, consisting of 7X7 convolution and 3X3 max-pooling as preprocessing layers, then three aggregation blocks and two transition layers, and followed by BN-ReLU-Pooling-FC-Softmax. We used consecutive operations of BN, ReLU, Conv, Dropout and 2X2 average pooling as transition layers. The bottleneck layer was employed to constitute

; i.e., BN-ReLU-Conv (1X1)-BN-ReLU-Conv (3X3). In the aggregation block, only the placement of skip connection was different between S3-Net and DenseNet. The layer depth and growth rate in three blocks were and respectively.

In both models, the training started from scratch. The dropout rate was 0.2. The batch size in every training was 100, with the total number of epochs being 50. The optimization relied on the Adam method with

=1e-8. The learning rate varied as for the first 25 epochs and for the last 25 epochs in the case of DenseNet, but in the case of S3-Net the learning rate was for first 30 epochs and for the remaining epochs.

In Table II, the results are compared by accuracy and efficiency respectively. In terms of accuracy, The performance of S3-Net is slightly lower than that of DenseNet. One of the state-of-the-art results is 0.91% as reported in [39]. As for efficiency, the number of parameters in S3-Net is , less than half of that used in DenseNet.

Method Test Error Parameters DenseNet 1.21% 0.55M S3-Net 1.51% 0.24M Rectifier MLP+Dropout 1.05% MP+DBM 0.91%

TABLE II: Comparison of model accuracy and efficiency comparisons between S3-Net and other cutting-edge models.

Iv Conclusion

In this study, we have investigated the utility of skip connections in an extremely slim network theoretically, which has only one neuron per layer but goes as deeply as needed. Inspired by the elegant topology of S3-Net, we evaluated a S3-Net prototype in digit recognition, with encouraging preliminary results from the MNIST dataset. Further extension and evaluation are under way.


  • [1] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning. nature, 521(7553), 436, 2015
  • [2] F. Fan, W. Cong, G. Wang, A new type of neurons for machine learning, International Journal for Numerical Methods in Biomedical Engineering, vol. 34, no. 2, Feb. 2018.
  • [3] G. E. Dahl, D. Yu, L. Deng and A. Acero, ”Context-dependent pre- trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing,” vol. 20, no. 1, pp. 30-42, 2012.
  • [4]

    A. Kumar, et al., ”Ask me anything: Dynamic memory networks for natural language processing. In ICML, 2016.

  • [5]

    H Chen, et al, ”Low-dose CT with a residual encoder-decoder convolutional neural network,” IEEE transactions on medical imaging. vol. 36, no. 12, pp. 2524-35, 2017.

  • [6] G. Wang, ”A Perspective on Deep Imaging,” IEEE Access, vol. 4, pp. 8914-8924, 2016.
  • [7] M. Anthimopoulos, S. Christodoulidis, L. Ebner, A. Christe, S. Mougiakakou, ”Lung pattern classification for interstitial lung diseases using a deep convolutional neural network,” IEEE Trans. Med. Imaging, vol.35, no.5, pp. 12071216, 2016.
  • [8]

    H. Shan, ”3-D Convolutional Encoder-Decoder Network for Low-Dose CT via Transfer Learning From a 2-D Trained Network,” IEEE transactions on medical imaging, vol. 37, no. 6, pp.1522-1534, 2018.

  • [9]

    A. Krizhevsky, I. Sutskever, and G. Hinton, Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105), 2012

  • [10]

    C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, J. Rick Chang, et al. Going deeper with convolutions. In IEEE conference on computer vision and pattern recognition (CVPR), 2015.

  • [11] C. Szegedy, et al., ”Rethinking the inception architecture for computer vision,” In CVPR, 2016.
  • [12] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In The International Conference on Learning Representations (ICLR), 2015.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR), 2016.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision (ECCV), 2016.
  • [15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, Densely connected convolutional networks. In CVPR (Vol. 1, No. 2, p. 3), July. 2017
  • [16] L. Zhu, et al, ”Sparsely Aggregated Convolutional Networks,” In ECCV, 2018.
  • [17] K. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192, 1989.
  • [18] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  • [19] V. Kurková. Kolmogorov’s theorem and multilayer neural networks. Neural networks, 5(3):501–506, 1992.
  • [20] R. Eldan, O. Shamir, ”The power of depth for feedforward neural networks,” In COLT, 2016.
  • [21] L. Szymanski and B. McCane. Deep networks are effective encoders of periodicity. IEEE transactions on neural networks and learning systems, 25(10):1816–1827, 2014.
  • [22] D. Rolnick and M. Tegmark. The power of deeper networks for expressing natural functions. In The International Conference on Learning Representations (ICLR), 2018.
  • [23]

    N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory (COLT), 2016.

  • [24] H. N. Mhaskar and T. Poggio. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 14(06):829–848, 2016.
  • [25] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory (COLT), 2016.
  • [26] S. Liang and R. Srikant. Why deep neural networks for function approximation? In The International Conference on Learning Representations (ICLR), 2017.
  • [27] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang, The expressive power of neural networks: A view from the width. In Advances in Neural Information Processing Systems (pp. 6231-6239), 2017
  • [28] F. Fan, and G. Wang, Universal Approximation with Quadratic Deep Networks. arXiv preprint arXiv:1808.00098, 2018
  • [29] H. Lin, and S. Jegelka, ResNet with one-neuron hidden layers is a Universal Approximator. arXiv preprint arXiv:1806.10909, 2018
  • [30] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 447-456), 2015
  • [31] V. Badrinarayanan, A. Kendall, and R. Cipolla, Segnet: A deep convolutional, 2015 encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561.
  • [32] O. Ronneberger, P. Fischer, and T. Brox, U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Springer, Cham, Oct. 2015
  • [33] R. K. Srivastava, K. Greff, and J. Schmidhuber, Highway networks. arXiv preprint arXiv:1505.00387, 2015
  • [34] G. Larsson, M. Maire, and G. Shakhnarovich, Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016
  • [35] A. N. Kolmogorov, “On the representation of continuous functions of several variables by superpositions of continuous functions of a smaller number of variables”, Proceedings of the USSR Academy of Sciences, 108 (1956), pp. 179–182; English translation: Amer. Math. Soc. Transl., 17 (1961), pp. 369–373.
  • [36] S. Ioffe, and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015
  • [37]

    V. Nair, and G. Hinton, Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 807-814),2010

  • [38] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929-1958, 2014
  • [39] I. Goodfellow, M. Mirza, A. Courville, and Y. Bengio, Multi-prediction deep Boltzmann machines. In Advances in Neural Information Processing Systems (pp. 548-556), 2013