
Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization
While deep learning is successful in a number of applications, it is not yet well understood theoretically. A satisfactory theoretical characterization of deep learning however, is beginning to emerge. It covers the following questions: 1) representation power of deep networks 2) optimization of the empirical risk 3) generalization properties of gradient descent techniques  why the expected error does not suffer, despite the absence of explicit regularization, when the networks are overparametrized? In this review we discuss recent advances in the three areas. In approximation theory both shallow and deep networks have been shown to approximate any continuous functions on a bounded domain at the expense of an exponential number of parameters (exponential in the dimensionality of the function). However, for a subset of compositional functions, deep networks of the convolutional type can have a linear dependence on dimensionality, unlike shallow networks. In optimization we discuss the loss landscape for the exponential loss function and show that stochastic gradient descent will find with high probability the global minima. To address the question of generalization for classification tasks, we use classical uniform convergence results to justify minimizing a surrogate exponentialtype loss function under a unit norm constraint on the weight matrix at each layer  since the interesting variables for classification are the weight directions rather than the weights. Our approach, which is supported by several independent new results, offers a solution to the puzzle about generalization performance of deep overparametrized ReLU networks, uncovering the origin of the underlying hidden complexity control.
08/25/2019 ∙ by Tomaso Poggio, et al. ∙ 82 ∙ shareread it

Biologicallyplausible learning algorithms can scale to large datasets
The backpropagation (BP) algorithm is often thought to be biologically implausible in the brain. One of the main reasons is that BP requires symmetric weight matrices in the feedforward and feedback pathways. To address this "weight transport problem" (Grossberg, 1987), two more biologically plausible algorithms, proposed by Liao et al. (2016) and Lillicrap et al. (2016), relax BP's weight symmetry requirements and demonstrate comparable learning capabilities to that of BP on small datasets. However, a recent study by Bartunov et al. (2018) evaluate variants of targetpropagation (TP) and feedback alignment (FA) on MINIST, CIFAR, and ImageNet datasets, and find that although many of the proposed algorithms perform well on MNIST and CIFAR, they perform significantly worse than BP on ImageNet. Here, we additionally evaluate the signsymmetry algorithm (Liao et al., 2016), which differs from both BP and FA in that the feedback and feedforward weights share signs but not magnitudes. We examine the performance of signsymmetry and feedback alignment on ImageNet and MS COCO datasets using different network architectures (ResNet18 and AlexNet for ImageNet, RetinaNet for MS COCO). Surprisingly, networks trained with signsymmetry can attain classification performance approaching that of BPtrained networks. These results complement the study by Bartunov et al. (2018), and establish a new benchmark for future biologically plausible learning algorithms on more difficult datasets and more complex architectures.
11/08/2018 ∙ by Will Xiao, et al. ∙ 18 ∙ shareread it

A Surprising Linear Relationship Predicts Test Performance in Deep Networks
Given two networks with the same training loss on a dataset, when would they have drastically different test losses and errors? Better understanding of this question of generalization may improve practical applications of deep networks. In this paper we show that with crossentropy loss it is surprisingly simple to induce significantly different generalization performances for two networks that have the same architecture, the same meta parameters and the same training error: one can either pretrain the networks with different levels of "corrupted" data or simply initialize the networks with weights of different Gaussian standard deviations. A corollary of recent theoretical results on overfitting shows that these effects are due to an intrinsic problem of measuring test performance with a crossentropy/exponentialtype loss, which can be decomposed into two components both minimized by SGD  one of which is not related to expected classification performance. However, if we factor out this component of the loss, a linear relationship emerges between training and test losses. Under this transformation, classical generalization bounds are surprisingly tight: the empirical/training loss is very close to the expected/test loss. Furthermore, the empirical relation between classification error and normalized crossentropy loss seem to be approximately monotonic
07/25/2018 ∙ by Qianli Liao, et al. ∙ 10 ∙ shareread it

Theory IIIb: Generalization in Deep Networks
A main puzzle of deep neural networks (DNNs) revolves around the apparent absence of "overfitting", defined in this paper as follows: the expected error does not get worse when increasing the number of neurons or of iterations of gradient descent. This is surprising because of the large capacity demonstrated by DNNs to fit randomly labeled data and the absence of explicit regularization. Recent results by Srebro et al. provide a satisfying solution of the puzzle for linear networks used in binary classification. They prove that minimization of loss functions such as the logistic, the crossentropy and the exploss yields asymptotic, "slow" convergence to the maximum margin solution for linearly separable datasets, independently of the initial conditions. Here we prove a similar result for nonlinear multilayer DNNs near zero minima of the empirical loss. The result holds for exponentialtype losses but not for the square loss. In particular, we prove that the weight matrix at each layer of a deep network converges to a minimum norm solution up to a scale factor (in the separable case). Our analysis of the dynamical system corresponding to gradient descent of a multilayer network suggests a simple criterion for ranking the generalization performance of different zero minimizers of the empirical loss.
06/29/2018 ∙ by Tomaso Poggio, et al. ∙ 2 ∙ shareread it

FisherRao Metric, Geometry, and Complexity of Neural Networks
We study the relationship between geometry and capacity measures for deep neural networks from an invariance viewpoint. We introduce a new notion of capacity  the FisherRao norm  that possesses desirable invariance properties and is motivated by Information Geometry. We discover an analytical characterization of the new capacity measure, through which we establish normcomparison inequalities and further show that the new measure serves as an umbrella for several existing normbased complexity measures. We discuss upper bounds on the generalization error induced by the proposed measure. Extensive numerical experiments on CIFAR10 support our theoretical findings. Our theoretical analysis rests on a key structural lemma about partial derivatives of multilayer rectifier networks.
11/05/2017 ∙ by Tengyuan Liang, et al. ∙ 0 ∙ shareread it

Streaming Normalization: Towards Simpler and More Biologicallyplausible Normalizations for Online and Recurrent Learning
We systematically explored a spectrum of normalization algorithms related to Batch Normalization (BN) and propose a generalized formulation that simultaneously solves two major limitations of BN: (1) online learning and (2) recurrent learning. Our proposal is simpler and more biologicallyplausible. Unlike previous approaches, our technique can be applied out of the box to all learning scenarios (e.g., online learning, batch learning, fullyconnected, convolutional, feedforward, recurrent and mixed  recurrent and convolutional) and compare favorably with existing approaches. We also propose Lp Normalization for normalizing by different orders of statistical moments. In particular, L1 normalization is wellperforming, simple to implement, fast to compute, more biologicallyplausible and thus ideal for GPU or hardware implementations.
10/19/2016 ∙ by Qianli Liao, et al. ∙ 0 ∙ shareread it

Viewtolerant face recognition and Hebbian learning imply mirrorsymmetric neural tuning to head orientation
The primate brain contains a hierarchy of visual areas, dubbed the ventral stream, which rapidly computes object representations that are both specific for object identity and relatively robust against identitypreserving transformations like depthrotations. Current computational models of object recognition, including recent deep learning networks, generate these properties through a hierarchy of alternating selectivityincreasing filtering and toleranceincreasing pooling operations, similar to simplecomplex cells operations. While simulations of these models recapitulate the ventral stream's progression from early viewspecific to late viewtolerant representations, they fail to generate the most salient property of the intermediate representation for faces found in the brain: mirrorsymmetric tuning of the neural population to head orientation. Here we prove that a class of hierarchical architectures and a broad set of biologically plausible learning rules can provide approximate invariance at the top level of the network. While most of the learning rules do not yield mirrorsymmetry in the midlevel representations, we characterize a specific biologicallyplausible Hebbtype learning rule that is guaranteed to generate mirrorsymmetric tuning to faces tuning at intermediate levels of the architecture.
06/05/2016 ∙ by Joel Z. Leibo, et al. ∙ 0 ∙ shareread it

Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex
We discuss relations between Residual Networks (ResNet), Recurrent Neural Networks (RNNs) and the primate visual cortex. We begin with the observation that a shallow RNN is exactly equivalent to a very deep ResNet with weight sharing among the layers. A direct implementation of such a RNN, although having orders of magnitude fewer parameters, leads to a performance similar to the corresponding ResNet. We propose 1) a generalization of both RNN and ResNet architectures and 2) the conjecture that a class of moderately deep RNNs is a biologicallyplausible model of the ventral stream in visual cortex. We demonstrate the effectiveness of the architectures by testing them on the CIFAR10 dataset.
04/13/2016 ∙ by Qianli Liao, et al. ∙ 0 ∙ shareread it

Deep Convolutional Networks are Hierarchical Kernel Machines
In itheory a typical layer of a hierarchical architecture consists of HW modules pooling the dot products of the inputs to the layer with the transformations of a few templates under a group. Such layers include as special cases the convolutional layers of Deep Convolutional Networks (DCNs) as well as the nonconvolutional layers (when the group contains only the identity). Rectifying nonlinearities  which are used by presentday DCNs  are one of the several nonlinearities admitted by itheory for the HW module. We discuss here the equivalence between group averages of linear combinations of rectifying nonlinearities and an associated kernel. This property implies that presentday DCNs can be exactly equivalent to a hierarchy of kernel machines with pooling and nonpooling layers. Finally, we describe a conjecture for theoretically understanding hierarchies of such modules. A main consequence of the conjecture is that hierarchies of trained HW modules minimize memory requirements while computing a selective and invariant representation.
08/05/2015 ∙ by Fabio Anselmi, et al. ∙ 0 ∙ shareread it

Pruning Convolutional Neural Networks for Image Instance Retrieval
In this work, we focus on the problem of image instance retrieval with deep descriptors extracted from pruned Convolutional Neural Networks (CNN). The objective is to heavily prune convolutional edges while maintaining retrieval performance. To this end, we introduce both dataindependent and datadependent heuristics to prune convolutional edges, and evaluate their performance across various compression rates with different deep descriptors over several benchmark datasets. Further, we present an endtoend framework to finetune the pruned network, with a triplet loss function specially designed for the retrieval task. We show that the combination of heuristic pruning and finetuning offers 5x compression rate without considerable loss in retrieval performance.
07/18/2017 ∙ by Gaurav Manek, et al. ∙ 0 ∙ shareread it

Do Deep Neural Networks Suffer from Crowding?
Crowding is a visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. In this work, we study the effect of crowding in artificial Deep Neural Networks for object recognition. We analyze both standard deep convolutional neural networks (DCNNs) as well as a new version of DCNNs which is 1) multiscale and 2) with size of the convolution filters change depending on the eccentricity wrt to the center of fixation. Such networks, that we call eccentricitydependent, are a computational model of the feedforward path of the primate visual cortex. Our results reveal that the eccentricitydependent model, trained on target objects in isolation, can recognize such targets in the presence of flankers, if the targets are near the center of the image, whereas DCNNs cannot. Also, for all tested networks, when trained on targets in isolation, we find that recognition accuracy of the networks decreases the closer the flankers are to the target and the more flankers there are. We find that visual similarity between the target and flankers also plays a role and that pooling in early layers of the network leads to more crowding. Additionally, we show that incorporating the flankers into the images of the training set does not improve performance with crowding.
06/26/2017 ∙ by Anna Volokitin, et al. ∙ 0 ∙ shareread it