
Biologicallyplausible learning algorithms can scale to large datasets
The backpropagation (BP) algorithm is often thought to be biologically implausible in the brain. One of the main reasons is that BP requires symmetric weight matrices in the feedforward and feedback pathways. To address this "weight transport problem" (Grossberg, 1987), two more biologically plausible algorithms, proposed by Liao et al. (2016) and Lillicrap et al. (2016), relax BP's weight symmetry requirements and demonstrate comparable learning capabilities to that of BP on small datasets. However, a recent study by Bartunov et al. (2018) evaluate variants of targetpropagation (TP) and feedback alignment (FA) on MINIST, CIFAR, and ImageNet datasets, and find that although many of the proposed algorithms perform well on MNIST and CIFAR, they perform significantly worse than BP on ImageNet. Here, we additionally evaluate the signsymmetry algorithm (Liao et al., 2016), which differs from both BP and FA in that the feedback and feedforward weights share signs but not magnitudes. We examine the performance of signsymmetry and feedback alignment on ImageNet and MS COCO datasets using different network architectures (ResNet18 and AlexNet for ImageNet, RetinaNet for MS COCO). Surprisingly, networks trained with signsymmetry can attain classification performance approaching that of BPtrained networks. These results complement the study by Bartunov et al. (2018), and establish a new benchmark for future biologically plausible learning algorithms on more difficult datasets and more complex architectures.
11/08/2018 ∙ by Will Xiao, et al. ∙ 18 ∙ shareread it

A Surprising Linear Relationship Predicts Test Performance in Deep Networks
Given two networks with the same training loss on a dataset, when would they have drastically different test losses and errors? Better understanding of this question of generalization may improve practical applications of deep networks. In this paper we show that with crossentropy loss it is surprisingly simple to induce significantly different generalization performances for two networks that have the same architecture, the same meta parameters and the same training error: one can either pretrain the networks with different levels of "corrupted" data or simply initialize the networks with weights of different Gaussian standard deviations. A corollary of recent theoretical results on overfitting shows that these effects are due to an intrinsic problem of measuring test performance with a crossentropy/exponentialtype loss, which can be decomposed into two components both minimized by SGD  one of which is not related to expected classification performance. However, if we factor out this component of the loss, a linear relationship emerges between training and test losses. Under this transformation, classical generalization bounds are surprisingly tight: the empirical/training loss is very close to the expected/test loss. Furthermore, the empirical relation between classification error and normalized crossentropy loss seem to be approximately monotonic
07/25/2018 ∙ by Qianli Liao, et al. ∙ 10 ∙ shareread it

Theory IIIb: Generalization in Deep Networks
A main puzzle of deep neural networks (DNNs) revolves around the apparent absence of "overfitting", defined in this paper as follows: the expected error does not get worse when increasing the number of neurons or of iterations of gradient descent. This is surprising because of the large capacity demonstrated by DNNs to fit randomly labeled data and the absence of explicit regularization. Recent results by Srebro et al. provide a satisfying solution of the puzzle for linear networks used in binary classification. They prove that minimization of loss functions such as the logistic, the crossentropy and the exploss yields asymptotic, "slow" convergence to the maximum margin solution for linearly separable datasets, independently of the initial conditions. Here we prove a similar result for nonlinear multilayer DNNs near zero minima of the empirical loss. The result holds for exponentialtype losses but not for the square loss. In particular, we prove that the weight matrix at each layer of a deep network converges to a minimum norm solution up to a scale factor (in the separable case). Our analysis of the dynamical system corresponding to gradient descent of a multilayer network suggests a simple criterion for ranking the generalization performance of different zero minimizers of the empirical loss.
06/29/2018 ∙ by Tomaso Poggio, et al. ∙ 2 ∙ shareread it

Streaming Normalization: Towards Simpler and More Biologicallyplausible Normalizations for Online and Recurrent Learning
We systematically explored a spectrum of normalization algorithms related to Batch Normalization (BN) and propose a generalized formulation that simultaneously solves two major limitations of BN: (1) online learning and (2) recurrent learning. Our proposal is simpler and more biologicallyplausible. Unlike previous approaches, our technique can be applied out of the box to all learning scenarios (e.g., online learning, batch learning, fullyconnected, convolutional, feedforward, recurrent and mixed  recurrent and convolutional) and compare favorably with existing approaches. We also propose Lp Normalization for normalizing by different orders of statistical moments. In particular, L1 normalization is wellperforming, simple to implement, fast to compute, more biologicallyplausible and thus ideal for GPU or hardware implementations.
10/19/2016 ∙ by Qianli Liao, et al. ∙ 0 ∙ shareread it

Viewtolerant face recognition and Hebbian learning imply mirrorsymmetric neural tuning to head orientation
The primate brain contains a hierarchy of visual areas, dubbed the ventral stream, which rapidly computes object representations that are both specific for object identity and relatively robust against identitypreserving transformations like depthrotations. Current computational models of object recognition, including recent deep learning networks, generate these properties through a hierarchy of alternating selectivityincreasing filtering and toleranceincreasing pooling operations, similar to simplecomplex cells operations. While simulations of these models recapitulate the ventral stream's progression from early viewspecific to late viewtolerant representations, they fail to generate the most salient property of the intermediate representation for faces found in the brain: mirrorsymmetric tuning of the neural population to head orientation. Here we prove that a class of hierarchical architectures and a broad set of biologically plausible learning rules can provide approximate invariance at the top level of the network. While most of the learning rules do not yield mirrorsymmetry in the midlevel representations, we characterize a specific biologicallyplausible Hebbtype learning rule that is guaranteed to generate mirrorsymmetric tuning to faces tuning at intermediate levels of the architecture.
06/05/2016 ∙ by Joel Z. Leibo, et al. ∙ 0 ∙ shareread it

Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex
We discuss relations between Residual Networks (ResNet), Recurrent Neural Networks (RNNs) and the primate visual cortex. We begin with the observation that a shallow RNN is exactly equivalent to a very deep ResNet with weight sharing among the layers. A direct implementation of such a RNN, although having orders of magnitude fewer parameters, leads to a performance similar to the corresponding ResNet. We propose 1) a generalization of both RNN and ResNet architectures and 2) the conjecture that a class of moderately deep RNNs is a biologicallyplausible model of the ventral stream in visual cortex. We demonstrate the effectiveness of the architectures by testing them on the CIFAR10 dataset.
04/13/2016 ∙ by Qianli Liao, et al. ∙ 0 ∙ shareread it

Theory II: Landscape of the Empirical Risk in Deep Learning
Previous theoretical work on deep learning and neural network optimization tend to focus on avoiding saddle points and local minima. However, the practical observation is that, at least in the case of the most successful Deep Convolutional Neural Networks (DCNNs), practitioners can always increase the network size to fit the training data (an extreme example would be [1]). The most successful DCNNs such as VGG and ResNets are best used with a degree of "overparametrization". In this work, we characterize with a mix of theory and experiments, the landscape of the empirical risk of overparametrized DCNNs. We first prove in the regression framework the existence of a large number of degenerate global minimizers with zero empirical error (modulo inconsistent equations). The argument that relies on the use of Bezout theorem is rigorous when the RELUs are replaced by a polynomial nonlinearity (which empirically works as well). As described in our Theory III [2] paper, the same minimizers are degenerate and thus very likely to be found by SGD that will furthermore select with higher probability the most robust zerominimizer. We further experimentally explored and visualized the landscape of empirical risk of a DCNN on CIFAR10 during the entire training process and especially the global minima. Finally, based on our theoretical and experimental results, we propose an intuitive model of the landscape of DCNN's empirical loss surface, which might not be as complicated as people commonly believe.
03/28/2017 ∙ by Qianli Liao, et al. ∙ 0 ∙ shareread it

Compression of Deep Neural Networks for Image Instance Retrieval
Image instance retrieval is the problem of retrieving images from a database which contain the same object. Convolutional Neural Network (CNN) based descriptors are becoming the dominant approach for generating global image descriptors for the instance retrieval problem. One major drawback of CNNbased global descriptors is that uncompressed deep neural network models require hundreds of megabytes of storage making them inconvenient to deploy in mobile applications or in custom hardware. In this work, we study the problem of neural network model compression focusing on the image instance retrieval task. We study quantization, coding, pruning and weight sharing techniques for reducing model size for the instance retrieval problem. We provide extensive experimental results on the tradeoff between retrieval performance and model size for different types of networks on several data sets providing the most comprehensive study on this topic. We compress models to the order of a few MBs: two orders of magnitude smaller than the uncompressed models while achieving negligible loss in retrieval performance.
01/18/2017 ∙ by Vijay Chandrasekhar, et al. ∙ 0 ∙ shareread it

Unsupervised learning of clutterresistant visual representations from natural videos
Populations of neurons in inferotemporal cortex (IT) maintain an explicit code for object identity that also tolerates transformations of object appearance e.g., position, scale, viewing angle [1, 2, 3]. Though the learning rules are not known, recent results [4, 5, 6] suggest the operation of an unsupervised temporalassociationbased method e.g., Foldiak's trace rule [7]. Such methods exploit the temporal continuity of the visual world by assuming that visual experience over short timescales will tend to have invariant identity content. Thus, by associating representations of frames from nearby times, a representation that tolerates whatever transformations occurred in the video may be achieved. Many previous studies verified that such rules can work in simple situations without background clutter, but the presence of visual clutter has remained problematic for this approach. Here we show that temporal association based on large classspecific filters (templates) avoids the problem of clutter. Our system learns in an unsupervised way from natural videos gathered from the internet, and is able to perform a difficult unconstrained face recognition task on natural images: Labeled Faces in the Wild [8].
09/12/2014 ∙ by Qianli Liao, et al. ∙ 0 ∙ shareread it

Can a biologicallyplausible hierarchy effectively replace face detection, alignment, and recognition pipelines?
The standard approach to unconstrained face recognition in natural photographs is via a detection, alignment, recognition pipeline. While that approach has achieved impressive results, there are several reasons to be dissatisfied with it, among them is its lack of biological plausibility. A recent theory of invariant recognition by feedforward hierarchical networks, like HMAX, other convolutional networks, or possibly the ventral stream, implies an alternative approach to unconstrained face recognition. This approach accomplishes detection and alignment implicitly by storing transformations of training images (called templates) rather than explicitly detecting and aligning faces at test time. Here we propose a particular localitysensitive hashing based voting scheme which we call "consensus of collisions" and show that it can be used to approximate the full 3layer hierarchy implied by the theory. The resulting endtoend system for unconstrained face recognition operates on photographs of faces taken under natural conditions, e.g., Labeled Faces in the Wild (LFW), without aligning or cropping them, as is normally done. It achieves a drastic improvement in the state of the art on this endtoend task, reaching the same level of performance as the best systems operating on aligned, closely cropped images (no outside training data). It also performs well on two newer datasets, similar to LFW, but more difficult: LFWjittered (new here) and SUFRW.
11/16/2013 ∙ by Qianli Liao, et al. ∙ 0 ∙ shareread it

Theory of Deep Learning III: explaining the nonoverfitting puzzle
A main puzzle of deep networks revolves around the absence of overfitting despite overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamical systems associated with gradient descent minimization of nonlinear networks behave near zero stable minima of the empirical error as gradient system in a quadratic potential with degenerate Hessian. The proposition is supported by theoretical and numerical results, under the assumption of stable minima of the gradient. Our proposition provides the extension to deep networks of key properties of gradient descent methods for linear networks, that as, suggested in (1), can be the key to understand generalization. Gradient descent enforces a form of implicit regularization controlled by the number of iterations, and asymptotically converging to the minimum norm solution. This implies that there is usually an optimum early stopping that avoids overfitting of the loss (this is relevant mainly for regression). For classification, the asymptotic convergence to the minimum norm solution implies convergence to the maximum margin solution which guarantees good classification error for "low noise" datasets. The implied robustness to overparametrization has suggestive implications for the robustness of deep hierarchically local networks to variations of the architecture with respect to the curse of dimensionality.
12/30/2017 ∙ by Tomaso Poggio, et al. ∙ 0 ∙ shareread it
Qianli Liao
is this you? claim profile
Doctor of Philosophy (Ph.D.) at Massachusetts Institute of Technology