
Transformed CNNs: recasting pretrained convolutional layers with selfattention
Vision Transformers (ViT) have recently emerged as a powerful alternativ...
read it

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
Convolutional architectures have proven extremely successful for vision ...
read it

More data or more parameters? Investigating the effect of data structure on generalization
One of the central features of deep learning is the generalization abili...
read it

Triple descent and the two kinds of overfitting: Where why do they appear?
A recent line of research has highlighted the existence of a double desc...
read it

On the HeavyTailed Theory of Stochastic Gradient Descent for Deep Neural Networks
The gradient noise (GN) in the stochastic gradient descent (SGD) algorit...
read it

Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias
Despite the phenomenal success of deep neural networks in a broad range ...
read it

A TailIndex Analysis of Stochastic Gradient Noise in Deep Neural Networks
The gradient noise (GN) in the stochastic gradient descent (SGD) algorit...
read it

Scaling description of generalization with number of parameters in deep learning
We provide a description for the evolution of the generalization perform...
read it

A jamming transition from under to overparametrization affects loss landscape and generalization
We argue that in fullyconnected networks a phase transition delimits th...
read it

The jamming transition as a paradigm to understand the loss landscape of deep neural networks
Deep learning has been immensely successful at a variety of tasks, rangi...
read it

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine
We publicly release a new largescale dataset, called SearchQA, for mach...
read it

Perspective: Energy Landscapes for Machine Learning
Machine learning techniques are being increasingly used as flexible non...
read it

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond
We look at the eigenvalues of the Hessian of a loss function before and ...
read it

EntropySGD: Biasing Gradient Descent Into Wide Valleys
This paper proposes a new optimization algorithm called EntropySGD for ...
read it

Universal halting times in optimization and machine learning
The authors present empirical distributions for the halting time (measur...
read it

Explorations on high dimensional landscapes
Finding minima of a real valued nonconvex function over a high dimensio...
read it
Levent Sagun
is this you? claim profile