
-
Modifying Memories in Transformer Models
Large Transformer models have achieved impressive performance in many na...
read it
-
O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Transformer networks use pairwise attention to compute contextual embedd...
read it
-
Why distillation helps: a statistical perspective
Knowledge distillation is a technique for improving the performance of a...
read it
-
Doubly-stochastic mining for heterogeneous retrieval
Modern retrieval problems are characterised by training sets with potent...
read it
-
Federated Learning with Only Positive Labels
We consider learning a multi-class classification model in the federated...
read it
-
Robust Large-Margin Learning in Hyperbolic Space
Recently, there has been a surge of interest in representation learning ...
read it
-
Reliable Distributed Clustering with Redundant Data Assignment
In this paper, we present distributed generalized clustering algorithms ...
read it
-
Low-Rank Bottleneck in Multi-head Attention Models
Attention based Transformer architecture has enabled significant advance...
read it
-
Achieving Multi-Port Memory Performance on Single-Port Memory with Coding Techniques
Many performance critical systems today must rely on performance enhance...
read it
-
Are Transformers universal approximators of sequence-to-sequence functions?
Despite the widespread adoption of Transformer models for NLP tasks, the...
read it
-
Sampled Softmax with Random Fourier Features
The computational cost of training with softmax cross entropy loss grows...
read it
-
The Generalized Lasso for Sub-gaussian Measurements with Dithered Quantization
In the problem of structured signal recovery from high-dimensional linea...
read it
-
Robust Gradient Descent via Moment Encoding with LDPC Codes
This paper considers the problem of implementing large-scale gradient de...
read it
-
Representation Learning and Recovery in the ReLU Model
Rectified linear units, or ReLUs, have become the preferred activation f...
read it
-
The PhaseLift for Non-quadratic Gaussian Measurements
We study the problem of recovering a structured signal x_0 from high-dim...
read it
-
MDS Code Constructions with Small Sub-packetization and Near-optimal Repair Bandwidth
This paper addresses the problem of constructing MDS codes that enable e...
read it