
MultiHead Attention: Collaborate Instead of Concatenate
Attention layers are widely used in natural language processing (NLP) an...
read it

Taming GANs with Lookahead
Generative Adversarial Networks are notoriously challenging to train. Th...
read it

ByzantineRobust Learning on Heterogeneous Datasets via Resampling
In Byzantine robust distributed optimization, a central server wants to ...
read it

Dynamic Model Pruning with Feedback
Deep neural networks often have millions of parameters. This can hinder ...
read it

Ensemble Distillation for Robust Model Fusion in Federated Learning
Federated Learning (FL) is a machine learning setting where many devices...
read it

Extrapolation for Largebatch Training in Deep Learning
Deep learning networks are typically trained by Stochastic Gradient Desc...
read it

Secure ByzantineRobust Machine Learning
Increasingly machine learning systems are being deployed to edge servers...
read it

Masking as an Efficient Alternative to Finetuning for Pretrained Language Models
We present an efficient method of utilizing pretrained language models, ...
read it

Data Parallelism in Training Sparse Neural Networks
Network pruning is an effective methodology to compress large neural net...
read it

A Unified Theory of Decentralized SGD with Changing Topology and Local Updates
Decentralized stochastic optimization methods have gained a lot of atten...
read it

Robust Crosslingual Embeddings from Parallel Sentences
Recent advances in crosslingual word embeddings have primarily relied o...
read it

Advances and Open Problems in Federated Learning
Federated learning (FL) is a machine learning setting where many clients...
read it

On the Relationship between SelfAttention and Convolutional Layers
Recent trends of incorporating attention mechanisms in vision have led r...
read it

On the Tunability of Optimizers in Deep Learning
There is no consensus yet on the question whether adaptive gradient meth...
read it

Model Fusion via Optimal Transport
Combining different models is a widely used paradigm in machine learning...
read it

Decentralized Deep Learning with Arbitrary Communication Compression
Decentralized training of deep learning models is a key element for enab...
read it

Correlating Twitter Language with CommunityLevel Health Outcomes
We study how language on social media is linked to diseases such as athe...
read it

PowerSGD: Practical LowRank Gradient Compression for Distributed Optimization
We study gradient compression methods to alleviate the communication bot...
read it

On Linear Learning with Manycore Processors
A new generation of manycore processors is on the rise that offers dozen...
read it

Better Word Embeddings by Disentangling Contextual nGram Information
Pretrained word vectors are ubiquitous in Natural Language Processing a...
read it

Crosslingual Document Embedding as ReducedRank Ridge Regression
There has recently been much interest in extending vectorbased word rep...
read it

SysML: The New Frontier of Machine Learning Systems
Machine learning (ML) techniques are enjoying rapidly increasing adoptio...
read it

Structure TreeLSTM: Structureaware Attentional Document Encoders
We propose a method to create document representations that reflect thei...
read it

Forecasting intracranial hypertension using multiscale waveform metrics
Objective: Intracranial hypertension is an important risk factor of seco...
read it

Overcoming MultiModel Forgetting
We identify a phenomenon, which we refer to as multimodel forgetting, t...
read it

Evaluating the Search Phase of Neural Architecture Search
Neural Architecture Search (NAS) aims to facilitate the design of deep n...
read it

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication
We consider decentralized stochastic optimization with the objective fun...
read it

Unsupervised Scalable Representation Learning for Multivariate Time Series
Time series constitute a challenging data type for machine learning algo...
read it

Error Feedback Fixes SignSGD and other Gradient Compression Schemes
Signbased algorithms (e.g. signSGD) have been proposed as a biased grad...
read it

Efficient Greedy Coordinate Descent for Composite Problems
Coordinate descent with random coordinate selection is the current state...
read it

Sparsified SGD with Memory
Huge scale machine learning problems are nowadays tackled by distributed...
read it

Wasserstein is all you need
We propose a unified framework for building unsupervised representations...
read it

Don't Use Large MiniBatches, Use Local SGD
Minibatch stochastic gradient methods are the current state of the art ...
read it

COLA: CommunicationEfficient Decentralized Linear Learning
Decentralized machine learning is a promising emerging paradigm in view ...
read it

A Distributed SecondOrder Algorithm You Can Trust
Due to the rapid growth of data and computational resources, distributed...
read it

Global linear convergence of Newton's method without strongconvexity or Lipschitz gradients
We show that Newton's method converges globally at a linear rate for obj...
read it

Training DNNs with Hybrid Block Floating Point
The wide adoption of DNNs has given birth to unrelenting computing requi...
read it

EndtoEnd DNN Training with Block Floating Point Arithmetic
DNNs are ubiquitous datacenter workloads, requiring orders of magnitude ...
read it

Revisiting FirstOrder Convex Optimization Over Linear Spaces
Two popular examples of firstorder optimization methods over linear spa...
read it

EmbedRank: Unsupervised Keyphrase Extraction using Sentence Embeddings
Keyphrase extraction is the task of automatically selecting a small set ...
read it

Efficient Use of LimitedMemory Accelerators for Linear Learning on Heterogeneous Systems
We propose a generic algorithmic building block to accelerate training o...
read it

Learning Aerial Image Segmentation from Online Maps
This study deals with semantic segmentation of highresolution (aerial) ...
read it

Unsupervised robust nonparametric learning of hidden community properties
We consider learning of fundamental properties of communities in large n...
read it

Greedy Algorithms for Cone Constrained Optimization with Convergence Guarantees
Greedy optimization methods such as Matching Pursuit (MP) and FrankWolf...
read it

Generating Steganographic Text with LSTMs
Motivated by concerns for user privacy, we design a steganographic syste...
read it

Faster Coordinate Descent via Adaptive Importance Sampling
Coordinate descent methods employ random partial updates of decision var...
read it

Unsupervised Learning of Sentence Embeddings using Compositional nGram Features
The recent tremendous success of unsupervised word embeddings in a multi...
read it

Leveraging Large Amounts of Weakly Supervised Data for MultiLanguage Sentiment Classification
This paper presents a novel approach for multilingual sentiment classif...
read it

A Unified Optimization View on Generalized Matching Pursuit and FrankWolfe
Two of the most fundamental prototypes of greedy optimization are the ma...
read it

Screening Rules for Convex Problems
We propose a new framework for deriving screening rules for convex optim...
read it
Martin Jaggi
is this you? claim profile
TenureTrack Assistant Professor at EPFL (École polytechnique fédérale de Lausanne)