
A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free!
Decentralized optimization methods enable ondevice training of machine ...
Sparse Communication for Training Deep Networks
Synchronous stochastic gradient descent (SGD) is the most common method ...
Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning
Federated learning is a challenging optimization problem due to the hete...
PowerGossip: Practical LowRank Communication Compression in Decentralized Deep Learning
Lossy gradient compression has become a practical tool to overcome the c...
MultiHead Attention: Collaborate Instead of Concatenate
Attention layers are widely used in natural language processing (NLP) an...
Taming GANs with Lookahead
Generative Adversarial Networks are notoriously challenging to train. Th...
ByzantineRobust Learning on Heterogeneous Datasets via Resampling
In Byzantine robust distributed optimization, a central server wants to ...
Dynamic Model Pruning with Feedback
Deep neural networks often have millions of parameters. This can hinder ...
Ensemble Distillation for Robust Model Fusion in Federated Learning
Federated Learning (FL) is a machine learning setting where many devices...
Extrapolation for Largebatch Training in Deep Learning
Deep learning networks are typically trained by Stochastic Gradient Desc...
Secure ByzantineRobust Machine Learning
Increasingly machine learning systems are being deployed to edge servers...
Masking as an Efficient Alternative to Finetuning for Pretrained Language Models
We present an efficient method of utilizing pretrained language models, ...
Data Parallelism in Training Sparse Neural Networks
Network pruning is an effective methodology to compress large neural net...
A Unified Theory of Decentralized SGD with Changing Topology and Local Updates
Decentralized stochastic optimization methods have gained a lot of atten...
Robust Crosslingual Embeddings from Parallel Sentences
Recent advances in crosslingual word embeddings have primarily relied o...
Advances and Open Problems in Federated Learning
Federated learning (FL) is a machine learning setting where many clients...
On the Relationship between SelfAttention and Convolutional Layers
Recent trends of incorporating attention mechanisms in vision have led r...
On the Tunability of Optimizers in Deep Learning
There is no consensus yet on the question whether adaptive gradient meth...
Model Fusion via Optimal Transport
Combining different models is a widely used paradigm in machine learning...
Decentralized Deep Learning with Arbitrary Communication Compression
Decentralized training of deep learning models is a key element for enab...
Correlating Twitter Language with CommunityLevel Health Outcomes
We study how language on social media is linked to diseases such as athe...
PowerSGD: Practical LowRank Gradient Compression for Distributed Optimization
We study gradient compression methods to alleviate the communication bot...
On Linear Learning with Manycore Processors
A new generation of manycore processors is on the rise that offers dozen...
Better Word Embeddings by Disentangling Contextual nGram Information
Pretrained word vectors are ubiquitous in Natural Language Processing a...
Crosslingual Document Embedding as ReducedRank Ridge Regression
There has recently been much interest in extending vectorbased word rep...
SysML: The New Frontier of Machine Learning Systems
Machine learning (ML) techniques are enjoying rapidly increasing adoptio...
Structure TreeLSTM: Structureaware Attentional Document Encoders
We propose a method to create document representations that reflect thei...
Forecasting intracranial hypertension using multiscale waveform metrics
Objective: Intracranial hypertension is an important risk factor of seco...
Overcoming MultiModel Forgetting
We identify a phenomenon, which we refer to as multimodel forgetting, t...
Evaluating the Search Phase of Neural Architecture Search
Neural Architecture Search (NAS) aims to facilitate the design of deep n...
Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication
We consider decentralized stochastic optimization with the objective fun...
Unsupervised Scalable Representation Learning for Multivariate Time Series
Time series constitute a challenging data type for machine learning algo...
Error Feedback Fixes SignSGD and other Gradient Compression Schemes
Signbased algorithms (e.g. signSGD) have been proposed as a biased grad...
Efficient Greedy Coordinate Descent for Composite Problems
Coordinate descent with random coordinate selection is the current state...
Sparsified SGD with Memory
Huge scale machine learning problems are nowadays tackled by distributed...
Wasserstein is all you need
We propose a unified framework for building unsupervised representations...
Don't Use Large MiniBatches, Use Local SGD
Minibatch stochastic gradient methods are the current state of the art ...
COLA: CommunicationEfficient Decentralized Linear Learning
Decentralized machine learning is a promising emerging paradigm in view ...
A Distributed SecondOrder Algorithm You Can Trust
Due to the rapid growth of data and computational resources, distributed...
Global linear convergence of Newton's method without strongconvexity or Lipschitz gradients
We show that Newton's method converges globally at a linear rate for obj...
Training DNNs with Hybrid Block Floating Point
The wide adoption of DNNs has given birth to unrelenting computing requi...
EndtoEnd DNN Training with Block Floating Point Arithmetic
DNNs are ubiquitous datacenter workloads, requiring orders of magnitude ...
Revisiting FirstOrder Convex Optimization Over Linear Spaces
Two popular examples of firstorder optimization methods over linear spa...
EmbedRank: Unsupervised Keyphrase Extraction using Sentence Embeddings
Keyphrase extraction is the task of automatically selecting a small set ...
Efficient Use of LimitedMemory Accelerators for Linear Learning on Heterogeneous Systems
We propose a generic algorithmic building block to accelerate training o...
Learning Aerial Image Segmentation from Online Maps
This study deals with semantic segmentation of highresolution (aerial) ...
Unsupervised robust nonparametric learning of hidden community properties
We consider learning of fundamental properties of communities in large n...
Greedy Algorithms for Cone Constrained Optimization with Convergence Guarantees
Greedy optimization methods such as Matching Pursuit (MP) and FrankWolf...
Generating Steganographic Text with LSTMs
Motivated by concerns for user privacy, we design a steganographic syste...
Faster Coordinate Descent via Adaptive Importance Sampling
Coordinate descent methods employ random partial updates of decision var...
Martin Jaggi
TenureTrack Assistant Professor at EPFL (École polytechnique fédérale de Lausanne)