Sparse mixture of expert architectures (MoEs) scale model capacity witho...
The ubiquitous and demonstrably suboptimal choice of resizing images to ...
The top-k operator returns a k-sparse vector, where the non-zero values
...
Training large, deep neural networks to convergence can be prohibitively...
Adversarial robustness is a key desirable property of neural networks. I...
Regularized optimal transport (OT) is now increasingly used as a loss or...
Effective scaling and a flexible task interface enable large language mo...
Large sparsely-activated models have obtained excellent performance in
m...
Transformers are widely applied to solve natural language understanding ...
Machine learning models based on the aggregated outputs of submodels, ei...
Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated exce...
In the low-data regime, it is difficult to train good supervised models ...
Transfer learning has been recently popularized as a data-efficient
alte...
Transfer of pre-trained representations can improve sample efficiency an...
Modern deep convolutional networks (CNNs) are often criticized for not
g...
Transfer of pre-trained representations improves sample efficiency and
s...