
Unsupervised Image Matching and Object Discovery as Optimization
Learning with complete or partial supervision is powerful but relies on evergrowing human annotation efforts. As a way to mitigate this serious problem, as well as to serve specific applications, unsupervised learning has emerged as an important field of research. In computer vision, unsupervised learning comes in various guises. We focus here on the unsupervised discovery and matching of object categories among images in a collection, following the work of Cho et al. 2015. We show that the original approach can be reformulated and solved as a proper optimization problem. Experiments on several benchmarks establish the merit of our approach.
04/05/2019 ∙ by Huy V. Vo, et al. ∙ 40 ∙ shareread it

Partially Encrypted Machine Learning using Functional Encryption
Machine learning on encrypted data has received a lot of attention thanks to recent breakthroughs in homomorphic encryption and secure multiparty computation. It allows outsourcing computation to untrusted servers without sacrificing privacy of sensitive data. We propose a practical framework to perform partially encrypted and privacypreserving predictions which combines adversarial training and functional encryption. We first present a new functional encryption scheme to efficiently compute quadratic functions so that the data owner controls what can be computed but is not involved in the calculation: it provides a decryption key which allows one to learn a specific function evaluation of some encrypted data. We then show how to use it in machine learning to partially encrypt neural networks with quadratic activation functions at evaluation time, and we provide a thorough analysis of the information leaks based on indistinguishability of data items of the same label. Last, since most encryption schemes cannot deal with the last thresholding operation used for classification, we propose a training method to prevent selected sensitive features from leaking, which adversarially optimizes the network against an adversary trying to identify these features. This is interesting for several existing works using partially encrypted machine learning as it comes with little reduction on the model's accuracy and significantly improves data privacy.
05/24/2019 ∙ by Theo Ryffel, et al. ∙ 12 ∙ shareread it

On the Global Convergence of Gradient Descent for Overparameterized Models using Optimal Transport
Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuoustime gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the manyparticle limit, this gradient flow, although nonconvex, converges to global minimizers. The proof involves Wasserstein gradient flows, a byproduct of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.
05/24/2018 ∙ by Lenaïc Chizat, et al. ∙ 8 ∙ shareread it

Globally Convergent Newton Methods for Illconditioned Generalized Selfconcordant Losses
In this paper, we study largescale convex optimization algorithms based on the Newton method applied to regularized generalized selfconcordant losses, which include logistic regression and softmax regression. We first prove that our new simple scheme based on a sequence of problems with decreasing regularization parameters is provably globally convergent, that this convergence is linear with a constant factor which scales only logarithmically with the condition number. In the parametric setting, we obtain an algorithm with the same scaling than regular firstorder methods but with an improved behavior, in particular in illconditioned problems. Second, in the non parametric machine learning setting, we provide an explicit algorithm combining the previous scheme with Nyström projection techniques, and prove that it achieves optimal generalization bounds with a time complexity of order O(ndf λ), a memory complexity of order O(df 2 λ) and no dependence on the condition number, generalizing the results known for leastsquares regression. Here n is the number of observations and df λ is the associated degrees of freedom. In particular, this is the first largescale algorithm to solve logistic and softmax regressions in the nonparametric setting with large condition numbers and theoretical guarantees.
07/03/2019 ∙ by Ulysse MarteauFerey, et al. ∙ 7 ∙ shareread it

Localized Structured Prediction
Key to structured prediction is exploiting the problem structure to simplify the learning process. A major challenge arises when data exhibit a local structure (e.g., are made by "parts") that can be leveraged to better approximate the relation between (parts of) the input and (parts of) the output. Recent literature on signal processing, and in particular computer vision, has shown that capturing these aspects is indeed essential to achieve stateoftheart performance. While such algorithms are typically derived on a casebycase basis, in this work we propose the first theoretical framework to deal with partbased data from a general perspective. We derive a novel approach to deal with these problems and study its generalization properties within the setting of statistical learning theory. Our analysis is novel in that it explicitly quantifies the benefits of leveraging the partbased structure of the problem with respect to the learning rates of the proposed estimator.
06/06/2018 ∙ by Carlo Ciliberto, et al. ∙ 6 ∙ shareread it

Marginal Weighted Maximum Loglikelihood for Efficient Learning of PerturbandMap models
We consider the structuredoutput prediction problem through probabilistic approaches and generalize the "perturbandMAP" framework to more challenging weighted Hamming losses, which are crucial in applications. While in principle our approach is a straightforward marginalization, it requires solving many related MAP inference problems. We show that for logsupermodular pairwise models these operations can be performed efficiently using the machinery of dynamic graph cuts. We also propose to use double stochastic gradient descent, both on the data and on the perturbations, for efficient learning. Our framework can naturally take weak supervision (e.g., partial labels) into account. We conduct a set of experiments on mediumscale character recognition and image segmentation, showing the benefits of our algorithms.
11/21/2018 ∙ by Tatiana Shpakova, et al. ∙ 6 ∙ shareread it

A General Theory for Structured Prediction with Smooth Convex Surrogates
In this work we provide a theoretical framework for structured prediction that generalizes the existing theory of surrogate methods for binary and multiclass classification based on estimating conditional probabilities with smooth convex surrogates (e.g. logistic regression). The theory relies on a natural characterization of structural properties of the task loss and allows to derive statistical guarantees for many widely used methods in the context of multilabeling, ranking, ordinal regression and graph matching. In particular, we characterize the smooth convex surrogates compatible with a given task loss in terms of a suitable Bregman divergence composed with a link function. This allows to derive tight bounds for the calibration function and to obtain novel results on existing surrogate frameworks for structured prediction such as conditional random fields and quadratic surrogates.
02/05/2019 ∙ by Alex NowakVila, et al. ∙ 4 ∙ shareread it

A Universal Algorithm for Variational Inequalities Adaptive to Smoothness and Noise
We consider variational inequalities coming from monotone operators, a setting that includes convex minimization and convexconcave saddlepoint problems. We assume an access to potentially noisy unbiased values of the monotone operators and assess convergence through a compatible gap function which corresponds to the standard optimality criteria in the aforementioned subcases. We present a universal algorithm for these inequalities based on the MirrorProx algorithm. Concretely, our algorithm simultaneously achieves the optimal rates for the smooth/nonsmooth, and noisy/noiseless settings. This is done without any prior knowledge of these properties, and in the general setup of arbitrary norms and compatible Bregman divergences. For convex minimization and convexconcave saddlepoint problems, this leads to new adaptive algorithms. Our method relies on a novel yet simple adaptive choice of the stepsize, which can be seen as the appropriate extension of AdaGrad to handle constrained problems.
02/05/2019 ∙ by Francis Bach, et al. ∙ 4 ∙ shareread it

Massively scalable Sinkhorn distances via the Nyström method
The Sinkhorn distance, a variant of the Wasserstein distance with entropic regularization, is an increasingly popular tool in machine learning and statistical inference. We give a simple, practical, parallelizable algorithm NYSSINK, based on Nyström approximation, for computing Sinkhorn distances on a massive scale. As we show in numerical experiments, our algorithm easily computes Sinkhorn distances on data sets hundreds of times larger than can be handled by stateoftheart approaches. We also give provable guarantees establishing that the running time and memory requirements of our algorithm adapt to the intrinsic dimension of the underlying data.
12/12/2018 ∙ by Jason Altschuler, et al. ∙ 4 ∙ shareread it

Nonlinear Acceleration of Deep Neural Networks
Regularized nonlinear acceleration (RNA) is a generic extrapolation scheme for optimization methods, with marginal computational overhead. It aims to improve convergence using only the iterates of simple iterative algorithms. However, so far its application to optimization was theoretically limited to gradient descent and other singlestep algorithms. Here, we adapt RNA to a much broader setting including stochastic gradient with momentum and Nesterov's fast gradient. We use it to train deep neural networks, and empirically observe that extrapolated networks are more accurate, especially in the early iterations. A straightforward application of our algorithm when training ResNet152 on ImageNet produces a top1 test error of 20.88 classification pipeline. Furthermore, the code runs offline in this case, so it never negatively affects performance.
05/24/2018 ∙ by Damien Scieur, et al. ∙ 2 ∙ shareread it

Nonlinear Acceleration of CNNs
The Regularized Nonlinear Acceleration (RNA) algorithm is an acceleration method capable of improving the rate of convergence of many optimization schemes such as gradient descend, SAGA or SVRG. Until now, its analysis is limited to convex problems, but empirical observations shows that RNA may be extended to wider settings. In this paper, we investigate further the benefits of RNA when applied to neural networks, in particular for the task of image recognition on CIFAR10 and ImageNet. With very few modifications of exiting frameworks, RNA improves slightly the optimization process of CNNs, after training.
06/01/2018 ∙ by Damien Scieur, et al. ∙ 2 ∙ shareread it