
PrivacyUtility Tradeoff of Linear Regression under Random Projections and Additive Noise
Data privacy is an important concern in machine learning, and is fundamentally at odds with the task of training useful learning models, which typically require the acquisition of large amounts of private user data. One possible way of fulfilling the machine learning task while preserving user privacy is to train the model on a transformed, noisy version of the data, which does not reveal the data itself directly to the training procedure. In this work, we analyze the privacyutility tradeoff of two such schemes for the problem of linear regression: additive noise, and random projections. In contrast to previous work, we consider a recently proposed notion of differential privacy that is based on conditional mutual information (MIDP), which is stronger than the conventional (ϵ, δ)differential privacy, and use relative objective error as the utility metric. We find that projecting the data to a lowerdimensional subspace before adding noise attains a better tradeoff in general. We also make a connection between privacy problem and (noncoherent) SIMO, which has been extensively studied in wireless communication, and use tools from there for the analysis. We present numerical results demonstrating the performance of the schemes.
02/13/2019 ∙ by Mehrdad Showkatbakhsh, et al. ∙ 18 ∙ shareread it

Differentially Private ConsensusBased Distributed Optimization
Data privacy is an important concern in learning, when datasets contain sensitive information about individuals. This paper considers consensusbased distributed optimization under data privacy constraints. Consensusbased optimization consists of a set of computational nodes arranged in a graph, each having a local objective that depends on their local data, where in every step nodes take a linear combination of their neighbors' messages, as well as taking a new gradient step. Since the algorithm requires exchanging messages that depend on local data, private information gets leaked at every step. Taking (ϵ, δ)differential privacy (DP) as our criterion, we consider the strategy where the nodes add random noise to their messages before broadcasting it, and show that the method achieves convergence with a bounded meansquared error, while satisfying (ϵ, δ)DP. By relaxing the more stringent ϵDP requirement in previous work, we strengthen a known convergence result in the literature. We conclude the paper with numerical results demonstrating the effectiveness of our methods for mean estimation.
03/19/2019 ∙ by Mehrdad Showkatbakhsh, et al. ∙ 14 ∙ shareread it

Straggler Mitigation in Distributed Optimization Through Data Encoding
Slow running or straggler tasks can significantly reduce computation speed in distributed computation. Recently, codingtheoryinspired approaches have been applied to mitigate the effect of straggling, through embedding redundancy in certain linear computational steps of the optimization algorithm, thus completing the computation without waiting for the stragglers. In this paper, we propose an alternate approach where we embed the redundancy directly in the data itself, and allow the computation to proceed completely oblivious to encoding. We propose several encoding schemes, and demonstrate that popular batch algorithms, such as gradient descent and LBFGS, applied in a codingoblivious manner, deterministically achieve sample path linear convergence to an approximate solution of the original problem, using an arbitrarily varying subset of the nodes at each iteration. Moreover, this approximation can be controlled by the amount of redundancy and the number of nodes used in each iteration. We provide experimental results demonstrating the advantage of the approach over uncoded and data replication strategies.
11/14/2017 ∙ by Can Karakus, et al. ∙ 0 ∙ shareread it

Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning
Performance of distributed optimization and learning systems is bottlenecked by "straggler" nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is "encoded" to have an overcomplete representation with builtin redundancy, and the straggling nodes in the system are dynamically left out of the computation at every iteration, whose loss is compensated by the embedded redundancy. We show that oblivious application of several popular optimization algorithms on encoded data, including gradient descent, LBFGS, proximal gradient under data parallelism, and coordinate descent under model parallelism, converge to either approximate or exact solutions of the original problem when stragglers are treated as erasures. These convergence results are deterministic, i.e., they establish sample path convergence for arbitrary sequences of delay patterns or distributions on the nodes, and are independent of the tail behavior of the delay distribution. We demonstrate that equiangular tight frames have desirable properties as encoding matrices, and propose efficient mechanisms for encoding largescale data. We implement the proposed technique on Amazon EC2 clusters, and demonstrate its performance over several learning problems, including matrix factorization, LASSO, ridge regression and logistic regression, and compare the proposed method with uncoded, asynchronous, and data replication strategies.
03/14/2018 ∙ by Can Karakus, et al. ∙ 0 ∙ shareread it

Densifying Assumedsparse Tensors: Improving Memory Efficiency and MPI Collective Performance during Tensor Accumulation for Parallelized Training of Neural Machine Translation
Neural machine translation  using neural networks to translate human language  is an area of active research exploring new neuron types and network topologies with the goal of dramatically improving machine translation performance. Current stateoftheart approaches, such as the multihead attentionbased transformer, require very large translation corpuses and many epochs to produce models of reasonable quality. Recent attempts to parallelize the official TensorFlow "Transformer" model across multiple nodes have hit roadblocks due to excessive memory use and resulting out of memory errors when performing MPI collectives. This paper describes modifications made to the Horovod MPIbased distributed training framework to reduce memory usage for transformer models by converting assumedsparse tensors to dense tensors, and subsequently replacing sparse gradient gather with dense gradient reduction. The result is a dramatic increase in scaleout capability, with CPUonly scaling tests achieving 91 (300 nodes), and up to 65 (200 nodes) using the Stampede2 supercomputer.
05/10/2019 ∙ by Derya Cavdar, et al. ∙ 0 ∙ shareread it

QsparselocalSGD: Distributed SGD with Quantization, Sparsification, and Local Computations
Communication bottleneck has been identified as a significant issue in distributed optimization of largescale learning models. Recently, several approaches to mitigate this problem have been proposed, including different forms of gradient compression or computing local models and mixing them iteratively. In this paper we propose QsparselocalSGD algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients. We propose both synchronous and asynchronous implementations of QsparselocalSGD. We analyze convergence for QsparselocalSGD in the distributed setting for smooth nonconvex and convex objective functions. We demonstrate that QsparselocalSGD converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers. We use QsparselocalSGD to train ResNet50 on ImageNet, and show that it results in significant savings over the stateoftheart, in the number of bits transmitted to reach target accuracy.
06/06/2019 ∙ by Debraj Basu, et al. ∙ 0 ∙ shareread it
Can Karakus
is this you? claim profile