Yu-Xiang Wang

is this you? claim profile

0

Scientist, Amazon AI, AWS, Applied Scientist, Amazon AI, Machine Learning Department in Carnegie Mellon University, PhD Student at Carnegie Mellon University.

  • A Higher-Order Kolmogorov-Smirnov Test

    We present an extension of the Kolmogorov-Smirnov (KS) two-sample test, which can be more sensitive to differences in the tails. Our test statistic is an integral probability metric (IPM) defined over a higher-order total variation ball, recovering the original KS test as its simplest case. We give an exact representer result for our IPM, which generalizes the fact that the original KS test statistic can be expressed in equivalent variational and CDF forms. For small enough orders (k ≤ 5), we develop a linear-time algorithm for computing our higher-order KS test statistic; for all others (k ≥ 6), we give a nearly linear-time approximation. We derive the asymptotic null distribution for our test, and show that our nearly linear-time approximation shares the same asymptotic null. Lastly, we complement our theory with numerical studies.

    03/24/2019 ∙ by Veeranjaneyulu Sadhanala, et al. ∙ 16 share

    read it

  • Improving the Gaussian Mechanism for Differential Privacy: Analytical Calibration and Optimal Denoising

    The Gaussian mechanism is an essential building block used in multitude of differentially private data analysis algorithms. In this paper we revisit the Gaussian mechanism and show that the original analysis has several important limitations. Our analysis reveals that the variance formula for the original mechanism is far from tight in the high privacy regime (ε→ 0) and it cannot be extended to the low privacy regime (ε→∞). We address this limitations by developing an analytic Gaussian mechanism whose variance is calibrated directly using the Gaussian cumulative density function instead of a tail bound approximation. We also propose to equip the Gaussian mechanism with a post-processing step based on adaptive denoising estimators by leveraging that the variance of the perturbation is known. Our experiments show that analytical calibration removes at least a third of the variance of the noise compared to the classical Gaussian mechanism, and that denoising dramatically improves the accuracy of the Gaussian mechanism in the high-dimensional regime.

    05/16/2018 ∙ by Borja Balle, et al. ∙ 12 share

    read it

  • Imitation-Regularized Offline Learning

    We study the problem of offline learning in automated decision systems under the contextual bandits model. We are given logged historical data consisting of contexts, (randomized) actions, and (nonnegative) rewards. A common goal is to evaluate what would happen if different actions were taken in the same contexts, so as to optimize the action policies accordingly. The typical approach to this problem, inverse probability weighted estimation (IPWE) [Bottou et al., 2013], requires logged action probabilities, which may be missing in practice due to engineering complications. Even when available, small action probabilities cause large uncertainty in IPWE, rendering the corresponding results insignificant. To solve both problems, we show how one can use policy improvement (PIL) objectives, regularized by policy imitation (IML). We motivate and analyze PIL as an extension to Clipped-IPWE, by showing that both are lower-bound surrogates to the vanilla IPWE. We also formally connect IML to IPWE variance estimation [Swaminathan and Joachims 2015] and natural policy gradients. Without probability logging, our PIL-IML interpretations justify and improve, by reward-weighting, the state-of-art cross-entropy (CE) loss that predicts the action items among all action candidates available in the same contexts. With probability logging, our main theoretical contribution connects IML-underfitting to the existence of either confounding variables or model misspecification. We show the value and accuracy of our insights by simulations based on Simpson's paradox, standard UCI multiclass-to-bandit conversions and on the Criteo counterfactual analysis challenge dataset.

    01/15/2019 ∙ by Yifei Ma, et al. ∙ 12 share

    read it

  • Online Forecasting of Total-Variation-bounded Sequences

    We consider the problem of online forecasting of sequences of length n with total-variation at most C_n using observations contaminated by independent σ-subgaussian noise. We design an O(n n)-time algorithm that achieves a cumulative square error of Õ(n^1/3C_n^2/3σ^4/3) with high probability. The result is rate-optimal as it matches the known minimax rate for the offline nonparametric estimation of the same class [Mammen and van de Geer, 1997]. To the best of our knowledge, this is the first polynomial-time algorithm that optimally forecasts total variation bounded sequences. Our proof techniques leverage the special localized structure of Haar wavelet basis and adaptivity to unknown smoothness parameter in the classical wavelet smoothing [Donoho et al., 1998]. We also compare our model to the rich literature of dynamic regret minimization and nonstationary stochastic optimization, where our problem can be treated as a special case. We show that the workhorse in those settings --- online gradient descent and its variants with a fixed restarting schedule --- are instances of a class of linear forecasters that require a suboptimal regret of Ω̃(√(n)). This implies that the use of more adaptive algorithms are necessary to obtain the optimal rate.

    06/08/2019 ∙ by Dheeraj Baby, et al. ∙ 5 share

    read it

  • Provably Efficient Q-Learning with Low Switching Cost

    We take initial steps in studying PAC-MDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in real-world applications (such as medical domains), and we propose to quantify adaptivity using the notion of local switching cost. Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is O(H^3SA K), and we provide a lower bound of Ω(HSA) on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting, which yields nontrivial results that improve upon prior work in certain aspects.

    05/30/2019 ∙ by Yu Bai, et al. ∙ 3 share

    read it

  • Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

    Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-product self attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length L, making modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only O(L( L)^2) memory cost, improving the time series forecasting in finer granularity under constrained memory budget. Our experiments on both synthetic data and real-world datasets show that it compares favorably to the state-of-the-art.

    06/29/2019 ∙ by Shiyang Li, et al. ∙ 3 share

    read it

  • Subsampled Rényi Differential Privacy and Analytical Moments Accountant

    We study the problem of subsampling in differential privacy (DP), a question that is the centerpiece behind many successful differentially private machine learning algorithms. Specifically, we provide a tight upper bound on the Rényi Differential Privacy (RDP) (Mironov, 2017) parameters for algorithms that: (1) subsample the dataset, and then (2) apply a randomized mechanism M to the subsample, in terms of the RDP parameters of M and the subsampling probability parameter. This result generalizes the classic subsampling-based "privacy amplification" property of (ϵ,δ)-differential privacy that applies to only one fixed pair of (ϵ,δ), to a stronger version that exploits properties of each specific randomized algorithm and satisfies an entire family of (ϵ(δ),δ)-differential privacy for all δ∈ [0,1]. Our experiments confirm the advantage of using our techniques over keeping track of (ϵ,δ) directly, especially in the setting where we need to compose many rounds of data access.

    07/31/2018 ∙ by Yu-Xiang Wang, et al. ∙ 2 share

    read it

  • ProxQuant: Quantized Neural Networks via Proximal Operators

    To make deep neural networks feasible in resource-constrained environments (such as mobile devices), it is beneficial to quantize models by using low-precision weights. One common technique for quantizing neural networks is the straight-through gradient method, which enables back-propagation through the quantization mapping. Despite its empirical success, little is understood about why the straight-through gradient method works. Building upon a novel observation that the straight-through gradient method is in fact identical to the well-known Nesterov's dual-averaging algorithm on a quantization constrained optimization problem, we propose a more principled alternative approach, called ProxQuant, that formulates quantized network training as a regularized learning problem instead and optimizes it via the prox-gradient method. ProxQuant does back-propagation on the underlying full-precision vector and applies an efficient prox-operator in between stochastic gradient steps to encourage quantizedness. For quantizing ResNets and LSTMs, ProxQuant outperforms state-of-the-art results on binary quantization and is on par with state-of-the-art on multi-bit quantization. For binary quantization, our analysis shows both theoretically and experimentally that ProxQuant is more stable than the straight-through gradient method (i.e. BinaryConnect), challenging the indispensability of the straight-through gradient method and providing a powerful alternative.

    10/01/2018 ∙ by Yu Bai, et al. ∙ 2 share

    read it

  • Non-stationary Stochastic Optimization with Local Spatial and Temporal Changes

    We consider a non-stationary sequential stochastic optimization problem, in which the underlying cost functions change over time under a variation budget constraint. We propose an L_p,q-variation functional to quantify the change, which captures local spatial and temporal variations of the sequence of functions. Under the L_p,q-variation functional constraint, we derive both upper and matching lower regret bounds for smooth and strongly convex function sequences, which generalize previous results in (Besbes et al., 2015). Our results reveal some surprising phenomena under this general variation functional, such as the curse of dimensionality of the function domain. The key technical novelties in our analysis include an affinity lemma that characterizes the distance of the minimizers of two convex functions with bounded L_p difference, and a cubic spline based construction that attains matching lower bounds.

    08/09/2017 ∙ by Xi Chen, et al. ∙ 0 share

    read it

  • Per-instance Differential Privacy and the Adaptivity of Posterior Sampling in Linear and Ridge regression

    Differential privacy (DP), ever since its advent, has been a controversial object. On the one hand, it provides strong provable protection of individuals in a data set, on the other hand, it has been heavily criticized for being not practical, partially due to its complete independence to the actual data set it tries to protect. In this paper, we address this issue by a new and more fine-grained notion of differential privacy --- per instance differential privacy (pDP), which captures the privacy of a specific individual with respect to a fixed data set. We show that this is a strict generalization of the standard DP and inherits all its desirable properties, e.g., composition, invariance to side information and closedness to postprocessing, except that they all hold for every instance separately. When the data is drawn from a distribution, we show that per-instance DP implies generalization. Moreover, we provide explicit calculations of the per-instance DP for the output perturbation on a class of smooth learning problems. The result reveals an interesting and intuitive fact that an individual has stronger privacy if he/she has small "leverage score" with respect to the data set and if he/she can be predicted more accurately using the leave-one-out data set. Using the developed techniques, we provide a novel analysis of the One-Posterior-Sample (OPS) estimator and show that when the data set is well-conditioned it provides (ϵ,δ)-pDP for any target individuals and matches the exact lower bound up to a 1+Õ(n^-1ϵ^-2) multiplicative factor. We also propose AdaOPS which uses adaptive regularization to achieve the same results with (ϵ,δ)-DP. Simulation shows several orders-of-magnitude more favorable privacy and utility trade-off when we consider the privacy of only the users in the data set.

    07/24/2017 ∙ by Yu-Xiang Wang, et al. ∙ 0 share

    read it

  • Understanding the 2016 US Presidential Election using ecological inference and distribution regression with census microdata

    We combine fine-grained spatially referenced census data with the vote outcomes from the 2016 US presidential election. Using this dataset, we perform ecological inference using distribution regression (Flaxman et al, KDD 2015) with a multinomial-logit regression so as to model the vote outcome Trump, Clinton, Other / Didn't vote as a function of demographic and socioeconomic features. Ecological inference allows us to estimate "exit poll" style results like what was Trump's support among white women, but for entirely novel categories. We also perform exploratory data analysis to understand which census variables are predictive of voting for Trump, voting for Clinton, or not voting for either. All of our methods are implemented in python and R and are available online for replication.

    11/11/2016 ∙ by Seth Flaxman, et al. ∙ 0 share

    read it