# Online and Batch Learning Algorithms for Data with Missing Features

We introduce new online and batch algorithms that are robust to data with missing features, a situation that arises in many practical applications. In the online setup, we allow for the comparison hypothesis to change as a function of the subset of features that is observed on any given round, extending the standard setting where the comparison hypothesis is fixed throughout. In the batch setup, we present a convex relation of a non-convex problem to jointly estimate an imputation function, used to fill in the values of missing features, along with the classification hypothesis. We prove regret bounds in the online setting and Rademacher complexity bounds for the batch i.i.d. setting. The algorithms are tested on several UCI datasets, showing superior performance over baselines.

## Authors

• 15 publications
• 42 publications
• 11 publications
• ### Online Alternating Direction Method

Online optimization has emerged as powerful tool in large scale optimiza...
06/27/2012 ∙ by Huahua Wang, et al. ∙ 0

• ### New Analysis and Algorithm for Learning with Drifting Distributions

We present a new analysis of the problem of learning with drifting distr...
05/19/2012 ∙ by Mehryar Mohri, et al. ∙ 0

• ### ZigZag: A new approach to adaptive online learning

We develop a novel family of algorithms for the online learning setting ...
04/13/2017 ∙ by Dylan J. Foster, et al. ∙ 0

• ### Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem

We consider the combinatorial multi-armed bandit (CMAB) problem, where t...
05/08/2019 ∙ by Nadav Merlis, et al. ∙ 0

• ### Convergence Analyses of Online ADAM Algorithm in Convex Setting and Two-Layer ReLU Neural Network

05/22/2019 ∙ by Biyi Fang, et al. ∙ 0

• ### Online Learning to Rank with Features

We introduce a new model for online ranking in which the click probabili...
10/05/2018 ∙ by Shuai Li, et al. ∙ 8

• ### Regret Bounds for Non-decomposable Metrics with Missing Labels

We consider the problem of recommending relevant labels (items) for a gi...
06/07/2016 ∙ by Prateek Jain, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Standard learning algorithms assume that each training example is fully observed and doesn’t suffer any corruption. However, in many real-life scenarios, training and test data often undergo some form of corruption. We consider settings where all the features might not be observed in every example, allowing for both adversarial and stochastic feature deletion models. Such situations arise, for example, in medical diagnosis—predictions are often desired using only a partial array of medical measurements due to time or cost constraints. Survey data are often incomplete due to partial non-response of participants. Vision tasks routinely need to deal with partially corrupted or occluded images. Data collected through multiple sensors, such as multiple cameras, is often subject to the sudden failure of a subset of the sensors.

In this work, we design and analyze learning algorithms that address these examples of learning with missing features. The first setting we consider is online learning where both examples and missing features are chosen in an arbitrary, possibly adversarial, fashion. We define a novel notion of regret suitable to the setting and provide an algorithm which has a provably bounded regret on the order of , where is the number of examples. The second scenario is batch learning, where examples and missing features are drawn according to a fixed and unknown distribution. We design a learning algorithm which is guaranteed to globally optimize an intuitive objective function and which also exhibits a generalization error on the order of , where is the data dimension.

Both algorithms are also explored empirically across several publicly available datasets subject to various artificial and natural types of feature corruption. We find very encouraging results, indicating the efficacy of the suggested algorithms and their superior performance over baseline methods.

Learning with missing or corrupted features has a long history in statistics [14, 10]

, and has recieved recent attention in machine learning

[9, 15, 5, 7]. Imputation methods (see [14, 15, 10]) fill in missing values, generally independent of any learning algorithm, after which standard algorithms can be applied to the data. Better performance might be expected, though, by learning the imputation and prediction functions simultaneously. Previous works [15] address this issue using EM, but can get stuck in local optima and do not have strong theoretical guarantees. Our work also is different from settings where features are missing only at test time [9, 11], settings that give access to noisy versions of all the features [6] or settings where observed features are picked by the algorithm [5].

Section 2 introduces both the general online and batch settings. Sections 3 and 4 detail the algorithms and theoretical results within the online and batch settings resp. Empirical results are presented in Section 5.

## 2 The Setting

In our setting it will be useful to denote a training instance and prediction

, as well as a corruption vector

, where

 [zt]i={0if feature i is not observed,1if feature i is observed.

We will discuss as specific examples both classification problems where and regression problems where . The learning algorithm is given the corruption vector as well as the corrupted instance,

 x′t=xt∘zt,

where denotes the component-wise product between two vectors. Note that the training algorithm is never given access to , however it is given , and so has knowledge of exactly which coordinates have been corrupted. The following subsections explain the online and batch settings respectively, as well as the type of hypotheses that are considered in each.

### 2.1 Online learning with missing features

In this setting, at each time-step the learning algorithm is presented with an arbitrarily (possibly adversarially) chosen instance and is expected to predict . After prediction, the label is then revealed to the learner which then can update its hypothesis.

A natural question to ask is what happens if we simply ignore the distinction between and and just run an online learning algorithm on this corrupted data. Indeed, doing so would give a small bound on regret:

 R(T,ℓ)=T∑t=1ℓ(⟨wt,x′t⟩,yt)−infw∈WT∑t=1ℓ(⟨w,x′t⟩,yt), (1)

with respect to a convex loss function

and for any convex compact subset . However, any fixed weight vector in the second term might have a very large loss, making the regret guarantee useless—both the learner and the comparator have a large loss making the difference small. For instance, assume one feature perfectly predicts the label, while another one only predicts the label with 80% accuracy, and is the quadratic loss. It is easy to see that there is no fixed that will perform well on both examples where the first feature is observed and examples where the first feature is missing but the second one is observed.

To address the above concerns, we consider using a linear corruption-dependent hypothesis which is permitted to change as a function of the observed corruption . Specifically, given the corrupted instance and corruption vector, the predictor uses a function to choose a weight vector, and makes the prediction . In order to provide theoretical guarantees, we will bound the following notion of regret,

 Rz(T,ℓ)=T∑t=1ℓ(⟨wt,x′t⟩,yt)−infw∈WT∑t=1ℓ(⟨w(zt),x′t⟩,yt), (2)

where it is implicit that also depends on and now consists of corruption-dependent hypotheses. Similar definitions of regret have been looked at in the setting learning with side information [8, 12], but our special case admits stronger results in terms of both upper and lower bounds. In the most general case, we may consider as the class of all functions which map , however we show this can lead to an intractable learning problem. This motivates the study of interesting subsets of this most general function class. This is the main focus of Section 3.

### 2.2 Batch learning with missing features

In the setup of batch learning with i.i.d. data, examples are drawn according to a fixed but unknown distribution and the goal is to choose a hypothesis that minimizes the expected error, with respect to an appropriate loss function : .

The hypotheses we consider in this scenario will be inspired by imputation-based methods prevalent in statistics literature used to address the problem of missing features [14]. An imputation mapping is a function used to fill in unobserved features using the observed features, after which the completed examples can be used for prediction. In particular, if we consider an imputation function , which is meant to fill missing feature values, and a linear predictor , we can parameterize a hypothesis with these two function .

It is clear that the multiplicative interaction between and will make most natural formulations non-convex, and we elaborate more on this in Section 4. In the i.i.d. setting, the natural quantity of interest is the generalization error of our learned hypothesis. We provide a Rademacher complexity bound on the class of pairs we use, thereby showing that any hypothesis with a small empirical error will also have a small expected loss. The specific class of hypotheses and details of the bound are presented in Section 4. Furthermore, the reason as to why an imputation-based hypothesis class is not analyzed in the more general adversarial setting will also be explained in that section.

## 3 Online Corruption-Based Algorithm

In this section, we consider the class of corruption-dependent hypotheses defined in Section 2.1. Recall the definition of regret (2), which we wish to control in this framework, and of the comparator class of functions . It is clear that the function class is much richer than the comparator class in the corruption-free scenario, where the best linear predictor is fixed for all rounds. It is natural to ask if it is even possible to prove a non-trivial regret bound over this richer comparator class . In fact, the first result of our paper provides a lower bound on the minimax regret when the comparator is allowed to pick arbitrary mappings, i.e. the set contains all mappings. The result is stated in terms of the minimax regret under the loss function under the usual (corruption-free) definition (1):

 R∗(T,ℓ)=infw1∈Wsup(x1,z1,y1)⋯infwT∈Wsup(xT,zT,yT)R(T,ℓ)
###### Proposition 1

If the minimax value of the corruption dependent regret for any loss function is lower bounded as

 infw1∈Wsup(x1,z1,y1)⋯infwT∈Wsup(xT,zT,yT)Rz(T,ℓ)=Ω(2d/2R∗(T2d/2,ℓ)).

This proposition (the proof of which appears in the appendix [17]) shows that the minimax regret is lower bounded by a term that is exponential in the dimensionality of the learning problem. For most non-degenerate convex and Lipschitz losses, without further assumptions (see e.g. [1]) which yields a lower bound. The bound can be further strengthened to for linear losses which is unimprovable since it is achieved by solving the classification problem corresponding to each pattern independently.

Thus, it will be difficult to achieve a low regret against arbitrary maps from to . In the following section we consider a restricted function class and show that a mirror-descent algorithm can achieve regret polynomial in and sub-linear in , implying that the average regret is vanishing.

### 3.1 Linear Corruption-Dependent Hypotheses

Here we analyze a corruption-dependent hypothesis class that is parametrized by a matrix , where may be a function of . In the simplest case of , the parametrization looks for weights that depend linearly on the corruption vector . Defining achieves this, and intuitively this allows us to capture how the presence or absence of one feature affects the weight of another feature. This will be clarified further in the examples.

In general, the matrix will be , where will be determined by a function that maps to a possibly higher dimension space. Given, a fixed , the explicit parameterization in terms of is,

 wA,ψ(zt)=Aψ(zt). (3)

In what follows, we drop the subscript from

in order to simplify notation. Essentially this allows us to introduce non-linearities as a function of the corruption vector, but the non-linear transform is known and fixed throughout the learning process. Before analyzing this setting, we give a few examples and intuition as to why such a parametrization is useful. In each example, we will show how there exists a choice of a matrix

that captures the specific problem’s assumptions. This implies that the fixed comparator can use this choice in hindsight, and by having a low regret, our algorithm would implicitly learn a hypothesis close to this reasonable choice of .

#### 3.1.1 Corruption-free special case

We start by noting that in the case of no corruption (i.e. ) a standard linear hypothesis model can be cast within the matrix based framework by defining and learning .

#### 3.1.2 Ranking-based parameterization

One natural method for classification is to order the features by their predictive power, and to weight features proportionally to their ranking (in terms of absolute value; that is, the sign of weight depends on whether the correlation with the label is positive or negative). In the corrupted features setting, this naturally corresponds to taking the available features at any round and putting more weight on the most predictive observed features. This is particularly important while using margin-based losses such as the hinge loss, where we want the prediction to have the right sign and be large enough in magnitude.

Our parametrization allows such a strategy when using a simple function . Without loss of generality, assume that the features are arranged in decreasing order of discriminative power (we can always rearrange rows and columns of if they’re not). We also assume positive correlations of all features with the label; a more elaborate construction works for when they’re not. In this case, consider the parameter matrix and the induced classification weights

 [A]i,j=⎧⎪ ⎪⎨⎪ ⎪⎩1,j=i−1d,ji,  [w(zt)]i=[zt]i(1−∑j

Thus, for all such that we have . The choice of 1 for diagonals and for off-diagonals is arbitrary and other values might also be picked based on the data sequence . In general, features are weighted monotonically with respect to their discriminative power with signs based on correlations with the label.

#### 3.1.3 Feature group based parameterization

Another class of hypotheses that we can define within this framework are those restricted to consider up to -wise interactions between features for some constant . In this case, we index the unique subsets of features of size up to . Then define if the corresponding subset is uncorrupted by and equal to otherwise. An entry now specifies the importance of feature , assuming that at least the subset is present. Such a model would, for example, have the ability to capture the scenario of a feature that is only discriminative in the presence of some other features. For example, we can generalize the ranking example from above to impose a soft ranking on groups of features.

#### 3.1.4 Corruption due to failed sensors

A common scenario for missing features arises in applications involving an array of measurements, for example, from a sensor network, wireless motes, array of cameras or CCDs, where each sensor is bound to fail occasionally. The typical strategy for dealing with such situations involves the use of redundancy. For instance, if a sensor fails, then some kind of an averaged measurement from the neighboring sensors might provide a reasonable surrogate for the missing value.

It is possible to design a choice of

matrix for the comparator that only uses the local measurement when it is present, but uses an averaged approximation based on some fixed averaging distribution on neighboring features when the local measurement is missing. For each feature, we consider a probability distribution

which specifies the averaging weights to be used when approximating feature using neighboring observations. Let be the weight vector that the comparator would like to use if all the features were present. Then, with and for we define,

 [A]i,i=w∗i+∑j≠iw∗jpji,[A]i,j=−w∗jpji. (4)

Thus, say only feature is missing, we still have , where by assumption .

Of course, the averaging in such applications is typically local, and we expect each sensor to put large weights only on neighboring sensors. This can be specified via a neighborhood graph, where nodes and have an edge if is used to predict when feature is not observed and vice versa. From the construction (4) it is clear that the only off-diagonal entries that are non-zero would correspond to the edges in the neighborhood graph. Thus we can even add this information to our algorithm and constrain several off-diagonal elements to be zero, thereby restricting the complexity of the problem.

### 3.2 Matrix-Based Algorithm and Regret

We use a standard mirror-descent style algorithm [16, 3] in the matrix based parametrization described above. It is characterized by a strongly convex regularizer , that is

 R(A)≥R(B)+⟨∇R(B),A−B⟩F+12∥A−B∥2  ∀A,B∈A,

for some norm and where is the trace inner product. An example is the squared Frobenius norm . For any such function, we can define the associated Bregman divergence

We assume is a convex subset of , which could encode constraints such as some off-diagonal entries being zero in the setup of Section 3.1.4. To simplify presentation in what follows, we will use the shorthand . The algorithm initializes with any and updates

 At+1=argminA∈A{ηt⟨∇ℓt(At),A⟩F+DR(A,At)} (5)

If and , the update simplifies to gradient descent .

Our main result of this section is a guarantee on the regret incurred by Algorithm (5). The proof follows from standard arguments (see e.g. [16, 4]). Below, the dual norm is defined as .

###### Theorem 1

Let be strongly convex with respect to a norm and , then Algorithm 5 with learning rate exhibits the following regret upper bound compared to any with ,

 T∑t=1ℓ(⟨Atzt,x′t⟩,yt)−infA∈AT∑t=1ℓ(⟨Azt,x′t⟩,yt)≤3RG√T.

## 4 Batch Imputation Based Algorithm

Recalling the setup of Section 2.2, in this section we look at imputation mappings of the form

 ϕM(x′,z)=x′+diag(1−z)M⊤x′. (6)

Thus we retain all the observed entries in the vector , but for the missing features that are predicted using a linear combination of the observed features and where the column of encodes the averaging weights for the feature. Such a linear prediction framework for features is natural. For instance, when the data vectors are Gaussian, the conditional expectation of any feature given the other features is a linear function. The predictions are now made using the dot product

 ⟨w,ϕ(x′,z)⟩=⟨w,x′⟩+⟨w,diag(1−z)M⊤x′⟩,

where we would like to estimate based on the data samples. From a quick inspection of the resulting learning problem, it becomes clear that optimizing over such a hypothesis class leads to a non-convex problem. The convexity of the loss plays a critical role in the regret framework of online learning, which is why we restrict ourselves to a batch i.i.d. setting here.

In the sequel we will provide a convex relaxation to the learning problem resulting from the parametrization (6

). While we can make this relaxation for natural loss functions in both classification and regression scenarios, we restrict ourselves to a linear regression setting here as the presentation for that example is simpler due to the existence of a closed form solution for the ridge regression problem.

In what follows, we consider only the corrupted data and thus simply denote corrupted examples as . Let denote the matrix with row equal to and similarly define as the matrix with row equal to . It will also be useful to define and and finally let .

### 4.1 Imputed Ridge Regression (IRR)

In this section we will consider a modified version of the ridge regression (RR) algorithm, robust to missing features. The overall optimization problem we are interested in is as follows,

 (7)

where the hypothesis and imputation matrix are simultaneously optimized. In order to bound the size of the hypothesis set, we have introduced the constraint that bounds the Frobenius norm of the imputation matrix. The global optimum of the problem as presented in (7) cannot be easily found as it is not jointly convex in both and . We next present a convex relaxation of the formulation (7). The key idea is to take a dual over but not , so that we have a saddle-point problem in the dual vector and . The resulting saddle point problem, while being concave in is still not convex in

. At this step we introduce a new tensor

, where . Finally we drop the non-convex constraint relating and replacing it with a matrix positive semidefiniteness constraint.

Before we can describe the convex relaxation, we need one more piece of notation. Given a matrix and a tensor , we define the matrix

 [KMN]i,j=x⊤ixj+x⊤iM¯¯¯¯Zixj+x⊤i¯¯¯¯ZjM⊤xj+d∑k=1[¯¯¯zi]k[¯¯¯zj]kx⊤iNkxj. (8)

The following proposition gives the convex relaxation of the problem (7) that we refer to as Imputed Ridge Regression (IRR) and which includes a strictly larger hypothesis than the pairs with which we began.

###### Proposition 2

The following semi-definite programming optimization problem provides a convex relaxation to the non-convex problem (7):

 mint, M:∥M∥2F≤γ2N:∑k∥Nk∥2F≤γ4t (9) s.t.  [KMN+λTIyy⊤t]⪰0,  KMN⪰0.

The proof is deferred to the appendix for lack of space. The main idea is to take the quadratic form that arises in the dual formulation of (7) with the matrix ,

and relax it to the matrix  (8). The constraint involving positive semidefiniteness of is needed to ensure the convexity of the relaxed problem. The norm constraint on is a consequence of the norm constraint on .

One tricky issue with relaxations is using the relaxed solution in order to find a good solution to the original problem. In our case, this would correspond to finding a good pair for the primal problem (7). We bypass this step, and instead directly define the prediction on any point as:

 (10)

Here, are solutions to the saddle-point problem

 minM:∥M∥F≤γN:∑k∥Nk∥2F≤γ4maxα2α⊤y−α⊤(KMN+λTI)α. (11)

We start by noting that the above optimization problem is equivalent to the one in Proposition 2. The intuition behind this definition (10) is that the solution to the problem (7) has this form, with replaced with . In the next section, we show a Rademacher complexity bound over functions of the form above to justify our convex relaxation.

### 4.2 Theoretical analysis of IRR

As mentioned in the previous section, we predict with a hypothesis of the form (10) rather than going back to the primal class indexed by pairs. In this section, we would like to show that the new hypothesis class parametrized by is not too rich for the purposes of learning. To do this, we give the class of all possible hypotheses that can be the solutions to the dual problem (9) and then prove a Rademacher complexity bound over that class. The set of all possible triples that can be potential solutions to (9) lie in the following set

The bound on is made implicitly in the optimization problem (assuming the training labels are bounded ). To see this, we note that the problem (9) is obtained from (11) by using the closed-form solution of the optimal . Then we can bound , where

denotes the smallest eigenvalue of the matrix

. Note that in general there is no linear hypothesis that corresponds to the hypotheses in the relaxed class and that we are dealing with a strictly more general function class. However, the following theorem demonstrates that the Rademacher complexity of this function class is reasonably bounded in terms of the number of training points and dimension and thereby still provides provable generalization performance [2].

Recall the Rademacher complexity of a class

 RT(H)=ESEσ[1Tsuph∈H∣∣∣T∑i=1σih(xi,zi)∣∣∣], (12)

where the inner expectation is over independent Rademacher random variables

and the outer one over a sample .

###### Theorem 2

If we assume a bounded regression problem and , then the Rademacher complexity of the hypothesis set is bounded as follows,

 RT(H)≤(1+γ+(γ+γ2)√d)BR2λ√T=O(√dT).

Due to space constraints, the proof is presented in the appendix. Theorem 2 allows us to control the gap between empirical and expected risks using standard Rademacher complexity results. Theorem 8 of [2], immediately provides the following corollary.

###### Corollary 3

Under the conditions of Theorem 2, for any , with probability at least over samples of size , every satisfies

 E [(y−h(x′,z))2]≤1TT∑t=1(yt−h(x′t,zt))2 +BR2(1+γ)2λ⎛⎝BR2(1+γ)2λ√dT+√8ln(2/δ)T⎞⎠.

## 5 Empirical Results

This section presents empirical evaluation of the online matrix-based algorithm 5, as well as the Imputed Ridge Regression algorithm of Section 4.1. We use baseline methods zero-imputation and mean-imputation where the missing entries are replaced with zeros and mean estimated from observed values of those features resp. Once the data is imputed, a standard online gradient descent algorithm or ridge-regression algorithm is used. As reference, we also show the performance of a standard algorithm on uncorrupted data. The algorithms are evaluated on several UCI repository datasets, summarized in Table 1.

The thyroid dataset includes naturally corrupted/missing data. The optdigits dataset is subjected to artificial corruption by deleting a column of pixels, chosen uniformly at random from the 3 central columns of the image (each image contains 8 columns of pixels total). The remainder of the datasets are subjected to two types of artificial corruption: data-independent or data-dependent corruption. In the first case, each feature is randomly deleted independently, while the features are deleted based on thresholding values in the latter case.

We report average error and standard deviations over 5 trials, using 1000 random training examples and corruption patterns. We tune hyper-parameters using a grid search from

to . Further details and explicit corruption processes appear in the appendix.

### 5.1 Online Corruption Dependent Hypothesis

Here we analyze the online algorithm presented in section 3.2 using two different types of regularization. The first method simply penalizes the Frobenius norm of the parameter matrix (frob-reg), . The second method (sparse-reg) forces a sparse solution by constraining many entries of the parameter matrix equal to zero as mentioned in Section 3.1.4. We use the regularizer , where is an additional tunable parameter. This choice of regularization is based on the example given in equation (4), where we would have .

We apply these methods to the splice classification task and the optdigits dataset in several one vs. all classification tasks. For splice, the sparsity pattern used by the sparse-reg method is chosen by constraining those entries where feature and have a correlation coefficient less than 0.2, as measured with the corrupted training sample. In the case of optdigits, only entries corresponding to neighboring pixels are allowed to be non-zero.

Figure 1 shows that, when subject to data-independent corruption, the zero imputation, mean imputation and frob-reg methods all perform relatively poorly while the sparse-reg method provides significant improvement for the splice dataset. Furthermore, we find data-dependent corruption is quite harmful to mean imputation as might be expected, while both frob-reg and sparse-reg still provide significant improvement over zero-imputation. More surprisingly, these methods also perform better than training on uncorrupted data. We attribute this to the fact that we are using a richer hypothesis function that is parametrized by the corruption vector while the standard algorithm uses only a fixed hypothesis. In Table 2 we see that the sparse-reg performs at least as well as both zero and mean imputation in all tasks and offers significant improvement in the 3-vs-all and 6-vs-all task. In this case, the frob-reg method performs comparably to sparse-reg and is omitted from the table due to space.

### 5.2 Imputed Ridge Regression

In this section we consider the performance of IRR across many datasets. We found standard SDP solvers to be quite slow for problem (9

). We instead use a semi-infinite linear program (SILP) to find an approximately optimal solution (see e.g.

[13] for details).

In Tables 3 and 4 we compare the performance of the IRR algorithm to zero and mean imputation as well as to standard ridge regression performance on the uncorrupted data. Here we see IRR provides improvement over zero-imputation in all cases and does at least as well as mean-imputation when dealing with data-independent corruption. For data-dependent corruption, IRR continues to perform well, while mean-imputation suffers. For this setting, we have also compared to an independent-imputation method, which imputes data using an matrix that is trained independently of the learning algorithm. In particular the column of is selected as the best linear predictor of the feature given the rest, i.e. the solution to: where is the set of training examples that have the feature present. Although, this method can perform better than mean-imputation, the joint optimization solution provided by IRR provides an even more significant improvement. At the bottom of Table 4 we also measure performance with thyroid which has naturally missing values. Here again IRR performs significantly better than the competitor methods. Zero-imputation is not shown due to space, but it performs uniformly worse. Figure 1 shows more detailed results for the abalone dataset across different levels of corruption and displays the consistent improvement which the IRR algorithm provides.

In Table 5 we see that, with respect to the column-corrupted optdigit dataset, the IRR algorithm performs significantly better than zero-imputation and mean-imputation in majority of tasks.

## 6 Conclusion

We have introduced two new algorithms, addressing the problem of learning with missing features in both the adversarial online and i.i.d. batch settings. The algorithms are motivated by intuitive constructions and we also provide theoretical performance guarantees. Empirically we show encouraging initial results for online matrix-based corruption-dependent hypotheses as well as many significant results for the suggested IRR algorithm, which indicate superior performance when compared to several baseline imputation methods.

#### Acknowledgements

We gratefully acknowledge the support of the NSF under award DMS-0830410. AA was partially supported by an MSR PhD Fellowship. We also thank anonymous reviewers for suggesting additional references and improvements to proofs.

## References

• [1] J. Abernethy, A. Agarwal, P. L. Bartlett, and A. Rakhlin. A stochastic view of optimal regret through minimax duality. CoRR, abs/0903.5328, 2009.
• [2] P.L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 3, 2003.
• [3] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3), 2003.
• [4] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambr. Univ. Press, 2006.
• [5] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir. Efficient learning with partially observed attributes. ICML, 2010.
• [6] N. Cesa-Bianchi, S.S. Shwartz, and O. Shamir. Online Learning of Noisy Data with Kernels. COLT, 2010.
• [7] G. Chechik, G. Heitz, G. Elidan, P. Abbeel, and D. Koller. Max-margin classification of data with absent features. JMLR, 9, 2008.
• [8] T.M. Cover and E. Ordentlich. Universal portfolios with side information. Information Theory, IEEE Transactions on, 42(2):348 –363, mar 1996.
• [9] O. Dekel, O. Shamir, and L. Xiao.

Learning to classify with missing and corrupted features.

Machine learning, 2010.
• [10] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journ. of the Royal Stat. Society, 39(1), 1977.
• [11] A. Globerson and S. Roweis. Nightmare at test time: robust learning by feature deletion. In ICML, 2006.
• [12] E. Hazan and N. Megiddo. Online learning with prior information. In COLT, 2007.
• [13] K. Krishnan and J.E. Mitchell. Semi-infinite linear programming approaches to semidefinite programming problems. Novel approaches to hard discrete optimization problems, 37, 2003.
• [14] R.J.A. Little and D.B. Rubin. Statistical analysis with missing data. Wiley New York, 1987.
• [15] B. M. Marlin. Missing Data Problems in Machine Learning. PhD thesis, University of Toronto, 2008.
• [16] A. S. Nemirovski and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. 1983.
• [17] A. Rostamizadeh, A. Agarwal, and P. Bartlett. Online and Batch Learning Algorithms for Data with Missing Features. ArXiv e-prints, 2011.

## Appendix A Proof of Proposition 1

The strategy used here is to consider the total regret accumulated by an algorithm on several different tasks, each one indexed by a different .

First, in order to simplify the interaction between and , suppose only the first coordinates of contain information (and the rest are always set equal to 0) and assume only the last coordinates of contain any information (the rest are always set equal to 1). Thus, for every one of the distinct values of we associate a different independent , or task, which the algorithm is trying to learn.

The main intuition is that the learning problem now reduces to a multitask classification problem with different tasks. Without further assumptions, it can be shown that the minimax regret of such a multitask classification problem is as bad as solving the tasks independently.

We partition the total number of iterations , where each is the number of iterations a particular was used by the adversary. In order to analyze the minimax regret, we can use von-Neumann duality (see e.g. [1]) to get

 infw1sup(x1,z1,y1)⋯infwTsup(xT,zT,yT) [T∑t=1ℓ(⟨wt(zt),x′t⟩,yt)−infw∈WT∑t=1ℓ(⟨w(zt),x′t⟩,yt)] =suppE[T∑t=1infwt∈WE[ℓ(⟨wt(zt),x′t⟩,yt)|(xs,ys,zs)t−11] −infw∈WT∑t=1ℓ(⟨w(zt),x′t⟩,yt)],

where the supremum is over joint distributions on sequences

.

It is clear that the first term decomposes over the tasks (since it decomposes over individual examples). The second minimization optimizes over all mappings in the set . This can be done alternatively by maximizing over the choice of a weight vector for each task individually. As a result, the minimax regret decomposes as a sum of the minimax regrets for each task.

If we choose then the total regret (which is the sum of the regrets accumulated from each task) is measured as follows,

 2d/2∑i=1R∗(Ti,ℓ)=2d/2∑i=1R∗(T2d/2,ℓ)=2d/2R∗(T2d/2,ℓ).

This completes the proof of the proposition.

## Appendix B Proof of Theorem 1

The proof is standard and just included for completeness. We recall from the update rule that

 At+1=argminA∈A{ηt⟨∇ℓt(At),A⟩F+DR(A,At)}.

Consequently, satisfies the first order optimality conditions:

 ⟨ηt∇ℓt(At)+∇R(At+1)−∇R(At),A−At+1⟩F≥0, (13)

for all . Now for any fixed , we can write the regret

 T∑t=1ℓt(At)−ℓt(A)≤T∑t=1⟨∇ℓt(At),At−A⟩F ≤T∑t=1[⟨∇ℓt(At),At+1−A⟩F+⟨∇ℓt(At),At−At+1⟩F] ≤T∑t=1[1ηt⟨∇R(At+1)−∇R(At),A−At+1⟩F +⟨∇ℓt(At),At−At+1⟩F]. (14)

Here the first inequality follows from the convexity of the loss and the last inequality is a consequence of (13). Also, applying (13) with gives

 ηt⟨∇ℓt(At),At−At+1⟩F ≥⟨∇R(At+1)−∇R(At),At+1−At⟩F ≥∥At−At+1∥2,

where the last step is a consequence of the strong convexity of the regularizer . Finally, applying Hölder’s inequality to the LHS of the above display yields

 ∥At−At+1∥2 ≤ηt⟨∇ℓt(At),At−At+1⟩F ≤ηt∥∇ℓt(At)∥∗∥At−At+1∥,

where is the dual norm to . Hence we get

 ∥At−At+1∥≤ηtG, (15)

where the last step follows from the Lipschitz assumption in the theorem statement. As a result, we can bound the second term in (14) as

 ≤∥∇ℓt(At)∥∗∥At−At+1∥ ≤ηtG2. (16)

For the first term in (14), we note that

 ⟨∇R(At+1)−∇R(At),At+1−At⟩F ≤DR(A,At)−DR(At+1,At)−DR(A,At+1) ≤DR(A,At)−DR(A,At+1),

where the last step follows from non-negativity of Bregman divergences. Finally, we combine the two bounds from above and substitute for the value of . Simplifying yields the statement of the theorem.

## Appendix C Proof of Proposition 2

In order to formulate a tractable problem we first rewrite the imputed ridge regression problem in its dual formulation.

 minMmaxα  2T∑i=1αiyi− T∑i,j=1αiαj((xi+¯¯¯¯ZiM⊤xi)⊤(xj+¯¯¯¯ZjM⊤xj)+λTI) s.t.  ∥M∥2F≤γ2

The inner maximization problem is concave in and the optimal solution for any fixed is found via the standard closed form solution for ridge regression:

where denotes the component-wise (Hadamard) product between matrices and will be used to denote the Gram matrix containing dot-products between imputed training instances. Plugging this solution into the minimax problem results in the following matrix fractional minimization problem,

 minM y(KM+λTI)−1y,  s.t. ∥M∥2F≤γ2.

This problem is still not convex in due to the quadratic terms that appear in . The main idea for the convex relation will be to introduce new variables which substitute the quadratic terms , resulting in a matrix that is linear in terms of the optimization variables and . This is shown precisely below:

 [KM]i,j =x⊤ix⊤j+x⊤iM¯¯¯¯Zixj+x⊤i¯¯¯¯ZjM⊤xj +x⊤iM¯¯¯¯Zi¯¯¯¯ZjM⊤xj∑dr,s,k=1[xi]r[xj]s[¯¯zi]k[¯¯zj]k[M]r,k[M]s,k i,j =x⊤ix⊤j+x⊤iM¯¯¯¯Zixj+x⊤i¯¯¯¯ZjM⊤xj +d∑k=1[¯¯¯zi]k[¯¯¯zj]kx⊤iNkxj∑dr,s,k=1[xi]r[xj]s[¯¯zi]k[¯¯zj]k[Nk]r,s

Note that the matrix no longer necessarily corresponds to a Gram matrix and that may no longer be positive semi-definite (which is required for the convexity of a matrix fractional problem objective). Thus, we add an additional explicit positive semi-definiteness constraint resulting in the following optimization problem,

 minM,N