# Streaming Methods for Restricted Strongly Convex Functions with Applications to Prototype Selection

In this paper, we show that if the optimization function is restricted-strongly-convex (RSC) and restricted-smooth (RSM) -- a rich subclass of weakly submodular functions -- then a streaming algorithm with constant factor approximation guarantee is possible. More generally, our results are applicable to any monotone weakly submodular function with submodularity ratio bounded from above. This (positive) result which provides a sufficient condition for having a constant factor streaming guarantee for weakly submodular functions may be of special interest given the recent negative result (Elenberg et al., 2017) for the general class of weakly submodular functions. We apply our streaming algorithms for creating compact synopsis of large complex datasets, by selecting m representative elements, by optimizing a suitable RSC and RSM objective function. Above results hold even with additional constraints such as learning non-negative weights, for interpretability, for each selected element indicative of its importance. We empirically evaluate our algorithms on two real datasets: MNIST- a handwritten digits dataset and Letters- a UCI dataset containing the alphabet written in different fonts and styles. We observe that our algorithms are orders of magnitude faster than the state-of-the-art streaming algorithm for weakly submodular functions and with our main algorithm still providing equally good solutions in practice.

## Authors

• 13 publications
• 32 publications
04/19/2019

### Submodular Maximization Beyond Non-negativity: Guarantees, Fast Algorithms, and Applications

It is generally believed that submodular functions -- and the more gener...
03/08/2017

### Streaming Weak Submodularity: Interpreting Neural Networks on the Fly

In many machine learning applications, it is important to explain the pr...
07/05/2017

### ProtoDash: Fast Interpretable Prototype Selection

In this paper we propose an efficient algorithm ProtoDash for selecting ...
06/09/2021

### Submodular + Concave

It has been well established that first order optimization methods can c...
09/18/2021

### Streaming algorithms for Budgeted k-Submodular Maximization problem

Stimulated by practical applications arising from viral marketing. This ...
11/14/2018

### Submodular Optimization Over Streams with Inhomogeneous Decays

Cardinality constrained submodular function maximization, which aims to ...
09/26/2020

### An optimization problem for continuous submodular functions

Real continuous submodular functions, as a generalization of the corresp...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Extracting compact synopses of large data sets or important features are a vital tool for summarizing, understanding, explaining and manipulating large datasets and large, complex machine learning models

[12, 11]

. Besides interpretability and human understanding, such synopses equally enable outlier detection, retaining information in lifelong learning systems, scaling deep learning, transfer learning and obtaining quick performance estimates for autoML systems

[8]. These applications demand fast yet accurate and reliable algorithms for synopsis generation that can flexibly adapt to user and application demands and are robust to uncertainties in the data. These approaches can be unified as finding a subset out of a collection of items (data points, features, etc.) that maximize a scoring function . The scoring function measures the information, relevance and quality of the selection. The desiderata for the scoring function naturally imply notions of diminishing returns: for any two sets and any item , it holds that . This is the definition of submodularity [10, 15].

In this paper, we provide two streaming algorithms for selecting such high value elements from data streams or large complex datasets. We also learn non-negative weights for each of them indicative of their importance. The non-negativity makes the weights more interpretable, as many domain experts find negative weights hard to interpret [14, 11]. Our first streaming algorithm, ProtoBasic, is extremely efficient and for which we prove a constant factor approximation guarantee when the objective that it tries to maximize is restricted strongly convex (RSC) and restricted smooth (RSM) [5]

, even with the additional non-negativity constraint on the importance weights. Functions that are RSC and RSM form a rich subclass of weakly submodular functions, including but not limited to ordinary least squares, generalized linear models, structured regularizers for matrix completion or any form of M-estimator

[4, 16]. Loosely speaking, weakly submodular functions are close to being submodular but not quite and for which greedy algorithms lead to good solutions in the batch setting [17]. Submodularity ratio [4] is a way of measuring this distance from submodularity.

In fact more generally, a constant factor bound can be shown for monotonic weakly submodular functions for whom the submodularity ratio can be bounded from above. This includes the RSC and RSM function class. This (positive) result which provides a sufficient condition for having a constant factor streaming guarantee for weakly submodular functions may be of special interest given the recent negative result [6] showing the absence of such a guarantee for the general class of weakly submodular functions. As an example and for the reader to obtain further insight we discuss the counter example given in [6] used to prove their negative result in the context of submodularity ratio, arguing that it cannot be bounded for that specific function.

Our second streaming algorithm, ProtoStream, is an enhancement of the first and is threshold based selecting elements with high incremental gain leading to a diverse selection which may not be the case with ProtoBasic. We provide theoretical arguments for which thresholds should be selected when running this algorithm.

We then empirically evaluate the efficacy of our algorithms for the prototype selection application [11]. We compare with the state-of-the-art streaming algorithm recently proposed for weakly submodular functions [6] in terms of performance and speed on two real datasets: MNIST- a handwritten digits dataset and Letters- a UCI dataset containing the alphabet written in different fonts and styles.

## 2 Preliminaries

Given a positive integer , let denote the set of the first natural numbers.

• Let be two disjoint sets, and . The submodularity ratio [4] of L with respect to (w.r.t.) S is given by:

 γL,S=∑i∈S(f(L∪i)−f(L))f(L∪S)−f(L) (2.1)

The function is submodular iff , . However, if can be shown to be bounded away from 0, but not necessarily , then is said to be weakly submodular.

• A function is said to be restricted strong concave with parameter and restricted smooth with parameter [5] if ;

 −cΩ2∥y−x∥22≥l(y)−l(x)−⟨∇l(x),y−x⟩≥−CΩ2∥y−x∥22. (2.2)

We denote the RSC and RSM parameters on the domain of all m-sparse non-negativevectors by and respectively. We care about this non-negative orthant denoted by because of our additional non-negativity constraint on the learned weights for each selected prototypes motivated from an interpretability standpoint. This is further explained in Section refsec:experiments. Also, let with the corresponding smoothness parameter .

## 3 Problem Statement

Given elements from an input space , a constant independent of n, and a continuous function with RSC and RSM properties, our objective is:

 Maximize l(w) s.t. ∥w∥0≤m and w≥0.

Defining a set function as

 f(L)≡maxw:supp(w)∈Ll(w) (3.1)

for a set where , our goal is to find that set that maximizes subject to the cardinality constraint that . Note that is monotonic as if then . Hence, without loss of generality we assume that . Given a set , the point at which attains maximum with the support in is represented by .

Easy to see that explicitly computing is an NP-complete problem. In this work, we develop a fast streaming algorithm that closely approximates even for the worst case streaming order of the elements. To this end, we show later that when is RSC and RSM, then it is possible to have a constant factor streaming algorithm even for the worst case streaming order. More generally, we establish that if the submodularity ratio for any weakly submodular monotonic set function is bounded from above, then a streaming algorithm with constant approximation guarantee exists as stated in Theorem 5.5.

## 4 Related Work

As mentioned before subset selection especially based on submodularity has wide applications in understanding, summarizing and manipulating large datasets [10, 15, 12] given that it is possible to obtain tractable algorithms with constant factor guarantees. In fact, it is known that even in the streaming setting [2] constant factor algorithms are possible for submodular functions.

Recently, it was shown [6] though that for the larger class of weakly submodular functions [4] no constant factor algorithm can exist in the streaming setting. This was a surprising result given that for the batch setting it has been known for a while that such approximation algorithms do exist [17].

In this work we propose streaming algorithms for a rich subclass of weakly submodular functions [5] namely those that are RSC and RSM. Efficient batch algorithms for the same were proposed in [11, 5]. In fact, the focus on interpretability through learning non-negative weights was highlighted in [11]. Our work thus shows that a constant factor streaming algorithm is possible for RSC and RSM weakly submodular functions or more generally for weakly submodular functions for whom the submodularity ratio can be bounded from above even with having to learn non-negative weights for the selected elements indicative of their importance from an interpretability standpoint [14, 11, 3]. This thus provides a sufficient condition, which includes a rich enough subclass of weakly submodular functions, for obtaining such a guarantee and is interesting in light of the recent result [6].

## 5 Methods and Results

In this section we propose two streaming algorithms, a simple one and an enhanced threshold based one. We show based on our first algorithm that it is possible to obtain a constant factor bound for RSC and RSM functions. Also more generally, the constant factor bound can be shown for any monotonic weakly submodular function with submodularity ratio bounded from above. Here we also discuss the counter example given in [6] in the context of submodularity ratio. We then describe our second threshold based algorithm which is an enhancement of the first and that adds elements based on high incremental gain and is thus likely to select diverse elements leading to potentially better performance in practice. We provide a (theoretical) discussion here of what thresholds should be considered when running this algorithm.

### 5.1 Algorithmic Description for ProtoBasic

Algorithm 1, ProtoBasic, is the first streaming algorithm we propose. The algorithm is quite simple where we maintain only one active solution set making it extremely fast. Moreover, only function gradient evaluations are required for deciding on each new element, rather than function evaluations as in [6] adding to its scalability.

The algorithm first proceeds by selecting the first elements. Then for every subsequent element it checks the value of adding that element to the empty set based on the function gradient. If this value is higher than the minimum value amongst the elements that have been currently selected, then we replace this minimum value element with the current one. The minimum value element can be accessed efficiently possibly using a min heap data structure. Finally, the optimal weights can be computed for the selected set.

### 5.2 Theoretical Guarantees

We first based on Lemmas 5.1 and 5.2 for any RSC and RSM function show a constant factor bound for ProtoBasic. We then show how RSC and RSM implies bounds on the submodularity ratio and how a bounded submodularity ratio can also lead to algorithms with constant factor guarantees. Complete proofs can be found in Appendix A .

###### Lemma 5.1.

Let be the RSM constant for any two vectors and where . Then for any two sets and with , and we have,

 l(ζ(L∪S))−l(ζ(L))≥12~Ck∥∥∇l+S(ζ(L))∥∥2.
###### Proof Sketch.

Based on definition of RSM and evaluating the KKT conditions for optimality we get the necessary lower bound. ∎

###### Lemma 5.2.

Let be the RSC constant for any two sparse vectors and . Then for any two sets and with , and we have,

 l(ζ(L∪S))−l(ζ(L))≤12ck∥∥∇l+S(ζ(L))∥∥2.
###### Proof Sketch.

Based on definition of RSC and evaluating the KKT conditions for optimality we get the necessary upper bound. ∎

###### Theorem 5.3 (Constant factor guarantee for RSC and RSM functions).

Consider a function with RSC and RSM parameters and respectively and let be a set function defined as in (3.1). If is the solution of ProtoBasic and is the optimal set of size , then for we have

 f(S)≥κf(L∗).
###### Proof Sketch.

First setting in lemma 5.1 and then setting and in lemma 5.2 we get the constant factor bound. ∎

###### Lemma 5.4 (Bounded submodularity ratio γ).

Let be a set function defined as in (3.1) where is RSC and RSM. Then for any two disjoint sets and we have,

 c|L|+|S|~C1≤γL,S≤~C|S|c|L|+1.
###### Proof Sketch.

Using inequalities in lemmas 5.1 and 5.2 we can bound the submodularity ratio for any RSC and RSM function as above. ∎

###### Theorem 5.5 (Constant factor guarantee for functions with bounded γ).

Let be a monotonic weakly submodular function with the property that any set of cardinality has a bounded submodularity ratio, i.e., where and are positive constants independent of and depends only on . Then the set containing the elements with the highest singleton values computable in a streaming setting (say by using min heaps) satisfies,

 f(S)≥κf(L∗) where κ=rmRm

where is the optimal size solution at which attains maximum.

###### Proof Sketch.

The result follows from the inequalities that ensue given the fact that ; ; . ∎

### 5.3 Impossibility Result and Submodularity Ratio

We now briefly describe how the submodularity ratio of the weakly submodular function constructed in [6] to show the impossibility result cannot be bounded from above and thus does not contradict our results. Moreover, it provides insight into the connections between the two. As considered in [6], for any set define the functions and using the base elements and . An impossibility result is shown for the set function . Letting and we find . For any singleton set , and implies . Further as . Hence grows with which can be made large enough to violate any upper bound and thereby engendering the impossibility result.

### 5.4 Algorithmic Description for ProtoStream

Algorithm 2, ProtoStream, unlike ProtoBasic is threshold based. It maintains multiple candidate sets of elements in parallel corresponding to thresholds in the range at intervals of for an user input . Here where is the element such that among all the encountered elements . The total number of candidates sets that are simultaneously maintained is requiring a total space of independent of . Value depends on the highest gradient element encountered thus far which is also one of the candidate sets. Those sets are updated for which the incremental gain in adding the new element based on its gradient is greater than , where the are the thresholds in . Notice that the incremental gain is a constant that does not depend on or RSC and RSM parameters of the objective function and is thus easily computable. Eventually, the set along with its corresponding weights that has the highest value of is chosen as the final solution. Lemma 5.6 gives a lower bound for the set function evaluated at the set containing elements corresponding to a threshold .

###### Lemma 5.6.

If the set for the threshold has cardinality then .

###### Proof Sketch.

The result follows from Lemma 5.1 and that we add an element only if

### 5.5 Choosing Thresholds for ProtoStream

Recall that the thresholds are searched in the interval where the interval length is independent of the RSC and RSM parameters and hence readily available. The upper bound on the range of is chosen to guarantee that for any new element , all candidate sets to which must be appended when its incremental gain exceeds are considered and no already seen elements are overlooked that should have been taken for the set when instantiating a new . This is because when is chosen from , every element that satisfies the threshold criteria to be a part of will appear on or after is instantiated and never before, as for any past element , where is the new value of that may be instantiated after seeing . Ergo, . The following insight is useful in motivating our choice for the lower range of . Setting and in Lemma 5.2 we get

 f(L∗)≤12cm∑j∈L∗[∇l+j(0)]2≤ρm2cm (5.1)

implying that . Hence we choose the lower range of to be the value that lower bounds . Setting to be the singleton set which has the maximum gradient at and in Lemma 5.1 we have

 cmρ2~C1≤cmf({p})≤cmf(L∗). (5.2)

Let us first consider the case where the number of chosen prototypes is so few that . From Lemma 5.1 and the inequality in (5.1) we find

 f({p})≥ρ2~C1≥cmf(L∗)~C1m≥c2mf(L∗)~C21.

Hence by just opting for the singleton set , we obtain a constant factor approximation. In the more interesting case where , (5.2) implies that . Hence we set the range to be . Note that for a value , if , then in accordance with Lemma 5.6 we will have , resulting in a better constant approximation factor compared to derived for ProtoBasic as .

## 6 Experiments

We now empirically investigate the performance of our algorithms relative to the state-of-the-art Streak algorithm [6] on two real datasets MNIST [13] and Letters [9]

. We extract compact synapses on the fly for these datasets of size

by selecting a maximum of prototypes obtained by maximizing the following cost function, which is a reformulation of maximum mean discrepancy metric and has been successfully used to select prototypes in the batch setting [11, 12]:

 Maximize l(w)=wTμ−12wTKw s.t. ∥w∥0≤m and % w≥0. (6.1)

Here is the positive definite Kernel matrix with entries where is appropriately chosen kernel function to define the inner products between data samples. The entries of the vector contains the mean inner product of a data sample with the rest and is defined as . An empirical estimate of is maintained based on ideas described in [2] in the experiments. The vector are the non-negative weights with utmost entries non-zero which are indicative of the importance of the corresponding prototypes. It was shown in [11] that the function in equation 6.1 is RSC and RSM and the corresponding set function defined as in equation 3.1 is weakly submodular even with the non-negativity constraint on the weights. When all weights are set to and only the support set is unknown, the set function in (3.1) can be shown to be (strongly) submodular [12] and for these class of functions, streaming algorithms with constant factor guarantees are developed in [2]. As [11] describes in detail the usefulness of having non-equal weights, we consider the more general setting here and apply our streaming algorithms for the same. In all the experiments we use a Gaussian kernel for whose width is found through cross-validation and set as smaller values didn’t improve the objective by much, although significantly slowed down Streak.

For both MNIST and Letters, a (global) 1-nearest neighbor (1-NN) classifier

[12] was used to evaluate the efficacy of the selected prototypes. Since the learned weights and the distance metric in 1-NN classification are not the same scale, we performed the standard 1-NN classification based on the top prototypes selected based on largest weights.

Additional experiments where the test set is split into multiple target datasets containing only (examples of) a single digit/alphabet, while the training or source dataset remains the same, and we want to evaluate the adaptability of the algorithms to such heavily skewed test distributions, are given in Appendix

B.2. We observe in such settings that ProtoBasic is in fact the method of choice.

### 6.1 Mnist

The MNIST dataset consists of 70000 (60K+10K) handwritten digits. We use the set of size 10000, as the base set from which we choose up to 750 prototypes—since after this the gain to our objective 6.1 was incremental—and then evaluate it on the remaining 60000 using it as a test set.

We observe in Figure 1 (left) that the performance of both Streak and ProtoStream in terms of classification accuracy on the test set are very similar across different values of . ProtoBasic is significantly worse and the reason for this is the lack of diversity in its chosen prototypes as visualized in Figure 1 (right). In this plot, we see that the distribution of the 10 digits in the base set is almost uniform, and both Streak and ProtoStream are able to reasonably recover this, however, ProtoBasic ends up selecting just a few digits. This is because ProtoBasic chooses the prototypes only based on their gradient values computed at , and is non-incremental in the sense that subsequent choices do not depend on which ones have been chosen thus far and hence is unable to create a diverse prototype set. However, both ProtoStream and Streak are incremental methods as the incremental gain for an incoming element depends on the current content of the sets.

In Figure 1 (center), we see the main benefit of our methods. We plot the per threshold times as parallelized implementations may be possible for maintaining the different sets and so a comparison on this metric is important. In Table 1, we see the total run times for a serial implementation of these methods. In both cases we see that ProtoStream is approximately 2 orders of magnitude faster than Streak. Moreover, in Table 1 we observe that ProtoStream achieves the same quality solution as Streak, given that the maximum objective value (of equation 6.1) is identical for both of them.

The reason for such a wide computational gap is that our algorithms only require gradient evaluations which are about for each new instance, while Streak performs function evaluations which are for each new instance as (3.1) is a quadratic optimization problem. Moreover, while ProtoStream does only function evaluations to recompute the weights after the addition of an instance to the set , Streak performs function evaluations per threshold as computing the incremental gain for every element requires such an evaluation.

### 6.2 Letters

The Letters dataset is a UCI repository dataset consisting of 20000 instances of the 26 letters in the alphabet written in 20 different fonts and 5 different styles. There are 16 attributes which encompass statistical moments and edge counts when scanning these letter images in different directions. Typically, the first 16000 instances are used for training and the remaining 4000 are used as test. We selected up to 500 prototypes from the base set of 16000 since the gain based on (

6.1) after that was marginal. The selected prototypes were then used to classify the other 4000 using 1-NN classifier. In Figure 2 (left) we see that the accuracy of ProtoStream is almost indistinguishable from Streak and at times superior for some values of . ProtoBasic, again performs inferiorly due to lack of diversity as elucidated above and is validated in Figure 2 (right). We again observe in Figure 2 (center) and Table 1 that our algorithms are orders of magnitude faster than Streak as they do not require evaluation of set function in (3.1) for every new instance, albeit that ProtoStream still achieves the same quality (i.e. same max objective value) solution as Streak.

More experiments showcasing the diversity of our selection across fonts and stroke styles are given in Appendix B.1.

## 7 Discussion

In summary, we described sufficient conditions for obtaining a constant factor streaming algorithm for weakly submodular functions. Our conditions cover a rich class of functions namely those that are RSC and RSM. As a more general result, we established that any monotonic weakly submodular function with bounded submodularity ratio from above has a streaming algorithm with constant approximation guarantees. We developed an extremely fast threshold free algorithm and a high performing threshold based algorithm that is still orders of magnitude faster than the state-of-the-art at least for quadratic functions over several variables and also closely matches the latter in practical performance. In the future, we would like to study how much our conditions can be relaxed to bridge the gap between necessity and sufficiency for the rich class of weakly submodular functions.

## Appendix A Proofs

### a.1 Proof of Lemma 5.1

###### Proof.

Let be a vector with a value one only at the coordinates and zero elsewhere. For all , define . As is the optimal point for we have

 l(ζ(L∪S))−l(ζ(L)) ≥l(y(S))−l(ζ(L)) ≥⟨∇l(ζ(L)),∑j∈Sαj1({j})⟩−~Ck2∑j∈Sα2j. (A.1)

Maximizing w.r.t. each , we get where . Substituting these values of in (A.1) gives us the required lower bound, namely

 l(ζ(L∪S))−l(ζ(L))≥12~Ck∥∥∇l+S(ζ(L))∥∥2. (A.2)

### a.2 Proof of Lemma 5.2

###### Proof.

By the definition of constant we find

 l(ζ(L∪S))−l(ζ(L))≤⟨∇l(ζ(L)),ζ(L∪S)−ζ(L)⟩−ck2∥∥ζ(L∪S)−ζ(L)∥∥2 ≤maxv:v(L∪S)c=0,v>=0⟨∇l(ζ(L)),v−ζ(L)⟩−ck2∥∥v−ζ(L)∥∥2. (A.3)

Observe that the KKT conditions at the optimum for the function necessitates that ,

 ζ(L)j>0 ⟹∇lj(ζ(L))=0, ζ(L)j=0 ⟹∇lj(ζ(L))≤0

and hence we have . When , , and maximizing w.r.t. , the maximum occurs at where . Plugging this maximum value of in (A.3) we get the upper bound

 l(ζ(L∪S))−l(ζ(L))≤12ck∥∥∇l+S(ζ(L))∥∥2. (A.4)

### a.3 Proof of Theorem 5.3

Setting in Lemma 5.1 we get

 f(S) ≥∥∥∇l+S(0)∥∥22~Cm≥∥∥∇l+L∗(0)∥∥22~Cm≥cmf(L∗)~Cm. (A.5)

The second inequality follows from the fact contains the elements that maximizes the gradient values . The third inequality is obtained by setting and in Lemma 5.2. Setting we obtain a constant approximation of .

### a.4 Proof of Lemma 5.4

Recall that given two disjoint sets and , the submodularity ratio is defined as

 γL,S=∑j∈S[f(L∪{j})−f(L)]f(L∪S)−f(L).

where and . Using inequalities (A.2) and (A.2) we can bound the submodularity ratio as

 c|L|+|S|~C1≤γL,S≤~C|S|c|L|+1. (A.6)

### a.5 Proof of Theorem 5.5

As the set consists of those elements where the function evaluation on the singleton sets is the maximal, we have ; ; . When compared with the optimal set we find

 f(S)=∑j∈Sf({j})γ∅,S ≥1Rm⎡⎣∑p∈L∗f({p})⎤⎦=γ∅,L∗Rmf(L∗)≥rmRmf(L∗).

Thus where .

### a.6 Proof of Lemma 5.6

Recall that an incoming element is added to the set provided

 ∇lj(ζ(Lτ))≥√2τm. (A.7)

By setting to be singleton set in Lemma 5.1 we get

 f(Lτ∪{j})−f(Lτ)≥12~C1[∇l+j(ζ(Lτ))]2≥τ~C1m.

So by adding to the current set , the increase the set function is at least . When , it follows that .

### b.1 Letters: Fonts and Stroke Styles

As mentioned in Section 6.2, we know that the letters dataset spans 20 different fonts and 5 different stroke styles. It has been known from previous studies [7, 9] that one could cluster any letter into 20 groups and partition based on the fonts. Analogously clustering into 5 groups can largely uncover the different stroke styles.

Given this we wanted to see if our prototypes from ProtoStream span the different fonts and styles. Since the partitions are not given we perform k-means clustering and partition copies of each letter into 20 and then 5 groups. We assigned each of our 500 prototypes to the closest cluster based on euclidean distance. We then plotted a histogram of what fraction of instances belonged to which cluster. We also compared this with assignment to randomly formed clusters so as to verify that the clustering in fact had some information.

These results are seen in Figures 3 and 4. The more uniform the distribution the better. We see clearly that our prototypes are quite equitably distributed across the different clusters with being much superior than random. This implies two things. First, that the clusters do capture information of possibly fonts and styles. Secondly, our prototypes nicely span these fonts and styles again verifying that ProtoStream selects diverse informative instances.

### b.2 Adapting to Target Dataset

The plots in Figures 1 and 2 indirectly appraise the quality of the selected prototypes based on their accuracy in classifying a test set. In this section we design experiments from which we can directly infer the prototype selection quality by studying how well our algorithms adapt to a different test or target distribution. To this end, we create target datasets having samples only from a single class (digits in MNIST and letters in UCI). For example, we create a target dataset for the digit 1 by selecting only from the original test of 60000. Given the original source dataset which contain (almost) an equal mix of different digits or letters, the goal is to see how well our algorithms adapt to these heavily skewed target distributions

that contain only a single digit/alphabet. In other words, we wish to evaluate whether they still just pick a uniform distribution over all the digits/letters from

or adapt and pick more prototypes of the target digit. Selecting prototypes from one source set that matches well with a different target distribution are natural in covariate shift correction settings [1].

For going across datasets, we optimize the cost function:

 Maximize l(w)=wTμ−12wTKw s.t. ∥w∥0≤m and % w≥0 (B.1)

where as before is the positive definite Kernel matrix with entries and the entries of the vector contains the mean inner product of a data sample in with the target and is given by: . Here . Note that the labels of the target samples are not exposed to the algorithms. The prototype selection quality can be quantified from the percentage of selected prototypes that match target class. Higher the percentage, better is the selection quality.

We see in Figure 5 that our algorithms along with Streak do adapt to the target distribution. In fact, ProtoBasic almost exclusively picks examples of the target digit in MNIST showcasing its effectiveness in such a setting. The relative running times are similar to those reported in the main document. Given this, ProtoBasic could be the most preferred method in scenarios where the target dataset more or less contains a single class.

## References

• [1] D. Agarwal, L. Li, and A. J. Smola. Linear-Time Estimators for Propensity Scores. In

Intl. Conference on Artificial Intelligence and Statistics (AISTATS)

, pages 93–100, 2011.
• [2] A. Badanidiyuru, B. Mirzasoleiman, A. Karbasi, and A. Krause. Streaming submodular maximization: Massive data summarization on the fly. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 671–680. ACM, 2014.
• [3] J. Bien and R. Tibshirani. Prototype Selection for Interpretable Classification. Ann. Appl. Stat., pages 2403–2424, 2011.
• [4] A. Das and D. Kempe. Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection. In Intl. Conference on Machine Learning (ICML), 2011.
• [5] E. Elenberg, R. Khanna, A. G. Dimakis, and S. Negahban. Restricted Strong Convexity Implies Weak Submodularity. In https://arxiv.org/abs/1612.00804, 2017.
• [6] E. R. Elenberg, A. G. Dimakis, M. Feldman, and A. Karbasi.

Streaming weak submodularity: Interpreting neural networks on the fly.

Advances in Neural Inf. Processing, 2017.
• [7] X. Z. Fern and C. Brodley. Cluster Ensembles for High Dimensional Clustering: An Empirical Study. Machine Learning Research, 22, January 2004.
• [8] M. Feurer, K. E. Aaron Klein, J. Springenberg, M. Blum, and F. Hutter. Efficient and robust automated machine learning. Advances in Neural Information Processing Systems Workshop, 12 2015.
• [9] P. W. Frey and D. J. Slate. Letter recognition using holland-style adaptive classifiers. Machine Learning, 6(2), 1991.
• [10] S. Fujishige. Submodular functions and optimization. Number 58 in Annals of Discrete Mathematics. Elsevier Science, 2 edition, 2005.
• [11] K. Gurumoorthy, A. Dhurandhar, and G. Cecchi. Protodash: Fast interpretable prototype selection. In https://arxiv.org/abs/1707.01212v2, 2017.
• [12] B. Kim, R. Khanna, and O. Koyejo. Examples are not Enough, Learn to Criticize! Criticism for Interpretability. In Conference on Neural Information Processing Systems (NIPS), 2016.
• [13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
• [14] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In In NIPS, pages 556–562. MIT Press, 2001.
• [15] L. Lovász. Mathematical programming – The State of the Art, chapter Submodular Functions and Convexity, pages 235–257. Springer, 1983.
• [16] S. Negahban, B. Yu, M. J. Wainwright, and P. K. Ravikumar. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1348–1356. 2009.
• [17] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An Analysis of Approximations for Maximizing Submodular Set Functions. Math. Program., 14:265–294, December 1978.