Robust Classification by Pre-conditioned LASSO and Transductive Diffusion Component Analysis

11/19/2015 ∙ by Yanwei Fu, et al. ∙ Disney Research Stanford University 0

Modern machine learning-based recognition approaches require large-scale datasets with large number of labelled training images. However, such datasets are inherently difficult and costly to collect and annotate. Hence there is a great and growing interest in automatic dataset collection methods that can leverage the web. unreliable way. Collecting datasets in this way, however, requires robust and efficient ways for detecting and excluding outliers that are common and prevalent. far, there have been a limited effort in machine learning community to directly detect outliers for robust classification. Inspired by the recent work on Pre-conditioned LASSO, this paper formulates the outlier detection task using Pre-conditioned LASSO and employs unsupervised transductive diffusion component analysis to both integrate the topological structure of the data manifold, from labeled and unlabeled instances, and reduce the feature dimensionality. Synthetic experiments as well as results on two real-world classification tasks show that our framework can robustly detect the outliers and improve classification.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, machine learning-based approaches have profoundly helped to push performance of the computer vision algorithms. Most modern computer vision recognition approaches take the form of supervised learning and rely on large corpora of labeled data to train classification (or regression) models. To a large extent such corpora datasets (e.g., ImageNet by

Deng et al. (2009), MSR COCO by Lin et al. (2014)) have been collected from the web by searching for query terms relevant to a particular object label and then verifying consistency though crowdsource labeling (e.g., using Amazon Mechanical Turk). However, such data collection methods, where every image needs to be verified or labeled by one or more annotators, is costly and difficult to scale, particularly to multi-label datasets.

To address the difficulties of data collection and annotation a number of automated and semi-automated ways to leverage web data have been proposed. For example, Berg & Forsyth (2006), and later NEIL by Chen et al. (2013), proposed to first cluster the web collected images, either by Latent Dirichlet Allocation (Berg & Forsyth (2006)) or exemplar-based affinity propagation (Chen et al. (2013)), and then label the clusters (Berg & Forsyth (2006)

) or attempt to assign query labels to largest clusters directly. Such automated or semi-automated ways of dataset construction, by and large, result in datasets and learned models that are plagued with outliers. Outliers are problematic as they can adversely bias the decision boundary of the classifiers and degrade the overall performance. Further, ideally one would want to eliminate human labeling all together from the data collection process, rather treating results returned from a search engine as weakly-supervised noisy data, shifting the onus on learning to identify and be robust to outliers.

Detecting outliers in training datasets is, however, extremely challenging due to two reasons. (1) Masking and Swamping of outliers  She & Owen (2011). Masking: refers to a phenomena that arises when one outlying observation is detected and left out, the remaining outliers can cause the resulting model to be less accurate, therefore making it look like the removed outlier is an inlier. As an example, consider line fitting to a set of inlier observations and two outliers on both sides of the line; removing one of the outliers will actually cause more, not less, bias in the resulting regression. Swamping: refers to the fact that outliers can make some inliers look like the outliers, which leads to inlier observations being removed, reducing the accuracy of the learned model; swamping becomes more serious in the presence of multiple outliers. (2) High dimensionality of the feature representation and lack of good similarity metrics in the feature space. Ideally, inliers and the outliers should have different low-level feature distributions. However, due to the aforementioned factors it is extremely difficult to use distributions in the feature space for outlier detection.

Inspired by the theoretical analysis on Pre-conditioned LASSO in Wauthier et al. (2013), we design a general framework for outlier detection. We formulate the outlier detection for multi-class classification as a Pre-conditioned LASSO problem. We further design an unsupervised transductive diffusion component analysis (TDCA) for feature dimension reduction to meet the conditions to recover the signed support of true outliers. TDCA, by construction, also limits the negative effects of data bias between web data and unlabelled data through (1) its diffusion on transductive graphs (Eq (8)); and (2) softmax embedding for inferring more comparable node features (Eq (9

)). Formally, for the multi-class classification problems, we assume there are true feature coefficient vectors (e.g.

) which can help infer the corresponding class labels of all instances. Particularly, the inferred label of instance

follows Gaussian distribution in the label space:

, where is the ground-truth labels of instance and

is the variance.

For simplicity, labels for a multi-label problem can be encoded using real-valued numbers (e.g., label for instance as ; label for instance as , etc.). Note, this effectively converts multi-class classification problem into a regression problem. The low-level features of instances are assumed to form a low-dimensional manifold in a high-dimensional feature space (Roweis & Saul (2000)),

(1)

where is low-level feature matrix with being the number of training instances and being feature dimension. are the coefficients of low-level features and is a -dimensional label vector; stands for the Gaussian noise in the model. is the instance-wise sparse outlier vector: is nonzero if training instance is the outlier variable; otherwise, for inliers is zero.

2 Related work

The problem of leveraging data from the web, to learn recognition models with minimal additional human supervision dates back to the work Fergus et al. (2004, 2005); Bergamo & Torresani (2010), where attempts were made at re-ranking images obtained from Google Image Search using visual information. In Fergus et al. (2004) the models for re-ranking of images were learned in either unsupervised mode where all returned images were used or in relevance feedback mode where user is tasked with annotating few images. Berg & Forsyth (2006) constructed a dataset automatically for several animal categories. Latent Dirichlet analysis was used to identify the latent image topics and corresponding images; users were then asked to judge whether clusters were relevant or not. Other forms of clustering or latent models have also been tried, e.g., pLSA in Fergus et al. (2010) and exemplar-based affinity propagation Chen et al. (2013). Schrof and colleagues Schrof et al. (2007)

used a combination of textual and image-based analysis to arrive at the training data. A few attempts have been made to shift the onus on the learning algorithms, by proposing active learning

Collins et al. (2008) or incremental learning Li et al. (2007) architectures that implement forms of iterative model or image ranking refinement. To the best of our knowledge, none of these methods Frenay & Verleysen (2014) provided theoretically sound way of dealing with outliers and may suffer from Masking and Swamping problems; our model is an attempt to address this explicitly, particularly in the image classification context.

In statistics and economics, the variables in Eq(1) is also called incidental parameters Fan et al. (2012), which are first studied in Neyman & Scott (1948). Recently, robust regression methods with instance-wise outlier indicators (Eq (1)) has been studied in She & Owen (2011); Witten (2013); Katayama & Fujisawa (2015); Nguyen & Tran (2013). They showed that the penalized least square with penalty 111it is also known as soft-thresholding. (Eq (2

)) is in fact equivalent to the Huber M-estimator

Huber (1981); Gannaz (2007). Thus they introduced non-convex penalty (i.e. hard-thresholding) for outlier detection. In contrast, our framework is inspired by the analysis of Pre-conditional LASSO Wauthier et al. (2013). In many practical applications, LASSO is always sufferred from high-dimension and correlated features. We thus introduce TDCA, which performs a vital role of making our outlier detection work; otherwise, the soft-thresholding will fail as reported in the She & Owen (2011) (and shown in our AwA experiments). Fu et al. (2014, to appear) discussed the problem of outliers detection using LASSO framework on crowdsourced pairwise comparison graphs. Particularly, in their work, the incident matrix in their Eq (2) greatly simplifies the problem by decomposing the original pairwise graph space into gradient flow and cyclic flow; and outliers are only projected into cyclic flow; In contrast, approach here (P-LASSO-TDCA) aims at a more general classification scenario by learning from noisy web data without assuming such specific graph structure.

Graph-based transductive learning methods (e.g., Zhu (2007)) have attracted considerable attention in recent years. Their benefit is that graphs can capture the manifold structure of the data in a transductive setting. However, the potential high dimensionality 222In the graph, each node may have interactions with the other nodes. Such interactions have been taken as the features of this node on graph. of each node make the properties of such graphs hard to analyze. Classical linear dimensionality reduction techniques such as principle component analysis do not work for dimensionality reduction on graphs, because they fail to encode graph’s topological structure. In contrast, softmax embedding (Eq 9) have been around for more than one decade Hinton & Roweis (2002); van der Maaten & Hinton (2008), which showed that such a technique is an improvement over the ’traditional’ manifold dimension reduction methods (e.g. LLE Roweis & Saul (2000)) in which widely separated data-points can be “collapsed” as near neighbors in the low-dimensional space. Besides softmax embedding, our TDCA is further built on diffusion maps and unsupervised transductive learning to capture the topological structure of both labelled and unlabelled data for feature dimension reduction. Note that the idea of using diffusion maps for dimension reduction has been used in Cho et al. (2015), whilst our TDCA is doing transductive learning, which is designed for weakly labelled tasks by computing the concept manifold over both training and testing images.

3 Main Approach (P-LASSO-TDCA)

Our approach has two parts which focus on label space and low-level feature space respectively. In the label space, we formulate Pre-conditioned LASSO for outlier detection, while TDCA is utilized for graph-based feature dimension reduction in feature space. Once outliers are found, TDCA features of inliers can be used to train any classifier of choice.

3.1 Pre-conditioned LASSO (P-LASSO)

According to Gaussian-Markov theorem, the best unbiased estimator for linear regression model, in absence of outliers, is ordinary least square (OLS) estimator. We assume presence of the outliers in our training set, but that those outliers are sparse. These two observations, in conjunction, lead to the following problem formulation:

(2)

In other words, having subtracted sparse set of outliers from the data, we assume OLS for remaining inliers. Taking the Lagrangian and setting results in:

(3)

Putting Eq (3) back into Eq (2), we can configure LASSO for outlier variable ,

We introduce here the hat matrix . The hat matrix is a symmetric () and idempotent () matrix. Thus the above equation can be simplified into,

where and and

is the identity matrix;

controls the amount of regularization on : will simplify to an OLS problem; while, will shrink all to .

Let

have a full Singular Value Decomposition (SVD)

333In this paper, we assume . If not, we can use TDCA for reduction. as in Fu et al. (2014): and where and are the orthogonal basis of column space and kernel space of respectively ( and ). Thus is the conjugate transpose of . We have and . Eq (3.1) can be simplified into the following pre-conditioned LASSO Wauthier et al. (2013),

(5)

So the number of ‘effective’ observations in Eq (5) is . The above equation is actually trying to solve, .

Let us review some basic facts about LASSO:

  1. LASSO is sign consistent if there exists a sequence444Here indicates that is a function of . such that , as (i.e. in Eq (5)) and if Irrepresentable Condition holds (Fan & Li (2001); Zhao & Yu (2006); Wainwright (2009)). Here we define such that iff element-wise; and .

    indicate the true sparse outliers with the corresponding .

  2. Eq (3.1) and Eq (5) are also called Pre-conditioned LASSO. The more generalized conditions for signed support recovery are thoroughly analyzed in Wauthier et al. (2013).

  3. The solution of LASSO is piecewise linear when varies from to

    . The regularization path of LASSO can be obtained as efficiently as solving a “ridge regression” (

    Hastie et al. (2009)).

  4. LASSO estimator is biased (Hastie et al. (2009)): it will bias the estimated non-zero coefficients of toward zero.

  5. The standard cross-validation on Eq (5) may also not work: each instance is associated with an outlier variable which makes classical leave-out cross-validation unstable. There is no information to assign the values to outlier variables for the left-out samples.

Due to conditions and and further inspired by Fu et al. (2014, to appear)555as well as the suggestions from Page 91 of Hastie et al. (2009)., we turn the problem of solving for in Eq (3.1) into a problem of ordering training instances by . Specifically, we compute the regularization path of LASSO when is changed from to ; and LASSO will first select the variable subset accounting for the highest variances to the observations at noted in She & Owen (2011); Fan et al. (2012). Such subset will be assigned nonzero elements in Eq (3.1) and thus have higher likelihood of being outliers. We can thus order the samples by checking their when is changed from to . The top subset of this ordered list are taken as outliers in our problems. Furthermore, we can do the cross-validation with respect to . Specifically, for each value, we take as outliers those instances that have nonzero coefficients values and leave them out. Then we do the cross-validation on the remaining training set and select the subset that achieves the highest accuracy on withheld data.

3.2 Transductive Diffusion Component Analysis

For sign consistency in LASSO and to recover the signed support of outlier variables , our task requires in Eq (5) where we can either increase the number of training instances666This is always an option due to the high number of instances on Internet. However, the more instances we take, the more outliers there likely be, as fraction of outliers increases with the number of responses to a query. () or reduce the feature dimension (). Furthermore, correlated variables (features) are a perennial problem for the Lasso and frequently lead to systematic failures. It is essential to both reduce feature dimensions and disentangle the correlated feature dimensions.

Here we discuss using TDCA for feature dimension reduction. Another difficulty of solving LASSO is that the columns of are correlated Wauthier et al. (2013); we hope that the dimensionality reduction process can help alleviate this by removing correlations and redundancies in data. Our final observation is that high dimension data, such as image features, often lives on a low-dimensional manifold Roweis & Saul (2000). To leverage this, we want a manifold-aware or manifold-based dimensionality reduction technique which is also robust to noise and outliers. This motivates our transductive diffusion component analysis for feature dimension reduction.

Diffusion-based methods are normally used to model the topological structure of data. However, the web training datasets that we are after have much more noise and outlier data points. To prevent these artifacts from degrading results, we introduce a transductive diffusion component analysis to help unravel the transductive graphs which are composed of both training and testing data.

(1) Transductive Graph Construction. Suppose we construct a graph with nodes corresponding to training and testing images. Assuming and are the original low-level features of nodes and in our graph, the similarity weight among these two nodes is defined as:

(6)

where is the square of inner product of features of node and with a free parameter . For computational efficiency, we use k-nearest-neighbour graph instead of fully connected graph. Thus we have the graph

, and the transition probability of instances

and is thus defined as,

(7)

where the sum is over the k-nearest-neighbor set of , and is also within this set.

(2) Lazy Random Walk (LRW). LRW is used here to guarantee convergence to the stationary distribution. LRW from node is defined as in Zhou et al. (2003); Lafon & Lee (2006),

(8)

where and ; . The is the restart probability (usually set as ), balancing the influence of local and global topological information in diffusion; is the diffusion state of node at the -th step. The diffusion state is defined as for node .

(3) Softmax Embedding for Dimensionality Reduction. The noise and outlier data samples will cause missing or spurious interactions in the graph, and thus will have significant negative impact towards the random walks. Softmax normalization, in contrast, is a way of reducing the influence of extreme values or outliers in the data. Thus, we employ softmax function to approximate the probability assigned to node in the diffusion state of node , i.e.,

(9)

where the original feature of node is unraveled and reduced into two vector representations777 and usually have much less dimensions than . and , which model the topological structures of node in the graph. The is referred to as the node features while is the context features that capture the connections of node with the other nodes (Cho et al. (2015)). For undirected graphs, the and

should be close in direction (in the sense of cosine similarity) and both of them can capture fine-grained topological structures which can be useful for classification tasks.

We order all and into vectors and . They are computed by solving the following optimization problem,

(10)

where is the KL-divergence (relative entropy) is used as the objective function. This optimization problem can be efficiently solved using L-BFGS (Liu & Nocedal (1989)). Here, we use as the reduced low-level feature set for Pre-conditioned LASSO in Eq (5). After removing outliers, we use the feature set , with linear SVM classifier, for classification.

4 Experiments

4.1 Synthetic experiments

We use synthetic experiments to evaluate the efficacy of Pre-conditioned LASSO. Specifically, we generate 3 different classes: , and where . Each class has generated instances. We add outliers for each class. The outliers are uniformly sampled from the neighborhood around the mean of each class. Note that the outliers, typically, have larger magnitude than the Gaussian noise (i.e., ). The data is visualized in Fig. 1(left): red indicates the outliers and blue the inliers for the three classes. We assign , , labels for each class. The instances are indexed as follows: are inliers; are outliers. The regularization path is shown in Fig. 1(middle) with the index of each instance (curve) labelled. Graph shows that the outliers are encountered (i.e.,have signed ) first when changes from to along the regularization path. Note, the fact that blue inliers are encountered way to the left of Fig. 1(middle) after most outliers are encountered indicates that outliers can be effectively identified. The number of instances removed is exactly the same as the number of outliers888We use this cutoff value only to faciliate the evaluation in Fig. 1(right). Note that (1) this value doesnot affect the P-LASSO step; (2)real applications usually donot know this number, but we can roughly estimate the outlier percentage by randomly sampling a small portion of instances and manually identify the outliers.; and we can compute the accuracy of outlier detection in all the experiments shown in Fig. 1(right).

Ability to detect outliers is clearly a function of the number of outliers themselves. It is typically much more difficult to detect outliers when their fraction, as a function of inliers, is large. To validate how our approach handles such a scenario, we do an additional experiment where we vary the percentage of the outliers. Specifically, we keep the 300 inliers while we vary the percentage of outliers from of the number of inliers. Note that with 150% outliers the number of outliers is 50% larger than the number of inliers making outlier detection rather challenging. We use our P-LASSO framework to detect the corresponding outliers. The accuracy of outlier detection is shown in Fig. 1

(right). The experiments are repeated 10 times, with mean and standard deviation bars shown. Fig.

1 (right) shows that the accuracy of outlier detection does not drop significantly as the ratio of outlier increases, making the approach applicable to scenarios with large number of outliers. Significantly, even with of outliers, our algorithm can still remove of the outliers. This further validates the efficacy of our P-LASSO framework.

Figure 1: Synthetic experiment: experimental setup and data is illustrate in the (left); (middle) shows regularization path of P-LASSO. Notably, the fact that outliers appear to the right of the inliers on the regularization path indicates ability of P-LASSO to effectively detect outliers. On the (right) accuracy of outlier detection is illustrated as a function of percentage of outlier with respect to the inliers. See text for further discussion.

4.2 Labeling Actor faces using Google Images

We apply our approach to automatically actor face labeling in the context of “Buffy the Vampire Slayer” Bauml et al. (2013); Everingham et al. (2006) dataset. The dataset consists of episodes 1–6 from season 5 of the show (BF-1 to BF-6) and labels are provided by Bauml et al. (2013). Previous work requires either scripts or manual labeling as the training data for classification Tapaswi et al. (2014), or both. In contrast, we attempt to recognize the actors in a fully automatic setting using only web training images. For each episode of Buffy, the cast list is downloadable from IMDB website and we use three queries to google: “actor_name”, “TV_series_name + actor_name”, and “TV_series_name + character_name”, for each actor and download top 20 results from google image search as our training images. The main challenge of using web data is that the images we downloaded can be very noisy. We show how our framework can deal with this challenge and achieves reasonably good performance using only web images. The raw features we use have 128 dimensions and are extracted using standard face pipeline Bauml et al. (2013): by detect facial landmarks, aligning the faces, and describing each face by Fisher Vector faces with large-margin dimensionality reduction Simonyan et al. (2013). There are around different actors in Buffey series. The ground-truth is provided by Bauml et al. (2013); classification accuracy results are reported in Tab. 1. We set as the dimensionality for and respectively in TDCA. Linear SVM is used as the final classifier once outliers are removed. In this experiment, we always have .

We compare our framework (P-LASSO-TDCA) with five different baselines: Raw: directly using raw features for classification; LRW: Constructing transductive graphs of labelled and unlabelled data and then using Lazy random walk Zhou et al. (2003) for classification; TDCA: using the reduced features for classification; P-LASSO: using soft-thresholding with raw features to detect outliers, then for classification; IPOD: using hard-thresholding She & Owen (2011) with raw features to detect outliers and then for classification 999Note that the IPOD algorithm is non-convex. So for a more fair comparison with IPOD, we employ the regularization path of P-LASSO to initialize IPOD..

The results are shown in Tab. 1. (1) The averaged results of our P-LASSO-TDCA are better than those of all the other alternatives. This shows the efficacy of our framework. (2) To further evaluate the significance of each component, we compare the results of RAW, LRW and TDCA: LRW is better than RAW on BF-1, BF-2, and BF-3. However, RAW is better than LRW on BF-5 and BF-6. We argue that these differences are caused by the outier images downloaded from the web. The training images of BF-5 and BF-6 have higher ratio of outliers which destroyed the random walk label propagation. TDCA is always better than RAW and LRW, since it employs softmax function to approximate the stationary distribution of graphs and unravel the topological structures into low-dimensional representations. Both of them actually can limit the negative effects of outliers in original label propagation steps: (a) softmax normalization in Eq (9) can reduce the influence of extreme values and make the representation of each node more consistent. (b) SVM classification on low-dimensional representations limited the bad cases in label propagation of LRW that labels of outliers are faster propagated than those of good observations.

P-LASSO and IPOD with Raw features are compared as an alternative to our framework. Even with raw features, P-LASSO is still able to detect and leave some true positive outliers out. Thus P-LASSO with raw feature can also improve the results of RAW. IPOD is initialized by P-LASSO and can also do a good detection on outliers. However, IPOD is using penalty on outliers and thus is nonconvex. It works in most cases (BF-3, BF-5, BF-6) except BF-1, BF-2, and BF-4 which actually makes the performance worse than that of P-LASSO-TDCA. Note that our results rely solely on the training images downloaded from the web, and thus the reported performances are still lower than those reported in Bauml et al. (2013), which utilize script data to obtain high quality training images within the same video. Using only web supervision, in this case, is very challenging since the faces queried by actor name may not necessarily related to those in TV series, and thus the facial appearance, and even the hairstyles may be different.

BF-1 BF-2 BF-3 BF-4 BF-5 BF-6 Avg
RAW 0.2644 0.1641 0.2146 0.1760 0.2863 0.2349 0.2234
LRW (Zhou et al. (2003)) 0.3534 0.2274 0.2516 0.2311 0.2023 0.2219 0.2480
TDCA (Cho et al. (2015) ) 0.4110 0.2461 0.2590 0.2671 0.2723 0.2321 0.2813
P-LASSO (Wauthier et al. (2013)) 0.4123 0.2503 0.2618 0.2407 0.3130 0.2396 0.2863
IPOD (She & Owen (2011)) 0.3861 0.2866 0.2886 0.2455 0.3651 0.2618 0.3056
P-LASSO-TDCA 0.4450 0.2980 0.2683 0.2971 0.3473 0.2461 0.3170
Table 1: Results of labelling actor faces.
Figure 2: Regularizaition path of P-LASSO, IPOD and P-LASSO-TDCA.

4.3 Animals with Attribute (AwA) dataset

We use AwA dataset to further evaluate our framework. AwA dataset consists of classes of animals with images. We randomly select images of each class as the testing set. We use the animal names as the key words to automatically download around images for each class. This gives us 1102 training images in total. Overfeat features Sermanet et al. (2014) (4096 dimensional) are used as the feature extractor for all the images. Since the animal names are very common and unique, the training images have much better quality than the last experiment. Thus experiment is employed to reveal the insights of the differences of IPOD, P-LASSO and P-LASSO-TDCA. Since there are different classes, we reduce the dimensions to for and respectively in TDCA. In this case, RAW gives us , and TDCA boosts this result to . Our P-LASSO-TDCA further improves TDCA performance to , which is surprisingly comparable to the results () obtained using standard supervised learning that use the same number of instances (20) from the hand annotated training split of AwA dataset to train a model.

We compare the IPOD, P-LASSO and P-LASSO-TDCA. The feature dimension (4096) is much larger than that of training instances (1102). Specifically, we compare the regularization path generated by P-LASSO, IPOD and our P-LASSO-TDCA. The results are shown in Fig. 2. Following the regularization path computed in Fig. 2, we list the top several outliers detected by P-LASSO, IPOD and P-LASSO-TDCA in Fig. 2(above). The green boxes indicate the successfully detected outliers; the red boxes indicate failures. The style of the Bobcat image makes it more similar to the tiger class; in the moose image, the animal is walking on the highway which is atypical. The pig image is also atypical since it is the only black pig image in there entire training data. As for P-LASSO and IPOD on AwA dataset, we have . In this case as is shown in Fig.2, the P-LASSO and IPOD are very conservative and can only detect two outliers (basically the same images on different categories). IPOD is also better than P-LASSO by removing the false positive instance. Both IPOD and P-LASSO however did not improve the classification accuracy (still , the same as RAW)

5 Conclusion

This paper introduced a novel framework P-LASSO-IPOD for robust classification. Inspired by recent theoretical analysis on Pre-conditioned LASSO Wauthier et al. (2013), we employ P-LASSO and regularization path for directly outlier detection in label space; we also design the DTCA for manifold-based feature dimension reduction in feature space. The experiments validate the efficacy of our framework.

References

  • Bauml et al. (2013) Bauml, Martin, Tapaswi, Makarand, and Stiefelhagen, Rainer. Semi-supervised learning with constraints for person identification in multimedia data. In CVPR, 2013.
  • Berg & Forsyth (2006) Berg, T.L. and Forsyth, D.A. Animals on the web. In CVPR, 2006.
  • Bergamo & Torresani (2010) Bergamo, Alessandro and Torresani, Lorenzo. Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Neural Information Processing Systems (NIPS), 2010.
  • Chen et al. (2013) Chen, Xinlei, Shrivastava, Abhinav, and Gupta, Abhinav. Neil: Extracting visual knowledge from web data. In ICCV, 2013.
  • Cho et al. (2015) Cho, Hyunghoon, Berger, Bonnie, and Peng, Jian. Diffusion component analysis: Unraveling functional topology in biological networks. In Annual International Conference on Research in Computational Molecular Biology, 2015.
  • Collins et al. (2008) Collins, Brendan, Deng, Jia, Li, Kai, and Fei-Fei, Li. Towards scalable dataset construction: An active learning approach. In ECCV, 2008.
  • Deng et al. (2009) Deng, Jia, Dong, Wei, Socher, R., Li, Li-Jia, Li, Kai, and Fei-Fei, Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. doi: 10.1109/CVPR.2009.5206848.
  • Everingham et al. (2006) Everingham, M., Sivic, J., and Zisserman, A. “Hello! My name is… Buffy” – automatic naming of characters in TV video. In BMVC, 2006.
  • Fan & Li (2001) Fan, Jianqing and Li, Runze. Variable selection via nonconcave penalized likelihood and its oracle properties. JASA, 2001.
  • Fan et al. (2012) Fan, Jianqing, Tang, Runlong, and Shi, Xiaofeng. Partial consistency with sparse incidental parameters. arXiv:1210.6950, 2012.
  • Fergus et al. (2004) Fergus, R., Perona, P., and Zisserman, A. A visual category filter for goolge images. In ECCV, 2004.
  • Fergus et al. (2005) Fergus, R., Fei-Fei, L., Perona, P., and Zisserman, A. Learning object categories from google’s image search. In ICCV, 2005.
  • Fergus et al. (2010) Fergus, R., Fei-Fei, Li, Perona, P., and Zisserman, A. Learning object categories from internet image searches. Proceedings of the IEEE, 2010.
  • Frenay & Verleysen (2014) Frenay, B. and Verleysen, M. Classification in the presence of label noise: A survey.

    IEEE Trans. Neural Networks and Learning Systems

    , 2014.
  • Fu et al. (2014) Fu, Yanwei, Hospedales, Timothy M., Xiang, Tao, Gong, Shaogang, and Yao, Yuan. Interestingness prediction by robust learning to rank. In ECCV, 2014.
  • Fu et al. (to appear) Fu, Yanwei, Hospedales, Timothy M., Xiang, Tao, Xiong, Jiechao, Gong, Shaogang, Wang, Yizhou, and Yao, Yuan. Robust subjective visual property prediction from crowdsourced pairwise labels. IEEE TPAMI, to appear.
  • Gannaz (2007) Gannaz, Iréne. Robust estimation and wavelet thresholding in partial linear models. Stat. Comput., 17:293–310, 2007.
  • Hastie et al. (2009) Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. The elements of statistical learning:Data Mining, Inference, and Prediction (2nd). Springer, 2009.
  • Hinton & Roweis (2002) Hinton, Geoffrey and Roweis, Sam. Stochastic neighbor embedding. In NIPS, 2002.
  • Huber (1981) Huber, P. J. Robust Statistics. New York: Wiley, 1981.
  • Katayama & Fujisawa (2015) Katayama, Shota and Fujisawa, Hironori. Sparse and robust linear regression: An optimization algorithm and its statistical properties. In arxiv:1505.05257. 2015.
  • Lafon & Lee (2006) Lafon, Stephane and Lee, Ann B. Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE TPAMI, 2006.
  • Li et al. (2007) Li, Li-Jia, Wang, Gang, and null Li Fei-Fei. Optimol: automatic online picture collection via incremental model learning. In CVPR, 2007.
  • Lin et al. (2014) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In ECCV, 2014.
  • Liu & Nocedal (1989) Liu, D. and Nocedal, J. On the limited memory method for large scale optimization. Mathematical Programming B, 45(3):503–528, 1989.
  • Neyman & Scott (1948) Neyman, J. and Scott, Elizabeth L. Consistent estimates based on partially consistent observations. 1948.
  • Nguyen & Tran (2013) Nguyen, N.H. and Tran, T.D. Robust lasso with missing and grossly corrupted observations. IEEE Tran. Information Theory, 2013.
  • Roweis & Saul (2000) Roweis, Sam and Saul, Lawrence. Nonlinear dimensionality reduction by locally linear embedding. Science, 2000.
  • Schrof et al. (2007) Schrof, F., Criminisi, A., and Zisserman, A. Harvesting image databases from the web. In ICCV, 2007.
  • Sermanet et al. (2014) Sermanet, Pierre, Eigen, David, Zhang, Xiang, Mathieu, Michaël, Fergus, Rob, and LeCun, Yann. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
  • She & Owen (2011) She, Yiyuan and Owen, Art B. Outlier detection using nonconvex penalized regression. Journal of American Statistical Association, 2011.
  • Simonyan et al. (2013) Simonyan, K., Parkhi, O. M., Vedaldi, A., and Zisserman, A. Fisher Vector Faces in the Wild. In British Machine Vision Conference, 2013.
  • Tapaswi et al. (2014) Tapaswi, Makarand, Bauml, Martin, and Stiefelhagen, Rainer. Storygraphs: visualizing character interactions as a timeline. In CVPR, 2014.
  • van der Maaten & Hinton (2008) van der Maaten, Laurens and Hinton, Geoffrey. Visualizing data using t-sne. JMLR, 2008.
  • Wainwright (2009) Wainwright, Martin J. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 2009.
  • Wauthier et al. (2013) Wauthier, Fabian L., Jojic, Nebojsa, and Jordan, Michael I. A comparative framework for preconditioned lasso algorithms. In NIPS, 2013.
  • Witten (2013) Witten, Daniela M.

    Penalized unsupervised learning with outliers.

    Statistics and its Interface, 2013.
  • Zhao & Yu (2006) Zhao, Peng and Yu, Bin. On model selection consistency of lasso. JMLR, 2006.
  • Zhou et al. (2003) Zhou, Dengyong, Bousquet, Olivier, Lal, Thomas N., Weston, Jason, and Schölkopf, Bernhard. Learning with local and global consistency. In NIPS. 2003.
  • Zhu (2007) Zhu, Xiaojin. Semi-supervised learning literature survey. Technical Report 1530, University of Wisconsin-Madison Department of Computer Science, 2007.