In human intelligence, the ability of recognizing patterns is the most fundamental cognitive skill in the brain and serves as the building block for other high-level decision making, which is historically shown to be crucial for our survival and evolution in complex environments. On the other hand, for the purpose of machine intelligence, pattern recognition is also an essential goal for both machine learning and artificial intelligence, where the solving of many high-level intelligent problems relies heavily on the success of automatic and accurate pattern recognition.
During the past decades (see the survey papers in 1968 , 1980  and 2000 ), many exciting achievements in pattern recognition have been reported, and most successful methods are statistical approaches , such as parametric and nonparametric Bayes decision rules 48], boosting algorithms , and so on. To guarantee high accuracy, these models are usually built on some well-designed hand-crafted features. In traditional approaches, the choice of the feature representation strongly influences the classification performance . Since 2006 
, the end-to-end approach of deep learning
, which simultaneously learns the feature and classifier directly from the raw data, has become the new cutting-edge solution for many pattern recognition tasks.
The accuracies on many problems have been increased significantly and rapidly from time to time. For example, on the MNIST (10-class handwritten digit) dataset, with convolutional neural networks, it is easy to achieve more than 99% accuracy without traditional hand-crafted features 
. On the more challenging task of 1000-class ImageNet large scale visual recognition, the accuracy was improved year-by-year, for example, AlexNet (2012, 84.7%), GoogLeNet  (2015, 93.33%), ResNet  (2016, 96.43%), and so on. The newest accuracies 
have already surpassed human-level performance with large margins. Actually, this kind of accuracy improvement and record-breaking phenomena happen all the time for different pattern recognition tasks, such as face recognition[270, 264], speech recognition [50, 242], handwriting recognition [92, 325], and so on. It seems that: from the perspective of accuracy, pattern recognition has become a well-solved problem.
However, accuracy is only one aspect of measuring performance. When launching a high-accuracy pattern recognition system into real applications, many unsatisfying and unexpected results may still happen, making the system unstable in robustness, and the reason causing these problems is usually a mixture of different factors. For example, Nguyen et al. 
reveal that the state-of-the-art deep neural networks (DNNs) are easily fooled by assigning high confidence predictions for unrecognizable images, indicating that: although the accuracy of DNN is very high, it is not as robust as human vision when dealing with outliers. Moreover, as shown by, a small perturbation (particularly designed) on the input sample will cause a large perturbation (incorrect prediction) on the output of a pattern recognition system, leading to great adversarial risk when using such system in real applications with stringent safety requirement. Moreover, in traditional pattern recognition, the class set is usually assumed to be closed. However, in real world, the open set problem  with dynamically changed class set is much more common. When re-contextualized into open set problems, many once solved tasks become significant challenging tasks again .
Another phenomenon causing significant performance drop for pattern recognition is the distribution mismatch. It has been shown that even a very small distribution shift can lead to a great performance drop for high accuracy pattern recognition systems . Therefore, besides accuracy, the adaptability and transferability  of a pattern recognition system become very important in real-life applications. Most pattern recognition systems only have single input and output, however, an effective strategy for improving robustness is to increase the diversity on both the input and the output of a system. Therefore, the multi-modal learning  and multi-task learning  are also important issues for robust pattern recognition. In real world, the patterns seldom occur in isolation, but instead, usually appear with rich contextual information. Learning from the dependencies between patterns to improve the robustness of decision making is an important problem in pattern recognition .
Most pattern recognition systems are actually data-hungry, and their high accuracies rely highly on both the quantity and quality of the training data. Any disturbances on the number or the labeling condition of the data will usually lead to great changes on the final performance. However, in real applications, it is usually difficult to collect large databases and produce accurate manual labeling. Therefore, the few-shot  or even zero-shot  learning abilities of pattern recognition systems are of great value for real applications. On the other hand, in order to reduce the dependence on data quality, the pattern recognition system should be robust and learnable with noisy data . Moreover, besides supervised training, other strategies like unsupervised , self-supervised , and semi-supervised  learning are also valuable for pattern recognition systems to learn from abundant unlabeled data and easily-obtained surrogate supervisory signals.
Based on the above observations, the problem of pattern recognition is far from being solved when considering different requirements in real applications. Besides accuracy, more attention should be paid on improving the robustness of pattern recognition. There are many previous works on robust pattern recognition (see Section III), however, most of them are driven from a single view point of robustness. Currently, there is no clear definition on robust pattern recognition. In order to give a comprehensive understanding of robustness, and more importantly, reduce the gap between pattern recognition research and the requirements of real applications, in this paper we study and review different perspectives that are crucial for robust pattern recognition.
Actually, most pattern recognition models are derived from three implicit assumptions: closed-world assumption, independent and identically distributed assumption, and clean and big data assumption. These assumptions are reasonable in a controlled laboratory environment and will simplify the complexity of the problem, therefore, they are fundamental for most pattern recognition models. However, in real-world, these assumptions are usually not satisfied, and in most cases, the performance of models built under these assumptions will deteriorate significantly. Therefore, to build robust systems for real environments, we should try to break these assumptions and develop new models and algorithms by reconsidering the essentials of pattern recognition.
In the rest of this paper, we first give a brief overview of pattern recognition methods in Section 2. After that, we define robustness for pattern recognition in Section 3. Then, we present detailed overviews on different attempts trying to break the three basic assumptions in Sections 4, 5 and 6, respectively. Lastly, we draw concluding remarks in Section 7. For readers who want to acquire some background knowledge before reading this paper, please refer to the appendix.
Ii A Brief Overview of Pattern Recognition Methods
As shown in Fig. 1
, pattern recognition methods can be divided into two main categories: two-stage and end-to-end. Most traditional methods are two-stage, i.e., with cascaded handcrafted feature representation and pattern classification. The feature representation is to transform the raw data to a feature space with the property of within-class compactness and between-class separability. Preprocessing (like removing noise and normalizing data) is firstly applied to reduce within-class variance, while feature extraction further enlarges between-class variance, and this procedure is usually domain-specific. Actually, for solving new pattern recognition problems, the first thing is the design of feature representation, and a good feature will significantly reduce the burden on subsequent classifier learning. This kind of efforts can be found in different applications like iris recognition, gait recognition , action recognition , and so on.
After feature representation, the second stage is pattern classification, which is a much more general problem. Actually, classification is the main focus of many textbooks including Fukunaga , Duda et al. , Bishop , and so on. This stage is also known as statistical pattern recognition , where many different issues are considered from different perspectives. Firstly, dimensionality reduction 
is widely adopted to derive a lower dimensional representation to facilitate subsequent classification task. Another approach of feature selection can be viewed as a discrete dimensionality reduction. After that, many classical classification models can be applied. The most fundamental one is the Bayes decision theory 18]23]. Kernel methods [153, 195] have been widely applied to extend linear models to nonlinear ones by performing linear operations on higher or even infinite dimensional space transformed implicitly by a kernel mapping function, and the most representative method is SVM (support vector machine) . Ensemble methods  can further improve the performance by combining predictions from multiple complementary models. Clustering  is widely used as an unsupervised strategy for pattern recognition.
In two-stage methods, we usually have multiple choices for both feature representation and classifier learning. It is hard to predict which combination will lead to the best performance, and in practice, different pattern recognition problems usually have different optimal configurations according to domain-specific experiences. Contrarily, deep learning  methods are end-to-end by learning the feature representation and classification jointly from the raw data. In this way, the learned features and classifiers are more cooperative toward the given task in a data-driven manner, which is more flexible and discriminative than two-stage methods.
Formerly, deep neural networks are usually layer-wise pre-trained by unsupervised models like auto-encoder 110]. Nowadays, deeper and deeper neural networks can be trained end-to-end due to many improved strategies such as better initialization , activation , optimization , normalization , architecture , and so on. Due to shared-weights architecture and local connectivity characteristic, the convolutional neural network  has been successfully used in many visual recognition tasks like image classification , detection , segmentation 
, and so on. Moreover, due to the ability of dealing with arbitrary-length sequences, the recurrent neural network has been widely used for sequence-based pattern recognition like speech recognition, scene text recognition , and so on. Furthermore, the attention mechanism  can further improve deep learning performance by focusing on the most relevant information. Nowadays, deep learning has become the cutting-edge solution for numerous pattern recognition tasks.
Besides the broad class of statistical pattern recognition approaches, structural pattern recognition has been developed for exploiting and understanding the rich structural information in patterns . Unlike statistical feature representation, the structure of patterns is of variable dimensionality, and can be viewed as in non-Euclidean space. String matching and graph matching are basic problems in structural pattern recognition. To improve the learning ability of structural pattern recognition problems, kernel methods (with graph kernel ), probabilistic graphical models , and graph neural networks  have been used. Overall, the research and application of structural pattern recognition is less popular than that of statistical methods.
|Year||Ref.||Definition of Robustness||Type|
|1996||||Small-sample effects, distortion of samples||III|
|1999||||Train/test condition mismatch||II|
|2001||||Imprecise and changing environments||II|
|2003||||New class discovery, outlier rejection||I|
|2007||||Clutter, learn from a few examples||III|
|2011||||Noise corruption and occlusion, outliers||III|
|2011||||Small number of training data||III|
|2017||||Adversarial, random noise||I|
|2018||||Outlier, feature noise, label noise||III|
Iii Robustness in Pattern Recognition
To build a pattern recognition system, there should be some training samples and test samples where is the observed pattern and
is the corresponding label (the hat on the symbol is used to differentiate training and test samples). The purpose of pattern recognition is to learn the joint distributionor conditional distribution from the training set and then evaluate the learned model on a different test set . During this process, there are usually some basic assumptions.
Assumption I: Closed-world assumption. The output space is assumed to be composed of a fixed number (e.g., ) of classes which are pre-defined as a prior , and all samples are assumed to come from these classes , , . Under this assumption, we can clearly and easily define the decision boundaries since the whole space is partitioned into regions. However, in real world applications, this assumption does not always hold, and there is an open space much larger than . The samples in can be outliers not belonging to any classes, unknown samples from some new classes not shown in the training set, or even adversarial samples from the confusing area. In these cases, the pattern recognition system will produce over-confident wrong predictions, because in its opinion there are only options and the winning class is highly reliable.
Assumption II: IID assumption. The samples are assumed to be independent considering the joint distribution of observations and labels or the marginal distribution of the observations , while the training and test data are assumed to be identically distributed and . Under the independent assumption, we can then define the empirical loss as the summation of individual losses, e.g., . Under the identical distribution assumption, we can therefore hope that minimizing the training error on will yield good generalization performance on . However, in real world, the IID assumption is often violated: the data collected from multiple sources or conditions can not be simply viewed as independent, and moreover, a small mismatch between the training and test environments will cause significant performance degradation.
Assumption III: Clean and big data assumption
. The training data are assumed to be well labeled, and the volume of data is assumed to be large enough for covering different variations. Under this assumption, the only requirement is the capacity of the model, and supervised learning can be used to achieve good generalization performance. However, in real world applications, it is hard to collect a large number of training samples and also impossible to label all of them perfectly. How to effectively build pattern recognition systems from noisy data under small sample size (labeled or unlabeled) is a fundamental difference between machine intelligence and human intelligence.
As long as these assumptions remain stable, we can count on a reliable system to do its job time after time. However, when the assumptions no longer hold and the conditions start drifting, we want the system to keep its performance and be insensitive to these variations, this is called the robustness for a pattern recognition system. Actually, in the literature, there are already a lot of research on robust pattern recognition giving different definitions for robustness from diverse perspectives. We show some representative definitions in Table I, and actually they can also be partitioned into three types corresponding to the above three assumptions. Usually, the lack of robustness in these studies is caused by the dissatisfaction of the assumptions. On the other hand, there are many other research works in the literature focusing on breaking the above assumptions but not using the terminology of robust pattern recognition, which are also of great values to this field. Therefore, to better understand the current state and identify directions for future research, this paper surveys recent advances in robust pattern recognition and presents them in a common taxonomy according to the three assumptions.
The main contents are organized as shown in Fig. 2. Under each assumption, taxonomic sub-classification is presented to partition the content into four sub-topics, resulting in totally twelve issues. A brief and comprehensive overview is presented for each of them accompanied by a discussion on current and future research directions. For some of the topics there already exist good review papers, however, we differ from them by focusing more on recent advances and the relations to robust pattern recognition. Although these topics are interrelated, the review and discussion on each of them are made to be as self-contained as possible. Readers can choose to start at anywhere according to their own interests. However, it should always be remembered that the purpose is to break the three basic assumptions and realize robust pattern recognition.
Iv Breaking Closed-world Assumption
Most pattern recognition methods are based on the closed-world assumption: although we only have finite observations of samples and categories, we still try to find a full partition of the whole space, this of course is unwise and improper. For example, the support vector machine 
seeks a hyperplane to partition the whole space into two half spaces under the principle of maximum margin. In deep neural networks
, the softmax layer partitions the whole space into a fixed number of classes and the summation of class probabilities is assumed to be one. These closed-world models will make overconfident errors on outliers and new category samples. Actually, there are massive unknown regions in pattern classification due to the finite set of training samples. To avoid making ridiculous mistakes, we must find methods to deal with these open and unseen spaces. In this paper, motivated by the “known and unknown” statement in , we summarize the approaches on breaking closed-world assumption into the following perspectives.
Iv-a Known Known: Empirical Risk
As shown in Fig. 3a, in closed-world recognition, we usually assume that we can observe some samples from some pre-defined categories, and we denote this case as “known known” (things we know that we know). A straightforward strategy in this case is to minimize the empirical risk, which is estimated as the misclassification rate on observed samples. However, since the number of samples is finite, minimizing empirical risk can not guarantee good generalization performance due to over-fitting. For example, it is common for the nearest neighbor decision rule  to achieve perfect accuracy on training data but unsatisfactory performance on test data, and -nearest neighbor is used to improve generalization with searched by cross-validation. In decision trees , over-fitting occurs when the tree is designed to perfectly fit all training samples, and pruning methods are applied to trim off unnecessary branches. Multilayer perceptron is able to approximate any decision boundary to arbitrary accuracy , and different tricks (like early stopping, weight decay) are used to avoid over-fitting.
The theoretical research of VC-dimension  suggests minimizing the structural risk instead of empirical risk, by balancing the complexity of model against its success at fitting training data. Under this principle, the support vector machine  seeks a hyperplane that has the largest distance to the nearest training data point of any class, resulting in the large-margin regularization. From then on, many other regularization strategies have also been proposed, like sparsity , low-rank , manifold , and so on. These regularization operations are usually combined with the empirical loss to build a better objective function. Other strategies (not integrated into the objective function) can also be viewed as implicit regularization, such as training with noise , regularized parameter estimation , dropout , and so on. As a conclusion, a common strategy for “known known” is empirical risk minimization with a well-defined regularization to improve generalization performance.
Iv-B Known Unknown: Outlier Risk
As shown in Fig. 3b, besides known known, there is also “known unknown” in the open space (we know there are some things we do not know). In open-world recognition, the things we do not know are often denoted as outliers. The simplest way to deal with outlier is extending the -class problem to
classes by adding a new class representing outliers. However, the drawback of this approach is that we need to collect outlier samples, and the distribution of outlier class is usually too complex to model. A more general case is that we do not have outlier samples, and the problem now becomes outlier detection34]
, or novelty detection.
Iv-B1 Pattern rejection
The first solution we should consider is to integrate some rejection strategies into the traditional pattern classifiers, and many such attempts can be found in the literature. For Bayes decision theory, Chow  showed that the optimal rule (for rejecting ambiguous patterns) is to reject the pattern if the maximum of the a posteriori probabilities is less than some threshold. Dubuisson and Masson  proposed a modified rejection for the case being smaller than some threshold, which is suitable for rejecting outliers (not belonging to pre-defined classes). For many other classical models, the rejection option needs to be specifically designed according to the structure of classifier, like support vector machine , nearest neighbor , sparse representation , multilayer perceptron , and so on. Ensemble learning of multiple classifiers can also be used for rejection . It was shown that different classifier structures and learning algorithms affect the rejection performance significantly .
Iv-B2 Softmax and extensions
In many pattern recognition systems like deep neural networks, the softmax function is widely used for classification:
where is the discriminant function for class and is the predicted class for . Let and be the top-1 and top-2 probabilities. To reject outliers, a straightforward strategy  is to set some thresholds on (confidence on predicted class) or (ambiguity between the top two classes), and the sample should be rejected if either or is below some threshold. Actually, this kind of operation rejects uncertain predictions rather than unknown classes. Due to the closed-world property , it can be easily fooled with outliers: a sample from a novel class (not predefined classes) may still have large values for both and . This means although the prediction is wrong, the classifier is still very confident (known as overconfident error ), making the system hard to apply a threshold for rejection. A simple and straightforward modification is using the sigmoid function
to break the sum-to-one assumption and adopting the one-vs-all training [168, 254] to improve outlier rejection. In this case, for each class, the training samples from other classes are viewed as outliers, and a sample can be efficiently rejected if  , since it does not belong to any known classes. Transforming sigmoid (binary) probabilities to multi-class probabilities satisfying by the Dempster-Shafer theory of evidence  can make outlier probability measurable as .
To extend softmax for open set recognition, the openmax  fits a Weibull distribution on the distances between samples and class-means to give a parametric estimation of the probability for an input being an outlier with respect to each class. Another extension called generative openmax  employs generative adversarial network for novel category data synthesis to explicitly model the outlier class.
Iv-B3 One-class classification
In the literature, another solution for outlier detection is the one-class classification , where all training samples are assumed to come from only one class. The support vector data description  uses a hyper-sphere with minimum volume to encompass as many training points as possible. The one-class SVM  treats the origin in feature space as the representation for open space and maximizes the margin of training samples with respect to it using a kernel-based method. To use one-class models in multi-class recognition tasks, each class can be modeled with an individual one-class classifier, and then the outputs for different classes can be combined and normalized to grow a multi-class classifier with the reject option .
Iv-B4 Open space risk
Recently, more and more attentions on this old and important issue are actually awakened due to the work of , which defined the open space risk as:
where is a measurable recognition function: for recognition of the class of interest and when it is not recognized. The is the “open space” and is a ball that includes all of the known training samples as well as the open space . The is considered to be the relative measure of the open space compared to the whole space, and the challenge on using this theory lies on how to define and get a computationally tractable open space risk term. In , the 1-vs-set machine is proposed as an extension of traditional SVM by using a slab defined by two parallel hyper-planes to define the open space. Similar idea has also been studied by  under open space hyper-plane classifiers. The work of  further introduces a new model called compact abating probability (CAP) by defining the open space as the space sufficiently far from any known training sample:
where is a closed ball of radius centered around training sample . A technique called Weibull-calibrated SVM has been proposed  by combining CAP with statistical extreme value theory  for score calibration to improve multi-class open set recognition.
Consider an expert pattern recognition system which can classify digits from 0 to 9 perfectly, when we feed an image of “apple” into the system, it said “this is a 6 and I am very confident”, this will immediately change our feeling about this system from intelligent to foolish. The ability of learning to reject
is a major difference between closed-world and open-world recognition. Besides particular designed methods, theoretical analysis on this problem is particularly important, and although some studies have made good attempts on this direction, it is still worth further exploration. In many approaches, a threshold is usually used to distinguish normal and abnormal patterns, and different thresholds will lead to different tradeoffs between the adopted measurements (like precision and recall). Therefore, the choice of the threshold is usually task-dependent (different tasks will require different tradeoffs), and to evaluate the overall performance of a particular method, the threshold-independent metric
should be used, like the AUROC (area under receiver operating characteristic curve), AUPR (area under precision-recall curve), and so on.
Iv-C Unknown Known: Adversarial Risk
An intriguing phenomenon in open space is “unknown known”: things we think we know but it turns out we do not. Different from the known unknown in Fig. 3b which denotes open space far away from training data and hence we know they are unknown, the unknown known in Fig. 3c represents open space near decision boundaries where we are supposed to know but actually not, due to the limited number of training data not covering this space. Ambiguous prediction will happen for points close to the decision boundaries, for example, visually we think a sample is from one class but the system classifies it to another class. Since it is hard to sample such observation (low frequency in real world), the story is started by generating such samples to fool the system, which is known as adversarial examples .
Iv-C1 Generation of adversarial examples
At the beginning, Szegedy et al.  show that by applying an imperceptible perturbation to an image, it is possible to arbitrarily change its prediction. Given any sample , an adversarial example can be found through constrained optimization. Since the perturbation is small, we can not find any obvious difference between and visually, but their predicted labels are different, indicating the system is not robust: small perturbation on input causes large perturbation on output. An efficient method to generate adversarial examples called fast gradient sign is proposed by : let be parameters of a model, and be input and ground truth, and be the cost used to train the model, the perturbation is then defined as:
where is a step-parameter. The elements in correspond to the sign of the gradient of cost function with respect to input. Since is on the direction of gradient, moving along will increase , and consequently, a large-enough will cause to be misclassified. The iterative gradient sign  is proposed as a refinement to fast gradient sign. After that, DeepFool is proposed  to search a minimal perturbation that is sufficient to change the label.
Iv-C2 Threat of adversarial examples
As shown in , using adversarial examples as attacks will be great threats for many applications, such as self-driving cars, voice commands, robots, and so on. It has been shown in  that a perturbation with 1/1000 magnitude as the original image is sufficient to fool state-of-the-art deep neural networks. Moreover, many new attack methods are still being gradually proposed [208, 31]. The work of  showed that it is even possible to fool the system by only modifying one pixel of natural images. Furthermore, the method of  is proposed to attack a system which is viewed as a black-box. More surprisingly, the existence of a single small image-agnostic perturbation (called universal perturbation) that fools state-of-the-art classifiers on most natural images is also found . All these attempts have posed significant challenges on the robustness of pattern recognition systems.
Iv-C3 Defense methods
To deal with adversarial attacks, many defense methods have been proposed. A typical approach is to augment the training set with adversarial examples and then retrain the model on the augmented data set [88, 269]. The defensive distillation is proposed  to smooth the model during training for making it less sensitive to adversarial samples. The defense-GAN  is trained to model the distribution of unperturbed real data, and at inference time it finds a close output to a given sample which does not contain the adversarial changes. Besides making the system robust to adversarial examples, Metzen et al.  show that adversarial perturbations can also be detected by augmenting the system with a detector network which is trained on the binary classification task of distinguishing genuine samples from perturbed ones. The deep contractive network  adopts an objective function
is a standard loss function and
is the hidden representation for layerin a deep neural network. Since adversarial examples can be produced using the gradient sign method , regularizing the smoothness of gradient is therefore helpful to avoid them.
Since there are open spaces near the decision boundaries, data augmentation can be used to fill them for improving robustness. The stability training  adds pixel-wise uncorrelated Gaussian noise for each input to produce an augmented sample , and then forces the outputs of the system on and to be as close as possible, thus improving its robustness against small perturbations. The robust optimization  uses an alternating min-max procedure to increase local stability
where is a sample-label pair and is the loss function, the is a ball around with some radius. In the inside max procedure, the can be viewed as an augmented worst-case sample, and in the outside min problem, the loss on is minimized, thus making the system to be stable in a small neighborhood around every training point.
An interesting work of mixup  proposes a data-agnostic method to produce augmented data points
where are two examples drawn at random from training data and are their corresponding one-hot label vectors.111Previously, we use to denote the label which is an integer from . Here we use (in bold) to represent a one-hot label vector: a vector (with length ) filled with at the index of the labeled class and with everywhere else. The is a random parameter to produce the augmented sample and a new soft label
(not one-hot anymore). This is based on the assumption that: linear interpolations of feature vectors should lead to linear interpolations of the associated labels. Although this approach is simple, it is very effective to produce augmented samples spreading not only within the same class but also between different classes, and therefore, the open space near decision boundaries is well handled with these augmented samples. Similar approach is also adopted in a work ofbetween-class learning . Although a mixture of two examples may not make sense for humans visually, it will make sense for machines as suggested by , and it is shown by  that this can not only improve the generalization performance but can also increase the robustness to adversarial examples.
As shown in Fig. 4a, another related concept is fooling examples  which are produced to be completely unrecognizable to human eyes but the pattern recognition system will still classify them into particular classes with high confidence. This is different from adversarial examples shown in Fig. 4b. Actually, the phenomenon of fooling examples is the result of outliers with closed-world assumption which is discussed in Section IV-B. It is shown by 
that retraining of the system by viewing fooling examples as a newly added class is not sufficient to solve this problem, and a new batch of fooling images can be produced to fool the new system even after many retraining iterations. This is because there are massive open spaces for outliers and it is impossible to model them completely. Contrarily, augmenting training data with adversarial examples was shown to significantly increase the robustness even with only one extra epoch, this is because the open spaces near decision boundaries are limited and constrained, and therefore giving us possibility to model them. Since many people continue to propose new attack methods for producing adversarial examples, the research of novel defense strategy becomes particularly important to guarantee the safety of pattern recognition.
Iv-D Unknown Unknown: Open Class Risk
As shown in Fig. 3d, the last case in open space is “unknown unknown”: situations where a lot of unknown samples (out of this world) are grouping into different unknown (unseen) categories. In this case, we should not simply mark them as a single and large category of unknown (like Section IV-B), but also need to identify the newly emerged categories in a fine-grained manner. This is a common situation in real applications, where the datasets are dynamic and novel categories must be continuously detected and added, which is denoted as open world recognition in  and class-incremental learning in .
Iv-D1 Definition of the problem
During continuous use of a pattern recognition system, abundant or even infinite test data will come in a streaming manner. As shown in Fig. 5, the open-world recognition process can be decomposed into three steps. The first step is detecting unknown samples and placing them in a buffer, which requires the system to reject samples from unseen classes and keep high accuracy for seen classes. The second step is labeling unknown samples in the buffer into new categories, which can be either finished by humans or automatically implemented. The last step is then updating the classifier with augmented categories and samples, which requires the classifier to be efficiently trainable in a class-incremental manner where different classes occur at different times. Step 1 has already been discussed in Section IV-B, therefore, this section focuses on steps 2 and 3.
Iv-D2 Labeling unknown samples
A simple and accurate approach for step 2 is seeking help from human beings 
, either in a batch manner when the buffer size reaches some threshold, or immediately when the users encounter some strange outputs from the system and then try to give some feedback. Moreover, a strategy to make this process more efficient is using active learning to reduce the labeling cost by selecting the most valuable data to query their labels. On the contrary, a more challenging task is automatic new class discovery  without human labeling. Unsupervised clustering  is an efficient and effective solution for finding new classes. A cluster can be seen as a new category if the number of samples falling in this cluster is large enough, and otherwise, it should only be viewed as an outlier and ignored. However, a difficulty for this approach is the model selection problem, i.e., how many novel classes are contained in the data? To deal with this, the clustering algorithms should have the ability of automatic model selection .
Iv-D3 Class incremental learning
The solution of step 3 requires us to rethink the relationship between discriminative and generative models. A pattern recognition system usually contains class-independent feature extraction and class-specific decision function .222For example, in deep learning, the is a multi-layer neural network, and is usually a linear function on like . The classification is then: . In discriminative model, all the decision functions are trained jointly (like hinge loss, softmax loss, and so on), which can be viewed as the competition between different classes to adjust decision boundaries. On the contrary, in generative model, is usually used to model each class independently (like negative log-likelihood loss for some distribution). Discriminative model usually has higher accuracy, however, since are coupled in training, adding a new class will affect others, requiring retraining of them with all data available. Contrarily, in generative model, class-incremental learning will become much simpler, since the training of is independent of other classes. However, the drawback is that generative model usually leads to lower accuracy. Therefore, hybrid discriminative and generative models become necessary: is discriminative while is generative. Actually, many recent works are already using this principle.
Iv-D4 Prototype based approaches
The nearest class mean (NCM)  can generalize to new classes at near-zero cost:
where is the mean of training samples (totally ) in space for class , and is based on the Euclidean distance to class-mean. Different criteria can be defined, such as softmax  or sigmoid , to learn a discriminative for high accuracy, and the updating of is always a generative mean calculation. When new class arrives, it is efficient to compute and augment the model with a new decision function , without affecting other classes.
The class-independent feature extraction can be either a linear dimensionality reduction  or a nonlinear deep neural network . Similar idea is also adopted in  where a lot of exemplar samples are selected dynamically to represent each class in CNN transformed space, and a nearest-mean-of-exemplars strategy is used for classification. Another work of prototypical network  is also a NCM classifier in deep neural network transformed space. A more general analysis on representing each class as Gaussian is given in . A work of convolutional prototype learning  uses automatically learned prototypes to represent each class by regularizing the deviation of prototypes from class-means. As shown in , when normalized to lie on a sphere, NCM is also equal to the traditional linear classifier in neural networks.
Since it is hard to enumerate all categories at once, how to smoothly update the system to learn more and more concepts over time is therefore an important and challenging task. Although using class-means or prototypes to represent each class is a simple generative model, it is effective for class distribution modeling, because a powerful
(e.g., deep neural networks) can be learned to transform complex intra-class distributions into simplified Gaussian distributions, which will then be efficient and effective for class-incremental learning. Other classical classifiers can also be modified for class-incremental learning like random forest, support vector machine , and so on. For class-incremental learning of deep neural networks, the newly learned classes may erase the knowledge of old classes, due to the joint updating of and , resulting in catastrophic forgetting . A remedy is to review the historical data of old classes occasionally to prevent forgetting , and many other recent advances are gradually proposed on this topic like the evolving neural structure (learning to grow) , dynamic generative memory (learning to remember) , and so on. More efficient class-incremental learning models that can handle forgetting problem effectively will be the focus of future research.
V Breaking IID Assumption
Independent and identically distributed (IID) random variable is a fundamental assumption for most pattern recognition methods. However, in practical applications, this assumption is often violated. On the Dagstuhl seminar organized by Darrell et al. in 2015, it was the agreement of all participants that learning with interdependent and non-identically distributed data should be the focus of future research. Moreover, it was shown by  that even a very small mismatch between training and test distributions will make state-of-the-art models dropping their performance significantly.
In pattern recognition, a labeled sample is usually assumed to come from a feature space and a label space . A specific combination of feature space and label space can be viewed as an environment , where a learner is defined and performed. The learner should be adjustable when the environment starts to change , which can be summarized in four cases:
and . This is the most-widely considered case, where the feature spaces and label spaces are identical, and the environmental change comes from the conditional distribution .
. The feature spaces are identical but the label spaces are different, for example, cross-class transfer learning and multi-task learning.
and . The feature spaces are different while the label spaces are identical, which happens often in multi-modal learning.
and . Both the feature spaces and label spaces are different, which is the most difficult situation, and can be viewed as multi-modal multi-task learning.
V-a Learning with Interdependent Data
In traditional pattern recognition, the samples are assumed to be independent. However, in real world, we usually have some group information (also denoted as set, bag or field in the literature) for the samples, implying statistical dependencies among them. Let
denote groups of samples and their corresponding labels. The purpose now is to learn the classifier and make decision with grouped data :
In each group, the samples are no longer independent.
Different groups may not be identically distributed.
Different groups can have different cardinalities.
V-A1 Content consistency
A straightforward and widely-considered case is that the samples in each group have the same label , which is known as image set classification  or group-based classification  in the literature. This kind of content consistency in a group is very common in practice, for example, the temporal coherence between consecutive images in videos, the same object captured by multi-angle camera networks, classification based on long term observations , and so on. Each group can be viewed as an unordered set of samples, and therefore the task is to define the similarities between different sets, for example by: viewing each set as a linear subspace and defining the similarity between two sets as canonical correlation , describing each set with the Grassmann and Stiefel manifolds  and using geodesic distances as metrics, representing each set as an affine hull  and calculating between-set distance from the sparse approximated nearest points, and so on. Besides viewing each set to lie on a certain geometric surface, deep learning framework based on minimum reconstruction error  can be used to automatically discover the underlying geometric structure for image set classification. The multiple samples in the same group (or set) will provide complementary information from different aspects like appearance variations, view-points, illumination changes, nonrigid deformations, and so on. Therefore, it offers new opportunities to improve the classification accuracy compared with single example based classification.
V-A2 Style consistency
Besides content consistency, another situation is style consistency : the samples in a group are isogenous or generated by the same source. For example, in handwriting recognition, a group of characters produced by a certain writer are homogeneous with his/her individual writing style; in face recognition, face images can appear as different groups according to different poses or illumination conditions; in speech recognition, different speakers have different accents, and so on. These situations provide important group information, and in each group the style is consistent (other than content). Moreover, a new group unnecessarily enjoys the same style as the training groups, which means style transfer exists between training and test groups. In the literature, this problem is studied under the terminology of pattern field classification by [243, 287, 288] where a field is a group of isogenous patterns. Specifically, in  a class-style conditional mixture of Gaussians is used to model the isogenous patterns, in 
the dependencies among samples is modeled by second-order statistics with normally distributed styles, and in the intraclass and interclass styles are studied under adaptive classification. The traditional Bayes decision theory can also be extended to pattern field classification . By utilizing style consistency, classifying groups of patterns is shown to be much more accurate than classifying single patterns  on various tasks like multi-pose face recognition, multi-speaker vowel classification, and multi-writer handwriting recognition.
V-A3 Group-level supervision
A useful strategy to realize weakly-supervised learning is that there is only group-level supervision and the labels for the individual samples are not provided, which is a natural fit for numerous real-world applications and is denoted as multi-instance learning (MIL)  in the literature. A group of instances is denoted as a bag, and although each bag has an associated label, the labels of the individual instances that conform the bag are not known. For example, in drug activity prediction , a molecule (bag) can adopt a wide range of shapes (instances) by rotating some of its internal bonds, and knowing a previously-synthesized molecule has desired drug effect does not directly provide information on the shapes. In image classification , a single image (bag) can be represented by a collection of regions, blocks or patches (instances), and the labels are only attached to images instead of the low-level segments. The instances in each bag can be treated as non-IID samples , and not all of them are necessarily relevant : some instances may not convey any information about the bag class, or even come from other classes, thus providing confusing information. MIL usually deals with binary classification, and it is initially defined as : a bag is considered to be positive if and only if it contains at least one positive instance. Relaxed and alternative definitions for MIL are presented by  to extend applications for different domains. Due to the property of weak supervision, MIL has found wide applications like image categorization , object localization , computer-aided diagnosis , and so on.
V-A4 Decision making in context
In above discussed situations, the order of the samples in each group is actually ignored. However, the organizational structure of the samples gives us a very important information of context which is historically shown to be crucial in pattern recognition . For example, the linguistic context  takes place very naturally during the process of human reading. By using a language model , the performance of many related tasks like speech recognition, character recognition and text processing can be significantly improved. Moreover, the spatial arrangement of the samples, known as geometric context, is also an important piece of information for different pattern recognition tasks like  and . A widely-used strategy to learn from the context is viewing the samples as a sequence
, and many methods like hidden Markov models (HMMs), conditional random fields (CRFs)  and recurrent neural networks (RNNs) 
can be used to model the dependencies among samples from the perspectives of Markov chain, conditional joint probability and long-short-term dependency respectively. Besides sequence,graph is another useful representation for contextual learning, and recently graph neural networks [244, 141, 289] have gained increasing popularity in various domains by modeling the dependencies between nodes in a graph via efficient and effective message passing among them. Moreover, other than structured input representation, the dependencies can also happen in the output space, known as structured output learning , which tries to predict more complex outputs such as trees, strings or lattices other than a group of independent labels. More details on this important issue can be found in .
By utilizing the dependencies among data, assigning labels for a group of patterns simultaneously will be more accurate and robust than labeling them separately. The key problem is how to define and learn from the group information. Content and style  are two important factors in pattern recognition. The dependencies derived from content consistency and style consistency are useful information to improve the performance of group-based pattern recognition. Multi-instance learning which only requires group-level supervision is an effective strategy for weakly-supervised learning. The contextual information embedded in the order or arrangement of the samples is proved to be important for structured prediction. Besides using dependency to improve performance, automatic discovery of the relationship of samples (relational reasoning)  is also an important direction.
V-B Domain Adaptation and Transfer Learning
As shown in Fig. 6a, when both the feature space and label space are identical, the non-IIDness may happen on the conditional distribution. In this case, the domain adaptation and transfer learning are actually dealing with the same thing: there is usually a source domain with sufficient labeled data and a target domain with a small amount of labeled or unlabeled data, and the purpose is to reduce the distribution mismatch between two domains in supervised, unsupervised, or semi-supervised manners.
V-B1 Supervised fine-tuning
When there exist some labeled data in target domain (supervised domain adaptation), a simple and straightforward solution is to fine-tune the model on these extra labeled data. Actually, every pattern classifier which can be trained via incremental or online learning can be used in this way for supervised domain adaptation. The model trained on the source domain can be viewed as not only a good initialization but also a regularization, and fine-tuning on target data will gradually reduce the distribution shift between two domains. Many classifiers can essentially be learned incrementally like neural networks trained with back-propagation. For non-incremental classifiers, we can also develop some counterpart algorithms for them, such as incremental decision tree , incremental SVM , and so on.
V-B2 Cross-domain mapping
Another widely used strategy is learning cross-domain mappings to reduce the distribution shift. Let and denote source and target data, and denotes the parameters in classifier. The cross-domain mapping can be defined in various ways:
In the first approach of parameter mapping, the source and target distributions are matched using transformed parameters . For example, Leggetter and Woodland 
use a linear transformation on the mean parameters of the hidden Markov model for speaker adaptation, and in
a residual transformation network is used to map source parameters into target parameters. Another strategy is thesource mapping applied on the source data , and the transformed source data can be used together with the target data to train the classifier. Meanwhile, the target mapping applies the mapping on target data [327, 115], and the advantage is that the adaptation (learning of ) can happen after the training of the source classifier . At last, we can also define the co-mapping by projecting both the source and target data  to a shared common space. The mapping can be either linear [161, 327, 115] or nonlinear [234, 49, 205]. Different criteria can be used to learn , like maximum likelihood , minimum earth mover distance , minimum regularized Euclidean distance , discriminative training [115, 234], component analysis , and so on.
V-B3 Distribution matching
The purpose of domain adaptation and transfer learning is to match the distributions of source and target domains. The importance re-weighting  is a widely used strategy for distribution matching: each source sample is weighted by the importance factor where and are target and source densities. Using weighted source samples to train the classifier will work well on target domain. However, density estimation is known to be a hard problem especially in high-dimensional spaces, therefore, directly estimating the importance without going through density estimation would be more promising as shown by  and . Another problem in distribution matching is how to measure the discrepancy of two distributions. The maximum mean discrepancy (MMD)  is a widely-used strategy by measuring the distance between the means of two distributions in a reproducing kernel Hilbert space (RKHS). It is shown  that MMD will asymptotically approach zero if and only if the two distributions are the same. Since MMD is easy to calculate and does not require the label information, it has been widely used as a regularization term for unsupervised domain adaptation . Other than only using the distance between first-order means as the measurement, Zhang et al.  propose aligning the second-order covariance matrices in RKHS for distribution matching. Besides MMD, many other kinds of distances, divergences, and information theoretical measurements can also be used for distribution matching, as discussed in the survey paper of .
V-B4 Adversarial learning
Recently, an increasingly popular idea of adversarial learning tries to make the features from both domains to be as indistinguishable as possible. As shown in Fig. 7, the whole framework is composed of four components. The feature extractors (for source) and (for target) are usually defined as deep neural networks. The task classifier is used to perform the original -way classification for both source and target data. Importantly, a domain classifier is used to judge whether a sample is from source or target domain (binary classification). Since we have two classifiers, here we can define two standard classification losses and . The key of adversarial learning is that these losses are optimized like playing a min-max game:
The purpose of is to guarantee classification accuracy while aims to confuse the domain classifier and make the feature distributions over two domains similar, thus resulting in domain-invariant features. To efficiently seek , multiple strategies can be used, like using gradient reversal layer  to reverse the gradient in back-propagation, adopting inverted labels 
to calculate another surrogate fooled loss, and minimizing the cross-entropy loss against a uniform distribution. Moreover, the feature extractor modules can be designed as shared () , partially shared , or independent () . Adversarial learning is efficient and effective for domain adaptation and transfer learning, and many subsequent improvements are still being gradually proposed .
V-B5 Multi-source problem
In the above approaches, we assume that there is only a single source domain, but in practice, multiple sources  may exist during data collection, which is related to the style consistent pattern field classification problem in Section V-A. The above discussed single-source methods can be extended accordingly to multi-source case. For example, the cross-domain mapping can be extended to multiple source-mappings with a shared Bayes classifier , the adversarial based method can be modified by replacing the binary domain classifier with a multi-way classifier representing multiple sources , and so on.
Domain adaptation and transfer learning are useful for many applications like speaker adaptation  in speech recognition, writer adaptation  in handwriting recognition, view adaptation  in face recognition, and so on. Fine-tuning is a straightforward and effective strategy for supervised adaptation, while cross-domain mapping is a general approach for supervised , unsupervised , and semi-supervised  adaptation. Traditional approaches usually focus on distribution matching, and adversarial learning has become the new trend for deep learning based adaptation. Multi-source phenomenon is common in practice, and how to discover latent domains  in mixed-multi-source data is an important and challenging problem.
V-C Multi-task Learning
As shown in Fig. 6b, another case is that the feature spaces are identical but the label spaces are changed. For example, a face image can be classified into different races, ages, genders, and so on. These tasks are not independent, instead, they are complementary to each other, and learning one task is helpful for solving another. How to efficiently and effectively learn from multiple related tasks is known as multi-task learning.
V-C1 Transferable representation learning
The first question for multi-task learning is that: can we find a generic feature representation that is transferable among different tasks? The traditional hand-crafted feature representations are usually task-specific, and new features need to be designed for new tasks. It has been shown by  that the features extracted from the deep neural network pre-trained with a large dataset are transferable to different tasks. The usual method is to train a base network, and then copy its first few layers to a target network, for which the remaining layers are randomly initialized and trained toward target task. There are multiple layers in deep neural networks, and the first layers usually learn low-level features whereas the latter layers learn semantic or high-level features. The low-level features are more general while the high-level features are more specific . Therefore, the transferabilities of features from bottom, middle, or top of a neural network are different, depending on the distance between the base task and target task : for similar tasks the later layers are more transferable, while for dissimilar tasks the earlier layers are preferred.
V-C2 Multi-task representation learning
The second question for multi-task learning is that: can dealing with multiple tasks simultaneously be used to integrate different supervisory signals for learning an invariant representation? Since each task will produce a task-specific loss function, generally, as soon as you find yourself optimizing more than one loss function, you are effectively doing multi-task learning . To learn multiple tasks jointly, there should be some shared and task-specific parameters in the architecture, and sharing what is learned during training different tasks in parallel is the central idea in multi-task learning. A straightforward approach is the hard parameter sharing  (Fig. 8a), which shares the bottom layers for all tasks and keeps several task-specific output layers. This is the most widely-used strategy in real applications due to its simplicity and effectiveness. However, it needs intensive experiments to find the optimal split position for shared and task-specific layers. Another approach is the soft parameter sharing in Fig. 8b, where each task has its own parameters which are regularized with some constraints to encourage similarity between them, like minimizing their distances  or with some partially shared structure . Other than using fixed sharing mechanism, another strategy is learning to share. For example, the cross-stitch network  proposes to learn an optimal combination of shared and task-specific representations as shown in Fig. 8c. The above approaches are based on sharing features among tasks, and the decision-making processes of each task are still independent. However, the solution of one task may be useful for solving other related tasks, indicating that we need task feedback to update the representation as shown in Fig. 8d. In this way, the performance of different tasks can be improved recurrently by utilizing the solutions (not only features) from other tasks .
V-C3 Task relationship learning
Finding the relationship between different tasks will make information sharing among tasks to be more selective and smooth. A straightforward strategy is using task clustering  to partition multiple tasks into several clusters, and the tasks in the same cluster are assumed to be similar to each other. It is also possible to dynamically widen a thin neural network in a greedy manner to create a tree-like deep architecture for clustering similar tasks in the same branch as shown in . Besides task clustering, many studies have also tried to learn both the per-task model parameters and the inter-task relationships simultaneously, where the task relationship can be formulated to be a matrix 
, tensor, or a nonlinear structure . This topic has attracted much attention in the research community, and the work of taskonomy  has won the CVPR2018 best paper award. The taskonomy (task taxonomy) is a directed hyper-graph that captures the transferability among tasks: an edge between two tasks represents a feasible transfer, while the weight is the prediction of the transfer performance. With task relationship, the transfer learning performance would be improved, due to better transfer path from most related tasks to a target task, which can not only reduce the computational complexity of using all tasks but can also avoid the phenomenon of negative transfer caused by dissimilar tasks .
When multiple related tasks can be defined naturally, multi-task learning will significantly improve the performance for many problems like computer vision47]. However, this requires labeled data for each task. A more efficient strategy is to use some auxiliary tasks where data can be collected without manual labeling (to be discussed in Section VI-B). When multiple tasks are learned jointly, how to balance their loss functions becomes the key problem. A dominant approach is to assign each task a pre-defined weight. However, the optimal weight is expensive and time-consuming to find empirically. Therefore, Kendall et al.  propose to learn the optimal weights automatically for multiple loss functions by considering the homoscedastic uncertainty of each task. Furthermore, gradient normalization  can also be used to automatically balance training in deep multi-task models by dynamically tuning the gradient magnitudes.
V-D Multi-modal Learning
The world surrounding us involves multiple modalities: we see objects, hear sounds, feel texture, smell odors, and taste flavors . Different from multi-task learning (Fig. 6b) where multiple tasks are performed with the same input, the purpose of multi-modal learning is to utilize the supplementary and complementary information in different modalities to complete a shared task (Fig. 6c) or multiple related tasks (Fig. 6d).
V-D1 Multi-modal representation and fusion
In multi-modal learning, each instance can be viewed by multiple modalities, which can be fused at different levels. The first approach we can consider is signal-level fusion, for example, the pansharpening of multi-resolution satellite images . However, different modalities usually have different data structures with different sizes or dimensions such as images, sound waves and texts, making them hard to be fused at raw-data level. Therefore, we must design some modality-wise representations, which could be either handcrafted features or learned with deep neural networks. For example, 2D/3D convolutional neural networks can be used for spatial structured signals like image, CT, fMRI, video, and so on, while recurrent neural networks can be used for temporal data like speech and text. With modality-wise feature extractions, different modalities are transformed into a unified space, and a straightforward fusion strategy is therefore feature-level fusion, for example, the modality-wise representations can be concatenated into a longer representation or averaged (with learnable modality-wise weights) to a new representation . After that, any traditional model can be learned on the fused feature representation for a given task. Another common strategy is decision-level fusion which is widely-investigated in multiple classifier systems . This fusion strategy is often favored because different models are used on different modalities, making the system more flexible and robust to modality-missing as the predictions are made independently. Recently, due to the development of deep learning which learns hierarchical representations, the intermediate-level fusion  is used to dynamically integrate modalities at different levels with an automatically learned and optimized fusion structure. Moreover, another important strategy is the learning-based fusion, for example, using cross weights to gradually learn interactions of modalities , using multiple-kernel learning to learn optimal feature fusion , learning sharable and specific features for different modalities , and so on.
A widely-occurring problem in multi-modal learning is modality-missing, i.e., some modalities are unaccessible for some instances during inference. The generative model such as deep Boltzmann machine  can be used to handle modality-missing by sampling the absent modality from the conditional distribution. We can also apply the modality-wise dropout  during training to improve the generalization performance for modality-missing.
V-D2 Cross-modal matching, alignment, and generation
Besides fusing multiple modalities to make accurate and robust predictions, another vibrant research direction causing increasing attentions is the cross-modal learning . In this case, different modalities are embedded into a coordinated and well-aligned space, for example, the maximum correlated space by canonical correlation analysis (CCA) , the semantic preserved space by joint embedding , and so on. The embedded space enables cross-modal matching tasks, such as retrieving the images that are relevant to a given textual query , deciding which face image is the speaker given an audio clip of someone speaking , and so on. A more difficult problem is cross-modal alignment for finding correspondences between sub-items of instances from multiple modalities . For example, aligning the steps in a recipe to a video showing the dish being made , aligning a movie to the script or the book chapters . According to , the cross-modal alignment can be achieved either explicitly like using dynamic time warping and CCA, or implicitly like using the attention mechanism in deep neural networks. Another task cross-modal generation, which seeks a mapping from one modality to another, has become very popular with an emphasis on language and vision , for example, generating the text description of an input image , inversely generating the image given a text description , and so on. The difficulty is increasing from matching, alignment, to generation, requiring much better understanding and high-level capture of the interaction and relationship between modalities.
V-D3 Multi-modal multi-task learning
The last case we shall discuss is the multi-modal multi-task learning (Fig. 6d), which can be partitioned into two types as shown in Fig. 9. The first and also much simpler setting is the the synchronous case, where all the modalities are available for solving each task. In this case, a fused representation of all modalities can be learned efficiently during joint training multiple related tasks, which is widely used in different applications such as disease diagnosis , traffic sign recognition , autonomous driving , emotion recognition , and so on. As shown in Fig. 9b, the second and also more challenging setting is the asynchronous case, where different tasks may only rely on their own modalities. For example, image classification works on image, speech recognition deals with sound waves, while machine translation handles text data. Intuitively, it is hard to consider these problems jointly, since both their inputs and outputs are different. Moreover, it is also not clear about the common knowledge shared among these seemingly unrelated problems, and how much benefit we can get from combining them. An interesting work named “one model to learn them all”  shows us such possibility and potential, where a single deep neural network to learn multiple tasks from various modalities is designed using modality-specific encoders, I/O mixer, and task-specific decoders. The joint training of diverse tasks with asynchronous image, speech, and text modalities is shown to benefit from shared architecture and parameters. Amazingly, although seemed to be unrelated, incorporating image classification in training would help to improve the performance of language parsing, indicating that some computational primitives can be shared between different modalities and even unrelated tasks .
There are many practical problems that can benefit from multi-modal learning. In biometric applications, a person can be identified by face, fingerprint, iris, or voice. Although each modality is already distinguishable, their combination will improve both the accuracy and robustness. In autonomous driving, the fusion of multiple sensors (radar, camera, LIDAR, GPS, and so on) is necessary and important to make robust decisions. Multi-modal analysis will also make the decision making to be more explainable . Moreover, the human brain is essentially both a multi-modal and a multi-task system: it continuously receives stimuli from various modes of the surrounding world and performs various perceptual and cognitive tasks. For a pattern recognition system, multi-modal perception increases the diversity on input while multi-task improves the diversity on output, and usually diversity will bring robustness. Therefore, joint multi-modal multi-task learning will be an inspiring and important future direction.
Vi Breaking Clean and Big Data Assumption
Pattern recognition systems usually have strong abilities to memorize training data. As shown in , even if we randomly change the labels of data completely, neural networks can still achieve near zero training error, indicating the strong capacity to fit training data. This is valuable if we have a clean (well-labeled) and large-enough (covering different variations) dataset, and fitting the training data in this case will usually also lead to good generalization performance. However, this assumption is hard to satisfy in real applications. Actually, clean data and big data are contradictory: it is easy to collect a well-labeled small dataset, but it is impossible to manually label a big dataset without any error. Therefore, in order to improve the robustness on both the quality and quantity of data, first of all, the training process should be robust to noisy data, and second, particular learning strategies should be considered to reduce the dependence on large amounts of data. To reach this goal, we present discussions and summarizations from the following four perspectives.
Vi-a Supervised Learning with Noisy Data
In supervised learning, the noises in data can be partitioned into three types: (1) label noise: the sample is valid but the label is wrong due to mislabeling; (2) sample noise (or attribute noise): the sample is noisy but the label is valid, for example, samples caused by corruption, occlusion, distortion, and so on; (3) outlier noise: both the sample and label are invalid, for example, samples from a new not-care class or a totally noisy signal, but still labeled as one of the classes to be classified. To deal with noisy data, different approaches have been proposed in the literature. Frenay and Verleysen  have surveyed many methods for label noise before the year of 2014. Complementarily, we consider all the three noise types and focus more on methods developed in recent years.
Vi-A1 Robust loss
The unbounded loss function will usually over-emphasize the noisy data, and hence, the decision boundaries will deviate severely from the optimal one. Therefore, the first solution for learning with noisy data is to redefine the loss function to be bounded. The convex functions are usually unbounded, and therefore, most redefined robust loss functions are non-convex . For example, the ramp loss  and truncated hinge loss  set an upper bound on hinge loss by allowing a maximum error for each training observation, resulting in a non-convex but robust SVM . The correntropy-induced loss  with properties of bounded, smooth, and non-convex is shown to be robust when combined with kernel classifiers. For deep neural networks, as suggested by , the categorical cross entropy loss is sensitive to label noise, and a comparison of different loss functions tolerant to label noise is given in .
Vi-A2 Noise transition
For a sample with annotation (either correct or wrong), we use to denote its ground-truth (clean) label. Now, the labeling noises can be modeled probabilistically by which is usually a complex process. However, we can assume : noisy label depends only on true label and not on the sample. This is an approximation of real-world labeling process and can still be useful in some certain scenarios , for example, there are usually some confusable (similar) categories which are hard for human labelers to distinguish, regardless of the specific samples. In this case, we can simply use a noise transition matrix to specify the probability of one label being wrongly annotated to another . Since , we can modify the loss function to be :
where is the number of classes. The matrix can be estimated from data and subsequently fixed during classifier training , jointly estimated with the classifier [263, 127], or estimated with human-assistance . Moreover, Vahdat  proposes an undirected graphical model to directly model other than .
Another approach is explicitly detecting and removing the noisy data. An effective strategy is using ensemble learning to filter noisy data . In ensemble learning, different classifiers are complementary to each other, hence, examples which are in contradiction with most learners can be identified confidently as noisy. With this kind of approach, the data pruning method  is shown to significantly improve generalization performance. For large-scale dataset cleaning, the partitioning filter  is proposed for noise identification from large distributed datasets. Another approach is directly incorporating the noise detection into the objective function of the learning machine, for example, the robust SVM approach  uses a binary indicator variable for each sample to explicitly mark it as noisy or clean. In this way, the noisy data can be automatically suppressed and no loss is charged for them during training. Similar idea is also used for learning distance metric from the noisy side information .
|Label Noise||Sample Noise||Outlier Noise|
Reweighting is a soft-version of cleaning: assigning small weights for noisy data other than completely removing them . The work of  proposes a reweighting module by a Siamese network to distinguish clean labels and noisy labels under iterative learning. Moreover, the cleanNet  assigns weights as the sample-to-label relevance calculated from a joint neural embedding network for measuring the similarity between a sample and its noisy labeled class. The work of mentorNet  treats the base model as the studentNet and a mentorNet is used to provide a curriculum (the reweighting scheme) for studentNet to focus on samples with probably correct labels, which is shown to significantly improve the performance on real-world large-scale noisy dataset of WebVision.
We can also correct the labels of noisy data by relabeling in the learning process. In , a probabilistic graphical model is used to simulate the relationship between samples, labels and noises, for deducing the true label with an EM-like algorithm. The bootstrapping  is also used for relabeling the noisy data by updating the labels as convex combination of original noisy label and current prediction of classifier iteratively. Similarly, Tanaka et al.  propose to learn model parameters and true labels alternatively under a joint optimization framework.
As shown in Table II, different methods can handle different data noises. Although the technical details are different, the purposes of robust loss, cleaning, and reweighting are actually similar, i.e., reducing the influence of noisy data in learning process. Therefore, they can be used for all the three noise types. The noise transition and relabeling methods are only suitable for the case of label noise, however, they are efficient and effective strategies to improve robustness when the noises in data are mainly caused by mislabeling.
Vi-B Unsupervised (Self-supervised) Learning
In traditional pattern recognition 
, unsupervised learning is usually referred todata clustering, however, nowadays, more emphasis is actually placed on unsupervised representation learning , where good and transferrable feature representation is learned from large amounts of unlabeled data. A widely-used strategy is self-supervised learning, a specific instance of supervised learning where the targets are directly generated from the data and therefore no need for labeling.
Since the data are unlabeled, a straightforward strategy is to use them as both the inputs and targets to learn a compressed representation with an encoder and decoder to minimize the reconstruction error:
The first approach following this idea is principal component analysis (PCA) which learns a linear subspace via and
with an orthogonal matrixfor projection. The restricted Boltzmann machine  and auto-encoder  are also reconstruction based methods and can be viewed as nonlinear extensions of PCA. From then on, various improvements have been proposed such as denoising auto-encoder , contractive auto-encoder , variational auto-encoder , and so on. The split-brain auto-encoder  splits the model into two disjoint sub-networks for cross-channel prediction, which transforms the reconstruction objective to a prediction based one, making the learned feature representation more semantic and meaningful.
Vi-B2 Pseudo label with clustering
We can also assign some pseudo labels to the data, and then transform the problem to a supervised learning task. For unlabeled data, a natural idea is to use some clustering algorithm to partition the data into different clusters, and then the cluster identities can be viewed as the pseudo label to learn representations. However, a challenge is that the clustering relies heavily on a good representation, and conversely the learning of representation also requires good clustering results as supervision, resulting in a chicken-or-egg-first problem. To solve this problem, the alternative learning strategy [10, 314] can be used for joint unsupervised learning of the representations and clusters.
Vi-B3 Pseudo label with exemplar learning
Other than clustering, another method of exemplar learning views each sample as a particular class, the pseudo label now is the sample identity, and the purpose is to separate all training samples from each other as much as possible. The exemplar-CNN  treats each patch (with random transformations) in an unlabeled image as a particular class, and the classifier is trained to separate all these classes. In this way, the learned representation not only ensures that different patches can be distinguished but also enforces invariance to specified transformations. Another approach uses noise as targets  to learn the representation and the one-to-one matching of training samples to uniformly sampled vectors for separating every training instance. In exemplar learning, each instance is treated as a distinct class of its own, therefore, the number of classes is the size of the entire training set. The computational challenges imposed by large number of classes need to be carefully considered  in exemplar learning. Recently, the momentum contrast (MoCo)  proposes using a dictionary as a queue and a momentum update mechanism to efficiently and effectively realize the idea of exemplar learning, and shows that the gap between unsupervised and supervised representation learning can be closed in many vision tasks.
Vi-B4 Surrogate tasks for computer vision
Recently, another interesting trend for unsupervised learning is seeking help from some surrogate tasks for which the labels or targets come for “free” with the data. For example, as shown in Fig. 10, learning to colorize grayscale image [322, 155], learning by inpainting  (to generate the contents of an missing area in image conditioned on its surroundings), learning by context prediction  (to predict the position of one patch relative to another patch in image), learning by solving jigsaw puzzles  (geometric rearrangement of random-permutated patches), learning by predicting image rotations , and so on. For all these tasks, the supervisory signal can be easily obtained automatically, and therefore, there is no need to worry about the insufficiency and labeling of the training data.
Although these tasks seem to be simply defined, doing well on these tasks requires the model to learn meaningful and semantic representations. For example, for inpainting the model needs to understand the content of the image to produce a plausible hypothesis for the missing part, in context prediction the model should learn to recognize objects and also their parts to predict relative positions. Therefore, the learned representations from these surrogate tasks can transfer well to more complicated tasks. Similar approaches can also be found on video related tasks [299, 7].
Vi-B5 Surrogate tasks for natural language
The idea of using surrogate tasks to learn representations from unlabeled data can also be used for other tasks like natural language processing, such as the language model for predicting what comes next in a sequence [51, 222]. The work of BERT  proposes two novel surrogate tasks for unsupervised learning. The first is masked language model which randomly masks some of the tokens from the input and the objective is to predict the original vocabulary identity of the masked word based on its context. The second task is next sentence prediction which is a binary classification task to predict whether a sentence is next to another sentence or not. These strategies have achieved new benchmark performance on eleven language tasks , indicating unsupervised pre-training has become an important integral part for language understanding.
Due to the abundantly available unlabeled data, unsupervised or self-supervised learning is a long pursued objective for representation learning. In addition to the methods discussed above, it is hopeful to see more and more effective and interesting self-supervised methods in the future. Besides classification, self-supervised learning can also be used for regression problems . Since different approaches have been proposed from different aspects, the combination of multiple self-supervised methods through multi-task learning  (Section V-C) is an inspiring direction.
Vi-C Semi-supervised Learning
Semi-supervised learning (SSL) deals with a small number of labeled data and a large amount of unlabeled data simultaneously, and therefore, can be viewed as a combination of supervised and unsupervised learning. In the literature, a wide variety of methods have been proposed for SSL, and comprehensive surveys can be found in  and . Nowadays, new progress especially deep learning based approaches have become the new state-of-the-art, therefore, in this section we focus on recent advances in SSL.
A straightforward strategy for SSL is combining the supervised loss on labeled data and unsupervised loss on unlabeled data to build a new objective function. The ladder network  combines the denoising auto-encoder (as unsupervised learning for every layer) with supervised learning at the top layer for SSL. The stacked what-where auto-encoder  uses a convolutional net for encoding and a deconvolutional net for decoding to simultaneously minimize a combination of supervised and reconstruction losses. Similarly, Zhang et al.  take a segment of the classification network as encoder and use mirrored architecture as decoding pathway to build several auto-encoders for SSL.
Another approach is using initial classifier to predict pseudo labels for unlabeled data and then retraining classifier with all data. This process is repeated iteratively for boosting both the accuracy of pseudo labels and the performance of classifier. Following this, the first idea is self-training  where a single model is used to predict the pseudo label as the class with maximum predicted probability, which is equivalent to the entropy minimization  in SSL. The co-training  uses two different models to label unlabeled data for each other. Moreover, the tri-training  utilizes bootstrap sampling to get three different training sets for building three different models. For example, in the tri-net  approach, three different modules are learned and if two modules agree on the prediction of the unlabeled sample confidently, the two modules will teach the third module on this sample. The strategies of self/co/tri-training are efficient to implement and also effective for SSL.
Vi-C3 Generative model
Generative model is another widely used strategy for SSL. For example, the Gaussian mixture model can be used to maximize the joint likelihood of both labeled and unlabeled data using EM algorithm. The variational auto-encoder (VAE)  can be used for SSL by treating the labels as additional latent variables. A recent trend is using generative adversarial network (GAN)  for SSL by setting up an adversarial game between a discriminator and generator . In original GAN, is a binary classifier. To apply GAN for SSL, is modified to be a -class model  or extended to classes ( real classes and one fake class) . For labeled data, should minimize their supervised loss. For unlabeled data, is then trained to minimize their uncertainty (e.g. by entropy minimization) to classes. Moreover, should also try to distinguish the generated samples by either maximizing their entropy (uncertainty) to classes  or classifying them to the additional fake class . Meanwhile, is trained from the opposite direction to generate realistic samples for the classes. By using GAN for SSL, the advantages are two-fold. First, it can generate synthetic samples for different classes which serve as additional training data. Second, even bad examples  from the generator will benefit SSL, because they are lying in low-density areas of the manifold which will guide the classifier to better locate decision boundary.
Most deep learning models utilize randomness to improve generalization. Therefore, multiple passes of an individual sample through the network might lead to different predictions, and the inconsistency between them can be used as the loss for unlabeled data. Let be a model with parameter , and as different randomness (data augmentation, dropout, and so on). The mean squared error is used by  to minimize the inconsistency of the predictions . This is denoted as -model in , where each sample is evaluated twice and the difference of predictions is minimized.
Actually, this can also be explained from the teacher-student viewpoint. For each unlabeled sample, a teacher is used to guide the learning of the student by minimizing
In -model the teacher is another evaluation with a different perturbation . However, a single evaluation can be very noisy, therefore, the temporal ensembling  proposes the use of exponentially moving average (EMA) of the predictions to form a teacher:
where is a momentum, and each unlabeled sample has a teacher that is the temporal ensembling of previous predictions. Moreover, the mean teacher  proposes to average model parameters other than predictions, i.e., the teacher uses EMA parameters of student model :
and now the teacher is , which is the same model with student but using historically-averaged parameters. The perturbation-based approaches will smooth the predictions on unlabeled data. Moreover, other than random perturbations, the virtual adversarial training  can be used to find a worst-case perturbation for better SSL.
Vi-C5 Global consistency
The perturbation-based approach is actually seeking a kind of local consistency: samples that are close in input space (due to perturbation) should also be close in output space. However, a more important idea is global consistency: samples forming an underlying structure should have similar predictions . To better utilize global consistency, the traditional graph-based SSL is combined with deep learning , by dynamically creating a graph in the latent space in each iteration batch to model data manifold, and then regularizing with the manifold for a more favorable state of class separation. Another interesting approach of learning by association  considers global consistency from a different perspective: imagine a walker from labeled data to unlabeled data according to the similarity calculated from latent representation, and then the walker will go back from unlabeled data to labeled data. Correct walks that start and end at the same class are encouraged, and wrong walks that end at a different class are penalized. The cycle-consistent association from labeled data to unlabeled ones and back can be efficiently modeled using transition probabilities , and therefore is an effective strategy to pursue global consistency in SSL.
In SSL, the massive unlabeled data and the scarce labeled data reveal the underlying manifold of the entire dataset, and by letting the predictions for all samples to be smooth on the manifold, more accurate decision boundary can be obtained compared to purely supervised learning. Reconstruction-based methods can learn a better representation from unlabeled data, while pseudo labels can be used for iterative training on unlabeled data. Recently, a new trend is using GAN to model the distribution of labeled and unlabeled data for SSL. Moreover, perturbation-based methods utilize the randomness in deep neural network to seek local consistency on unlabeled data, and how to effectively consider global consistency in deep learning based SSL still needs more exploration.
Vi-D Few-shot and Zero-shot Learning
In human intelligence, we can instantly learn a novel concept by observing only a few examples from a particular class. However, the state-of-the-art approaches in machine intelligence are usually highly data-hungry. This difference has inspired an important research topic of few-shot learning (FSL) . In FSL, we have a many-shot dataset and few-shot dataset (usually -shot -way: labeled samples for each of the classes, and is small like 1 or 5). The samples in have disjoint label space with . The purpose of FSL is to extract transferrable knowledge from to help us perform better on , as illustrated in Fig. 11.
Vi-D1 Metric learning
Under the principle that test and training conditions must match, the episode based training  is used to mimic the -shot -way setting. Specifically, in each training iteration, an episode is formed by randomly selecting classes with samples per class from to act as the support set , and meanwhile a fraction of the remainder samples from those classes are selected as the query set. This support/query split is designed to simulate the real application situation on . The purpose now is to define the probability . Although the underlying classes are different between and , the learned is hoped to be transferable between them, which can be viewed as a point-to-set metric. For example, in , a deep neural network embedded space is learned to calculate such a metric. To learn a better embedding (or metric), a memory module can be used  to explicitly encode the whole support set into memory for defining . Moreover, different criteria can be used to learn the metric like the mean square error  and the ranking loss . Since each support set is designed to be few-shot, the metric learned in this way can be transferred to unseen categories which also have few examples.
Vi-D2 Learning to learn (meta-learning)
The learning to learn or meta-learning is to train another learner at a higher level for guiding the original learner. In , the meta-learner is defined as the transformation from the model parameters learned from few samples to the model parameters learned from large enough samples, which can be viewed as a model-model regression. Another approach  instead uses the sample-model regression as the meta-learner by transforming each single sample directly to the classifier parameters. Denote original model as where represents the parameters to be learned, instead of learning directly, a meta-learner is used in  to map to . Now, the model becomes and the task is changed to learn . The meta-learner can be used to predict any parameters of another network like linear layers or convolutional layers , and once learned, the parameters for any novel category can be predicted by a simple forward pass. Besides predicting parameters, other meta-learning methods such as learning the optimization algorithm  or initialization  can also be used for FSL. Since the tasks considered in meta-learning is category-agnostic, they can be well transferred to new few-shot categories.
|Open-world||Evaluation Metric||Representative Performance|
|Section IV-A||Classification accuracy: the ratio of number of correctly classified patterns to the total number of patterns, evaluated on a test dataset different from the training dataset.||On a benchmark 1000-class ImageNet  dataset, human-level accuracy is 94.9% (top-5), and ResNet  could achieve 96.43% accuracy. On a smaller 10-class MNIST  dataset, it is common to achieve more than 99% accuracy (top-1).|
|Section IV-B||Rejection performance: a threshold is used to distinguish normal and abnormal patterns, and to evaluate overall performance, threshold-independent metric is used like area under curve.||To detect notMNIST from MNIST , the AUROC (area under receiver operating characteristic curve) is 85% and AUPR (area under precision-recall curve) is 86%.|
|Section IV-C||Adversarial robustness: Let subject to , the robustness of classifier  is where is expectation over data.||On ILSVRC the adversarial robustness  is for CaffeNet and for GoogLeNet, indicating that: a perturbation with magnitude as the original image is sufficient to fool state-of-the-art deep neural networks.|
|Section IV-D||Class-incremental capacity: the changing trend of classification accuracy as the number of classes increased.||On ILSVRC as the number of classes incremented from 100 to 1000, the accuracy is reduced from about 90% to 45% .|
|Non-iid||Evaluation Metric||Representative Performance|
|Section V-A||Contextual learning ability: the performance improvement caused by learning from a group of interdependent patterns.||By integrating geometric and linguistic contexts,  improved the correct rate of handwritten Chinese text recognition from 69% to 91%.|
|Section V-B||Adaptability and transferability: the performance of adaptation and transfer between different (i.e., source and target) domains which have different data distributions.||Through writer adaptation with style transfer mapping,  achieved more than 30% error reduction rate on a large scale handwriting recognition database CASIA-OLHWDB.|
|Section V-C||Multi-task cooperation ability: the gain of considering multiple tasks simultaneously compared to handling them independently.||The taskonomy  reduced the number of labeled samples needed for solving 10 tasks by roughly (compared to training independently) while keeping the performance nearly the same.|
|Section V-D||Multi-modal fusion ability: the boost of performance by utilizing the complementary information in different modalities.||By fusing multiple modalities at several spatial and temporal scales,  won the first place out of 17 teams on the ChaLearn 2014 “looking at people challenge gesture recognition track”.|
|Noisy Small Data||Evaluation Metric||Representative Performance|
|Section VI-A||Noisy data tolerance: the stability of classification performance when the training data contain a certain percentage of noises.||On CIFAR10, with clean data the accuracy is 93%, when 30% of the data are noisy, the accuracy is deteriorated to 72%, and a noise-robust model  can recover the accuracy to 91%.|
|Section VI-B||Self-supervised capability: performance of learning from purely unlabeled data under some self-supervised mechanisms.||Using instance discrimination as the self-supervision, the unsupervised MoCo  outperformed its supervised pre-training counterpart in 7 vision tasks on many datasets.|
|Section VI-C||Semi-supervised capability: performance of joint learning from massive unlabeled data and scarce labeled data.||On ImageNet, the accuracy of supervised learning (100% labeled data) is 96%, a semi-supervised model  (10% labeled and 90% unlabeled data) could achieve 91% accuracy.|
|Section VI-D||Few-shot generalization: knowledge transfer ability from learning of old classes to new classes with few or even zero data.||On miniImageNet  with 5-class, the 1-shot (one sample per class) accuracy is 49% and the 5-shot accuracy is 68%. On CUB  with 50-class, the 0-shot (no sample but side information of attribute is available) accuracy is 55%.|
Vi-D3 Learning with imagination
Another explanation for the FSL ability of humans is that we can easily visualize or imagine what novel objects should look like from different views although we only see very few examples. This has inspired the learning with imagination to produce additional training examples. As pointed out by , the challenge of FSL is that the few examples can only capture very little of the category’s intra-class variation. To solve this, we can use the many-shot dataset to learn the intra-class transformations of samples and then augment the samples in along these transformations . In another work , a hallucinator is trained by taking a single example of a category and producing other examples to expand the training set. The hallucinator is trained jointly with the classification model, and the goal is to help the algorithm to learn a better classifier , which is different from other data generation model like GAN  whose goal is to generate realistic examples. Actually, the effectiveness of learning with imagination comes from the recovery of the intra-class variation missed in FSL.
Vi-D4 Zero-shot learning
An extreme case of FSL is zero-shot learning (ZSL)  where there is no example for the novel categories. In this case, some side information is needed to transfer the knowledge from previously learned categories to novel categories (Fig. 11), including attributes , class names , word vector , text description , and so on. In attribute-based ZSL , attributes are typically nameable properties that are present or not for a certain category. In this way, multiple binary attribute-specific classifiers can be trained independently. After that, for a new class, training samples are no longer required, we only need the attribute associations for this class, and a test sample can be effectively classified by checking its predicted attributes. Besides user-defined attributes, learning the latent attributes  and the class-attribute associations  can further improve performance. A more general approach for ZSL suitable for different side information is embedding based approach , where two embedding networks are learned for both samples and side information, and the similarity between them are measured using Euclidean, cosine, or manifold distance . In the embedded space, nearest neighbor search (cross-modal match) can then be efficiently used for ZSL. The last approach for ZSL is synthesizing class-specific samples conditioned on their side information for unseen classes , which can be implemented in various ways like: data generation at feature-level  or sample-level , using variational auto-encoder  or generative adversarial network , conditioned on attributes  or text descriptions , and so on.
Building a good model totally from scratch with a small number of observations is difficult, and actually, the FSL abilities of human beings are based on our abundant prior experiences of dealing with related tasks. Therefore, as pointed out by : the key insight for FSL is that the categories we have already learned can give us information that helps us to learn new categories with fewer examples. Therefore, FSL can be viewed as cross-class transfer learning. Moreover, humans are good at ZSL because we have other knowledge sources (like book and Internet) from which we can infer what a new category looks like. Therefore, ZSL is more like cross-modal learning (Section V-D). Although many approaches have been proposed in the literature, few-shot and zero-shot learning are still urgently needed skills for machine intelligence.
Vii Concluding Remarks
This paper considers the robustness of pattern recognition from the perspective of three basic assumptions, which are reasonable in controlled laboratory environments for pursuing high accuracies, however, will become unstable and unreliable in real-life applications. To improve robustness, we present a comprehensive literature review of the approaches trying to break these assumptions:
For breaking closed-world assumption, we partition the open-space into four components: the known known corresponding to the empirical risk, the known unknown corresponding to the outlier risk, the unknown known corresponding to the adversarial risk, and the unknown unknown corresponding to the open class risk.
For breaking independent and identically distributed assumption, we first discuss the problems in learning with interdependent data, then review recent advances in domain adaptation and transfer learning, and finally analyse the multi-task and multi-modal learning for increasing the diversity on both output and input of the system.
For breaking clean and big data assumption, we first introduce supervised learning with noisy data, then review un/self-supervised and semi-supervised learning to learn from unlabeled data and surrogate supervision, and lastly discuss few-shot and zero-shot learning to transfer knowledge from big-data to small-data.
With the above approaches, we can improve the robustness of a pattern recognition system by: growing continuously with changing concepts in open world, adapting smoothly with changing environments under non-identical conditions, and learning stably with changing resources under different data quality and quantity. Actually, these are fundamental issues in robust pattern recognition, because these changing factors will usually greatly affect the stability of final performance in practice. Furthermore, in continuous use of a pattern recognition system, other than being a static model, how to make it evolvable during lifelong learning [215, 210] or never-ending learning  is an important step towards real intelligence. Through breaking the three basic assumptions, we can actually eliminate the main obstacles in reaching this goal.
Unlike the traditional closed-world classification which is usually evaluated by accuracy, how to evaluate the recognition performance in open and changing environments is a big issue. Besides accuracy, other evaluation measurements reflecting the ability in dealing with the changing factors are more important. As shown in Table III, when considering many other evaluation metrics (different from the classification accuracy), it is obvious that pattern recognition is far from solved. Moreover, in the research community, different tasks are usually evaluated with different metrics and databases. How to build a general benchmark for evaluating the robustness by integrating different metrics together is an important future task for pattern recognition.
Different from the widely-used empirical risk minimization, theoretical analysis to unify different open-world risks will become the foundation for future classifier design. A future pattern recognition system should acquire complementary information from interdependent data in different modalities and boost itself through the cooperation of multiple tasks by adaptively learning from a few labeled, unlabeled or noisy data. Although many attempts have been proposed in the literature, most of them try to solve a single problem from a single perspective. However, the three basic assumptions are actually related, and through joint consideration many new research problems can be raised, such as open-world domain adaptation , open-world semi-supervised learning , cross-modal domain adaptation , multi-task self-supervised learning , few-shot domain adaptation , and so on. Future research of a unified framework to deal with the open-world, non-i.i.d., noisy and small data issues simultaneously is the ultimate goal of robust pattern recognition.
Besides the robustness issues, many other problems are also important for pattern recognition. For example, the interpretability 
of the model: besides high accuracy, the system also needs to explain why such a prediction is made, for increasing our confidence and safety in trusting the result. Some traditional classifiers like decision tree and logistic regression are interpretable, but how to make other models especially black-box deep neural networks explainable is an important task. Another important issue is computational efficiency . Besides big data, strong computing power is also a key for the success of modern pattern recognition technologies. In order to widen the application scope and also reduce resource consumption, the compression and acceleration of pattern recognition models are of great values for practical applications. Since pattern recognition can be viewed as the simulation of human brain perception ability which enables machine to recognize objects or events in sensing data, how to effectively learn from neuroscience for developing brain-inspired , biologically-plausible  or psychophysics-driven  pattern recognition models is an inspiring future direction. With more attentions and efforts paid to these important issues in pattern recognition, the gap between human intelligence and machine intelligence can be narrowed in the foreseeable future.
Appendix A Background References
Outlier Detection 
Adversarial Example 
Open Set Recognition 
Class-incremental Learning 
Contextual Learning 
Domain Adaptation 
Transfer Learning 
Multi-task Learning 
Multi-modal Learning 
Learning with Noise 
Representation Learning 
Self-supervised Learning 
Semi-supervised Learning 
Few-shot Learning 
Zero-shot Learning 
-  Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classification. IEEE Trans. Pattern Anal. Mach. Intell., 38(7):1425–1438, 2016.
-  Z. Al-Halah, M. Tapaswi, and R. Stiefelhagen. Recovering the missing link: Predicting class-attribute associations for unsupervised zero-shot learning. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 5975–5984, 2016.
-  T. Almaev, B. Martinez, and M. Valstar. Learning to transfer: transferring latent task structures and its application to person-specific facial action unit detection. In Int. Conf. Comput. Vis., pages 3774–3782, 2015.
-  J. Amores. Multiple instance classification: Review, taxonomy and comparative study. Artificial Intelligence, 201:81–105, 2013.
-  G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In Int. Conf. Mach. Learn., pages 1247–1255, 2013.
-  A. Angelova, Y. Abu-Mostafam, and P. Perona. Pruning training sets for learning of object categories. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 494–501, 2005.
-  R. Arandjelovic and A. Zisserman. Look, listen and learn. In Int. Conf. Comput. Vis., pages 609–617, 2017.
-  H. Azizpour, A. Razavian, J. Sullivan, A. Maki, and S. Carlsson. Factors of transferability for a generic ConvNet representation. IEEE Trans. Pattern Anal. Mach. Intell., 38(9):1790–1802, 2016.
-  T. Baltrusaitis, C. Ahuja, and L. Morency. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell., 41(2):423–443, 2019.
-  M. Bautista, A. Sanakoyeu, E. Sutter, and B. Ommer. CliqueCNN: Deep unsupervised exemplar learning. In Advances Neural Inf. Process. Syst., pages 3846–3854, 2016.
-  A. Bendale and T. Boult. Towards open world recognition. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 1893–1902, 2015.
-  A. Bendale and T. Boult. Towards open set deep networks. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 1563–1572, 2016.
-  Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, 2013.
-  Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, 2003.
-  Y. Bengio, D.-H. Lee, J. Bornschein, T. Mesnard, and Z. Lin. Towards biologically plausible deep learning. arXiv:1502.04156, 2015.
-  L. Bertinetto, J. Henriques, J. Valmadre, P. Torr, and A. Vedaldi. Learning feed-forward one-shot learners. In Advances Neural Inf. Process. Syst., pages 523–531, 2016.
-  H. Bilen and A. Vedaldi. Integrated perception with recurrent multi-task neural networks. In Advances Neural Inf. Process. Syst., pages 235–243, 2016.
-  C. Bishop. Neural Networks for Pattern Recognition. Oxford university press, 1995.
-  C. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108–116, 1995.
-  C. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, 2006.
A. Blum and T. Mitchell.
Combining labeled and unlabeled data with co-training.
Annual conf. Comput. Learn. Theory, pages 92–100, 1998.
-  P. Bojanowski and A. Joulin. Unsupervised learning by predicting noise. In Int. Conf. Mach. Learn., pages 517–526, 2017.
-  L. Breiman. Classification and Regression Trees. Routledge, 2017.
-  C. Brodley and M. Friedl. Identifying mislabeled training data. J. Artif. Intell. Res., 11:131–167, 1999.
-  J. Brooks. Support vector machines with the ramp loss and the hard margin loss. Operations Research, 59(2):467–479, 2011.
-  H. Bunke and K. Riesen. Recent advances in graph-based pattern recognition with applications in document analysis. Pattern Recognition, 44(5):1057–1067, 2011.
-  H. Bunke and A. Sanfeliu. Syntactic and Structural Pattern Recognition - Theory and Applications. World Scientific, 1990.
-  P. Busto and J. Gall. Open set domain adaptation. In Int. Conf. Comput. Vis., pages 754–763, 2017.
-  Q. Cai, Y. Pan, T. Yao, C. Yan, and T. Mei. Memory matching networks for one-shot image recognition. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 4080–4088, 2018.
-  J. Cao, Y. Li, and Z. Zhang. Partially shared multi-task convolutional neural network with local constraint for face attribute learning. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 4290–4299, 2018.
-  N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium Security Privacy, pages 39–57, 2017.
-  R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.
-  H. Cevikalp. Best fitting hyperplanes for classification. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1076–1088, 2017.
-  V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41(3):1–58, 2009.
-  I. Chang and M. Loew. Pattern recognition with new class discovery. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 438–443, 1991.
-  O. Chapelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning. MIT Press, 2006.
-  D.-D. Chen, W. Wang, W. Gao, and Z.-H. Zhou. Tri-net for semi-supervised deep learning. In Int. Joint Conf. Artif. Intell., 2018.
-  S. Chen, Q. Jin, J. Zhao, and S. Wang. Multimodal multi-task learning for dimensional and continuous emotion recognition. In ACM Annual Workshop on Audio/Visual Emotion Challenge, pages 19–26, 2017.
-  Y. Chen, J. Bi, and J. Wang. MILES: Multiple-instance learning via embedded instance selection. IEEE Trans. Pattern Anal. Mach. Intell., 28(12):1931–1947, 2006.
-  Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich. GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Int. Conf. Mach. Learn., pages 1–10, 2018.
-  C. Chibelushi, F. Deravi, and J. Mason. Adaptive classifier integration for robust pattern recognition. IEEE Trans. Systems, Man, and Cybernetics, 29(6):902–907, 1999.
-  K. Cho, A. Courville, and Y. Bengio. Describing multimedia content using attention-based encoder-decoder networks. IEEE Trans. Multimedia, 17(11):1875–1886, 2015.
-  C. Chow. On optimum recognition error and reject tradeoff. IEEE Trans. Information Theory, 16(1):41–46, 1970.
-  S. Chowdhuri, T. Pankaj, and K. Zipser. MultiNet: Multi-modal multi-task learning for autonomous driving. arXiv:1709.05581v4, 2019.
-  C. Ciliberto, A. Rudi, L. Rosasco, and M. Pontil. Consistent multitask learning with nonlinear output relations. In Advances Neural Inf. Process. Syst., pages 1986–1996, 2017.
-  R. Cinbis, J. Verbeek, and C. Schmid. Weakly supervised object localization with multi-fold multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell., 39(1):189–203, 2017.
-  R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Int. Conf. Mach. Learn., pages 160–168, 2008.
-  C. Cortes and V. Vapnik. Support vector machine. Machine Learning, 20(3):273–297, 1995.
-  N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell., 39(9):1853–1865, 2017.
-  G.E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech, Langu. Process., 20(1):30–42, 2012.
-  A. Dai and Q. Le. Semi-supervised sequence learning. In Advances Neural Inf. Process. Syst., pages 3079–3087, 2015.
-  Z. Dai, Z. Yang, F. Yang, W. Cohen, and R. Salakhutdinov. Good semi-supervised learning that requires a bad GAN. In Advances Neural Inf. Process. Syst., pages 6510–6520, 2017.
-  T. Darrell, M. Kloft, M. Pontil, G. Ratsch, and E. Rodner. Machine learning with interdependent and non-identically distributed data. Dagstuhl Reports (Dagstuhl Seminar 15152), 5(4):18–55, 2015.
-  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018.
-  T. Dietterich. Steps toward robust artificial intelligence. AI Magazine, 38(3):3–24, 2017.
-  T. Dietterich, R. Lathrop, and T. Lozano-Perez. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89:31–71, 1997.
-  C. Doersch, A. Gupta, and A. Efros. Unsupervised visual representation learning by context prediction. In Int. Conf. Comput. Vis., pages 1422–1430, 2015.
-  C. Doersch and A. Zisserman. Multi-task self-supervised visual learning. In Int. Conf. Comput. Vis., pages 2051–2060, 2017.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In Int. Conf. Mach. Learn., pages 647–655, 2014.
-  W. Dong, G. Shi, X. Li, Y. Ma, and F. Huang. Compressive sensing via nonlocal low-rank regularization. IEEE Trans. Image Process., 23(8):3618–3632, 2014.
-  A. Dosovitskiy, P. Fischer, J. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell., 38(9):1734–1747, 2016.
-  B. Dubuisson and M. Masson. A statistical decision rule with incomplete knowledge about classes. Pattern Recognition, 26(1):155–165, 1993.
-  R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley & Sons, 2001.
-  L. Duong, T. Cohn, S. Bird, and P. Cook. Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser. In Int. Joint Conf. Natural Language Processing, pages 845–850, 2015.
-  E. Elhamifar and R. Vidal. Robust classification using structured sparse representation. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 1873–1879, 2011.
-  S. Ertekin, L. Bottou, and C. Giles. Nonconvex online support vector machines. IEEE Trans. Pattern Anal. Mach. Intell., 33(2):368–381, 2011.
-  A. Fawzi, S. Moosavi-Dezfooli, and P. Frossard. The robustness of deep networks: A geometrical perspective. IEEE Signal Process. Magazine, 34(6):50–62, 2017.
-  L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell., 28(4):594–611, 2006.
-  P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1627–1645, 2010.
-  C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Int. Conf. Mach. Learn., 2017.
-  J. Foulds and E. Frank. A review of multi-instance learning assumptions. Knowledge Engineering Review, 25(1):1–25, 2010.
-  B. Frenay and M. Verleysen. Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst., 25(5):845–869, 2014.
-  Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In Int. Conf. Mach. Learn., pages 148–156, 1996.
-  J. Friedman. Regularized discriminant analysis. J. American Statistical Association, 84(405):165–175, 1989.
-  A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. DeViSE: A deep visual-semantic embedding model. In Advances Neural Inf. Process. Syst., pages 2121–2129, 2013.
-  K.-S. Fu. Recent developments in pattern recognition. IEEE Trans. Comput., 29(10):845–854, 1980.
-  Z. Fu, T. Xiang, E. Kodirov, and S. Gong. Zero-shot learning on semantic class prototype graph. IEEE Trans. Pattern Anal. Mach. Intell., 40(8):2009–2022, 2018.
-  K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990.
Y. Ganin and V. Lempitsky.
Unsupervised domain adaptation by backpropagation.In Int. Conf. Mach. Learn., pages 1180–1189, 2015.
-  Z. Ge, S. Demyanov, Z. Chen, and R. Garnavi. Generative openmax for multi-class open set classification. arXiv:1707.07418, 2017.
-  A. Ghosh, H. Kumar, and P. Sastry. Robust loss functions under label noise for deep neural networks. In AAAI Conf. Artif. Intell., pages 1919–1925, 2017.
-  A. Ghosh, N. Manwani, and P. Sastry. Making risk minimization tolerant to label noise. Neurocomputing, 160:93–107, 2015.
-  S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting image rotations. In Int. Conf. Learn. Representations, 2018.
-  X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Int. Conf. Artif. Intell. Stat., pages 249–256, 2010.
-  M. Gonen and E. Alpaydin. Multiple kernel learning algorithms. J. Mach. Learn. Res., 12:2211–2268, 2011.
-  I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances Neural Inf. Process. Syst., pages 2672–2680, 2014.
-  I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In Int. Conf. Learn. Representations, 2015.
-  Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In Advances Neural Inf. Process. Syst., pages 529–536, 2005.
-  Y. Grandvalet, A. Rakotomamonjy, J. Keshet, and S. Canu. Support vector machines with a reject option. In Advances Neural Inf. Process. Syst., pages 537–544, 2009.
-  A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Int. Conf. Mach. Learn., pages 369–376, 2006.
-  A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell., 31(5):855–868, 2009.
-  A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. Smola. A kernel method for the two-sample-problem. In Advances Neural Inf. Process. Syst., pages 513–520, 2007.
-  J. Gu, J. Cai, S. Joty, L. Niu, and G. Wang. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 7181–7189, 2018.
-  S. Gu and L. Rigazio. Towards deep neural network architectures robust to adversarial examples. arXiv:1412.5068, 2014.
-  S. Guerriero, B. Caputo, and T. Mensink. Deep nearest class mean classifiers. In Worskhop Int. Conf. Learn. Representations, 2018.
-  I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157–1182, 2003.
-  P. Haeusser, A. Mordvintsev, and D. Cremers. Learning by association: a versatile semi-supervised training method for neural networks. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 89–98, 2017.
-  J. Hampshire and A. Waibel. The meta-pi network: Building distributed knowledge representations for robust multisource pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell., 14(7):751–769, 1992.
-  B. Han, J. Yao, G. Niu, M. Zhou, I. Tsang, Y. Zhang, and M. Sugiyama. Masking: A new perspective of noisy supervision. arXiv:1805.08193, 2018.
-  S. Han, H. Mao, and W. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In Int. Conf. Learn. Representations, 2016.
-  R. Haralick. Decision making in context. IEEE Trans. Pattern Anal. Mach. Intell., 5(4):417–428, 1983.
-  B. Hariharan and R. Girshick. Low-shot visual recognition by shrinking and hallucinating features. In Int. Conf. Comput. Vis., pages 3018–3027, 2017.
-  M. Hayat, M. Bennamoun, and S. An. Deep reconstruction models for image set classification. IEEE Trans. Pattern Anal. Mach. Intell., 37(4):713–727, 2015.
-  C. He, R. Wang, S. Shan, and X. Chen. Exemplar-supported generative reproduction for class incremental learning. In British Machine Vision Conf., 2018.
-  K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. arXiv:1911.05722, 2019.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 770–778, 2016.
-  R. He, W.-S. Zheng, B.-G. Hu, and X.-W. Kong. A regularized correntropy framework for robust pattern recognition. Neural Computation, 23(8):2074–2100, 2011.
-  D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Int. Conf. Learn. Representations, 2017.
-  G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
-  T. Ho, J. Hull, and S. Srihari. Decision combination in multiple classifier systems. IEEE Trans. Pattern Anal. Mach. Intell., 16(1):66–75, 1994.
-  V. Hodge and J. Austin. A survey of outlier detection methodologies. Artif. Intell. Review, 22(2):85–126, 2004.
-  J. Hoffman, S. Gupta, J. Leong, S. Guadarrama, and T. Darrell. Cross-modal adaptation for RGB-D detection. In IEEE Int. Conf. Robotics Automation, pages 5032–5039, 2016.
-  J. Hoffman, B. Kulis, T. Darrell, and K. Saenko. Discovering latent domains for multisource domain adaptation. In European Conf. Computer Vision, pages 702–715, 2012.
-  J. Hoffman, E. Rodner, J. Donahue, T. Darrell, and K. Saenko. Efficient learning of domain-invariant image representations. In Int. Conf. Learn. Representations, pages 1–9, 2013.
-  C. Hu, Y. Chen, L. Hu, and X. Peng. A novel random forests based class incremental learning method for activity recognition. Pattern Recognition, 78:277–290, 2018.
-  Y. Hu, A. Mian, and R. Owens. Face recognition using sparse approximated nearest points between image sets. IEEE Trans. Pattern Anal. Mach. Intell., 34(10):1992–2004, 2012.
-  J. Huang, A. Smola, A. Gretton, K. Borgwardt, and B. Scholkopf. Correcting sample selection bias by unlabeled data. In Advances Neural Inf. Process. Syst., pages 601–608, 2007.
-  K. Huang, R. Jin, Z. Xu, and C.-L. Liu. Robust metric learning by smooth optimization. In Conf. Uncertain. Artif. Intell., 2010.
-  S.-J. Huang, R. Jin, and Z.-H. Zhou. Active learning by querying informative and representative examples. IEEE Trans. Pattern Anal. Mach. Intell., 36(10):1936–1949, 2014.
-  H. Daume III and D. Marcu. Domain adaptation for statistical classifiers. J. Artif. Intell. Res., 26:101–126, 2006.
-  ILSVRC. ImageNet large scale visual recognition challenge. http://www.image-net.org/challenges/LSVRC.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Int. Conf. Mach. Learn., 2015.
Data clustering: 50 years beyond K-means.Pattern Recognit. Lett., 31(8):651–666, 2010.
-  A.K. Jain, R.P.W. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell., 22(1):4–37, 2000.
-  L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In Int. Conf. Mach. Learn., pages 2309–2318, 2018.
-  I. Jindal, M. Nokleby, and X. Chen. Learning deep networks from noisy labels with dropout regularization. In Int. Conf. Data Mining, pages 967–972, 2016.
-  P. Junior, R. Souza, R. Werneck, B. Stein, D. Pazinato, W. Almeida, O. Penatti, R. Torres, and A. Rocha. Nearest neighbors distance ratio open-set classifier. Machine Learning, 106(3):359–386, 2017.
-  L. Kaiser, A. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit. One model to learn them all. arXiv:1706.05137, 2017.
-  K. Kamnitsas, D. Castro, L. Folgoc, I. Walker, R. Tanno, D. Rueckert, B. Glocker, A. Criminisi, and A. Nori. Semi-supervised learning via compact latent space clustering. In Int. Conf. Mach. Learn., 2018.
-  M. Kan, J. Wu, S. Shan, and X. Chen. Domain adaptation for face recognition: Targetize source domain bridged by common subspace. Int. Journal of Computer Vision, 109(1):94–109, 2014.
-  M. Kandemir and F. Hamprecht. Computer-aided diagnosis from weak supervision: A benchmarking study. Comput. Med. Imaging Graph., 42:44–50, 2015.
-  X. Kang, S. Li, and J.A. Benediktsson. Pansharpening with matting model. IEEE Trans. Geoscience and Remote Sensing, 52(8):5088–5099, 2014.
-  B. Karmakar and N. Pal. How to make a neural network say “Don’t know”. Information Sciences, 430:444–466, 2018.
-  A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 7482–7491, 2018.
-  Y. Kharin. Robustness in Statistical Pattern Recognition. Springer Science & Business Media, 1996.
-  T.-K. Kim, J. Kittler, and R. Cipolla. Discriminative learning and recognition of image set classes using canonical correlations. IEEE Trans. Pattern Anal. Mach. Intell., 29(6):1005–1018, 2007.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. In Int. Conf. Learn. Representations, 2015.
-  D. Kingma, S. Mohamed, D. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances Neural Inf. Process. Syst., pages 3581–3589, 2014.
-  D. Kingma and M. Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.
-  T. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In Int. Conf. Learn. Representations, 2017.
-  A. Kolesnikov, X. Zhai, and L. Beyer. Revisiting self-supervised visual representation learning. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 1920–1929, 2019.
-  D. Koller and N. Friedman. Probabilistic Graphical Models. MIT Press, 2009.
-  S. Kotz and S. Nadarajah. Extreme Value Distributions: Theory and Applications. World Scientific, 2000.
-  J. Kozerawski and M. Turk. CLEAR: Cumulative learning for one-shot one-class image recognition. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 3446–3455, 2018.
-  A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances Neural Inf. Process. Syst., pages 1097–1105, 2012.
-  L. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. John Wiley & Sons, 2004.
-  A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv:1607.02533, 2016.
-  I. Kuzborskij, F. Orabona, and B. Caputo. From n to n+ 1: Multiclass transfer incremental learning. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 3358–3365, 2013.
-  J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Int. Conf. Mach. Learn., pages 282–289, 2001.
-  S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. In Int. Conf. Learn. Representations, 2017.
-  B. Lake, R. Salakhutdinov, and J. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
-  C. Lampert. Kernel methods in computer vision. Foundations and Trends in Computer Graphics and Vision, 4(3):193–285, 2009.
-  C. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell., 36(3):453–465, 2014.
G. Larsson, M. Maire, and G. Shakhnarovich.
Learning representations for automatic colorization.In Eur. Conf. Comput. Vis., pages 577–593, 2016.
-  P. Laskov, C. Gehl, S. Kruger, and K.-R. Muller. Incremental support vector learning: Analysis, implementation and applications. J. Mach. Learn. Res., 7:1909–1936, 2006.
-  Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998.
-  D.-H. Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop Int. Conf. Mach. Learn., 2013.
-  K.-H. Lee, X. He, L. Zhang, and L. Yang. CleanNet: Transfer learning for scalable image classifier training with label noise. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 5447–5456, 2018.
C.J. Leggetter and P.C. Woodland.
Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models.Computer Speech and Language, 9(2):171–185, 1995.
-  X. Li, Y. Zhou, T. Wu, R. Socher, and C. Xiong. Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. In Int. Conf. Mach. Learn., 2019.
-  Z. Li, L.-F. Cheong, S. Yang, and K.-C. Toh. Simultaneous clustering and model selection: Algorithm, theory and applications. IEEE Trans. Pattern Anal. Mach. Intell., 40(8):1964–1978, 2018.
-  Z. Li and D. Hoiem. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell., 40(12):2935–2947, 2018.
-  Z. Li and D. Hoiem. -distillation: Reducing overconfident errors on novel samples. arXiv:1804.03166, 2018.
-  A. Liu and B. Ziebart. Robust classification under sample selection bias. In Advances Neural Inf. Process. Syst., pages 37–45, 2014.
-  C.-L. Liu. Classifier combination based on confidence transformation. Pattern Recognition, 38(1):11–28, 2005.
-  C.-L. Liu. One-vs-all training of prototype classifiers for pattern classification and retrieval. In Int. Conf. Pattern Recognition, pages 3328–3331, 2010.
-  C.-L. Liu, K. Nakashima, H. Sako, and H. Fujisawa. Handwritten digit recognition: benchmarking of state-of-the-art techniques. Pattern Recognition, 36(10):2271–2285, 2003.
-  C.-L. Liu, H. Sako, and H. Fujisawa. Performance evaluation of pattern classifiers for handwritten character recognition. Int. J. Document Anal. Recognit., 4(3):191–204, 2002.
-  K. Liu, Y. Li, N. Xu, and P. Natarajan. Learn to combine modalities in multimodal deep learning. arXiv:1805.11730, 2018.
-  T. Liu and D. Tao. Classification with noisy labels by importance reweighting. IEEE Trans. Pattern Anal. Mach. Intell., 38(3):447–461, 2016.
-  X. Liu, J. Weijer, and A. Bagdanov. Exploiting unlabeled data in CNNs by self-supervised learning to rank. IEEE Trans. Pattern Anal. Mach. Intell., 41(8):1862–1878, 2019.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 3431–3440, 2015.
-  M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks. In Int. Conf. Mach. Learn., pages 1–9, 2015.
-  Y. Long, L. Liu, F. Shen, L. Shao, and X. Li. Zero-shot learning using synthesised unseen visual data with diffusion regularisation. IEEE Trans. Pattern Anal. Mach. Intell., 40(10):2498–2512, 2018.
-  X. Lu, Y. Wang, X. Zhou, Z. Zhang, and Z. Ling. Traffic sign recognition via multi-modal tree-structure embedded multi-task learning. IEEE Trans. Intelligent Transportation Systems, 18(4):960–972, 2017.
-  Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. Feris. Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 5334–5343, 2017.
-  A. Maas, A. Hannun, and A. Ng. Rectifier nonlinearities improve neural network acoustic models. In Int. Conf. Mach. Learn., 2013.
-  J. Malmaud, J. Huang, V. Rathod, N. Johnston, A. Rabinovich, and K. Murphy. What’s cookin’? interpreting cooking videos using text, speech and vision. arXiv:1503.01558, 2015.
-  R. Mammone, X. Zhang, and R. Ramachandran. Robust speaker recognition: A feature-based approach. IEEE Signal Processing Magazine, 13(5):58–71, 1996.
-  Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation with multiple sources. In Advances Neural Inf. Process. Syst., pages 1041–1048, 2009.
-  H. Masnadi-Shirazi and N. Vasconcelos. On the design of loss functions for classification: theory, robustness to outliers, and savageboost. In Advances Neural Inf. Process. Syst., pages 1049–1056, 2009.
-  M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham. Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Trans. Know. Data Eng., 23(6):859–874, 2011.
-  T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Distance-based image classification: Generalizing to new classes at near-zero cost. IEEE Trans. Pattern Anal. Mach. Intell., 35(11):2624–2637, 2013.
-  J. Metzen, T. Genewein, V. Fischer, and B. Bischoff. On detecting adversarial perturbations. In Int. Conf. Learn. Representations, 2017.
-  D. Miller and J. Browning. A mixture model and EM-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets. IEEE Trans. Pattern Anal. Mach. Intell., 25(11):1468–1483, 2003.
-  I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch networks for multi-task learning. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 3994–4003, 2016.
-  T. Mitchell, W. Cohen, E. Hruschka, and et al. Never-ending learning. Communications of ACM, 61(5):103–115, 2018.
-  T. Miyato, S. Maeda, M. Koyama, and S. Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell., 41(8):1979–1993, 2018.
-  C. Molnar. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. 2019. https://christophm.github.io/interpretable-ml-book/.
-  S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 1765–1773, 2017.
-  S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. DeepFool: a simple and accurate method to fool deep neural networks. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 2574–2582, 2016.
-  S. Motiian, Q. Jones, S. Iranmanesh, and G. Doretto. Few-shot adversarial domain adaptation. In Advances Neural Inf. Process. Syst., pages 6670–6680, 2017.
-  K. Muller, S. Mika, G. Riitsch, K. Tsuda, and B. Scholkopf. An introduction to kernel-based learning algorithms. IEEE Trans. Neural Netw., 12:181–201, 2001.
-  K. Murugesan, H. Liu, J. Carbonell, and Y. Yang. Adaptive smoothed online multi-task learning. In Advances Neural Inf. Process. Syst., pages 4296–4304, 2016.
-  A. Nagrani, S. Albanie, and A. Zisserman. Seeing voices and hearing faces: Cross-modal biometric matching. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 8427–8436, 2018.
-  G. Nagy. State of the art in pattern recognition. Proc. IEEE, 56(5):836–863, 1968.
-  N. Neverova, C. Wolf, G. Taylor, and F. Nebout. ModDrop: adaptive multi-modal gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell., 38(8):1692–1706, 2016.
-  A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 427–436, 2015.
-  M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Eur. Conf. Comput. Vis., pages 69–84, 2016.
-  S. Nowozin and C. Lampert. Structured learning and prediction in computer vision. Foundations and Trends in Computer Graphics and Vision, 6(3-4):185–365, 2011.
-  A. Oliver, A. Odena, C. Raffel, E. Cubuk, and I. Goodfellow. Realistic evaluation of deep semi-supervised learning algorithms. In Advances Neural Inf. Process. Syst., 2018.
-  O. Ostapenko, M. Puscas, T. Klein, P. Jahnichen, and M. Nabi. Learning to remember: A synaptic plasticity driven framework for continual learning. In IEEE Conf. Comput. Vis. Pattern Recognit., 2019.
-  S. Pan, I. Tsang, J. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw., 22(2):199–210, 2011.
-  S. Pan and Q. Yang. A survey on transfer learning. IEEE Trans. Know. Data Eng., 22(10):1345–1359, 2009.
-  N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. Celik, and A. Swami. Practical black-box attacks against machine learning. In ACM Asia Conf. Comput. Communi. Secur., pages 506–519, 2017.
-  N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. Celik, and A. Swami. The limitations of deep learning in adversarial settings. In IEEE Eur. Symposium Security Privacy, pages 372–387, 2016.
-  N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium Security Privacy, pages 582–597, 2016.
-  G. Parisi, R. Kemker, J. Part, C. Kanan, and S. Wermter. Continual lifelong learning with neural networks: A review. Neural Netw., 113:54–71, 2019.
-  D. Park, L. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 8779–8788, 2018.
-  D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. Efros. Context encoders: Feature learning by inpainting. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 2536–2544, 2016.
-  G. Patrini, A. Rozza, A. Menon, R. Nock, and L. Qu. Making deep neural networks robust to label noise: A loss correction approach. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 1944–1952, 2017.
-  P. Peng, Y. Tian, T. Xiang, Y. Wang, M. Pontil, and T. Huang. Joint semantic and latent attribute modelling for cross-class transfer learning. IEEE Trans. Pattern Anal. Mach. Intell., 40(7):1625–1638, 2018.
-  A. Pentina and C. Lampert. Lifelong learning with non-i.i.d. tasks. In Advances Neural Inf. Process. Syst., pages 1540–1548, 2015.
-  M. Pimentel, D. Clifton, L. Clifton, and L. Tarassenko. A review of novelty detection. Signal Processing, 99:215–249, 2014.
-  M. Poo, J. Du, N. Ip, Z. Xiong, B. Xu, and T. Tan. China brain project: basic neuroscience, brain diseases, and brain-inspired computing. Neuron, 92(3):591–596, 2016.
-  F. Provost and T. Fawcett. Robust classification for imprecise environments. Machine Learning, 42(3):203–231, 2001.
-  H. Qi, M. Brown, and D. Lowe. Low-shot learning with imprinted weights. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 5822–5830, 2018.
-  R. Qiao, L. Liu, C. Shen, and A. Hengel. Less is more: zero-shot learning from online textual documents with noise suppression. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 2249–2257, 2016.
-  L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77(2):257–286, 1989.
-  A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training. OpenAI Technical report, 2018.
-  D. Ramachandram and G. Taylor. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine, 34(6):96–108, 2017.
-  A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko. Semi-supervised learning with ladder networks. In Advances Neural Inf. Process. Syst., pages 3546–3554, 2015.
-  S. Rastegar, M. Baghshah, H. Rabiee, and S. Shojaee. MDL-CW: A multimodal deep learning framework with cross weights. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 2601–2609, 2016.
-  S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In Int. Conf. Learn. Representations, 2017.
-  S. Rebuffi, A. Kolesnikov, G. Sperl, and C. Lampert. iCaRL: Incremental classifier and representation learning. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 2001–2010, 2017.
-  B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do CIFAR-10 classifiers generalize to CIFAR-10? arXiv:1806.00451, 2018.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, H. Lee, and B. Schiele. Generative adversarial text to image synthesis. In Int. Conf. Mach. Learn., pages 1060–1069, 2016.
-  S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural networks on noisy labels with bootstrapping. In Workshop Int. Conf. Learn. Representations, 2015.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149, 2017.
-  B. RichardWebster, S. Anthony, and W. Scheirer. PsyPhy: A psychophysics driven evaluation framework for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell., 41(9):2280–2286, 2019.
-  S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In Int. Conf. Mach. Learn., page 2011, 833–840.
-  A. Rozantsev, M. Salzmann, and P. Fua. Residual parameter transfer for deep domain adaptation. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 4339–4348, 2018.
-  A. Rozantsev, M. Salzmann, and P. Fua. Beyond sharing weights for deep domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell., 41(4):801–814, 2019.
-  S. Ruder. An overview of multi-task learning in deep neural networks. arXiv:1706.05098, 2017.
-  M. Sajjadi, M. Javanmardi, and T. Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances Neural Inf. Process. Syst., pages 1163–1171, 2016.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In Advances Neural Inf. Process. Syst., pages 2234–2242, 2016.
-  P. Samangouei, M. Kabkab, and R. Chellappa. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. In Int. Conf. Learn. Representations, 2018.
-  N. Samsudin and A. Bradley. Nearest neighbour group-based classification. Pattern Recognition, 43(10):3458–3467, 2010.
-  A. Santoro, D. Raposo, D. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Advances Neural Inf. Process. Syst., pages 4967–4976, 2017.
-  G. Saon and M. Picheny. Recent advances in conversational speech recognition using convolutional and recurrent neural networks. IBM J. Research Development, 61(4):1–10, 2017.
-  P. Sarkar and G. Nagy. Style consistent classification of isogenous patterns. IEEE Trans. Pattern Anal. Mach. Intell., 27(1):88–98, 2005.
-  F. Scarselli, M. Gori, A. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Trans. Neural Networks, 20(1):61–80, 2009.
-  W. Scheirer, L. Jain, and T. Boult. Probability models for open set recognition. IEEE Trans. Pattern Anal. Mach. Intell., 36(11):2317–2324, 2014.
-  W. Scheirer, A. Rocha, A. Sapkota, and T. Boult. Toward open set recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(7):1757–1772, 2013.
-  J. Schmidhuber. Deep learning in neural networks: An overview. Neural Netw., 61:85–117, 2015.
-  B. Scholkopf, J. Platt, J. Shawe-Taylor, and A. Smola. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001.
-  T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio. Robust object recognition with cortex-like mechanisms. IEEE Trans. Pattern Anal. Mach. Intell., 29(3):411–426, 2007.
-  U. Shaham, Y. Yamada, and S. Negahban. Understanding adversarial training: Increasing local stability of neural nets through robust optimization. arXiv:1511.05432, 2015.
-  G. Shakhnarovich, J. Fisher, and T. Darrell. Face recognition from long-term observations. In Eur. Conf. Comput. Vis., pages 851–865, 2002.
-  B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell., 39(11):2298–2304, 2017.
-  H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plan. Infer., 90(2):227–244, 2000.
-  L. Shu, H. Xu, and B. Liu. DOC: Deep open classification of text documents. In Conf. Empirical Methods in Natural Language Processing, pages 2911–2916, 2017.
-  A. Smola, A. Gretton, L. Song, and B. Scholkopf. A Hilbert space embedding for distributions. In Int. Conf. Algor. Learn. Theory, pages 13–31, 2007.
-  J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances Neural Inf. Process. Syst., pages 4080–4090, 2017.
-  J. Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. In Int. Conf. Learn. Representations, 2016.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, 2014.
-  N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. In Advances Neural Inf. Process. Syst., pages 2222–2230, 2012.
-  J. Su, D. Vargas, and K. Sakurai. One pixel attack for fooling deep neural networks. arXiv:1710.08864, 2017.
-  C. Suen. N-gram statistics for natural language understanding and text processing. IEEE Trans. Pattern Anal. Mach. Intell., 1(2):164–172, 1979.
-  M. Sugiyama, S. Nakajima, H. Kashima, P. Bunau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances Neural Inf. Process. Syst., pages 1433–1440, 2008.
-  S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus. Training convolutional networks with noisy labels. In Workshop Int. Conf. Learn. Representations, 2015.
-  Y. Sun, X. Wang, and X. Tang. Hybrid deep learning for face verification. IEEE Trans. Pattern Anal. Mach. Intell., 38(10):1997–2009, 2016.
-  Z. Sun and T. Tan. Ordinal measures for iris recognition. IEEE Trans. Pattern Anal. Mach. Intell., 31(12):2211–2226, 2009.
-  F. Sung, Y. Yang, L. Zhang, T. Xiang, P. Torr, and T. Hospedales. Learning to compare: Relation network for few-shot learning. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 1199–1208, 2018.
-  I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence learning with neural networks. In Advances Neural Inf. Process. Syst., pages 3104–3112, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conf. Comput. Vis. Pattern Recognit., 2015.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In Int. Conf. Learn. Representations, 2014.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 1701–1708, 2014.
-  D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa. Joint optimization framework for learning with noisy labels. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 5552–5560, 2018.
-  A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances Neural Inf. Process. Syst., pages 1195–1204, 2017.
-  D. Tax. One-class classification: concept-learning in the absence of counter-examples. Ph.D. Thesis, Delft University of Technology, 2001.
-  D. Tax and R. Duin. Support vector data description. Machine Learning, 54(1):45–66, 2004.
-  D. Tax and R. Duin. Growing a multi-class classifier with a reject option. Pattern Recognit. Lett., 29(10):1565–1570, 2008.
-  J. Tenenbaum and W. Freeman. Separating style and content with bilinear models. Neural Computation, 12(6):1247–1283, 2000.
-  S. Thrun and J. O’Sullivan. Discovering structure in multiple learning tasks: The TC algorithm. In Int. Conf. Mach. Learn., pages 489–497, 1996.
-  Y. Tokozume, Y. Ushiku, and T. Harada. Between-class learning for image classification. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 5486–5494, 2018.
-  E. Triantafillou, R. Zemel, and R. Urtasun. Few-shot learning through an information retrieval lens. In Advances Neural Inf. Process. Syst., pages 2255–2265, 2017.
-  I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., 6:1453–1484, 2005.
-  P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa. Statistical computations on Grassmann and Stiefel manifolds for image and video-based recognition. IEEE Trans. Pattern Anal. Mach. Intell., 33(11):2273–2286, 2011.
-  E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In Int. Conf. Comput. Vis., pages 4068–4076, 2015.
-  E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 7167–7176, 2017.
-  P. Utgoff. Incremental induction of decision trees. Machine Learning, 4:161–186, 1989.
-  A. Vahdat. Toward robustness against label noise in training deep discriminative neural networks. In Advances Neural Inf. Process. Syst., pages 5596–5605, 2017.
-  V.N. Vapnik. Statistical Learning Theory. New York: John Wiley & Sons, 1998.
-  S. Veeramachaneni and G. Nagy. Style context with second-order statistics. IEEE Trans. Pattern Anal. Mach. Intell., 27(1):14–22, 2005.
-  S. Veeramachaneni and G. Nagy. Analytical results on style-constrained Bayesian classification of pattern fields. IEEE Trans. Pattern Anal. Mach. Intell., 29(7):1280–1285, 2007.
-  P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention networks. In Int. Conf. Learn. Representations, 2018.
-  V. Verma, G. Arora, A. Mishra, and P. Rai. Generalized zero-shot learning via synthesized examples. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 4281–4289, 2018.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.
Extracting and composing robust features with denoising autoencoders.In Int. Conf. Mach. Learn., pages 1096–1103, 2008.
-  O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. In Advances Neural Inf. Process. Syst., page 2016, 3630–3638.
-  W. Wan, Y. Zhong, T. Li, and J. Chen. Rethinking feature distribution for loss functions in image classification. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 9117–9126, 2018.
-  A. Wang, J. Cai, J. Lu, and T.-J. Cham. MMSS: Multi-modal sharable and specific feature learning for RGB-D object recognition. In Int. Conf. Comput. Vis., pages 1125–1133, 2015.
-  H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis., 103(1):60–79, 2013.
-  L. Wang, T. Tan, H. Ning, and W. Hu. Silhouette analysis-based gait recognition for human identification. IEEE Trans. Pattern Anal. Mach. Intell., 25(12):1505–1518, 2003.
-  M. Wang and W. Deng. Deep visual domain adaptation: A survey. arXiv:1802.03601, 2018.
-  Q.-F. Wang, F. Yin, and C.-L. Liu. Handwritten Chinese text recognition by integrating multiple contexts. IEEE Trans. Pattern Anal. Mach. Intell., 34(8):1469–1481, 2012.
-  X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In Int. Conf. Comput. Vis., pages 2794–2802, 2015.
-  Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S.-T. Xia. Iterative learning with open-set noisy labels. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 8688–8696, 2018.
-  Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Low-shot learning from imaginary data. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 7278–7286, 2018.
-  Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample learning. In Eur. Conf. Comput. Vis., pages 616–634, 2016.
-  Y. Wu and Y. Liu. Robust truncated hinge loss support vector machines. J. American Statistical Association, 102(479):974–983, 2007.
-  Z. Wu, Y. Xiong, S. Yu, and D. Lin. Unsupervised feature learning via non-parametric instance discrimination. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 3733–3742, 2018.
-  Y. Xian, C. Lampert, B. Schiele, and Z. Akata. Zero-shot learning: A comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell., 41(9):2251–2265, 2019.
-  Y. Xian, T. Lorenz, B. Schiele, and Z. Akata. Feature generating networks for zero-shot learning. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 5542–5551, 2018.
-  T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 2691–2699, 2015.
-  G. Xu, B.-G. Hu, and J. Principe. Robust C-loss kernel classifiers. IEEE Trans. Neural Netw. Learn. Syst., 29(3):510–522, 2018.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Int. Conf. Mach. Learn., pages 2048–2057, 2015.
-  L. Xu, K. Crammer, and D. Schuurmans. Robust support vector machine training via convex outlier ablation. In AAAI Conf. Artif. Intell., pages 536–542, 2006.
-  R. Xu, Z. Chen, W. Zuo, J. Yan, and L. Lin. Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 3964–3973, 2018.
-  S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Trans. Pattern Anal. Mach. Intell., 29(1):40–51, 2007.
-  H.-M. Yang, X.-Y. Zhang, F. Yin, and C.-L. Liu. Robust classification with convolutional prototype learning. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 3474–3482, 2018.
-  J. Yang, D. Parikh, and D. Batra. Joint unsupervised learning of deep representations and image clusters. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 5147–5156, 2016.
-  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances Neural Inf. Process. Syst., pages 3320–3328, 2014.
-  A. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 3712–3722, 2018.
-  C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In Int. Conf. Learn. Representations, 2017.
-  D. Zhang and D. Shen. Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage, 59(2):895–907, 2012.
-  H. Zhang, M. Cisse, Y. Dauphin, and D. Lopez-Paz. Mixup: Beyond empirical risk minimization. In Int. Conf. Learn. Representations, 2018.
-  H. Zhang and V. Patel. Sparse representation-based open set recognition. IEEE Trans. Pattern Anal. Mach. Intell., 39(8):1690–1696, 2017.
-  J. Zhang, W. Li, and P. Ogunbona. Transfer learning for cross-dataset recognition: A survey. arXiv:1705.04396, 2017.
-  R. Zhang, P. Isola, and A. Efros. Colorful image colorization. In Eur. Conf. Comput. Vis., pages 649–666, 2016.
-  R. Zhang, P. Isola, and A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 1058–1067, 2017.
-  T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. J. Mach. Learn. Res., 11:1081–1107, 2010.
-  X.-Y. Zhang, Y. Bengio, and C.-L. Liu. Online and offline handwritten Chinese character recognition: A comprehensive study and new benchmark. Pattern Recognition, 61:348–360, 2017.
-  X.-Y. Zhang, K. Huang, and C.-L. Liu. Pattern field classification with style normalized transformation. In Int. Joint Conf. Artificial Intelligence, pages 1621–1626, 2011.
-  X.-Y. Zhang and C.-L. Liu. Writer adaptation with style transfer mapping. IEEE Trans. Pattern Anal. Mach. Intell., 35(7):1773–1787, 2013.
-  Y. Zhang, K. Lee, and H. Lee. Augmenting supervised neural networks with unsupervised objectives for large-scale image classification. In Int. Conf. Mach. Learn., pages 612–621, 2016.
-  Z. Zhang, M. Wang, Y. Huang, and A. Nehorai. Aligning infinite-dimensional covariance matrices in reproducing kernel hilbert spaces for domain adaptation. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 3437–3445, 2018.
-  Z. Zhang and K. Zhao. Low-rank matrix approximation with manifold regularization. IEEE Trans. Pattern Anal. Mach. Intell., 35(7):1717–1729, 2013.
-  J. Zhao, M. Mathieu, R. Goroshin, and Y. LeCun. Stacked what-where auto-encoders. In Workshop Int. Conf. Learn. Representations, 2016.
-  S. Zheng, Y. Song, T. Leung, and I. Goodfellow. Improving the robustness of deep neural networks via stability training. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 4480–4488, 2016.
-  B. Zhou, D. Bau, A. Oliva, and A. Torralba. Interpreting deep visual representations via network dissection. IEEE Trans. Pattern Anal. Mach. Intell., 41(9):2131–2145, 2019.
-  D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf. Learning with local and global consistency. In Advances Neural Inf. Process. Syst., pages 321–328, 2004.
-  J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun. Graph neural networks: A review of methods and applications. arXiv:1812.08434, 2018.
-  Z.-H. Zhou and M. Li. Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans. Know. Data Eng., 17(11):1529–1541, 2005.
-  Z.-H. Zhou, Y.-Y. Sun, and Y.-F. Li. Multi-instance learning by treating instances as non-iid samples. In Int. Conf. Mach. Learn., pages 1249–1256, 2009.
-  X. Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2006.
-  X. Zhu, X. Wu, and Q. Chen. Eliminating class noise in large datasets. In Int. Conf. Mach. Learn., pages 920–927, 2003.
-  Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. Elgammal. A generative adversarial approach for zero-shot learning from noisy texts. In IEEE Conf. Comput. Vis. Pattern Recognit., pages 1004–1013, 2018.
-  Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Int. Conf. Comput. Vis., pages 19–27, 2015.