Learning could be understood as a behavior that a person/system improves him(her) self/itself by interacting with the outside environment and introspecting in the internal model of the world, so as to improve his(her)/its cognition, adaptation and regulation capability to environment. Machine learning aims to simulate such behavior by computers for executing certain tasks, and generally contain the following implementation elements. Firstly, a performance measure is generally required to be defined, and then useful information are exploited from the pre-collected historical experience (training data) and pre-known prior knowledge under the criterion of maximizing the performance measure to train a good learner to help analyze future data (jordan2015machine). Learning is substantiated to be beneficial to tasks like recognition, causality, inference, understanding, etc., and has achieved extraordinary performance among various practical tasks including image classification (Krizhevsky2012Alex; He2016Res), speech recognition (Hinton2012; Mikolov2011; Sainath2013)Cambria2013), machine translate (Sutskever2014), Atari video games (Mnih2015), Go games (Silver2016; Silver2017), Texas Hold’em poker (Moravvcik2017), skin cancer diagnosis (Esteva2017), quantum many-body problem (Carleo2017), etc..
, respectively. (a) Demonstration of Bayesian program learning(BPL). Provided only a single example (red boxes), BPL(Lake2015)
can rapidly learn the new concept (i.e., the generation procedure of character) with prior knowledge embedded into models to classify new examples, generate new examples, parse an object into parts and generate new concepts from related concepts. (b) Demonstration of Recursive Cortical Network (RCN). Given the same character “A” in a wide variety of appearances shape, RCN(George2017) can parse “A” with contours and surfaces, scene context and background with lateral connections, and achieve higher accuracy compared with CNN with fewer samples.
In the recent decades, machine learning has made significant progress and obtained impressively good performance on various tasks, which makes this line of approaches become the most highlighted techniques of the entire artificial intelligence field. While such success seems to make people more and more optimistic to the power of current machine learning approaches, many researchers and engineers began to recognize that most latest progresses of machine learning are highly dependent on the premise of large number of input samples (generally with annotations). Such kind of learning manner can be called Large Sample Learning (LSL) for notation convenience. The real cases, however, are always deviated from such ideal circumstances, and generally with the characteristic of Small Sample Learning (SSL). Specifically, there are mainly two categories of SSL scenarios. The first can be called concept learning, aiming to recognize and form never-seen new concepts through only few observations on these concepts by associating with previously learned knowledge of other ones. The other category is experience learning, sometimes also called small data learning, mainly proposed from the opposite side of LSL, i.e., to carry out machine learning on the condition of lacking sufficient training samples.
As a fundamental and widely existed learning paradigm in real cases, the early attempts of SSL might be originated from the context of multimedia retrieval (Zhou2001), and is gradually attracting increasing attention throughout various areas in the recent years (Miller2000; bart2005cross; Fei-Fei2003; Fei-Fei2006). A representative method is proposed by (Lake2015), achieving human-level performance on one-shot character classification task, and able to generate new examples of a concept trained from only one sample of the class, which are even indistinguishable from human behavior (see Fig. 1(a)). Afterwards, Lake2016 proposed a “characters challeng” advancement. That is, after an AI system views only a single exemplar, the system should be able to distinguish novel instances of an unfamiliar handwritten character from others. Lately, AI startup Vicarious (George2017)
claimed to outperform deep neural networks on challenging text recognition task with less one three-hundredth data and break the defense of modern text-based CAPTCHAs. In Vicarious’s official blog (https://www.vicarious.com/2017/10/26/common-sense-cortex-and-captcha/), they explained that why the central problem in AI is to understand the letter “A” (see Fig. 1(b)), and believed that “for any program to handle letter forms with the flexibility that human beings do, it would have to possess full-scale artificial intelligence”.
Generally speaking, while human can very easily perform these tasks, they are difficult for classical AI systems. Human cognition is distinguished by his/her ability to rapidly constitute a new concept from only a handful example, through latently leveraging his/her possessed prior knowledge to enable flexible inductive inferences (Hassabis2017). The concept takes an important role in human cognition (carey1999knowledge), where our minds make inferences that appear to go far beyond the data available through learning concepts and grasping causal relations (tenenbaum2011grow). For example, as shown in (roymachines), human can recognize a Segway even if he/she has seen it only once (i.e., one-shot learning). This is because our mind can decompose the concept into a collection of known parts such as wheels a steering stick. Many variations of these components are already encoded in our memory. Hence, it is generally not difficult for a child to compile different models of Segways rapidly. Even in the extreme case that one has not seen the sample on the concept before (i.e., zero-shot learning), given a description of some attributes of a Segway, one can still guess how its components should be connected, and through using his/her previous possessed knowledge on these attributes, he/she can still recognize a Segway without seeing it. SSL tries to imitate human cognition intelligence to solve these hard tasks that classical AI system can hardly process with only few exemplars.
In summary, SSL is a fundamental and gradually more widespread new learning paradigm featured by few training examples, and aims to simulate human learning capabilities that rapidly discover and represent new concept with few observations, parse a scene into objects and relations, recombine these elements to synthesis new instances through imagination, and implement other small sample learning tasks. The following issues encountered by current machine learning approaches are promising to be alleviated by the future development of SSL techniques, which makes this research meaningful for exploration:
1) Lack of labels due to high cost of human annotations.
As aforementioned, LSL can achieve excellent performance in various tasks in the presence of large amount of high-quality training samples. For example, to learn a deep learning model with tens or even hundreds of layers and containing a huge number of model parameters, we need to pre-collect large amount of training samples which are labelled with fully ground-truth annotations(goodfellow2016deep). Typical datasets so generated including PASCAL VOC (everingham2010pascal)
, ImageNet(russakovsky2015imagenet; deng2009imagenet), Microsoft COCO (lin2014microsoft) and many other known ones. In practice, however, it could be difficult to attain such high-quality annotations for many samples due to the high cost of data labeling process (e.g., small-scale event annotation from surveillance videos with crowded large-scale scenes (jiang2014easy)) or lack of experts’ experience (e.g., certain diseases in medical images (fries2017swellshark)
). Besides, many datasets are collected by crowdsourcing system or search engines for reducing human labor cost. They, however, inevitably contain large amount of low-quality annotations (i.e., coarse or even inaccurate annotations). This leads to the known issues of weakly supervised learning(zhou2017brief) or webly-supervised learning (chen2015webly), attracting much research attention in recent years (Section 5.1). For example, the task of semantic segmentation (hong2017weakly) usually could only be implemented on pre-collected images with image-level labels, rather than expected pixel-level labels. In such case, even under large amount of training samples, the conventional end-to-end training manner for LSL still inclines to fail due to such coarse annotations.
2) Long-tail distribution existed extensively in big data. The long tail phenomena appear in the dataset where a small number of objects/words/classes are very frequent, while many more are rare (bengio2015sharing). Taking the object recognition problem as an example (ouyang2016factors), rahman2018zero showed that for the known ILSVRC dataset (russakovsky2015imagenet), instance numbers for all classes follows an evident long-tail distribution. In all 200 classes of this dataset, only 11 highly frequent classes cover 50% amount of samples in the entire dataset, which makes a learner easily dominates its performance on these head classes while degrades its performance on other tail classes. A simple amelioration strategy is re-balancing training data like sampling examples from the rare classes more frequently (shen2016relay) or reducing the number of examples from the top numbered classes (he2009learning)
. This strategy, however, is generally heuristic and suboptimal. The former manner tends to generate sample redundancy and encounters the problem of over-fitting to the rare classes, whereas the latter inclines to lose critical feature knowledge within the classes with more samples(wang2017learning). Thus necessary SSL methods is expected to be helpful to alleviate such long-tail training issue by leveraging more beneficial prior knowledge of small-sample classes.
3) Insufficient data for conventional LSL approaches. Although we are in big data era, there still exist many domains lacking sufficient ideal training samples. For example, in intelligent medical diagnose issue, medical imaging data are much more difficult to be annotated with certain lesions in high-quality by common people without specific expertise compared to general images with trivial categories. Alternatively, new diseases consistently occur with few historical data, and rare diseases frequently occur with few cases, which can only obtain scarce training samples with accurate labels. Another example is intelligent communications, where systems should perform excellent transmission under very few pilot signals. In such cases, conventional LSL approaches can hardly perform and effective SSL techniques are urgently required.
4) Arose from cognitive science studies. Many scholars are attempting to achieve future AI by making machines really mimic humans for thinking and learning (russell2016artificial). The main motivation of SSL is a manner to more or less construct such a learning paradigm, i.e., simulating humans to learn new concepts from few observations with strong generalization ability. Inspired from cognitive science studies, some progresses have been made on this point (Hassabis2017). Recently, NIPS 2017 Workshop on Cognitively Informed Artificial Intelligence (https://sites.google.com/view/ciai2017/home) made efforts to bring together cognitive scientists, neuroscientists, and machine learning researchers to discuss opportunities for improving AI by leveraging scientific understanding of human perception and cognition. Valuable knowledge and experiences from cognitive science are expected to feed AI and SSL, and inspire useful learning regimes with the strength and flexibility of the human cognitive architecture.
In this paper, we aim to present a comprehensive survey on the current developments of the SSL paradigm, and introduce some closely related research topics. We will try to make definitions on SSL-related concepts to make clear their meanings, so as to help avoid confused and messy utilization of these words in literatures. The relations of SSL to biological plausibility and some discussions on the future directions for SSL worthy to be investigated will also be presented. The paper is organized as follows. Section 2 presents a definition for SSL, as well as its two categories of learning paradigms: concept learning and experience learning. Biological plausibility of SSL is also given in this section. Section 3 summarizes the recent techniques of concept learning, and then Section 4 surveys those of experience learning. Section 5 introduces some related research directions of SSL. Finally we make some discussions on future directions on SSL in Section 6 and conclude the full paper in Section 7.
2 Small Sample Learning
In this section, we first present a formal definition for SSL, and then provide some neuroscience evidence to support the rationality of this learning paradigm.
2.1 Definition of SSL
To the best of our knowledge, the current research of SSL focuses on two learning aspects, experience learning and concept learning, which we will introduce in detail in the following. We thus try to interpret all these related concepts as well as SSL. Besides, some other related concepts, like -shot learning, will also be clarified.
To start with, we try to interpret the task of machine learning based on the description in (jordan2015machine) as follows: Learning Algorithm helps Learners improve certain performance measure when executing some tasks, through pre-collected experiential Data (see Fig.2(a)). Conceptually, machine learning algorithms can be viewed to search, through a large space of candidate learners, guided by training experience, a learner that optimizes the performance metric. This setting usually requires a large amount of labeled data for training a good learner.
2.1.1 Experience Learning & Concept Learning
Experience learning is a specific SSL paradigm, in which samples directly related to tasks are highly insufficient. In other words, this kind of SSL regimes co-exists with Large Sample Learning (LSL), and its main goal is to reduce or meet the requirements for the number of sample of LSL methods. Given small sample input , the main strategies employed in this research mainly include two categories of approaches by making use of augmented data and knowledge system , respectively (see Fig.2(b)), as introduced in the following.
The augmented data approach attempts to compensate the input data with other sources of data highly related to the input small samples, usually yielded through transformation, synthesis, imagination or other ways, so as to make LSL applicable (kulkarni2015deep; long2017zero; antoniou2017data; chen2018semantic)
The knowledge system approach could use following types of knowledge:
Representations from other domains: Transferring the knowledge of describing the same target while from different domains, e.g., the lack of visual instances can be compensated by semantic descriptions on the same object (srivastava2012multimodal; ramachandram2017deep) (see Fig. 3(a)).
Cognition knowledge on concepts: Such knowledge include common sense knowledge, domain knowledge, and other prior knowledge on the learned concept with small training samples (davis2015commonsense; doersch2015unsupervised; stewart2017label). For example, if we want localize the eyes of Lenna (see Fig. 3(c)), we can employ the cognition knowledge that the positions of eyes are above that of the mouth.
Meta knowledge: Some high-level knowledge beyond data, and can be helpful to compensate each concept learning with not sufficient samples (lemke2015metalearning; Lake2016).
In some literatures, the setting of experience learning is also called small data learning (vinyals2016matching; santoro2016meta; hariharan2017low; edwards2016towards; altae2017low; munkhdalai2017meta). Here we call this learning manner experience learning yet for unified descriptions.
Compared with experience learning, concept learning represents more primary branch of current SSL research. This learning paradigm aims to perform recognition or form new concepts (samples) from few observations (samples) through fast processing. Conceptually, concept learning employs matching rule to associate concepts in concept system with input small samples , whose function is mainly performing cognition or completing a recognition task as well as generation, imagination, synthesis and analysis (see Fig.2(c)). The aforementioned concepts are explained as follows:
Concept system includes intensional representations and extension representations of concepts (see Fig.4):
Intensional representation indicates precise definitions in proposition or semantic form on the learned concept, like its attribute characteristics.
Extensional representation denotes prototypes and instances related to the learned concept.
Matching rule denotes a procedure to associate concepts in concept system with small samples to implement a cognition or recognition task. The result tries to keep optimal in terms of performance measure .
2.1.2 -shot Learning
-shot learning aims to learn information about object categories from training images, where is generally a very small number like or . The mathematical expression to this learning issue can be described as follows: Given dataset , each datum or is a
-dimensional feature vector,or denotes the corresponding label, and usually suppose . Zero-shot learning (ZSL) aims to recognize unseen objects in by leveraging the knowledge . That is, ZSL aims to construct a classifier by using the knowledge . Particularly, if the training and test classes are not disjoint, i.e., , the problem is known as generalized zero-shot learning (GZSL) (chao2016empirical; xian2017zero; song2018transductive). Comparatively, -shot learning aims to construct a classifier by means of information in , where , consisting of the unseen objects in which each category contains known-label objects. Especially, when , we call it one-shot learning.
Note that experience learning and concept learning are two categories of learning approaches for SSL, while -shot learning just describes a setting manner of the SSL problem, and can be set under both learning manners. Among current researches, -shot learning is mainly presented in the recognition problem (fu2018recent). In the future, more other problems are worthy to be investigated, such as generation (Lake2015), synthesis (long2017zero), parsing (zhu2018zero; bansal2018zero; rahman2018zero; shaban2017one), account for concepts are far more complicated than object categories only.
2.2 Neuroscience Evidences for SSL
The motivation of SSL is to mimic the learning capability of humans, who can learn new concepts from small sample with strong generalization ability. Here, we try to list more neuroscience evidences to further support the possible feasibility of the SSL paradigm (Hassabis2017).
2.2.1 Episodic Memory
Human intelligence is related to multiple memory systems (tulving1985many), including procedural, semantic, episodic memory (tulving2002episodic) and so on. In particular, experience replay (Mnih2015; schaul2015prioritized), a theory that can describe how the multiple memory system in the mammalian brain might interact, is critical to maximize data efficiency. In complementary learning systems (CLS) theory (kumaran2016learning), mammalians possess two learning systems, including parametric slow-learning neocortical system and non-parametric fast learning hippocampal system (see Fig.5(a)). The hippocampus encodes novel information after a single exposure, while this information is gradually consolidated to the neocortex in sleep or resting periods that are interleaved with periods of activity (Hassabis2017). In O’Neill’s view (o2010play), the consolidation is accompanied by replaying in the hippocampus and neocortex, which is observed as a reinstatement of the structured patterns of neural activity that accompanied the learning event. Therefore, experiences stored in a memory buffer can be used to gradually adjust the parameters of learning machine, such as SSL, which supports rapid learn based on an individual experience. This learning procedure is guaranteed by episodic control (blundell2016model), that rewarded action sequences can be internally re-enacted from a rapidly updateable memory store (gershman2017reinforcement). Recently, episodic-like memory systems have shown considerable promise in allowing new concepts to be learned rapidly based on only a few examples (vinyals2016matching; santoro2016meta).
2.2.2 Two-pathway guided search
One notable feature of SSL is fast leaning. For example, visual search is necessary for rapid scene analysis in daily life because information processing in the visual system is limited to one or a few targets or regions at one time. There exists a two-pathway guided search theory (wolfe2011visual) to support human fast visual search. As show in Fig.5(b), observers extract spatial layout information rapidly from the entire scene via the non-selective pathway. This global information of scene acts as top-down modulation to guide the salient object search in the selective pathway. This two-pathway based search strategy provides parallel processing of global and local information for rapid visual search.
2.2.3 Imagination and Planning
Humans are experts in simulation-based planning. Specifically, humans can more flexibly select actions based on predictions of long-term outputs, that are generated from an internal model of the environment learned through experiences (dolan2013goals; pezzulo2014internally). Human is able to not only remember the past experience, but also image or simulate the future (schacter2012future), which includes memory-based simulations and goal-directed simulations. Although imaging is intrinsically subjective and unobservable, we have reasons to believe that it has a conserved role in simulation-based planning across species (schacter2012future; hassabis2009construction) (see Fig.5(c)
). In AI system, Some progress has been made in scene understanding(eslami2016attend), 3D structure learning (rezende2016unsupervised), one-shot generalization of characters (rezende2016one), zero-shot recognition (long2017zero), and imagination of realistic environment (chiappa2017recurrent; racaniere2017imagination; gemici2017generative; hamrick2017metacontrol) based on simulation-based planning and imagination, which leads to data efficiency improvement and novel concept learning.
It should be indicated that there exists two key ideas, compositionality and causality (Lake2016), in imagination and planning procedure. Rich concepts can be built compositionally from simpler primitives (Lake2015). Compositionality allows for reuse of a finite set of primitives across many various scenarios by recombining them to produce an exponentially large number of novel yet coherent and useful concepts (Lake2016). Recent progress has been made by SCAN (higgins2017scan) in implicit hierarchy of abstract concepts from as few as five symbol-image pairs per concept. The capacity to extract causal knowledge (sloman2005causal) from the environment allows us to image and plan future events and to use those imagination and planning to decide on a course of simulations and explanation (tervo2016toward; bramley2017constructing). Typically, Lake2015 exploits causal relationship of combining primitive to reduce the dimension of hypothesis, and therefore leads to successes in classifying and generating new examples after seeing just a single example of a new concept (see Fig.5(d)).
3 Techniques on Concept learning
In this section, we will give an overview on the current developments on concept learning techniques. We first present a general methodology for concept learning, including intension matching, extension matching, and intension/extension mixed matching. Then we review relevant methods proposed from these three aspects, respectively.
3.1 A General Methodology for Concept Learning
Concept learning aims to perform recognition or form new concepts (classes) from few observations (samples) through fast processing. As shown in Fig.6, a concept system generally includes intensional representation and extensional representation of concepts. When small samples come, it is assumed that there exist various domain feature representations, like visual representations and semantic representations. Then concept learning employs matching rule to associate concepts in concept system with feature representations of small samples.
We summarize this procedure in Algorithm 1 (steps 1-3 can be executed in random ordering). Particularly, different domain representations can be integrated to match the intensional representation of concepts (intension matching), and virtual instances can be generated as extensional representations to match small samples (extension matching). Moreover, feature representations and concept representations can be aligned at a middle-level feature space, sometimes an embedding space. Through these matching rules, the recognition task can be implemented to return the final result. Under the unexpected situation that no concepts match the small samples, new concept could be formed via synthesis and analysis to undate the concept system (scheirer2013toward; shmelkov2017incremental).
3.2 Intension Matching
To match the intensional representation of concepts and different domain representations, it generally needs to learn a mapping between feature representations and intensional representations , which can be bidirectional, i.e., from to or from to , and then output the results that maximize the similarity between two representations.
3.2.1 From visual feature space to semantic space
The first category of intension matching approaches learns a mapping function by regression from the visual feature space to the semantic space (), which includes attributes (farhadi2009describing; parikh2011relative), word vectors (frome2013devise; socher2013zero), text descriptions (elhoseiny2013write; reed2016learning) and so on. In this case, there mainly exist two categories:
Semantic Embedding. This approach directly learns a mapping function between the visual feature space and the semantic embedding space. The function is usually learned from the labelled training visual data consisting of seen classes only. After that, zero-shot classification is performed directly by measuring similarity using nearest neighbour (NN) or its probabilistic variants such as direct attribute prediction (lampert2009learning; lampert2014attribute).
Semantic Relatedness. This approach learns an
-way discrete classifier for the seen classes in the visual feature space, which is then used to compute the visual similarity between an image of unseen class to those of the seen classes. Specifically, the semantic relationship between the seen and unseen classes is modelled by the distance between their prototypes to combine knowledge of seen classs, or the knowledge graph to distill the relationships(rohrbach2010helps; fergus2010semantic; deng2014large; marino2017more).
The pioneer work of semantic embedding was conducted by lampert2009learning; lampert2014attribute
, which used a Bayesian model to build the relationship between visual feature space and the semantic space. They provided two models for zero-shot learning, i.e., direct attribute prediction (DAP) and indirect attribute prediction (IAP), with the idea that learned the probability of attributes for given visual instance as prior and computes a MAP prediction of the unseen classes. Variants like topic model(yu2010attribute), random forests (jayaraman2014zero)
, and Bayesian networks(wang2013unified) have been explored. Whilst, palatucci2009zero presented a semantic output codes classifier, which directly learned a function (e.g.,ar regression) from visual feature space to the semantic feature space by one-to-one attribute encoding according to its knowledge base. Except for linear embedding, benefited from deep learning (goodfellow2016deep), some nonlinear embeddings have also been developed. For example, socher2013zero learned a deep learning model to map image close to semantic word vectors corresponding to their classes, and this embedding manner could be used to distinguish whether an image is of a seen or unseen class. Also, frome2013devise presented a deep visual-semantic embedding model (DeViSE) trained to bridge visual objects using both labeled image data as well as semantic information gleaned from unannotated texts. Besides, they tried to alleviate the limitation of ability that scales to large number of object categories by introducing unannotated text beyond annotated attributes, achieving good performance on the 1000-class ImageNet object recognition task for the first time. Afterwards, Zhang_2018_CVPR firstly investigated Polynomial and the RBF family of kernels to obtain a non-linear embedding. Recently, some works were investigated by additional constraints to learn the mapping. For example,deutsch2017zero cast the problem of ZSL as fitting a smooth function defined on a smooth manifold (visual domain) to sample data. To enhance the capability of discrimination, morgado2017semantically introduced two forms of semantic constraints to the CNN architecture. The model was encouraged to learn an hidden semantic layer together with a semantic code for classification. (yu2018stacked) further applied attention mechanism to generating an attention map for weighting the importance of different local regions and then integrated both the local and global features to obtain more discriminative representations for fine-grained ZSL. Recently, chen2018zero tried to introduce adversarial learning enables semantics transfer across classes to improve classification. Another attempt presenting an unsupervised-data adaptation inference framework into few/zero-shot learning was made by (tsai2017learning) to learn robust visual-semantic embeddings. Specifically, they combined auto-encoders representation learning models together with cross-domain learning criteria (i.e., Maximum Mean Discrepancy loss) to learn joint embeddings for semantic and visual features in an end-to-end learning framework.
Based on semantic relatedness, norouzi2013zero mapped images into the semantic embedding space via convex combination of the seen class label embedding vectors, i.e., using the probabilities of a softmax output layer to weight the semantic vectors of all the classes, that can perform zero-shot learning task on the large-scale ImageNet dataset. Likewise, mensink2014costa used co-occurrence statistics learning concept-concept relationships from texts (between the new label and existing ones), as a weight to combine seen classes classifiers. While changpinyo2016synthesized directly applied the convex combination scheme to synthesizing classifiers for the unseen classes. Similar work like (misra2017red) was also proposed, and applied a simple composition rule to generating classifiers for new complex concepts. On the other hand, salakhutdinov2011learning used knowledge graph like WordNet early to build a hierarchical classification model that allowed rare objects to borrow statistical strength from related objects that may have many training instances. After that, deng2014large introduced hierarchy and exclusion (HEX) graphs to train object classifiers by leveraging mutual exclusion among different classes. In order to define a proper similarity distance metric between a test image and the unseen class prototypes for ZSL, fu2015zero further explored rich intrinsic semantic manifold structure using a semantic graph in which each class is a node and the connectivity on the graph is determined by the semantic relatedness between classes. Recently, semantic relatedness provides supervision information for transferring knowledge from seen classes to unseen classes. For example, guo2017zero proposed a sample transfer strategy that transferred samples based on their transferability and diversity from seen classes to unseen classes via the class similarity, and assigned pseudo labels for them to train classifiers. Furthermore, li2017zero exploited the intrinsic relationship between the semantic space manifold and the transfer ability of visual-semantic mapping to generate more consistent semantic space with the image feature space. The graph convolutional network (GCN) technique was introduced into (wang2018zero) to transfer information (message-passing) between different categories. They tried to distill information via both semantic embeddings and knowledge graphs, in which a knowledge graph provided supervision to learn meaningful classifiers on top of semantic embeddings. Another work employed information propagation mechanism to reason the unseen labels is proposed by (lee2017multi). They designed multi-label ZSL by incorporating knowledge graphs for describing the relationships between multiple labels.
Since visual domain and semantic domain have different tasks and non-overlapping label spaces, the aforementioned existing methods are prone to the projection domain shift problem (fu2014transductive). To alleviate this issue, some works focus on manifold assumption. fu2015transductive firstly proposed a method to preserve the coherent of manifold structures of different representation spaces. And then li2017zerostructure incorporated a graph Laplacian regularization to preserve the geometric properties of target data in the label space as visual feature space, and the similar work focusing on preserving the locally visual structure was conducted by ji2017manifold. Moreover, xu2017matrix used matrix tri-factorization framework with manifold regularization on visual feature and semantic embedding spaces. As well, xu2017transductive investigated manifold regularization regression for zero-shot action recognition. However, some works investigated the joint structure learning between seen and unseen class. For example, kodirov2015unsupervised cast the mapping function learning problem as a sparse coding problem, joint learning seen and unseen semantic embedding. Specifically, each dimension of the semantic embedding space corresponds to a dictionary basis vector and the coefficients/sparse code of each visual feature vector is its projection in the semantic embedding space, enforcing visual projection in the semantic embedding space to be near to the unseen class prototypes. zhang2016zero further proposed a joint structured prediction scheme to seek a globally well-matched assignment structure between visual clusters and unseen classes in test time. Another attempts borrow the idea from self-paced learning (kumar2010self; jiang2014easy; jiang2014self) and were made by yu2017transductiveadaptive and niu2017zero. In a nutshell, their method iteratively selected the unseen instances from reliable to less reliable to gradually refine the predicted test labels and update the visual classifiers for unseen categories alternatively. Along this, similar works were developed by (ye2018self) and (luo2018zero). Comparatively, in each iteration, ye2018self selected the most confidently predicted unlabeled instances to refine the ensemble network parameters, and luo2018zero refined the class prototypes instead of labels. Different with other methods, kodirov2017semantic took the encoder-decoder paradigm, and they insist that it is very effective in mitigating the domain shift problem with additional reconstruction constraint. Likewise, fu2018zero extended (fu2015zero) by introducing a ranking loss. Specifically, the ranking loss objective was regularised by unseen class prototypes to prevent the projected object features from being biased towards the seen prototypes.
Another important factor degrading the performance of recognition is that the textual representation is usually too noisy. Against this issue, qiao2016less proposed an -norm based objective function which can simultaneously suppressed the noisy signal in the text and learned a function to match the text document and visual features. Afterwards, al2017automatic used a linguistic prior in a joint deep model to optimize the class-attribute associations to address noise and missing data in the text corpora. Besides, elhoseiny2017link proposed a learning framework that was able to connect text terms to its relevant parts of objects and suppress connections to non-visual text terms without any part-text annotations. More recently, Zhu_2018_CVPR simply passed textual features through additional fully connected layer before feeding it into the generator, and they argued that the modification achieved the comparable performance of noise suppression.
3.2.2 From semantic space to visual feature space
The second category of approaches along this research line learns a mapping function from the semantic space to the visual feature space (). The motivation to learn the mapping is to solve the hubness problem for the first time, i.e., the neighbourhoods surrounding mapped vectors contain many items that are “universal” neighbours. radovanovic2010hubs and dinu2014improving firstly noticed this problem in zero-shot learning. shigeto2015ridge argued that least square regularised projection functions make the hubness problem worse and firstly proposed to perform reverse regression, i.e., embedding class prototypes into the visual feature space. Transductive setting assumption was used in (shojaee2016semi), and they used both labeled samples of seen classes and unlabeled instances of unseen classes to learn a proper representation of labels in the space of deep visual features in which samples of each class are usually condensed in a cluster. After that, changpinyo2017predicting learned a mapping function such that the semantic representation of class can predict well its class exemplar (center) that characterized the clustering structure, and then the function was used to construct nearest-neighbor style classifiers. Different from the aforementioned, zhang2017learning learned an end-to-end deep learning model that maps semantic space to visual feature space, and dealt with the hubness problem efficiently. Recently, annadani2018preserving
learned an a multilayer perceptron based encoder-decoder, and preserved the structure of the semantic space in the embedding space (visual feature space) by utilizing semantic relations between categories while ensured discriminative capability.
3.3 Extension Matching
This category of approaches are constructed through using the input feature or generating a series of virtual instances according to the feature to make it possible to compare with the instances in the extension representation, and help find the one that maximizes the similarities between extensional representation and the feature/virtual instances. This is motivated by the fact that human can associate familiar visual elements and then imagine an approximate scene given a conceptual description. Note that extension matching is different from learning a mapping function from the semantic space to the visual feature space . Intuitively, the latter can be regarded as learning how to recognize the characteristics of an image and match it to a class. On the contrary, extension matching can be described as learning what a class visually looks like. Sometimes, extension matching has two explicit advantages over the learning manner as introduced in the previous section:
framework inclines to bring in information loss to the system, so as to degrade the overall performance. Comparatively, extension matching recognizes a new instance in the original space, which can help alleviate this problem.
Through synthesizing a series of virtual instances, we can always turn the SSL problem into a conventional supervised learning problem (LSL problem) such that we can take advantage of the power of LSL techniques in the SSL task, or directly use nearest neighbour (NN) algorithm.
The early work of extension matching was conducted by (yu2010attribute), through synthesizing data for ZSL using the Author-Topic (AT) model. Yet the drawback of the method is that it only deals with discrete attributes and discrete visual features like bag-of-visual-word feature accounting for the attributes, and yet the visual features usually have continuous values in real world. There are more methods proposed in recent years, which can be roughly categorized into the following three categories:
1) Learning an embedding function from to . long2017zerocvpr; long2017zero provided a framework to synthesize unseen visual (prototype) features by given semantic attributes. As aforementioned,
framework may lead to inferior performance owing to three main problems, in terms of structural difference, training bias, and variance decay, respectively. In correspondence, a latent structure-preserving space via dual-graph approach with the diffusion regularisation is proposed in their work.
2) Learning a probabilistic distribution for each seen class and extrapolating to unseen class distributions using the class-attribute information. Assume that data of each class in the image feature space approximately followed a Gaussian distribution,guo2017synthesizing
synthesized samples by random sampling with the distribution for each target class. Technically, the conditional probabilistic distribution for each target class was estimated by linear reconstruction based on the structure of the class attributes. Whilezhao2017zero posed ZSL as the missing data problem, estimating data distribution of unseen classes in the image feature space by transferring the manifold structure in the label embedding space to the image feature space. More generally, verma2017simple modeled each class-conditional distribution as an exponential family distribution and the parameters of the distribution of each seen/unseen class are defined as functions of the respective seen class attributes. Besides, these functions can be learned using only the seen class data and can be used to predict the parameters of the class-conditional distribution of each unseen class. Another attempt was to develop a joint attribute feature extractor in (lu2017zero). In their method, each fundamental unit was put in charge of the extraction of one attribute feature vector, and then based on the attribute descriptions of unseen classes, a probability-based sampling strategy was exploited to select some attribute feature vectors to synthesize combined feature representations for unseen class.
3) Using the generative model like generative adversarial network (GAN) (goodfellow2014generative)
or variational autoencoder (VAE)(kingma2013auto) to model the unseen classes’ distributions with the semantic descriptions and the visual distribution of the seen classes. Especially, using generated examples of unseen classes and given examples of seen classes to train a classification model provides an easy manner to handel the GZSL problem. For example, bucher2017generating learned a conditional generator (e.g., conditional GAN (odena2017conditional), denoising auto-encoder (bengio2013generalized), and so on) for generating artificial training examples to address ZSL and GZSL problems. Furthermore, xian2017feature proposed a conditional Wasserstein GAN (gulrajani2017improved) with a classification loss, f-CLSWGAN, generating sufficiently discriminative CNN features from different sources of class embeddings. Similarly, by leveraging GANs, zhang2018visual realized zero-shot video classification. Other works focus on VAE (kingma2013auto). For example, wang2018zero represented each seen/unseen class using a class-specific latent-space distribution, and used VAE to learn highly discriminative feature representations for the inputs. At test time, the label for an unseen-class test input is the class that maximizes the VAE lower bound. Afterwards, mishra2017generative trained a conditional VAE (sohn2015learning)
to learn the underlying probability distribution of the image features conditioned on the class embedding vector. Similarly,arora2018generalized proposed a method able to generate semantically rich CNN feature distributions via the conditional VAE with discriminator-driven feedback mechanism improving the reconstruction capability.
3.4 Intension/Extension Mixed Matching
This category of method aims to map both feature and concept representations into a middle-level representation space, and then predict the class label of an unseen instance by ranking the similarity scores between semantic features of all unseen classes and the visual feature of the instance in the middle-level representation space. The middle-level representation may be the result after a mathematical transformation (say, Fourier transformation). This strategy is different from semantic relatedness strategies as introduced in Section3.2.1, which accounts for semantic relatedness provides supervision information for combining or transferring knowledge from seen classes to unseen classes. Comparatively, intension/extension mixed matching implicitly/explicitly learns a middle-level representation space, in which the similarity between visual and semantic space can be easily determined. There are mainly three categories of typical approaches for learning the middle-level representation space, as summarized in the following.
1) Learning an implicit middle-level representation space through learning consistency functions like compatibility functions, canonical correlation analysis (CCA), or other strategies. E.g., akata2013label; akata2016label proposed a model that implicitly learned the instances and the attributes embeddings onto a common space where the compatibility between any pair of them can be measured. When given an unseen image, the correct class can be obtained through the rank higher than the incorrect ones. This consistency function has the form of a bilinear relation associating the image embedding and the label representation as . To make the whole process simpler and efficient, romera2015embarrassingly
proposed a different loss function and regularizer based on the same principle as(akata2013label; akata2016label) with a closed form solution to . Moreover, akata2015evaluation learned a joint embedding semantic space between attributes, text, and hierarchical relationships while akata2013label considered attributes as output embeddings. To learn more powerful mapping, some works focus on nonlinear styles. xian2016latent learned a nonlinear (piecewise linear) compatibility function, incorporating multiple linear compatibility units and allowed each image to choose one of them and achieve factorization over such (possibly complex combinations of) variations in pose, appearance and other factors. Readers can refer to (xian2017zero), in which both the evaluation protocols and data splits are evaluated among the linear compatibility functions and nonlinear compatibility functions. Another attempt was tried to learn visual features rather than fixed visual features (akata2013label; akata2016label; akata2015evaluation; romera2015embarrassingly). Inspired by (elhoseiny2013write; elhoseiny2017write) to learn pseudo-concepts to associate novel classes using Wikipedia articles, (ba2015predicting) used text features to predict the output weights of both the convolutional and the fully connected layers in a deep CNN as visual features. Also, existing methods may rely on provided fixed label embeddings. However, a representative work is like jiang2017learning, which learned label embeddings with or without side information (encode prior label representation) and integrated label embedding learning with classifier training. This is expected to produce adaptive label embeddings that are more informative for the target classification task. To overcome a large performance gap in zero-shot classification between attributes and unsupervised word embeddings, reed2016learning extended (akata2015evaluation)’s work to train an end-to-end deep neural language models from texts, and used the inner product of features generated by deep neural encoders instead of bilinear compatibility function, achieving a competitive recognition accuracy compared to attributes.
On the other hand, the early work of CCA is proposed by (hardoon2004canonical), using kernel CCA to learn a semantic representation to web images and their associated text. Recently, gong2014multi investigated a three-view CCA framework that incorporates the dependence of visual features and text on the underlying image semantics for retrieval tasks. fu2015transductive further proposed transductive multi-view CCA to learn a common latent embedding space aligning different semantic views and the low-level feature view, which alleviated the bias/projection domain shift. Afterwards, cao2017generalized proposed a unified multi-view subspace learning method for CCA using the graph embedding framework for visual recognition and cross-modal retrieval. Also, there exist other CCA variants for the task, like (qi2017joint; mukherjee2017deep). For example, qi2017joint proposed an embedding model jointly transferring inter-model and intra-model labels for an effective image classification model. The inter-modal label transfer is generalized to zero-shot recognition. mukherjee2017deep introduced deep matching autoencoders (DMAE) which learned a common latent space and pairing from unpaired multi-modal data. Specifically, DMAE is a general cross-modal learner that can be learned in an entirely unsupervised way, and ZSL is the special case.
2) Learning an implicit middle-level representation space through dictionary learning. zhang2016zero firstly learned an intermediate latent embedding based on dictionary learning to jointly learn the parameters of model for both domains that can not only accurately represent the observed data in each domain but also infer cross-domain statistical relationships when one exists. Similar works were also proposed, like (peng2016joint; ding2017low; jiang2017learning; ye2017zero; yu2017transductive; kolouri2017joint). For example, to mitigate the distribution divergence across seen and unseen classes, ding2017low learned a semantic dictionary to link visual features with their semantic representations based on a low-rank embedding space assumption, in which the latent semantic dictionary for unseen data should share its majority with semantic dictionary for the seen data. jiang2017learning further proposed a method to learn a latent attribute space with a dictionary learning framework to tackle the problems of attribute-based approaches simultaneously, i.e., discriminative (yu2013designing), interdependent (jayaraman2014decorrelating), large variations (kodirov2015unsupervised) within each attribute. Analogously, yu2017transductive formulated a dictionary framework to learn a bidirectional mapping based semantic relationship modeling scheme that sought for cross-modal knowledge transfer by simultaneously projecting the image features and label embeddings into a common latent space. Latest work in (kolouri2017joint) modeled the relationship between visual features and semantic attributes via joint sparse dictionaries, demonstrating an entropy regularization scheme can help address the domain shift problem, and a transductive learning scheme can help reduce the hubness phenomenon.
3) Learning an explicit middle-level representation space. zhang2015zero; zhang2016zero advocated the benefits of using attribute-attribute relationships, termed semantic similarity, as the intermediate semantic representation and learned a function to match the image features with the semantic similarity. As an extension of zhang2015zero; zhang2016zero, long2017zero123 aggregated visual representation to a discriminative representation which simplifies images to one template correspond to one class to achieve high inter-class variation and low intra-class variation, more powerful than large margin mechanism (zhang2015zero; zhang2016zero), and then mapped semantic embeddings to the discriminative representation space. Another work was made by (yu2017zero) using matrix decomposition to expand a latent space from the input modality under an implicit process. The intuition they considered is that learning an explicit encoding function between different modalities may be easily spoiled. Specifically, they learned the optimal intrinsic semantic information of different modalities via decomposing the input features based on an encoder-decoder framework, explicitly learning a feature-aware latent space via jointly maximizing the recoverability of the original space from the latent space and the predictability of the latent space from the original space. To eliminate the limitation of the existing attribute-based methods, i.e., the dependency on the attribute signatures of the unseen classes, sometimes laborious, demirel2017attributes2classname learned a discriminative word representation such that the similarities between class and attribute names follow the visual similarity, and used this learned representation to transfer knowledge from seen to unseen classes. Similar as (jiang2017learning) for learning latent attributes, li2018discriminative proposed to learn the latent discriminative features for ZSL in both visual and semantic space, as well learning features from a region with object instead of pre-trained CNN features. Especially, a category-ranking problem was modeled to learn latent attributes to ensure the learned attributes are discriminative.
4 Techniques on Experience Learning
In this section, we will give an overview on the techniques on experience learning. Firstly, we present a mathematical expression for general methodology of experience learning, especially including two categories of approaches along this research line, and then review relevant main techniques.
4.1 General Methodology for Experience Learning
Experience learning denotes the machine learning paradigm designed under the circumstance of insufficient samples, also called small data learning. A natural strategy of solving the experience learning is to borrow ideas from LSL techniques, which constitutes the main idea to construct a rational experience learning method. Specifically, a experience learning task can be implemented using the following approaches:
Approach 1: Increase samples and then directly employ conventional LSL methods;
Approach 2: Utilize small samples to rectify the known models/knowledge learned /obtained from other data sources;
Approach 3: Reduce the dependency of LSL upon the amount of samples to make the method feasible to small samples;
Approach 4: Meta learning.
We call the above first strategy as data augmentation (DA) strategy, and the other three as LSL model modification (MD) strategy for convenience. Note that we can perform above strategies simultaneously, instead of using only one in implementing a SSL task. In the following, we will review relevant main techniques of experience learning from these four aspects, respectively.
4.2 Approach 1: Augmented Data
A direct manner for experience learning is to generate more data from small training samples to compensate the issue of insufficient data. The LSL methods can then be directly employed for solving the problem. In the following we summarize five kinds of techniques designed in this manner, and in practice possibly more imaginative strategies could be further designed or have been used in practice.
By the imagination mechanism of human being, more hallucination samples can be formed as the augmented data. This can be realized through using various transformations on original samples, e.g., adding noise, mirroring, scaling, pose and lighting (kulkarni2015deep), rotation (okafor2017operational), polar harmonic transform (yap2010two), radial transform (salehinejad2017training), and so on. For example, for audio data, salamon2017deep applied four different deformations, namely, time stretching, pitch shifting, dynamic range compression, and background noise to overcome the problem of data scarcity. For domain-specific applications, there exists a requiring for preserving class labels by leveraging task-specific data transformations. ratner2017learning directly leveraged user domain knowledge in the form of transformation operations, able to generate realistic transformed data points which were useful for data augmentation. Recently, cubuk2018autoaugment introduced an automated approach to find data augmentation policies from data, that is, to use a search algorithm in the search space of data augmentation policies like translation, rotation, or shearing to find the best policy such that the neural network yields the highest validation accuracy on a target dataset.
4.2.2 Generative model
This strategy attempts to find the generalization model underlying the given small samples, and then use this model to further generate more samples for training. The early work along this line focused on semi-supervised learning with generative model, aroused the attention of many works(chen2016infogan; chongxuan2017triple; gan2017triangle; deng2017structured). Typically, (chongxuan2017triple) achieved 5% error rate on SVHN dataset (netzer2011reading) using 1000 examples (fewer than 1% of the whole samples). Recently, choe2017face
attempted to generate face images with several attributes and poses using GAN, enlarging the novel set to achieve increased performance on low-shot face recognition task. Analogously,shrivastava2017learning further developed a model called SimGAN that improved the realism of synthetic images from a simulator using unlabeled real data, while preserving the annotation information. Results show that there exists a significant improvement on gaze estimation and hand pose estimation using synthetic images. As an extension of (shrivastava2017learning), Lee2018Simulated leveraged the flexibility of data simulation process and the efficacy of bidirectional mappings between synthetic data and real data. To enhance few-shot learning systems more efficiently, antoniou2017data proposeed data augmentation GAN (DAGAN) to automatic learn to augment data. There also exist some works using novel generative model. E.g., hariharan2017low learned to hallucinate additional examples for novel classes by transferring modes of variation from the base classes with reconstruction and classification loss to address low-shot learning. Inspired by the fact that human can easily visualize or imagine what novel objects look like from different views, wang2018low trained a hallucinator via meta learning to generate additional examples and provide significant gains for low-shot learning. Likewise, vedantam2017generative
firstly tried to define a visually grounded imagination with evaluation metrics of 3C’s, i.e. correctness, coverage, and compositionality, and further propose how to create generative models which can imagine compositionally novel concrete and abstract visual concepts via modified VAE. Also, to learn compositional and hierarchical representations of visual concepts,higgins2017scan further described symbol-concept association network (SCAN). Crucially, SCAN can imagine and learn novel concepts that have never been experienced during training with compositional abstract hierarchical representations.
4.2.3 Pseudo-label method
This strategy is specifically imposed on small labeled sample set while sufficient unlabeled sample cases (i.e., semi-supervised data). The augmented data can be obtained through generating confident pseudo-labels by a self ameliorable model, such as curriculum/self-paced learning (bengio2009curriculum; kumar2010self; jiang2014easy; jiang2014self), dual learning (he2016dual), and data programming (ratner2016data).
Curriculum Learning and self-paced learning are learning regimes inspired by the learning process of humans and animals that implements learning through gradually including samples into training process from easy to complex so as to increase the entropy of training samples (bengio2009curriculum; kumar2010self; jiang2014easy; jiang2014self). This regime provides a good way to label the grade samples by the learning results at grade to boost the performance of SSL. For example, lin2018active developed a novel cost-effective framework for face identification, that is capable of automatically annotating new instances and incorporating them into training under weak expert recertification. Experiments demonstrated the effectiveness in terms of accuracy and robustness against noisy data under only small fraction of sample annotations. In object detection, a self-paced learning (SPL) framework was embedded in its optimization process, the selected training images going from “easy” to “hard” to gradually improve object detector (dong2017few). This method specifically trained the detector using the few annotated images per category (few-example), and then the detector generated reliable pseudo box-level labels and got improved with these pseudo-labeled bounding boxes. The method can achieve competitive performance compared to state-of-the-art weakly supervised object detection approaches (diba2017weakly), with only requirement of about 1% of the images in the the entire dataset to be annotated. For salient object detector, zhang2017supervision alternately completed the learning procedure without using any pixel-level human annotation: firstly generate reliable supervisory signals from the fusion process of weak saliency models to train the deep salient object detector in iterative learning stages, and then the obtained deep salient object detector is used to update the weak saliency map collection for the next learning stage. In medical imaging analysis, li2017self
proposed a self-paced convolutional neural network framework to augment the size of training samples by refining the unlabeled instances, achieving classify computed tomography (CT) image patches with very scarce manual labels. Recently,meng2015objective
demonstrated an insightful understanding that self-paced learning is robustness to outliers/heavy noises account for learn with a latent non-convex regularized penalty. Therefore, SPL has arose the attention on weakly-supervised learning and webly-supervised learning recently (see Section5.1).
Dual learning is firstly propose by (he2016dual)
in neural machine translation. Specifically, the method defines primal and dual tasks, e.g., English-to-French translation versus French-to-English translation, and then can form a closed loop between the primal and dual tasks. In the closed loop, primal translation model can translate unlabelled English to pseudo-French, and dual translation model can translate pseudo-French to pseudo-English. Then difference of groud-truth English and pseudo-English will return rewards using reinforcement learning algorithms to learners. After many iterations, confident pseudo-labelled data can benefit the models. In the recent years, the dual learning framework is successfully applied to visual question answering(VQA)(li2018visual), semantic image segmentation (luo2017deep)zhu2017unpaired; yi2017dualgan), zero-shot visual recognition (chen2018zero), and neural machine translation (he2016dual; he2017decoding). Some improvement of dual learning framework has been further development, such as (he2017decoding; wu2017sequence; xia2017dual).
Data programming (ratner2016data) denotes a technique aiming at learning with labeling functions to generate labeled training sets programmatically. After the raising of this idea, wu2018fonduer proposed Fonduer, together with data programming, to provide weak supervision of domain expertise to guide extract information from richly formatted data. Additionally, there exists other approaches to generate labeled training sets quickly. Typical works include Snorkel (bach2017snorkel), SwellShark (fries2017swellshark), EZLearn (grechkin2017ezlearn), and Flipper (varma2017flipper).
4.2.4 Cross-domain synthesis
The motivation of cross-domain synthesis approach is to compensate the domain with few samples by knowledge/data from other related domains (srivastava2012multimodal). Mathematically, the small samples are assumed to be generated from the domain , which can also be alternatively expressed with obtained from other related domains . By establishing the mapping , the augmented data can then be formed as . Recent researches have made excellent progress on different fashions of such cross-domain synthesis, benefitting from various generative models (as introduced in Section 4.2.2), like CNN (lecun1998gradient)
, RNN with the Long Short-Term Memory (LSTM)(hochreiter1997long), GAN, VAE, and other techniques including text to image synthesis (mansimov2015generating; reed2016generative), attribute to image (yan2016attribute2image; lample2017fader; dixit2017aga; chen2018semantic), image to text synthesis (vinyals2015show; karpathy2015deep; xu2015show; ren2015exploring), image to image synthesis (translation, style transfer) (isola2017image; zhu2017unpaired; gatys2016image; johnson2016perceptual), text to speech (van2016wavenet; anderson2013expressive; gibiansky2017deep), video to speech (owens2016visually), speech to video (deena2009speech), and so on. Crucially, cross-domain synthesis like text to image or attribute to image takes an important role in many techniques on concept learning (see Section 3.3).
A representative case necessary to use this technique is medical imaging analysis. In this practical domain, high-quality supervised samples (e.g., labeled with certain diseases) are generally scarce, expensive, and fraught with legal concerns regarding patient privacy. For these issues, cross-domain image synthesis has recently gained significant interest. The main research contents focus on image to image synthesis, including cross MRI image synthesis (ye2013modality; van2015cross; vemulapalli2015unsupervised; joyce2017robust; chartsias2018multimodal), MR image to CT image synthesis (roy2014mr; huynh2016estimating; torrado2016fast; cao2017dual), and label map to MRI (cordier2016extended). For example, van2015cross proposed location sensitive deep network(LSDN), improving MRI-T1 image to MRI-T2 image results by conditioning the synthesis on the position in the volume from which the patch comes. To be more general, joyce2017robust tried to synthesize MRI FLAIR with multi-input, i.e., MRI T1, T2, and DWI, robust to missing data and misaligned inputs via learning a modality-invariant latent representation. Afterwards, chartsias2018multimodal extended joyce2017robust’s work, easily predicting new output modalities through the addition of decoders which can be trained in isolation.
4.2.5 Domain adaptation / data transportation
An SSL problem can be handeled through borrowing the solution of the same type of other learning problems, which refers to the domain adaptation problem (pan2010survey). Readers can refer to (csurka2017domain; venkateswara2017deep) for more technical details. The learning manner is to transform the data from source domain, with sufficient annotated ones, by a differential homomorphism with certain constrains, to help form an augmented data set to compensate the small sample set collected from the target domain, with only few or no annotated data to obtain more rational solution (see Fig.7).
The early works proposed in this manner were constructed based on instance re-weighting (zadrozny2004learning; sugiyama2008direct; kanamori2009efficient; huang2007correcting; yan2017mind), which estimated the ratio between the likelihoods of being a source or target example or use maximum mean discrepancy (MMD) measure (borgwardt2006integrating) to weight data instances. A second- or higher-order knowledge transfer(koniusz2017domain) was used to bring closer the within-class scatters while maintaining good separation of the between-class scatters, and was further investigated on Action Recognition datasets(tas2018cnn). Another research direction focused on transformation, which matched both source and target domain under some constraints. For example, (saenko2010adapting)
learned a liner transformation between two domains by minimizing the effect of domain-induced changes in the feature distribution. And the non-linear transformation between the two domains was investigated by(long2014transfer) through minimizing the distance between the empirical expectations of source and target data distributions integrated within a kernel embedding. A typical example was to learn a transportation plan with the optimal transport theory (courty2017optimal), which constrained labeled samples of the same class in the source domain to remain close during transport. courty2017joint
went a step further to implicitly learn a non-linear transformation that minimized the optimal transport loss between the joint source distribution and an estimated target joint distribution, corresponding to the minimization of a bound on the target error. To compensate for the lack of target structure constraint,liang2018aggregating added a novel relaxed domain-irrelevant clustering-promoting term that jointly bridged the cross-domain semantic gap and increased the intra-class compactness in both domains. A local sample-to-sample matching method was developed by(das2018sample) recently, in which the source and target samples are treated as graphs. Another strategy was proposed in (bousmalis2017unsupervised) to propose a pixel-level domain adaptation method (PixelDA), which used GAN to learn a transformation from source domain to target domain. This mechanism is same as cross-domain synthesis using GAN (Section 4.2.4). Along this line, some works were released like (taigman2016unsupervised; tzeng2017adversarial; murez2017image; volpi2017adversarial), and boosted the performance in object recognition (hu2018duplex), person re-identification (deng2017image), brain lesion segmentation (kamnitsas2017unsupervised), and semantic segmentation (hong2018conditional). The latest work focused on the large domain shifts between source and target datasets in (koniusz2018museum), and a new dataset called open museum identification challenge (Open MIC) was released.
4.3 Approach 2: Rectify the Known Models/knowledge with Small Samples
Under the assumption that the knowledge in knowledge system learned before can share with future learning, the SSL strategy can be rationally constructed by rectifying the known models/knowledge in knowledge system to adapt the new observations. We will present several techniques of rectification in this section.
This approach aims to achieve the better cognition level through updating or fine-tuning the current knowledge (a trained model) by small training samples. In this scheme, the small sample set is used to update or finetune the existing model (e.g., a trained deep network) (yosinski2014transferable; oquab2014learning; hinton2006reducing; Krizhevsky2012Alex). In practice, it always pretrains a basic model on a source domain (where data are often abundant), and then fine-tunes the trained model (sometimes change several output layers’ topology structure but fixes other layers) on a target domain (where data are insufficient). The motivation is that there exists common representation among various objects in nature, and novel samples can adaptively fit the new similar tasks after the basic model extracts common representation knowledge among many objects.
For example, in object detection (girshick2014rich; girshick2015fast; ren2015faster), the CNN on the ImageNet ILSVRC2012 classification task is pre-trained and fine-tuned to fit the detection task. In SSL, fine-tuning has been successfully applied to visual classifiers in new domains (chu2016best), object detection with long-tail distribution (ouyang2016factors), neural machine translation (chu2017empirical), remote sensing scene classification (fang2016using), museum artwork identification(zhang2018artwork) and medical image analysis (tajbakhsh2016convolutional; shin2016deep).
Recently, some progresses have been made to improve the fashions of fine-tuning. For example, progressive networks (rusu2016progressive) solved multiple independent tasks at the end of training without assumptions about the relationship between tasks, and modifies or ignores previously learned task features via the lateral connections. Yet previous tasks were not affected by the newly learned features in the forward pass. Progressive networks were originally proposed for reinforcement learning to transfer knowledge for a simulated robotic environment to a real robot arm, massively reducing the training time required on the real world (rusu2017sim). The block-modular architecture (terekhov2015knowledge) is similar work while more focused on a visual discrimination task. Developmental Networks (wang2017growing) explored several routes for increasing model capacity during fine-tuning (see Fig.8), both in terms of going deeper (more layers) and wider (more channels per layer). Such strategy achieved good performance beyond classic fine-tuning approaches on certain tasks.
Knowledge distillation (hinton2015distilling) is certain form of Knowledge Transfer (KT) approach, and the motivation is transferring knowledge from an ensemble or from a large highly regularized model into a smaller, distilled model, as well as capturing the information provided by the true labels on small sample dataset. This learning form is sometimes called teacher-student networks(TSN) (see Fig.9(a)), where the student is penalized according to a softened version of the teacher’s output or ensemble of teacher networks. Formally, The parameters of the student network model are learned by minimizing a loss with the form
where refers to the cross-entropy and is a tunable parameter to balance both cross-entropies. The first term in Eq.(1) corresponds to the traditional cross-entropy between the output of a (student) network and labels, whereas the second term enforces the student network to be learned from the softened output of the teacher network(see Fig.9(b)).
Alternatively, inspired by curriculum learning strategies (bengio2009curriculum), which organized the training examples in a gradually more complex manner, such that the learner network gradually received examples of increasing difficulty w.r.t. the already learned concepts, romero2014fitnets introduced a hint-based learning concept to train the student network. Particularly, they utilized not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Mathematically, they trained the student network parameters from the first layer up to the guided layer as well the regressor parameters by minimizing the following loss function (see Fig.9(b)):
where and are the teacher/student functions up to their respective hint/guided layers with parameters and , and is the regressor function on top of the guided layer with parameters . Here, the outputs of and are expected to have the similar structure. Recently, yim2017gift proposed to define distilled knowledge in terms of flow between layers, which was calculated by computing the Gram matrix of features from two different layers. They demonstrated that novel technique optimized fast, and the student network outperformed the original network even trained at a different task. Another attempt was made by (huang2017like)
treating KT as a distribution matching problem, that is, matching the distributions of neuron selectivity patterns between teacher and student networks by minimizing the MMD metric. Different from the above methods via model distillation,radosavovic2017data investigated omni-supervised learning, a data distillation method, ensembling predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations.
The KT approach is promising to facilitate the SSL task, and has made some processes recently. For example, luo2017graph proposed a graph-based distillation method to distill rich privileged information from a large multi-modal dataset to teach student tasks models, which tackled the problem of action detection on limited data and partially observed modalities. To address the issue of detecting new classes objects, shmelkov2017incremental proposed a method for not only adapting the old network to the new classes with cross-entropy loss, but also ensuring performance on the old classes does not catastrophic forget with a new distillation loss which minimized the discrepancy between responses for old classes from the original and the new networks. Experiments demonstrated that the method can perform well even in the extreme case of adding new classes one by one. Likewise, chen2018lstd investigated low-shot object detection, and KT helped transfer object-label knowledge for each target-domain proposal to generalize low-shot learning setting. In biological domain, christodoulidis2017multisource firstly trained a teacher network from six general texture databases of similar domain, and then the model was fine-tuned on the limited number of lung tissue data, and finally transferred knowledge in an ensemble manner. Their fused knowledge was distilled to a network with the original architecture, with 2% increasing in the performance.
4.3.3 Domain adaptation/model adaptation
Except from data transportation in domain adaptation (Section 4.2.5), there exists another learning fashion: model adaptation. The insight is to adapt one or more existing models in knowledge system to the small sample dataset. The early work assumes that target model (classifier) consists of the source models (existing) and perturbation functions (see Fig.10(a)). yang2007cross
proposed adaptive support vector machines (A-SVMs), where a set of so called perturbation functions were added to the source classifier to progressively adjust the decision boundaries of target classifier in the target domain. The diagram of source classifier and target classifier can be understood by seeing Fig.10(b). Along this research direction, cross-domain SVM (jiang2008cross), domain transfer SVM (duan2009domain), domain adaptation SVM (bruzzone2010domain), adaptive multiple kernel learning (A-MKL) (duan2012exploiting), and residual transfer network (RTN) (long2016unsupervised) have been progressively proposed. Particularly, long2016unsupervised extended this idea to deep neural networks. Similarly, rozantsev2017residual
introduced a residual transformation network to relate the parameters of two domain network architecture. Considering that model adaptation and data transportation (Section4.2.5) are not independent, optimizing both the transformation and classifier parameters jointly was also developed in (shi2012information; hoffman2013efficient; saito2017maximum).
With the recent booming of deep neural network techniques, a naive regime to be easily formulated is the fine-tuning strategy (Section 4.3.1). Through fine-tuning the pretrained networks with target data, an efficient adaptation can be naturally guided. When domain discrepancy between the source and the target is very large, this adaptation manner might not work. This inspires the idea of minimizing the difference in learned feature covariances across domains, which guarantees that fine-tuning can ameliorate the performance. In this learning manner, tzeng2015simultaneous combined domain confusion and softmax cross-entropy losses to train the network with the target data, where domain confusion loss tried to learn domain-invariant representations, while softmax cross-entropy loss ensured the output feature representations of the source and target data were distinct. long2015learning extended (tzeng2015simultaneous)’s domain confusion loss by incorporating an MMD loss for all of the fully connected layers ( and ) of the AlexNet. Furthermore, they combined target classifier adaptation with residual learning (yang2007cross) and feature adaptation with MMD loss in (long2016unsupervised). Another attempt using adversarial loss was made by (ganin2015unsupervised)
. Specifically, they achieved the adaptation by augmenting a gradient reversal layer connecting the bottom feature extraction layers and the domain classifier, whose function was similar to the discriminator in GAN. In this way, the feature extractor was trained to extract domain invariant features. Against possible overfitting issue during the fine-tuning stage,sener2016learning
presented an end-to-end deep learning framework to learn domain transformation, through jointly optimizing the optimal deep feature representation and target label inference.
4.4 Approach 3: Reduce the Dependency of LSL upon the Amount of Samples
One of the significant issues of LSL is that its effectiveness is generally dependent on large amount of training dataset, that is, the regime is constructed in a data-driven rather than a model-driven manner. A rational paradigm to reduce the dependency of LSL upon the amount of samples is to strengthen the power of a machine learning model to reflect more insights underlying sample domains. We first revisit a general machine learning model as follows:
where is hypothesis, is the learner, is loss/cost function measuring discrepancy between predicting output and ground-truth input, and is the regularizer. We can then introduce the following manners for this model-strengthen task.
4.4.1 Model-driven Small Sample Learning
Using proper models to confine hypothesis space in machine learning (or topology of a neural network) tends to relax the dependence of a learning algorithm on the amount of samples. Following this idea, several promising regimes have been raised recently. we will introduce them in the remainder of this section.
White Box Model:
The white box model denotes a popular strategy presented recently for improving interpretability against deep learning, which is known to have issues of unclear working mechanism, namely, black box models. The direct work is to whiten deep neural network like CNN. Based on the idea of encoding objects in terms of visual concepts (VCs), deng2017unleashing developed an interpretable CNN model for few-shot learning, where the VCs were extracted from a small set of images of novel object categories using features from CNNs trained on other object categories. Recently, tang2017towards proposed a composition network (CompNet) through combining And-Or graphs (AOGs) with CNN models, where the learned compositionality is fully interpretable. Alternatively, garcia2017few defined a graph neural representations, which cast few-shot learning as a supervised message passing task.
The unfolded approach was pioneered in (gregor2010learning), where the authors unrolled the ISTA algorithm for sparse coding into a neural network. In this scheme, filters that are normally fixed in the iterative minimization are instead learned. Recently, yang2016deep unrolled the ADMM algorithm to design a CNN for MRI reconstruction, demonstrating performance equivalent to the state-of-the-art with advantages in running time but with few training images. This reflects the main idea of the model-driven deep-learning (xu2017model), combining the model-based and deep-learning-based approaches. This paradigm can incorporate domain knowledge into model family, and then establish the algorithm family to solve the model family, and the algorithm family could be unfolded to a deep network to learn the unknown parameters in the algorithm family. Along this line of research, various unfolded technique have been successfully applied into dynamic MR image reconstruction (schlemper2018deep; qin2017convolutional), sparse view/data Computed Tomography(CT) reconstruction (gupta2018cnn; chen2017learned; adler2018learned), compressive image reconstruction (metzler2017learned; zhang2018ista). Especially, diamond2017unrolled presented a framework infusing knowledge of the image formation into deep networks that solved inverse problems in imaging by leveraging unrolled optimization with deep priors, outperforming the state-of-the-art results for a wide variety of imaging problems, such as denoising, deblurring, and compressed sensing magnetic resonance imaging(MRI).
Memory Neural Networks:
Inspired by episodic memory (see Section 2.2.1), researchers try to endow neural networks with memory (sukhbaatar2015end; graves2014neural). santoro2016meta firstly applied this mechanism into SSL. Specifically, they proposed memory-augmented neural networks (MANN) though combining with more flexible storage capabilities and more generalized deep architectures, namely, the ability to rapidly bind never-seen information after a single presentation and the ability to slowly learn an abstract method for obtaining useful representations of raw data. MANN was thus expected to achieve efficiently inductive transferring knowledge, which means that new information can be flexibly stored and precisely inferred based on novel data and long experience. Subsequently, some works tried to follow (santoro2016meta) and further enhance its capability. For easy reference, a list of MANN’s variations is displayed in Table 1.
|Scaling MANN||one-shot learning||scale to space and time||Omniglot||rae2016scaling|
|MANN with Gaussian||one-shot learning||structured generative model||Omniglot||Harada2017|
|LMN||few-shot learning||online adapting||Omniglot||Shankar2018|
|FLMN||one-shot/zero-shot learning||memory interference||Omniglot,||mureja2017meta|
|Life-long Memory||life-long one-shot learning||scale to large memory||Omniglot||kaiser2017learning|
|Module MAVOT||one-shot learning||long-term memory||ImageNet||liu2017mavot|
|for video object tracking||ILSVRC2015|
|Augmented LSTM||few-shot learning||long-term memory of||VQA benchmark||Ma2018CVPR_a|
|scarce training exemplars||Visual 7W Telling|
|Memory-Augmented||one-shot learning||rapidly adapting||pulmonary||mobiny2017lung|
|Recurrent Networks||lung nodules|
In details, rae2016scaling incorporated sparse access memory (SAM) in MANN to help scale in both space and time as the amount of memory grows, facilitating the method capable of making use of efficient data structures within the network, and obtaining significant speedups during training. Alteratively, kaiser2017learning tried to enhance large-scale memory using fast nearest-neighbor algorithms. Another work was investigated in (Harada2017), and they constructed a memory augmented network with Gaussian embeddings capturing latent structure based on the disentanglement of content and style instead of pointwise embeddings. To establish an online model adaptation, Shankar2018 proposed labeled memory network (LMN) with a label addressable memory module and an adaptive weighting mechanism. As an extension of (Shankar2018), mureja2017meta further proposed feature-label memory network (FLMN) explicitly splitting the external memory into feature and label memories, outperforming MANN (santoro2016meta) by a large margin in supervised one-shot classification tasks. In terms of one-shot learning for video object tracking, liu2017mavot employed an external memory to store and remember the evolving features of the foreground object as well as backgrounds over time during tracking, making it possible to maintain long-term memory of the object. To solve the long-tailed distribution of the question-answer pairs on the VQA benchmark dataset (antol2015vqa), Ma2018CVPR_a developed MANN to increase capacity to remember uncommon question and answer pairs. In biological domain, mobiny2017lung extended MANN to CT lung nodule classification, adapting to the new CT image data received from a never-before seen distribution. chen2018sequential further introduced the memory mechanism to recommender systems. They developed MANN integrated with collaborative filtering to help recommendation in a more explicit, dynamic, and effective manner.
Neural Module Networks:
Neural Module Networks (andreas2016neural; andreas2016learning; hu2017modeling) are composed by collecting jointly-trained neural modules, which can be dynamically assembled into arbitrary deep networks. When we want to use the previously trained model on a new task, we can assemble these modules dynamically to produce a new network structure tailored to that task (an illustration of NMN architecture is depicted in Fig.11). Along the line of NMN, relation networks (RNs) (santoro2017simple), end-to-end module networks (N2NMN) (hu2017learning), program generator+execution engine (PG+EE) (johnson2017inferring), thalamus gated recurrent module (ThalNet) (hafner2017learning) and feature-wise linear modulation (FiLM) (perez2018film) have been developed. And FiLM is demonstrated that can generalize well to challenging, new data from few examples or even zero-shot settings.
4.4.2 Metric-driven Small Sample Learning
The metric learning idea along this research line is to learn a mapping from inputs to vectors in an embedding space to make the inputs of the same identity or category closer than those of different identities or categories (more discriminative than original input space) (kulis2013metric; lu2017deep). Once the mapping is learned, at test time a nearest neighbors method can be used for retrieval and classification without retraining models for new categories that are unseen during training. Through putting emphasis on more high-quality samples while depressing those low-quality ones, the dependence of the method to the size of samples can be more or less reduced.
). Siamese networks are composed of two same neural networks, i.e, Base CNN, with shared parameters. In training stage 1, input is a similar/dissimilar pair, and the model parameters are learnt by optimizing the contrastive loss function. In training stage 2 and testing stage, a k-nearest neighbour(KNN) approach is utilized in the embeddings space learnt by the base CNN. (b) Matching Network for Low Data Drug Discovery (image is reproduced from(altae2017low)). The core idea of this method is to use an attLSTM to generate both query embedding and support embedding that embed input examples (in small molecule space) into a continuous representation space. Then based on initial embeddings and , it can construct and through iteratively evolving both embeddings simultaneously using a similarity measure , where the support embedding defines . Finally, the prediction can be casted by the siamese one-shot learning problem.
The pioneer work was proposed by (wolf2009one), which applied One-Shot Similarity (OSS) measure as a kernel basis used with SVM, learning a similarity kernel for image classification of insects. wan2013one proposed a new spatio-temporal feature representation (3D EMoSIFT) by fusing RGB-D data, which was invariant to scale and rotation, and then used nearest neighbor classifier for one-shot learning gesture recognition. Benefited from deep neural networks, the deep metric learning is gradually more popular, which explicitly learns a nonlinear mapping to map data points into a new feature space by exploiting the architecture of deep neural networks. The main techniques are siamese networks (bromley1994signature) and triplet networks (hoffer2015deep). In koch2015siamese, powerful discriminative features were generalized for one-shot image recognition without any retraining, which were learned via a supervised metric-based approach with siamese neural networks for the first time. A similar architecture proposed in (hilliard2017dynamic) could handle arbitrary example sizes dynamically as the system was used. As shown in Fig.12(a), gupta2017siamese augmented vanilla siamese networks for chromosome classification. In order to reduce the dependency of using actual class labels annotated by human experts, chung2017learning proposed a deep siamese CNN to learn fixed-length latent image representation from solely image pair information. Alternatively, the siamese network methods emphasize less to the inter-class and intra-class variations, and thus ye2018deep developed deep triplet ranking networks for one-shot image classification with larger capacity in handling inter- and intra-class image variations. The triplet ranking loss can separate the instance pair that belongs to the same class from the instance pair that belongs to different classes in the relative distance metric space computed from the image embeddings. Furthermore, dong2017quadruplet tried to add more instances into a tuple, and connected them with a novel loss combining a pair-loss and a triplet based contractive-loss.
Note that the above deep metric learning approaches do not offer a natural mechanism to solve -shot -way tasks (recognize objects with samples) for and just focus on one-shot learning. Recently, some novel metric learning methods for SSL have been proposed to tackle the more general few-shot learning problem, namely, matching networks (vinyals2016matching), prototypical networks (snell2017prototypical), relation network (sung2017learning). Typically, vinyals2016matching learned a network called matching networks with an episodic training strategy. In each episode, the algorithm learns the embedding of the few labeled examples (the support set) to predict classes for the unlabeled points (the query set). Mathematically, we denote the support set , query set , and containing (e.g., 1 or 5) exemplar images per category. The query set is coupled with (has the same categories), but has no overlapped images. Each category of contains query images. During training, will be fed into the to-be-learned embedding function to generate the category classifiers . Then, is subsequently applied to for evaluating the classification loss. The training objective then amounts to learning the embedding function by minimizing the classification loss. This process can be mathematically expressed as follows:
where denotes the model parameters of the embedding function , and is the loss function. denotes applying the category classifiers on the query set . The purpose of episodic training is to mimic the real test environment containing few-shot support set and unlabeled query set, whose process can be viewed as meta-training (Section 4.5). The consistency between training and test environment alleviates the distribution gap and improves generalization, capable of obtaining state-of-the-art performance on a variety of one-shot classification tasks. Then can be interpreted as a weighted nearest-neighbor classifier. To enhance the capacity of memory, cai2018memory further incorporated memory module into matching networks learning process, additionally integrating the contextual information across support samples into the deep embedding architectures. Some works extended matching networks (vinyals2016matching) to various applications, like low data drug discovery (altae2017low) (see Fig.12(b)), video action recognition (kim2017matching), one-shot part labeling (choi2017structured) and one-shot action localization (yang2018one). Alternatively, snell2017prototypical established prototypical networks to learn a metric space where classification could be performed by computing distances to prototype representations of each class. Further, fortgaussian2017gaussian improved prototypical networks architecture with interpretation of encoder outputs and construction way of metric on the embedding space. As an extension of (vinyals2015show; snell2017prototypical), sung2017learning provided a learnable rather than fixed metric, or non-linear rather than linear classifier. Based on (sung2017learning), long2018object learned an object level representation and exploited rich object-level information to infer image similarity.
Other related methods are developed in (triantafillou2017few; oreshkin2018tadam; scott2018adapted). Specifically, triantafillou2017few adopted an information retrieval perspective on the problem of few-shot learning, i.e., each point acted as a ‘query’ that ranked the remaining ones based on its predicted relevance to them. The mean average precision objective function was used that aimed to extract as much information as possible from each training batch by direct loss minimization over all relative orderings of the batch points simultaneously. To find more effective similarity measures for SSL, oreshkin2018tadam proposed metric scaling and metric task conditioning to boost the performance of few-shot algorithms. Furthermore, a hybrid approach was proposed in (scott2018adapted) to combine deep embedding losses for training (metric learning) on the source domain with weight adaptation (domain adaptation) on the target domain for -shot learning.
4.4.3 Knowledge-driven Small Sample Learning
From the perspective of traditional Bayesian, regularization can be considered as prior knowledge, which is often the purely subjective assessment for learning tasks of an experienced expert. Following this understanding, tenenbaum2011grow showed that when humans or machines made inferences that went far beyond the data available, strong prior knowledge must be making up the difference. In the big data era, the prior has more extensive meanings, like knowledge extracted from the environment, events and activities. This knowledge may contain prior of learning tasks (stewart2017label), domain knowledge (pan2010survey) or side information (vapnik2009new), human/world knowledge (Lake2016; song2017machine) (e.g., human-level concepts (Lake2015), common sense (davis2015commonsense), and intuitive physics (smith2013sources; battaglia2016interaction; hamrick2017metacontrol)), and so on.
In the pioneer work in (Fei-Fei2006), authors verified that learning need not start from scratch, while a key insight was that knowledge of previously learned classes could be considered as prior knowledge. Recently, bringing specific domain knowledge to learning tasks has been causes widespread attention, such as physical laws (stewart2017label; battaglia2013simulation), low rank, sparsity, side information (vapnik2009new), domain noise distribution (xie2017robust) and so on. For example, stewart2017label introduced a new method for using physics and other domain constraints to supervise neural networks, detecting and tracking objects without any labeled examples. To fuse side information into data representation learning, tsai2017improving introduced two statistical approaches to improve one-shot learning. Alternatively, ji2017combining
aimed to identify the related prior knowledge from different sources and to systematically encode them into visual learning tasks though joint bottom-up and top-down inference. Specifically, they demonstrated how to identify permanent theoretical knowledge and circumstantial knowledge for different vision tasks and how to represent and integrate them with the image data, maintaining good recognition performance and excellent generalization ability with minimal or even no data. On the other hand, some novel theories of incorporating knowledge by Bayesian methods have been developed recently. Regularized Bayesian inference(zhu2017big) improved the flexibility of Bayesian framework via posterior regularization, providing a novel approach to incorporate knowledge. Another attempt called Bayesian deep learning (wang2016towards) integrated deep learning and Bayesian models within a principled probabilistic framework. In this unified framework, the interaction between data-driven deep learning and knowledge-driven Bayesian learning creates synergy and further boosts the performance.
Through employing domain knowledge, another research topic draws attention on unsupervised feature learning (srivastava2015unsupervised) and self-supervised feature learning (pathak2016context). Unsupervised feature learning aims to learn video representations to generate future target sequence by learning from the historical frames, where spatial appearances and temporal variations are two crucial structures. Sometimes CNN-based networks can predict one frame at a time and generate future images recursively, which are prone to focus on spatial appearances but RNN-based networks focus on temporal dynamics. Thus Convolutional LSTM (ConvLSTM) model becomes popular (xingjian2015convolutional; lotter2016deep; villegas2017decomposing; wang2017predrnn). Self-supervised feature learning learns invariance features, which does not require manually annotations (human intervention) but is still utilized in supervised learning by inferring supervisory signals from data structure. Recent methods mainly employ context information. For example, doersch2015unsupervised explored the spatial consistency of image as context prediction task to learn feature representation. Further, noroozi2016unsupervised created an extension by solving the jigsaw configuration. Alternatively, Temporal ordering of patches were investigated in (lee2017unsupervised). Recently, nathan2018improvements developed a set of methods to improve performance of self-supervised learning using context.
Researches on human/world knowledge are popular for cognitive science to inspire SSL strategies. Here we will review some popular progresses on causality (zhang2017learning) and compositionality (Lake2016), attention mechanism (desimone1995neural) and curiosity (gottlieb2013information).
Causality and compositionality.
Causality is about using knowledge of how real world processes produce perceptual observations, which helps influence how people learn new concepts. Compositionality allows for reuse of a finite set of primitives (addressing the data efficiency) across many scenarios by recombining them to produce an exponentially large number of novel yet coherent and potentially useful concepts (addressing the overfitting problem). Benefiting from the ideas of causality and compositionality, some novel models have been proposed to perform SSL. For example, a representative work is like (Lake2015). They established the framework of Bayesian program learning (BPL) to mimic human writing, which inferred the next stroke from the current stroke using causality and compositionality, achieving one-shot generating characters. Another work was investigated in (George2017). They established a generative vision model that were compositional, factorized, hierarchical, and flexibly queryable, achieving excellent generalization and occlusion-reasoning capabilities, and outperformed deep neural networks on a challenging scene text recognition benchmark while 300-fold more data efficient. Alternatively, higgins2017scan described a new framework of symbol-concept association network (SCAN), able to discover and learn an implicit hierarchy of abstract concepts from as few as five symbol-image pairs per concept. Crucially, SCAN can imagine and learn novel concepts that have never been experienced during training with compositional abstract hierarchical representations. Through assuming that complex visual concepts could be composed using primitive visual concepts, misra2017red presented an approach to compose classifiers to generate classifiers for new complex concepts.
Attention describes the tendency of visual processing to be confined largely to stimuli that are relevant to behavior (addressing the data efficiency). This topic has become an active research in image capationing (xu2015show), image generation (gregor2015draw), VQA (xiong2016dynamic), machine translation (bahdanau2014neural; johnson2016google), and speech recognition (chorowski2015attention). Specifically, gregor2015draw began the early work in small sample learning with the deep recurrent attentive writer (DRAW) neural network architecture for image generation, where attention helped the system to build up an image incrementally, attending to one portion of a “mental canvas” at a time. Moreover, rezende2016one developed new deep generative models building on the principles of feedback and attention, which could generate compelling and diverse samples after observing new examples just once. Likewise, johnson2016google utilized a single neural machine translation (NMT) model with attention module to translate between multiple languages achieving zero-shot translation. Recently, wang2017multi designed a neural network, that took the semantic embedding of the class tag to generate attention maps and used those attention maps to create the image features for one-shot learning. Besides, he2017single presented a fast yet accurate text detector with attention mechanism encoding strong supervised information of text in training that predicted word-level bounding boxes in one shot.
The role of curiosity has been widely studied in the context of solving tasks with sparse rewards (gottlieb2013information; hester2017intrinsically). In general, a learning agent with curiosity can explore its environment in the quest for new knowledge and learning skills that might be helpful in future scenarios with rare or deceptive rewards. pathak2017curiosity firstly proposed a mechanism for generating curiosity-driven intrinsic reward signal that scales to high-dimensional continuous state spaces like images, achieving gradually learning more and more complex skills with few data rewards or even no data rewards(see Fig.13).
Curiosity-driven exploration system, sometimes also called intrinsic motivation system (oudeyer2016intrinsic), and recently forestier2017intrinsically have presented intrinsically motivated goal exploration processes (IMGEP) algorithmic approach to establish unsupervised multi-goal reinforcement learning formal framework. Further, pere2018unsupervised extended IMGEP to add a unsupervised goal space learning stage (UGL), where an unsupervised representation learning algorithm was used to learn a lower-dimensional latent space representation, and then the representation was applied to a standard IMGEP. Readers are suggested to read (oudeyer2018computational) to review computational frameworks and theories of curiosity-driven learning.
4.5 Approach 4: Meta learning
The explanation of meta learning in computer science on Wikipedia is to use metadata to understand how automatic learning can become flexible in solving learning problems, and hence to improve the performance of existing learning algorithms or to learn (induce) the learning algorithm itself (https://en.wikipedia.org/wiki/Meta_learning_(computer_science)). In other words, meta learning helps a learning system capable of learning to learn by itself, which can achieve the adaptive perception and cognition of the environment. From the cognition learning perspective, one way human acquires prior knowledge (Section 4.4.3) is through meta learning with a long research history (harlow1949formation; thrun2012learning)
. Meta learning works through learning the common/shared methodology in accomplishment of a family of tightly related tasks in current researches, which sometimes is closely related to the machine learning notions of transfer learning(pan2010survey) or multi-task learning (zhang2017overview). Then the common/shared methodology can very easily adapt the novel task as much stronger prior, while this prior is learning to learn by itself, which forces the learning system to learn new tasks as rapidly and flexibly as humans do. For example, Lake2015 proposed Bayesian program learning (BPL) to develop hierarchical priors that allowed previous experience with related concepts to ease learning of new concepts. Specifically, meta learning helps BPL learn the generation process of handwritten characters, which is the common/shared methodology to understand and explain characters. Therefore when encountering novel types of handwritten characters, BPL can perform one-shot learning in classification tasks at human-level accuracy. Also, if the common/shared methodology corresponds to other types of symbolic concepts/knowledge, the learning system can perform broader tasks, which may be particularly promising (more detailed discussions are described in Section 6.2).
Meta learning takes an important role in SSL and artificial intelligence (Lake2016), and recent researches include learning to learn (santoro2016meta), learning to reinforcement learn (wang2016learning; duan2016rl; xu2018learning), learning to transfer (Ying2018transfer), learning to optimize (li2016learning; Rosenfeld2018Learning), learning to infer (hu2017learning; marino2018learning), learning to search (guez2018learning; balcan2018learning), learning to control (duan2017meta) and so on. In the following we will review some typical methods in SSL along this research line.
Learning to learn.
This strategy aims to adaptively determine the appropriate data, loss function, and hypothesis space in a machine learning model in meta-level learning manner. The pioneer work began at (santoro2016meta), which used LSTM and MANN meta-learners to learn quickly from data presented sequentially, binding data representations to their appropriate labels. The general scheme to map data representations to appropriate classes or function values boosts few-shot image classification performance. While vinyals2016matching treated the data as a set, their episodic training strategy helped mimic the real test environment containing few-shot support set and unlabeled query set. Also, they combined metric learning to judge image similarity, and similar strategy was employed in (snell2017prototypical; sung2017learning). romero2014fitnets further extended (snell2017prototypical) with unlabeled examples training. Another attempt learning a meta-level network was made in (wang2016learning). The network operated on the space of model parameters, which was specifically trained to regress many-shot model parameters (trained on large datasets) from few-shot model parameters (trained on small datasets). Recently, munkhdalai2017meta proposed a meta networks (MetaNet), as shown in Fig.14(a), where the base learner performed in the input task space whereas the meta learner operated in a task-agnostic meta space. The meta learner can continuously learn and perform meta knowledge acquisition across different tasks. When novel tasks coming, the base learner first analyzes the task, and then provides the meta learner with a feedback in the form of higher order meta information (knowledge) to explain its own status in the current task space. Based on the meta information, the meta learner rapidly parameterizes both itself and the base learner so that the MetaNet model can recognize the new concepts rapidly. To exploit the domain-specific task structure, mishra2017simple proposed a class of simple and generic meta-learner architectures combining temporal convolutions and soft attention. A new framework called learning to teach was proposed by fan2018learning. The teacher model leveraged the feedback from the student model to determine the appropriate data, loss function, and hypothesis space to facilitate the training of the student model. This technique can achieve almost the same accuracy as full-supervised training using much less training data and fewer iterations.
Learning to reinforcement learn.
(wang2016learning; duan2016rl) firstly introduced meta learning into reinforcement learning (RL) to realize deep meta-reinforcement learning, motivated by developing deep RL methods that could adapt rapidly to new tasks. In particular, learner used deep RL to train a recurrent network on a series of interrelated tasks, with the result that the network dynamics learned a second RL procedure which operated on a faster time-scale than the original algorithm. Some typical methods are introduced as follows. sung2017learning proposed to learn a meta-critic network that could be used to train multiple ‘actor’ networks to solve specific problems, and the shared meta-critic provided the transferrable knowledge that allows actors to be trained with only a few trials on a new problem (see Fig.14(b)). For hierarchically structured policy learning, frans2017meta employed shared primitives (sub-policies) to improve sample efficiency on unseen tasks. A generic neural mechanism was introduced by munkhdalai2017learning to meta learning called conditionally shifted neurons, which could modify activation values with task-specific shifts retrieved from a memory module with limited task experience. For continuous adaptation in non-stationary environments, al2017continuous developed a gradient-based meta-learning approach suitable. They regarded non-stationarity as a sequence of stationary tasks and trained agents to exploit the dependencies between consecutive tasks such that they can handle similar non-stationarities at execution time. Unlike model-free RL (wang2016learning; duan2016rl; sung2017learning; al2017continuous), clavera2018learning considered to learn online adaptation in the context of model-based reinforcement learning. They trained a global model such that, when combined with recent data, the model could be rapidly adapted to the local context.
Learning to transfer.
Meta-learner is able to implement a learning task on a large number of different tasks to automatically determine what and how to transfer should be appropriate. Inspired by this capability, the model-agnostic meta-learning (MAML) approach (finn2017model) was proposed aiming to meta-learn an initial condition (set of neural network weights) that was suitable for fine-tuning on few-shot problems under model and task-agnostic conditions. Furthermore, kim2018bayesian extended (finn2017model) by introducing Bayesian mechanisms for fast adaptation and meta-update, quickly obtaining an approximate posterior of a given unseen task, as well a probabilistic framework was developed by (finn2018probabilistic). To avoid a biased meta-learner like (finn2017model), jamal2018task proposed a task-agnostic meta-learning (TAML) algorithms to train a meta-learner unbiased towards a variety of tasks before its initial model was adapted to unseen tasks. To deal with the long-tail distribution in big data, wang2017learning introduced a meta-network that learned to progressively transfer meta-knowledge from the head to the tail classes, where meta-knowledge was encoded with a meta-network trained to predict many-shot model parameters from few-shot model parameters. Another attempt to learn to transfer across domains and tasks was made by (hsu2017learning). They learned a pairwise similarity (i.e., meta-knowledge) to perform both domain adaptation and cross-task transfer learning, which was realized using a neural network trained by using the output of the similarity. For model agnostic training procedure, li2017learning made meta-learning on simulated train/test split with domain-shift for domain generalisation, which could be applied to different base network types. The latest work was investigated by (Ying2018transfer). They proposed a framework of learning to transfer (L2T) to enhance transfer learning effectiveness by leveraging previous transfer learning experiences. In particular, L2T learns a reflection function mapping a pair of domains and the knowledge transferred between them to the performance improvement ratio. When a new pair of domains arrives, L2T optimizes what and how to transfer by maximizing the value of the learned reflection function.
Learning to optimize.
Casting optimization algorithm design as a learning problem allows us to specify the class of problems we are interested in through data as well as automatic generate optimizers. The learning process is shown in Fig.15. bertinetto2016learning firstly proposed a method to minimize a one-shot classification objective in a learning-to-learn formulation. Particularly, they optimized a pupil network through constructing the learner, called a learnet, which predicted the parameters of pupil network from a single exemplar. Followed by (bertinetto2016learning), ravi2016optimization proposed an LSTM-based meta-learner model to learn the exact optimization algorithm used to train pupil neural network classifier in the few-shot regime as well as episodic training idea. Recently, some works have improved LSTM-based meta-learner model (ravi2016optimization).
For example, finn2017model proposed a model-agnostic meta-learning (MAML) learner, which was compatible with any model trained with gradient descent and applicable to a variety of different learning problems. Alternatively, li2017metasgd insisted that the choice of meta-learners was crucial and developed an easily trainable meta-learner, Meta-SGD, that could initialize and adapt any differentiable learner in just one step. Notably, compared with LSTM-based learner (ravi2016optimization), Meta-SGD was conceptually simpler, easier to be implemented, and could be learned more efficiently. To help learning to learn able to scale to larger problems and generalize to new tasks, wichrowska2017learned introduced a hierarchical RNN architecture ensemble of small and diverse optimization tasks capturing common properties of loss landscapes. In theory, finn2018meta
stated that a meta-learner was able to approximate any learning algorithm in terms of its ability to represent functions of the dataset and test inputs independent of the type of meta-learning algorithm. Furthermore, a bridge between gradient-based hyperparameter optimization and learning to learn in some setting was explored by(franceschibridge2017).
5 Beyond Small Sample Learning
In this section, we will introduce some research topics closely related to SSL, and discuss their relationships to SSL.
5.1 Weakly-Supervised Learning
Different from SSL with few annotated samples, weakly-supervised learning usually contains more annotated information, while is coarse-grained or noisy, whose supervised information is inexact, inaccurate or incomplete (zhou2017brief). For example, semantic segmentation needs pixel-wise labels for supervised learning, while collecting large-scale annotations is significantly labor intensive and limited for some applications. To alleviate this annotation quality issue and make semantic segmentation more scalable and generally applicable, weakly supervised learning has attracted much attention recently. The challenge in this issue is that weak labels provide part (or even inaccurate) information of the supervision (hong2017weakly), such as image-level label, bounding box, point supervision, scribble, and so on. To reduce human intervention required for training further, some approaches design exploitation regimes of an additional source of data. For example, hong2017weakly2 made use of web videos as additional data, while the annotations of web videos returned by a search engine tend to be inevitably noisy since the query keywords may not be consistent with the visual content of target images, and thus the problem is evidently weakly supervised. Sometimes, weakly-supervised information help boost performance of SSL. For example, Niu_2018_CVPR recently designed a new framework, which can jointly leverage both web data and auxiliary labeled categories (zero-shot learning) for fine-grained image classification. Their model can tackle the label noise and domain shift issue to a certain extent.
5.2 Developmental Learning and Lifelong Learning
To avoid the issue of experience catastrophic forgetting (kirkpatrick2017overcoming), human can learn and remember many different tasks that are encountered over multiple timescales. Recently, developmental learning (sigaud2016towards) and lifelong learning (thrun1995lifelong; mitchell2018never) try to alleviate this issue. SSL focused on learning with few observations, which sometimes meets the need of developmental learning and lifelong learning sometimes accounting for new tasks containing few data. On the other hand, the techniques and ideas of developmental learning and lifelong Learning may inspire SSL solving strategy, like fine-tuning (Section 4.3.1). For example, kaiser2017learning proposed life-long one-shot learning, which firstly tries to make deep models learn to remember rare events through their lifetime.
5.3 Open Set Learning
Open set learning was firstly proposed by (scheirer2013toward), which tries to identify whether the testing images come from the training classes or some unseen classes. Unlike zero-shot learning, open set learning does not need to explicitly predict the class labels. Recently, this setting is also known as incremental learning (rebuffi2017icarl), where learning systems learn more and more concepts over time from a stream of data. This learning paradigm distinguishes with zero-shot learning in that it deploys data in a dynamic way. Furthermore, when we consider the data features/classes are both incremental and decremental, the setting is called online learning (hou2017one). To summarise, open set learning can be considered as a specific SSL problem with certain constraints, because the novel classes are coming with few observations, while it is dynamic, evolving and infeasible to keep the whole data. Recently, busto2017open explored the field of domain adaptation in open sets, which is called open set domain adaptation. In this setting, both source and target domain contain classes that are not interested, and the target domain contains classes not related to that in the source domain and vice versa.
6 Further Research
The research on SSL is just in its very beginning period. The current developments are still needed to be further improved, empirically justified, theoretically evaluated and capability extended. In this section, we will try to list some promising and challenging research directions worthy to be investigated in future research.
6.1 More Neuroscience-Inspired Researches
SSL stems from mimicking the mechanism of human being that recognizes and forms concepts. Though many works have focused on computer simulation of the human’s mechanism, more intrinsic simulations deserve to be further explored. Especially, the following issues should be necessary to be considered:
Faster information process mechanism.
SSL with capability of episodic memory and experience replay.
Generating data as the cognitive manners like imagination, planning, and synthesis.
Continual and incremental SSL regimes.
6.2 Meta Knowledge & Meta Learning
The core strategy of SSL consists in skillful use of the knowledge existed. The knowledge mostly used so far is concerned with the method itself for specific problem-solving, while not with the methodology of how to develop the methods. The latter knowledge is the knowledge in meta-level. How to effectively use the meta-level knowledge in SSL deserves to be further studied.
Meta-learning aims at learning methodology of doing things (namely, learning to learn, optimize, transfer, and so on). This should be one of the next important focuses in AI research. To achieve this goal may go through the following stages:
To accomplish a family of highly related tasks through learning the common methodology of solving the family of tasks.
To accomplish a set of weakly related tasks through learning the common methodology of solving the family of tasks.
To accomplish more general tasks through learning methodology of doing things.
Realizing the above stages needs a progressive efforts. The current developments may be mainly paid on realization of the first two stage goals.
6.3 Concept Learning: Main Challenges
It also a critical issue of how to build a universal (generally applicable) mapping from the visual (V, image) space to semantics space (S, concept space). The existing methods are far from satisfied, and the problem to transform from V to S is still very challenging, e.g., domain shift problems(fu2014transductive) and hubness problems(radovanovic2010hubs; dinu2014improving).
Another challenge is the new concept learning problem. How to justify a new concept is being formed, and how to properly formalize its intensional and extensional representations. Such research is less so far but imperative, e.g., subtype discovery of cancer.
6.4 Experience Learning: Main Challenges
Cross-domain synthesis is the most attractive manner to form augmented data. How to realize a proper transformation of an object from one representation to the another (say, from CT to MR from visual to brain signal) is still a challenging issue. The differential homomorphism approach provides a promising mathematical framework for alleviating this issue, its effectiveness is, however, far from satisfied.
Model/knowledge/metric-driven learning provides promising ways to relax the dependence of LSL on amount of samples, while some problems still need to be considered:
How to determine a proper family of models?
How to define, represent, and embed knowledge into a model?
How to design metric learning methods more suitable for SSL?
6.5 Promising SSL Applications
There are many attractive applications that SSL is hopeful to be generalized to use. Some typical cases include:
New drug discovery, human-machine interaction, subtype discovery of disease, and outlier detection;
Fast cognition and recognition: the applications needed to perceive environment and react in real time;
Experience learning with few samples: medical aid diagnosis, intelligent communication (suh2016label).
This paper has provides a comprehensive survey on the current developments on small sample learning (SSL). The existing SSL techniques can be divided into to main categories of approaches, including experience learning and concept learning. Both concepts, as well as SSL, have been finely explained in mathematics in the paper, and most typical methods along both lines of research have been comprehensively reviewed. Besides, biology plausibility has been provided to support the feasibility of SSL. Furthermore, the relationship of some current related methodologies with SSL has also been discussed, and some meaningful research directions of SSL have been introduced for future research.
This research was supported by the China NSFC projects under contracts 61661166011, 11690011, 61603292, 61721002.