“Can machines think (Turing, 1950)
? ” This is the question raised in Alan Turing’s seminal paper entitled “Computing Machinery and Intelligence” in 1950. He made the statement that, “The idea behind digital computers may be explained by saying that these machines are intended to carry out any operations which could be done by a human computer”. In other words, the ultimate goal of machines is to be as intelligent as humans. Recent years, due to the appearance of powerful computing devices such as GPU, large-scale data sets such as ImageNet(Deng et al., 2009), advanced models and algorithms such as CNN (Krizhevsky et al., 2012), AI speeds up its pace to be like humans and defeats humans in many fields. To name a few, AlphaGo (Silver et al., 2016) has defeated human champions in playing the ancient game of go, and ResNet (He et al., 2016) has a higher classification accuracy than humans on ImageNet data set of 1000 classes. While in other fields, AI involves in human’s daily life as highly intelligent tools, such as voice assistants, search engines, autonomous driving cars , and industrial robots.
Albeit its prosperity, current AI cannot rapidly generalize from a few examples to perform the task. The aforementioned successful applications of AI rely on exhaustive learning from large-scale data. In contrast, humans are capable of learning new task scenarios rapidly by utilizing what they learned in the past. For example, a child who learned how to add can rapidly transfer his knowledge to learn how to multiply given a few examples, e.g., and . Another example is that given a few photos of a stranger, a child can easily identify the same person from a large number of photos.
Bridging this gap between AI and human-like learning is an important direction. This can be tackled by turning to machine learning, a sub-field of AI which supports its scientific study bases such as models, algorithms and theories. Specifically, machine learning is concerned with the question of how to construct computer programs that automatically improve with experience (Mitchell, 1997). For the thirst of learning from limited supervised information to get the hang of the task, a new machine learning problem called Few-Shot Learning (FSL) (Fink, 2005; Fei-Fei et al., 2006) is proposed. When there is only one example to learn, FSL is also called one-shot learning problem. FSL can learn a new task of limited supervised information by incorporating prior knowledge.
FSL acts as a test-bed for AI. Therefore, it is tested on whether it learns like human. A typical example is character recognition (Lake et al., 2015)
, where computer programs are asked to classify, parse and generate new handwritten characters given a few examples. To deal with this task, one can decompose the characters into smaller parts transferable across characters, then aggregate these smaller components into new characters. This is a way of learning like human(Lake et al., 2017). Naturally, FSL advances the development of robotics (Craig, 2009) which targets at developing machines that can replicate human actions so as to replace humans in some scenarios. Examples are one-shot imitation (Duan et al., 2017), multi-armed bandits (Duan et al., 2017), visual navigation (Duan et al., 2017), continuous control in locomotion (Finn et al., 2017).
Apart from testing for AI, FSL can also help to relieve the burden of collecting large-scale supervised data for industrial needs. For example, ResNet (He et al., 2016) obtains a higher classification accuracy than humans on ImageNet data of 1000 classes. However, this is under the circumstances where each class has sufficient labeled images. In contrast, human can recognize around 30,000 classes (Biederman, 1987), where collecting sufficient images of each class for machines to learn is very laborious. Instead, FSL can help reduce the data gathering effort for these data intensive applications, such as image classification (Vinyals et al., 2016)et al., 2017), object tracking (Bertinetto et al., 2016), gesture recognition (Pfister et al., 2014), image captioning and visual question answering (Dong et al., 2018), video event detection (Yan et al., 2015), and language modeling (Vinyals et al., 2016). Moreover, being able to perform FSL can reduce the cost of those computationally expensive applications such as one-shot architecture search (Brock et al., 2018). When the models and algorithms succeed for FSL, they naturally apply to data sets of many samples which are easier to learn.
Another classic scenario for FSL is tasks where supervised information is hard or impossible to acquire due to some reason, such as privacy, safety or ethic issues. For example, drug discovery is a process of discovering the properties of new molecules so as to identify useful ones as new drugs (Altae-Tran et al., 2017). However, due to possible toxicity, low activity, and low solubility, these new molecules do not have many real biological records on clinical candidates. This makes the drug discovery task a FSL problem. Similar rare case learning applications can be FSL translation (Kaiser et al., 2017), cold-start item recommendation (Vartak et al., 2017), where the target tasks do not have many examples. It is through FSL that learning suitable models for these rare cases becomes possible.
With both academic dream of AI and industrial needs for cheap learning, FSL draws much attention and becomes a hot topic. As a learning paradigm, many methods endeavor to solve it, such as meta-learning method (Santoro et al., 2016), embedding learning method (Vinyals et al., 2016) and generative modeling method (Edwards and Storkey, 2017). However, there is no existing work that provides an organized taxonomy to connect FSL methods, explains why some methods work while others fail, nor discusses the pros and cons of different works. Therefore, we conduct a survey on FSL problem. Contributions of this survey are summarized as follows 111In comparison to existing survey (Shu et al., 2018), they focus on concept learning and experience learning for small sample. In contrast, our survey focuses on FSL..
[leftmargin = 12px]
We give the formal definition for FSL. It can naturally link to the classic machine learning definition proposed in (Mitchell, 1997). The definition is not only general enough to include all existing FSL works, but also specific enough to clarify what is the goal of FSL and how we can solve it. Such a definition is helpful for setting future research target in the FSL area.
We point out the core issue of FSL based on error decomposition (Bottou and Bousquet, 2008) in machine learning. We figure out that it is the unreliable empirical risk minimizer that makes FSL hard to learn. This can be relieved by satisfying or reducing the sample complexity of learning. More importantly, this provides insights to improve FSL methods in a more organized and systematic way.
We perform extensive literature review from the birth of FSL to the most recent published ones, and categorize them in a unified taxonomy in terms of data, model and algorithm. The pros and cons of different categories are thoroughly discussed. We also present a summary of the insights under each category. These can help establish a better understanding of FSL methods.
We propose four promising future directions for FSL in terms of problem setup, techniques, applications and theories. These insights are based on the weakness of the current development of FSL, with possible improvements to make in the future. We hope they can provide some insights.
1.1. Organization of the Survey
The remainder of this survey is organized as follows. Section 2 provides an overview of the survey, including FSL’s formal definition, core issue, relevant learning problems and a taxonomy of existing works in terms of data, model and algorithm. Section 3 is for methods that augment data to solve FSL problem. Section 4 is for methods that constrain the model so as to make FSL feasible. Section 5 is for methods that alter the search strategy of algorithm to deal with FSL problem. In Section 6, we propose future directions for FSL in terms of problem setup, techniques, applications and theories. Finally, the survey closes with conclusions in Section 7.
In this section, we first provide the notation used throughout the paper in Section 2.1. A formal definition of FSL problem is given in Section 2.2 with concrete examples. As FSL problem relates to many machine learning problems, we discuss their relatedness and difference in Section 2.3. In Section 2.4, we reveal the cores issue that makes FSL problem hard. Then according to how existing works deal with the core issue, we present a unified taxonomy in Section 2.5.
Consider a supervised learning task, FSL deals with a data set consisting of training set where is small and test set . Usually, people consider the -way--shot classification task (Vinyals et al., 2016; Finn et al., 2017) where contains examples from classes each with examples. Let
be the ground truth joint probability distribution of inputand output , and be the optimal hypothesis from to . FSL learns to discover by fitting and testing on . To approximate , model determines a hypothesis space of hypotheses parameterized by 222Parametric is used, as the non-parametric ones count on large-scale data to fit the shape, therefore they are not suitable for FSL.. Algorithm is optimization strategy to search through in order to find the that parameterizes the optimal for
. The performance is measured by a loss functiondefined over the prediction (e.g., ) and the real output .
2.2. Problem Definition
As FSL is naturally a sub-area in machine learning, before giving the definition of FSL, let us recall how machine learning is defined literately. We adopt Mitchell’s definition (Mitchell, 1997) here, which is shown in Definition 2.1.
Definition 2.1 (Machine learning (Mitchell, 1997)).
A computer program is said to learn from experience with respect to some classes of task and performance measure if its performance can improve with on measured by .
As we can see, a machine learning problem is specified by , and . For example, consider image classification task (), machine learning programs can improve its classification accuracy () through obtained by training with large-scale labeled images, e.g., ImageNet data set (Krizhevsky et al., 2012). Another example is the recent computer program, AlphaGo (Silver et al., 2016), which has defeated the human champion in playing the ancient game of go (). It improves its winning rate () against opponents by of training using a database of around 30 million recorded moves of human experts as well as playing against itself repeatedly.
The above-mentioned typical applications of machine learning require a lot of supervised information for the given tasks. However, as mentioned in the introduction, this may be difficult or even not possible. FSL is a special case of machine learning, which exactly targets at getting good learning performance with limited supervised information provided by data set . The supervised information refers to training data set comprises examples of the inputs ’s along with their corresponding output ’s (Bishop, 2006). Formally, we define FSL in Definition 2.2.
Definition 2.2 ().
Few-Shot Learning (FSL) is a type of machine learning problems (specified by , and ) where contains a little supervised information for the target .
To understand this definition better, let us show three typical scenarios of FSL (Table 1):
[leftmargin = 10px]
Act as a test bed for human-like learning: To move towards human intelligence, the ability of computer programs to solve FSL problem is vital. A popular task () is to generate samples of a new character given only a few examples (Lake et al., 2015). Inspired by how humans learn, the computer programs learn with the consisting of both the given examples as supervised information and pre-trained concepts such as parts and relations as prior knowledge. The generated characters are evaluated through the pass rate of visual Turing test (), which discriminates whether the images are generated by humans or machines. With this prior knowledge, computer programs can also learn to classify, parse and generate new handwritten characters of a few examples like humans.
Reduce data gathering effort and computation cost: FSL can also help to relieve the burden of collecting large-scale supervised information. Consider classifying classes of a few examples through FSL (Fei-Fei et al., 2006). The image classification accuracy () improves with the obtained by a few labeled images for each class of the target , and the prior knowledge extracted from other classes, such as raw images to co-training. Methods succeed in this task usually have higher generality, therefore they can be easily applied for tasks of many samples.
Learn for rare cases: Finally, it is through FSL that one can learn suitable models for rare cases of limited supervised data. For example, consider a common drug discovery task () which is to predict whether the new molecule brings in toxic effects (Altae-Tran et al., 2017). The percent of molecules correctly assigned to toxic or not toxic () improves with obtained by both the new molecule’s limited assay, and many similar molecules’ assays as prior knowledge.
|supervised information||prior knowledge|
|character generation (Lake et al., 2015)||a few examples of new character||pre-learned knowledge of parts and relations||pass rate of visual Turing test|
|image classification (Koch, 2015)||supervised few labeled images for each class of the target||raw images of other classes, or pre-trained models.||classification accuracy|
|drug toxicity discovery (Altae-Tran et al., 2017)||new molecule’s limited assay||similar molecules’ assays||classification accuracy|
As only a little supervised information directly related to is contained in , it is natural that common supervised machine learning approaches fail on FSL problems. Therefore, FSL methods make the learning of the target feasible by combining the available supervised information in with some prior knowledge, which is “any information the learner has about the unknown function before seeing the examples” (Mahadevan and Tadepalli, 1994).
2.3. Relevant Learning Problems
In this section, we discuss the relevant learning problems of FSL. The relatedness and difference with respect to FSL are specially clarified.
[leftmargin = 12px]
is a special case of semi-supervised learning, where only positive and unlabeled samples are given. Another related semi-supervised learning problem isactive learning (Settles, 2009), which selects informative unlabeled data to query an oracle for output
. By definition, FSL can be supervised learning, semi-supervised learning and reinforcement learning, depending on what kind of data is available apart from the limited supervised information. It neither requires the existing of unlabeled samples nor an oracle.
Transfer learning (Pan and Yang, 2010) transfers knowledge learned from the source domain and source task where sufficient training data is available, to the target domain and target task where training data is limited. Domain adaptation (Ben-David et al., 2007)
is a type of transfer learning problem, where the tasks are the same but the domains are different. Another related transfer learning problem calledzero-shot learning (Lampert et al., 2009) recognizes a new class with no supervised training examples by linking them to existing classes, which usually count on external data sources such as text corpus and lexical database (Xian et al., 2018). FSL does not need to be a transfer learning problem. However, when the given supervised information is limited to learn directly, FSL needs to transfer prior knowledge to the current task. Then this kind of FSL problem becomes a transfer learning problem.
Meta-learning or learning-to-learn (Hochreiter et al., 2001) improves of the new task by the provided data set and the meta knowledge extracted across tasks by a meta-learner. Specifically, meta learner gradually learns generic information (meta knowledge) across tasks, and learner rapidly generalizes meta-learner for a new task using task-specific information. Many FSL methods are meta-learning methods, using the meta-learner as prior knowledge. For later reference, a formal definition of meta-learning is in Appendix A.
2.4. Core Issue
Usually, we cannot get perfect predictions for a machine learning problem, i.e., there are some prediction errors. In this section, we illustrate the core issue under FSL based on error decomposition in machine learning (Bottou and Bousquet, 2008; Bottou et al., 2018).
Recall that machine learning is about improving with on measured by . In terms of our notation, this can be written as
Here, parameterizes the hypothesis chosen by the model. And learning is about the algorithm searching for the of the best hypothesis in that fits the data .
2.4.1. Empirical Risk Minimization
In essence, we want to minimize the expected risk , which is the losses measured with respect to . For some hypothesis , is defined as
However, is unknown. Hence empirical risk
is used to estimate the expected risk. It is defined as the average of the sample losses over the training data set ( of samples):
and learning is done by empirical risk minimization (Vapnik, 1992) (perhaps also with some regularizers). For illustrative purpose, let
[leftmargin = 10px]
, where attains its minima;
, where is minimized with respect to ;
, where is minimized with respect to .
Assume and are unique for simplicity. The total error of learning taken with respect to the random choice of training set can be decomposed into
where the approximation error measures how closely the functions in can approximate the optimal hypothesis , the estimation error measures the effect of minimizing the empirical risk instead of the expected risk within (Bottou and Bousquet, 2008; Bottou et al., 2018).
As shown, the total error is affected by (the hypothesis space) and (the number of examples in ). In other words, learning to reduce the total error can be attempted from the perspectives of data which provides , model which determines and algorithm which searches through for the of the best that fits .
2.4.2. Unreliable Empirical Risk Minimizer
Note that, for in (2), we have
which means more examples can help reduce . Thus, in common setting of supervised learning task, the training data set is armed with sufficient supervised information, i.e., is large. Empirical risk minimizer can provide a good, (i.e., according to (3)) approximation to the best possible for ’s in .
However, the number of available examples is small in FSL. This makes the empirical risk far from being a good approximation for expected risk and the resultant empirical risk minimizer not good. Indeed, this is the core issue of FSL, i.e., the empirical risk minimizer is no longer reliable. Therefore, FSL is much harder than common machine learning settings. A comparison between common versus few-shot setting is shown in Figure 1.
Historically, classical machine learning methods learn with regularizations (Goodfellow et al., 2016) to generalize the learned methods for new data set. Regularization techniques have been rooted in machine learning, which helps to reduce and get better learning performance (Mitchell, 1997). Classical examples include Tikhonov regularizer (Hoerl and Kennard, 1970), and lasso regularizer (Tibshirani, 1996). Admittedly, these regularizations can restrict the form of models. However, these simple regularization techniques cannot address the problem of FSL. They do not bring in any extra supervised information, therefore they cannot address the unreliability of the empirical risk minimizer caused by small . Thus, learning with regularization is not enough to offer good prediction performance for FSL problem.
2.4.3. Sample Complexity
Empirical risk minimization is closely related to sample complexity. Specifically, sample complexity refers to the number of training samples needed to guarantee the loss of minimizing empirical risk instead of expected risk is at most with probability (Mitchell, 1997). Mathematically, for , sample complexity is an integer such that for , we have
When is finite, is learnable. For infinite space , its complexity can be measured by Vapnik–Chervonenkis (VC) dimension (Vapnik and Chervonenkis, 1974). VC dimension is defined as the size of the largest set of inputs that can be shattered (split in all possible ways) by . is tightly bounded as
As shown in (4) and (5), for fixed and , needs to be less complicated to make the provided samples enough for . FSL methods usually use prior knowledge to compensate for the lacking in samples. One typical kind of FSL methods is Bayesian learning (Lake et al., 2015; Fei-Fei et al., 2006). It combines the provided training data set
with prior probability which is the probability available beforeis given (Bishop, 2006). In this way, the required to determine the final probability of is reduced provably (Mitchell, 1997; Germain et al., 2016). This inspires us to suspect that FSL methods can satisfy or reduce the by using prior knowledge. Consequently, the core issue of the unreliable empirical risk minimizer can be solved.
In previous sections, we discover that the core issue of FSL is the unreliable empirical risk minimizer . We also show that a reliable empirical risk minimizer can be obtained by satisfying or reducing the required sample complexity . Existing FSL works use prior knowledge to satisfy or reduce . Based on how prior knowledge is used, we categorize these works into kinds:
[leftmargin = 10px]
Model: methods that design based on prior knowledge in experience to constrain the complexity of . Learning with this constrained leads to a smaller as proved in (Mahadevan and Tadepalli, 1994; Nguyen and Zakynthinou, 2018; Germain et al., 2016). An illustration is shown in Figure 2(b). The gray area is not considered for later optimization as they are known unlikely to contain optimal according to prior knowledge. For this smaller , is enough to learn a more reliable .
Algorithm: methods that take advantage of prior knowledge to search for the which parameterizes the best hypothesis in . The prior knowledge alters the search strategy by providing a good initial point to begin the search, or directly providing the search steps. For methods of this kind, refining some existing parameters optimizes for and targets at , while utilizing meta-learner learned from a set of tasks directly targets at , therefore we show two paths from “start“ to in Figure 2(c). They both provably reduce (McNamara and Balcan, 2017; Amit and Meir, 2018).
Accordingly, existing works can be categorized into a unified taxonomy as shown in Figure 3. We will detail each category in next sections.
Methods in this section solve FSL problem by augmenting data using prior knowledge, so as to enrich the supervised information in . With more samples, the data is sufficient to meet the sample complexity needed by subsequent machine learning models and algorithms, and to obtain a more reliable .
Here, we show how data is augmented in FSL using prior knowledge. Depending on the type of prior knowledge, we classify these methods into four kinds as shown in Table 2. Accordingly, an illustration of how transformation works is shown in Figure 4. As the augmentation to each of classes in is done independently, we illustrate using example of class in .
|handcrafted rule||original||handcrafted rule on||(transformed , )|
|learned transformation||original||learned transformation on||(transformed , )|
|weakly labeled or unlabeled data set||weakly labeled or unlabeled||predictor trained by||(, output predicted by )|
|similar data set||sample from similar data set||aggregate new and by weighted average of samples of similar data set||aggregated sample|
This strategy augments by transforming each into several samples with some variation. The transformation procedure, which can be learned from similar data or designed by human expertise, is included in experience as prior knowledge. It is only applied to images so far, as the synthesized images can be easily evaluated by humans.
3.1.1. Handcrafted Rule
On image recognition tasks, many works augment by transforming original examples in using handcrafted rules as pre-processing routine, e.g., translating (Shyam et al., 2017; Lake et al., 2015; Santoro et al., 2016; Benaim and Wolf, 2018), flipping (Shyam et al., 2017; Qi et al., 2018), shearing (Shyam et al., 2017), scaling (Lake et al., 2015; Zhang et al., 2018c), reflecting (Edwards and Storkey, 2017; Kozerawski and Turk, 2018), cropping (Qi et al., 2018; Zhang et al., 2018c) and rotating (Santoro et al., 2016; Vinyals et al., 2016) the given examples.
3.1.2. Learned Transformation
In contrast, this strategy augments by duplicating original examples into several samples which are then modified by learned transformation. The learned transformation itself is the prior knowledge in , while neither its training samples nor learning procedure is needed for the current FSL task. The earliest paper on FSL (Miller et al., 2000) learns a set of geometric transformation from a similar class by iteratively aligning each sample with other samples. Then this learned transformation is applied on each to form a large data set which can be learned normally. Similarly, Schwartz et al. (2018) learn a set of auto-encoders from a similar class, each representing one intra-class variability, to generate new samples by adding the variation to . Assuming all categories share general transformable variability across samples, a single transformation function is learned in (Hariharan and Girshick, 2017) to transfer variation between sample pairs learned from other classes to by analogy. Instead of enumerating the variability within pairs, Kwitt et al. (2016) transform each to several new samples using a set of independent attribute strength regressors learned from a large set of scene images, and assign these new samples the label of the original . Based on (Kwitt et al., 2016), Liu et al. (2018) further propose to learn a continuous attribute subspace which can be used to bring in any attribute variation to .
Transforming by handcrafted rules is popularly used in deep models to reduce the risk of overfitting (Goodfellow et al., 2016). However, deep models are usually learned from large-scale data sets, where the samples are enough to estimate its rough distributions (either conditional distributions for discriminative models or generating distributions for generative models) (Mitchell, 1997). In this case, augmenting by more samples can help to draw a clearer shape of the distribution. In contrast, FSL contains only a little supervised information, thus its is not well exposed. These handcrafted rules such as simply scaling and rotation transform all images without considering the task or desired data property available in . They do not bring in extra supervised information. Therefore, they are only used as a pre-processing step for image data. As for transforming by learned transformation, it is data-driven and exploit prior knowledge akin to of task , therefore it can augment more suitable samples. However, this prior knowledge needs to be extracted from similar tasks, which may not always be available and can be costly to collect.
3.2. Transform Other Data Sets
This strategy transforms samples from other data sets and adapts them to be like of the target , so as to be augmented to the supervised information .
3.2.1. Weakly Labeled or Unlabeled Data Set
This strategy uses a large-scale weakly labeled or unlabeled data set. This data set is known to contain samples of the same label as , but the output is not explicitly given. Therefore, we have to first find these samples with the target labels. As these samples of a large-scale data set contain enormous variations of samples, augmenting them to helps depict a clearer . Consider video gesture recognition, Pfister et al. (2014) use a large but weakly labeled gesture reservoir, which contains large variations of continuous gestures of different people but no clear break between gestures. A classifier learned from is used to pick those samples with the same gestures of from the gesture reservoir. Then the final gesture classifier is built using these selected samples. Label propagation is used to label an unlabeled data set directly in (Douze et al., 2018).
3.2.2. Similar Data Set
This strategy augments by aggregating samples pairs from other similar but larger data sets. For example, a data set of tigers is similar to another data set of cats. The assumption is that the underlying optimal hypothesis applies to all classes, and the similarity between of classes can be transferred to of classes. Therefore, new samples can be generated by aggregating sample from a similar data set, where the aggregation weight is usually some similarity measure extracted from other information sources, such as text corpus (Tsai and Salakhutdinov, 2017). However, directly augmenting the aggregated samples to may not be appropriate, as these samples are not from the target FSL class. Therefore, Gao et al. (2018) design a method based on generative adversarial network (GAN) (Goodfellow et al., 2014) to generate indiscriminate synthetic aggregated from data set of many samples.
The gathering of weakly labeled or unlabeled data set is usually cheap, as no human effort is needed for labeling. However, along with this cheapness, the quality of these kinds of data sets is usually low, e.g., coarse and lack of strict data set collecting and scrutinizing procedure, resulting in unclear synthesizing quality. Besides, it is also costly to pick useful samples from this large data set. Similar data set shares some property with , and contains sufficient supervised information, making it a more informative data source to be exploited. However, determining the key property so as to seek similar data sets can be objective, and collecting this similar data set is laborious.
By augmenting , methods in this section can meet the desired sample complexity and obtain a reliable empirical risk minimizer . Methods of the first kind of transform each original sample by handcrafted or learned transformation rules. They augment based on original samples, therefore the constructed new samples will not be too far away from . But also due to this reason, generating samples in this way is restricted. Methods of the second kinds transform samples from other data set and adapt them to mimic . These data sets are in large-scale, providing tremendous samples of large variation for transformation. However, how to adapt those samples to be like can be hard.
In general, solving FSL from the perspective of augmenting is straightforward. The data can be augmented in consideration of incorporating the target of the problem which eases learning. And this augmentation procedure is usually reasonable to human. However, as is unknown, perfect prior knowledge is not possible. This means that the augmentation procedure is not precise. The gap between the estimated one and the ground truth largely interferes the data quality, even leading to concept drift.
Model determines a hypothesis space of hypotheses parameterized by to approximate the optimal hypothesis from input to output .
If common machine learning models are used to deal with the few-shot , they have to choose a small hypothesis space . As shown in (5), a small has small sample complexity, thus requiring fewer samples to be trained (Mitchell, 1997). When the learning problem is simple, e.g., the feature dimension is low, a small can already get desired good learning performance. However, learning problems in real-world are usually very complex, they cannot be well represented by hypothesis in a small due to significant (Goodfellow et al., 2016). Therefore, large is preferred for FSL, which makes common machine learning models not feasible. As we will see in the sequel, methods in this section learn a large by complementing the lack of samples through prior knowledge in . Specifically, the prior knowledge is used to affect the design choices of , such as constraining . In this way, the sample complexity is reduced, the empirical risk minimization is more reliable, and the risk of overfitting is reduced. In terms of what prior knowledge is used, methods falling in this kind can be further classified into four kinds, as summarized in Table 3.
|strategy||prior knowledge||how to constrain|
|multitask learning||other ’s with their data sets ’s||share parameter|
|embedding learning||embedding learned from/together with other ’s||project samples to a smaller embedding space where similar and dissimilar samples can be easily discriminated|
|learning with external memory||embedding learned from other ’s to interact with memory||refine samples by stored in memory|
|generative modeling||prior model learned from other ’s||restrict the form of distribution|
4.1. Multitask Learning
Multitask learning (Caruana, 1997) 333Here we present some instantiations of using multitask learning for FSL problems. For a comprehensive introduction of multitask learning, please refer to (Zhang and Yang, 2017) and (Ruder, 2017). methods learn multiple learning tasks spontaneously, exploiting the generic information shared across tasks and specific information of each task. It is popularly used for applications where multiple related tasks of limited training examples co-exist, therefore they can naturally apply for FSL problems. Note that when multitask learning deals with tasks from different domains, it is also called domain adaptation (Ben-David et al., 2007).
Formally, given a set of related tasks ’s including both tasks of few samples and many samples, each task operates on data sets ’s where consists of training set , and test set . Among these tasks, we call the few-shot tasks as target tasks, while the rest as source tasks. Multitask learning learns from ’s to obtain for each . As these tasks are related, they are assumed to have similar or overlapping hypothesis space ’s. Explicitly, this is done by sharing parameters among these tasks. And these shared parameters can be viewed as a way to constrain each by the other jointly learned tasks. In terms of whether parameter sharing is explicitly enforced, we separate methods in this strategy into hard and soft parameter sharing. Illustrations about hard and soft parameter sharing are in Figure 5.
4.1.1. Hard Parameter Sharing
This strategy explicitly shares parameter among tasks to promote overlapping ’s, and can additionally learn a task-specific parameter for each task to account for task specialties. In (Zhang et al., 2018c), this is done by sharing the first several layers of two networks to learn the generic information, while learning a different last layer to deal with different output for each task. Benaim and Wolf (2018) operate in the opposite way for domain adaptation. They learn separate embedding for source and target tasks in different domains to map them into a task-invariant space, then learn a shared classifier to classify samples from all tasks. Finally, method in (Motiian et al., 2017) first pre-trains a variational auto-encoder from the source tasks in the source domain, clones it for target task. Then it shares some layers to capture generic information, and lets both tasks to have some task-specific layers. The target task can only update their task-specific layers, while the source task can update both shared and their specific layers. It avoids directly updating the shared layers using so as to reduce the risk of overfitting.
4.1.2. Soft Parameter Sharing
This strategy does not explicitly share parameters across tasks. Instead, each task has its own hypothesis space and parameter . It only encourages parameters of different tasks to be similar, resulting in similar ’s. This can be done by regularizing ’s. Yan et al. (2015) penalize the pairwise difference of ’s among all combinations, forcing all to be learned similarly. Apart from regularizing ’s directly, another way is to force soft parameter sharing by adjusting ’s through loss. After optimization, the learned ’s also utilize information of each other. Luo et al. (2017) initialize the CNN for the target tasks in the target domain by a pre-trained CNN learning from source tasks in source domain. During training, they use an adversarial loss calculated from representations in multiple layers of CNN to force the two CNNs projects samples to a task-invariant space.
Multitask learning methods constrain learned for each task by a set of tasks jointly learned. By sharing parameters explicitly or implicitly, the jointly learned tasks together eliminate those infeasible regions implicitly. Sharing by hard parameter can be enforced easily. A shared hypothesis space is used to capture the commonality while each task builds its specific model hypothesis space on top it. In contrast, soft parameter sharing only encourages a similar hypothesis, which is a more flexible way to constrain . But how to enforce the similarity constraint needs careful design.
4.2. Embedding Learning
Embedding learning (Spivak, 1970; Jia et al., 2014) methods embeds to a smaller embedding space , where the similar and dissimilar pairs can be easily identified. Therefore, is constrained. The embedding function is mainly learned by prior knowledge, and can additionally use to bring in task-specific information. Note that embedding learning methods are mainly designed for classification tasks.
Embedding learning methods have the following key components: function which embeds samples to , function which embeds examples to , and similarity measure which is used to calculate the similarity between and for each in . Then is assigned to the class of the most similar . Although can be the same as , sometimes different and are used for and . This is because can be embedded explicitly depending on information from so as to adjust comparing interest (Bertinetto et al., 2016; Vinyals et al., 2016). Hence we discriminate these two embedding functions. Usually, a set of data sets ’s with and are used to learn these components. Note that can be data set of many samples or few samples. An illustration of embedding learning strategy is shown in Figure 6. We also present the details of existing methods in embedding learning in terms of , and in Table 4.
Next, according to what information is embedded in the embedding, we will classify these methods into task-invariant (in other words, general), task-specific and a combination of the two.
|mAP-DLM/SSVM(Triantafillou et al., 2017)||CNN||the same as||no||cosine similarity/mAP||specific|
|class relevance pseudo-metric (Fink, 2005)||kernel||the same as||no||squared distance||invariant|
|convolutional siamese net (Koch, 2015)||CNN||the same as||no||weighted distance||invariant|
|Micro-Set(Tang et al., 2010)||logistic projection||the same as||no||distance||combined|
|Learnet (Bertinetto et al., 2016)||adaptive CNN||the same as||yes||weighted distance||combined|
|DyConvNet (Zhao et al., 2018)||adaptive CNN||the same as||no||-||combined|
|R2-D2 (Bertinetto et al., 2019)||adaptive CNN||the same as||yes||-||combined|
|Matching Nets (Vinyals et al., 2016)||CNN, then LSTM with attention||CNN, biLSTM||yes||cosine similarity||combined|
|resLSTM (Altae-Tran et al., 2017)||GCN, then LSTM with attention||GCN, then LSTM with attention||yes||cosine similarity||combined|
|Active MN (Bachman et al., 2017)||CNN||biLSTM||yes||cosine similarity||combined|
|ProtoNet (Snell et al., 2017)||CNN||the same as||no||squared distance||combined|
|semi-supervised ProtoNet(Ren et al., 2018)||CNN||the same as||no||squared distance||combined|
|PMN (Wang et al., 2018b)||CNN, then LSTM with attention||CNN, then biLSTM||yes||cosine similarity||combined|
|TADAM (Oreshkin et al., 2018)||CNN||the same as||yes||squared distance||combined|
|ARC (Shyam et al., 2017)||RNN with attention, then biLSTM||the same as||yes||-||combined|
|Relation Net (Sung et al., 2018)||CNN||the same as||no||-||combined|
|GNN (Satorras and Estrach, 2018)||CNN, then GNN||the same as||yes||learned distance||combined|
|TPN (Liu et al., 2019)||CNN||the same as||yes||Gaussian similarity||combined|
|SNAIL (Mishra et al., 2018)||CNN with attention||the same as||no||-||combined|
Task-specific embedding methods learn an embedding function tailored for . Triantafillou et al. (2017) learns an embedding to maintain the ranking list for each in , where those of the same class rank higher and otherwise lower. Given the few-shot , the sample complexity is largely reduced by enumerating all pairwise comparisons between examples in as sample pairs. In this way, each original example can be included in multiple sample pairs, which largely reduces the required sample complexity.
Task-invariant embedding methods learn embedding function from a large set of data sets ’s which does not include the . The assumption is that if the embeddings can successfully separate many data sets on , they can be general enough to work well for without retraining. Fink (2005) proposes the first embedding method for FSL. It learns from auxiliary ’s a kernel space as , embeds both and to , where is assigned to the class of nearest neighbor in . A recent deep model convolutional siamese net (Koch, 2015) learns twin CNNs to embed sample pairs from a large set of data sets ’s to a common embedding space . It then constructs sample pairs using original samples of , and reformulates the classification task as a verification/ matching task which verifies whether the resultant embeddings of the sample pairs belong to the same class or not.
4.2.3. A Combination of Task-invariant and Task-specific
Task-specific embedding methods fully consider the task specialty, while task-invariant embedding methods can rapidly generalize for a new task without re-training. A trend is to combine the best of the above-mentioned methods: learn to adapt the generic task-invariant embedding space learned from prior knowledge by task-specific information contained in . Tang et al. (2010) firstly propose to optimize over distribution of FSL tasks, under the name micro-sets. They learn by logistic projection from these FSL tasks. For a new few-shot task , all samples in and are mapped to . Then is classified by nearest neighbor classifier on .
Recent works mainly use meta-learning methods444See Appendix A for a formal definition and a brief introduction of meta-learning. For these meta-learned embedding learning methods, ’s are the meta-training data sets ’s of meta-training tasks ’s, and a new task is one of the meta-testing tasks ’s. As in both meta-training and meta-testing stage, the learner will train by and test on using the provided data set of task , we use and without marking or for illustration simplicity. to merge the task-invariant and task-specific knowledge. We group them by the core ideas and highlight the representative works.
[leftmargin = 18px]
Learnet (Bertinetto et al., 2016) improves upon convolutional siamese net (Koch, 2015) by incorporating the specialty of of each task to . It learns a meta-learner from the meta-training data sets ’s to map to the parameter of each layer in in convolutional siamese net. To reduce the parameter number of learner, DyConvNet (Zhao et al., 2018) uses a fixed set of filters and only learns to combine them for learner. The recent work (Bertinetto et al., 2019)
replaces the classification layer of Learnet by a ridge regression model whose parameter can be found by cheap closed-form solution.
Matching Nets (Vinyals et al., 2016) assign to the most similar in , where and are embedded differently by and . Specially, is conditioned on , and aggregates information of all examples in by a bi-directional LSTM (biLSTM) (Graves and Schmidhuber, 2005). However, the bi-directional LSTM implicitly enforces an order among examples in
. Due to the vanishing gradient problem, nearby examples have a larger influence on each other. To remove the unnatural order,Altae-Tran et al. (2017) replace the biLSTM used in by LSTM with attention, and further iteratively refine both and to encode contextual information. An active learning variant in (Bachman et al., 2017) adds a sample selection stage to matching nets (Vinyals et al., 2016), which can label the most beneficial unlabeled sample and use it to augment .
ProtoNet (Snell et al., 2017) performs only one comparison between and prototype of each class in . This class ’s prototype is defined as the mean of embeddings of that class, i.e., where is one the examples of the th class in . The ProtoNet embeds both and using the same CNN, and ignores specialty of different ’s, Noticed that, a combination of the best of matching nets and ProtoNet is proposed in (Wang et al., 2018b) to account for task-specific information. Further, Oreshkin et al. (2018) average ’s as the task embedding, which is then mapped to some parameters of CNN used in ProtoNet. A semi-supervised variant of ProtoNet is proposed in (Ren et al., 2018), which learns to soft-assign related unlabeled samples to augment during learning.
Relative representations further embed the embedding of and each calculated from in jointly, which is then directly mapped to similarity score like classification. This idea is independently developed in ARC (Shyam et al., 2017) and relation net (Sung et al., 2018). ARC uses a RNN with attention to recurrently compare different regions of and each class prototype and produces the relative representation, additionally uses a biLSTM to embed the information of other comparisons as the final embedding. Relation net first uses a CNN to embed and to , then concatenates them as the relative representation, and outputs the similarity score by another CNN.
Relation graph is a graph maintaining all pairwise relationships among samples. Specifically, the graph is constructed using samples from both and as nodes, while its edges between nodes are determined by some learned . Then each is predicted using neighborhood information. GCN is used in (Satorras and Estrach, 2018) to learn the relation graph between ’s from and from . The resultant embedding for node is used to predict . In contrast, Liu et al. (2019) meta-learn an embedding which maps each and to , build a relation graph there, and label by closed-form label propagation rules.
SNAIL (Mishra et al., 2018) designs special embedding networks consisting of interleaved temporal convolution layers and attention layers. The temporal convolution is used to aggregate information from past time steps, and the attention selectively attends to specific time step relevant to the current input. The network’ parameter is meta-learned across tasks. Within each task, the network takes sequentially, and predicts immediately.
Task-specific embedding fully considers the domain knowledge of . However, the given few-shot can be biased, only learning from them may be inappropriate. Modeling ranking list among has a high risk of overfitting to . Besides, learned this way cannot generalize from new tasks or be adapted easily. Using pre-trained task-invariant embeddings has a low computation cost. However, the learned embedding function does not consider any task-specific knowledge. When special is the reason that has only a few examples such as learning for rare cases, simply applying task-invariant embedding function can be not suitable. A combination of task-invariant and task-specific information is usually learned by meta-learning methods. They can provide a good and quickly generalize for different tasks by learner. However, how to generalize for a new but unrelated task without bringing in negative transfer is not sure.
4.3. Learning with External Memory
such as neural Turing machine (NTM)(Graves et al., 2014) and memory networks (Weston et al., 2014; Sukhbaatar et al., 2015) allows short-term memorization and rule-based manipulation (Graves et al., 2014). Note that learning is a process of mapping useful information of training samples to the model parameters. Given new task with training data set , the model has to be re-trained to incorporate its information, which is costly. Instead, learning with external memory directly memorizes the needed knowledge in an external memory to be retrieved or updated, therefore it relieves the burden of learning and allows fast generalization. Formally, denote memory as , which has memory slots . Given a sample , it is first embedded by as query , which then attends to each through some similarity measure , e.g, cosine similarity. Then it uses the similarity to determine which knowledge is extracted from the memory, and predicts based on it. Table 5 introduces the detailed characteristics of each method with external memory.
|MANN (Santoro et al., 2016)||cosine similarity|
|abstraction memory(Xu et al., 2017)||dot product|
|life-long memory (Kaiser et al., 2017)||and age||cosine similarity|
|CMN (Zhu and Yang, 2018)||and age||dot product|
|APL (Ramalho and Garnelo, 2019)||squared distance|
|MetaNet (Munkhdalai and Yu, 2017)||fast weight||cosine similarity|
|CSNs (Munkhdalai et al., 2018)||fast weight||cosine similarity|
|MN-Net (Cai et al., 2018)||dot product|
For FSL, has limited samples, re-training the model is infeasible. Learning with external memory can help solve this problem by storing knowledge extracted from into an external memory. The embedding function learned from prior knowledge is not re-trained, therefore the initial hypothesis space is not changed. When a new sample comes, relevant contents are extracted from the memory and combine into the local approximation for this sample. Then the approximation is fed to the subsequent model for prediction, which is also pre-trained. As is stored in the memory, the task-specific information is effectively used. In sum, methods of this kind refine and re-interprete samples by stored in memory, consequently reshaping . An illustration of embedding learning strategy is shown in Figure 7.
Usually, when the memory is not full, new samples can be written to vacant memory slots. However, when the memory is full, one has to decide which memory slots to be updated or replaced by some designed rules. We group existing works according to different preference revealed in these update rules as follows.
[leftmargin = 18px]
Update the least recently used memory slot. The earliest work (Santoro et al., 2016) which uses memory to solve the FSL classification problem, MANN (Santoro et al., 2016) updates the least recently used memory slot for the new sample when the memory is full. As the image-label binding is shuffled across tasks, MANN cares more about mapping samples of the same class to the same label. In turn, samples of the same class together refine their class representation kept in the memory.
Update by location-based addressing. Some works use the location-based addressing proposed in NTM, which updates all memory slots at all time by back-propagation of gradients. The abstract memory (Xu et al., 2017) uses this update strategy. In each task, the meta-learner first extracts relevant from a memory containing large-scale auxiliary data, then sends them to the abstraction memory. The output of the abstraction memory is used for prediction.
Update according to the age of memory slots. Some memory records age for each memory slot. The memory slot increases its age when it is read, and resets age to 0 when it is updated. The oldest one is more like out-dated information. Both Life-long memory (Kaiser et al., 2017) and CMN (Zhu and Yang, 2018) update the oldest memory slot when the memory is full. However, some times one value the rare events in old memory slots. To deal with it, life-long memory specially prefers to update memory slots of the same classes. As each class occupies comparative number of memory slots, rare classes are protected in a way.
Update the memory only when the loss is high. The surprised-based memory module (Ramalho and Garnelo, 2019) designs a memory update rule which only updates the memory when the prediction loss for some is above a threshold. Therefore, the computation cost is reduced compared to a differentiable memory, and the memory contains minimal but diverse information for equivalent prediction.
Use the memory as storage without updating. MetaNet (Munkhdalai and Yu, 2017) stores sample-level fast weights for in a memory, and conditions its embedding and classification by the extracted fast weights, so as to combine the generic and specific information. MetaNet repeatedly applies the fast weight to selected layers of a CNN. In contrast, Munkhdalai et al. (2018)
learn fast weight to change the activation value of each neuron, which has a lower computation cost.
Aggregate the new information into the most similar one. MN-Net (Cai et al., 2018) merges the information of new sample to its most similar memory slots. Instead of directly predicting for as in matching nets (Vinyals et al., 2016), the memory is used to refine the , and to parameterize a CNN as in Learnet (Bertinetto et al., 2019). Then each is embedded by this conditional CNN, and is matched with by nearest neighbor search.
Adapting to new tasks can be done by simply putting to the memory, where fast generalization can be done easily. Besides, preference such as lifelong learning or reduction of memory updates can be incorporated into the designing of memory updating and accessing rules. However, it relies on human knowledge to design the desired rule. The existing works do not have a clear winner. How to automatically design or choose update rules according to different settings are important issues.
4.4. Generative Modeling
Generative modeling methods here refer to methods involving . They use both prior knowledge and to obtain the estimated distribution. Prior knowledge is usually learned prior models represented by parameters of some probability distribution, which are learned from a set of data sets ’s consisting of training set and test set . Usually, is large and is not one of ’s. Generative modeling methods update the probability distribution over for prediction. An illustration of this strategy is shown in Figure 8.
Specifically, the posterior which is the probability for given is computed by Bayes’ rule as . Expanding by parametrization, it can be written as , where is the parameter of . If is large enough, we can use it to learn a well-peaked , and obtain using maximum likelihood estimation (MLE) or maximum a posterior (MAP) . However, in FSL tasks has limited samples, which is not enough to learn . Consequently, it cannot learn a good .
Generative models for FSL assume is transferable across different (e.g., classes). Hence can be instead learned from a large set of data sets ’s. In detail, expand as , which also has parameter . Then we can obtain by adapting the distribution or learning using .
Table 6 summarizes the main bibliographic references falling in this strategy and their characteristics. By learning prior probability out of prior knowledge, the shape of is restricted. According to how is defined and shared across , we classify existing methods into parts and relations, super classes and latent variables.
|category||method||prior from ’s||how to use||task|
|parts and relations||Bayesian One-Shot (Fei-Fei et al., 2006)||object recognition|
|BPL (Lake et al., 2015)||and||fine-tune partial||classification, generation|
|super classes||HB (Salakhutdinov et al., 2012)||a hierarchy of||as one of ’s||object recognition|
|HDP-DBM (Torralba et al., 2011)||a hierarchy of||as one of ’s||object recognition|
|latent variables||SeqGen(Rezende et al., 2016)||as input||generation|
|Attention PixelCNN(Reed et al., 2018)||as input||image flipping, generation|
|Neural Statistician (Edwards and Storkey, 2017)||as input||classification, generation|
|VERSA (Gordon et al., 2019)||as input||classification, reconstruction|
|MetaGAN(Zhang et al., 2018b)||as input||classification|
|GMN (Bartunov and Vetrov, 2018)||as input||classification, generation|
4.4.1. Parts and Relations
This strategy learns parts and relations (a.k.a. ) from a large set of ’s as prior knowledge. Although the target few shot classes have few samples, their components such as parts and relations are shared with many classes. With much more samples to use, parts and relations are much easier to learn. For , the model needs to infer the correct combination of related parts and relations, then decides which target class this combination belongs to. Bayesian One-Shot (Fei-Fei et al., 2006) and BPL (Lake et al., 2015) fall in this category. Bayesian One-Shot leverages shapes and appearances of objects to help recognize objects, while BPL separates a character into types, tokens and further templates, parts, primitives to model characters. As the inference procedure is costly, a handful of parts is used in Bayesian One-Shot which largely reduce the combinatorial space of parts and relations, while only the five most possible combinations are considered in BPL.
4.4.2. Super Classes
Part and relation turn to model smaller part of samples, while super classes groups similar classes by unsupervised learning. Considering classification task, this strategy finds the bestwhich parameterize for these super classes as prior knowledge. A new class is first assigned to a super class, then learns its through adapting the super class’s . In (Salakhutdinov et al., 2012), they learn to form a hierarchy of classes using ’s (which includes ). In this way, similar classes together contribute to learning a precise general prior representing super classes, and in return each super class can provide guidance to its assigned classes, especially for of a few examples. The feature learning part of (Salakhutdinov et al., 2012) is further improved in (Torralba et al., 2011)
by using deep Boltzmann machines to learn more complicated features.
4.4.3. Latent Variables
Separating samples into parts and relations is handcrafted, and relies heavily on knowledge of human expertise. Instead, this strategy models latent variables with no implicit meaning shared across classes. Without decomposition, learned from ’s no longer needs to be adjusted, thus the computation cost for the new task is largely reduced. In order to handle more complicated , the models used in this strategy are usually deep models. According to which classic deep generative model act as basis, existing works can be grouped as follows.
[leftmargin = 18px]
Variational auto-encoder (VAE) (Kingma and Welling, 2014). Rezende et al. (2016) propose to model by a set of sequentially inferred latent variables using a sequential VAE. This model repeatedly attends to different regions of each from ’s and analyzes the capability of the current model to provide feedback, which can model the density estimation well.
propose an autoregressive model which decomposes the density estimation of an image into pixel-wise. The model sequentially generates each pixel conditioned on both already generated pixels and related information acquired from a memory storing.
Inference networks (Zhang et al., 2018a). Edwards and Storkey (2017) learn one inference networks to infer these latent variables, and another inference networks to map to which is the parameter of its generative distribution. Also learning inference networks by amortized variational inference, Gordon et al. (2019) learn to map to the parameter of a variational distribution which approximates the predictive posterior distribution over output .
Generative adversarial networks (GAN) (Goodfellow et al., 2014). Zhang et al. (2018b) jointly learns an imperfect GAN with a discriminative model for classification task. The imperfect GAN generates samples similar to examples in but slightly different. By learning to discriminate between ’s and the slightly different fake data, it obtains a sharper decision boundary.
Generative version of Matching Nets (GMN) (Vinyals et al., 2016). Bartunov and Vetrov (2018) extend the discriminative Matching Nets to generative setting as GMN. GMN replaces the used in Matching Nets by a latent variable . Then, it embeds by and by to embedding space , where attends to each to obtain the weights to aggregate ’s. This resultant embedding and are then fed to decoder networks to generate new sample .
Learning each object by decomposing them into smaller parts and relations leverages the human knowledge to do the decomposition. In contrast to other types of generative modeling methods discussed, parts and relations are more interpretable. However, human knowledge contains high bias, which may not suit the given data set. Besides, it can be hard or expensive to get, putting a strict restriction on application scenarios. In contrast, models learned for super classes can aggregate information from many related classes, and act as a general model for new class. However, they may not be optimal as not specific information is utilized. Methods with latent variables are more efficient and count on less human knowledge. However, as the exact meaning of the latent variable is unknown, methods of this kind are less easy to be understood.
In sum, all methods in this section design based on prior knowledge in experience to constrain the complexity of and reduce its sample complexity.
Multitask learning methods constrain of the few-shot task by a set of jointly learned tasks. They can communicate between different tasks and improve these tasks along the optimization process. They also implicitly augment data as some of the parameters are jointly learned by multiple tasks. However, the target must be one of ’s to perform joint training. Hence for each new task, one has to learn from scratch, which can be costly and slow. It is not suitable for tasks which only have one shot or prefer fast inference.
Embedding learning methods learn to embed samples a smaller embedding space, where the similar and dissimilar pairs can be easily identified. Therefore, is constrained. Most works learn from large-scale data sets for task-invariant information, and can absorb task-specialty of new tasks. Once learned, most methods can easily generalize from new tasks by a forward pass and perform the nearest neighbor among the embedded samples. However, how to mix the invariant and specific information of tasks within in a principled way is unclear.
Learning with external memory methods refine and re-interpret each sample by stored in memory, consequently reshaping . By explicitly storing in memory, they avoid laborious re-training to adapt for . The task-specific information is effectively used and not forgot easily. However, learning with the external memory incurs additional space and computational cost, which increases with an enlarged memory. Therefore, current external memory has a limited size, consequently it cannot memorize much information.
Generative modeling methods learn prior probability from prior knowledge, which shapes the form of . They have good interpret ability, causality and compositionality (Lake et al., 2015)
. By learning the joint distribution, they can deal with broader types of tasks such as generation and reconstruction. The learned generative models can generate many samples to do data augmentation. However, generative modeling methods typically have high computational cost and are difficult to derive compared with other models. For computation feasibility, they require severe simplification on the structure which leads to inaccurate approximations.
Algorithm is strategy to search in the hypothesis space for the parameter of the best hypothesis
. For example, stochastic gradient descent (SGD) and its variants(Bottou and Bousquet, 2008; Bottou et al., 2018) are one popular strategy to search in . In SGD, is updated through a sequence of updates . At the th iteration, let , then is updated by
where is the step size to be tuned. When supervised information is rich, there are enough training samples to update to arbitrary precision, and to find an appropriate through cross-validation. However, the provided few-shot is not enough to reach the required sample complexity. Consequently, the obtained empirical risk minimizer is unreliable.
Methods in this section do not restrict the shape of , so that common models such as CNN and RNN can still be used. Instead, they take advantage of prior knowledge to alter the search for that parameterizes the in to solve the FSL problem. In terms of how the search strategy is affected by prior knowledge, we classify methods in this section into three kinds (Table 7):
[leftmargin = 18px]
Refine existing parameters . An initial learned from other tasks is used to initialize the search, then is refined by .
Refine meta-learned . A meta-learner is learned form a a set of tasks drawn from the same task distribution as the few-shot task to output a general , then each learner refines the provided by the meta-learner using .
Learn search steps. This strategy learns a meta-learner to output search steps or update rules to guide each learner directly. Instead of learning a better initialization, it alters the search steps, such as direction or step size.
|strategy||prior knowledge||how to search of the in|
|refine existing parameters||learned||refine by|
|refine meta-learned||meta-learner||refine by|
|learn search steps||meta-learner||use search steps provided by the meta-learner|
5.1. Refine Existing Parameters
This strategy takes from a pre-trained model as a good initialization, and adapts it to by . The assumption is that captures general structures by learning from large-scale data, therefore it can be adapted using a few iterations to work well on .
5.1.1. Fine-tune with Regularization
This strategy fine-tunes the given with some regularization. An illustration of this strategy is shown in Figure 9.
Fine-tuning is popularly used in practice, which adapts the value of a (deep) model trained on large scale data such as ImageNet to smaller data sets through back-propagation (Donahue et al., 2014). The single which contains generic knowledge is usually parameter of deep models, which parameterizes a huge and complicated . Given the few-shot , simply fine-tuning by gradient descent leads to overfitting. How to adapt the value of without overfitting to the limited is the key design issue.
In this section, methods fine-tune with regularization to prevent overfitting. They can be grouped as follows.
[leftmargin = 18px]
Early-stopping is used in (Arik et al., 2018). However, it requires a separate validation set from to monitor the training, which further reduces the number of samples for training. Moreover, using a small set of validation set makes the searching strategy highly biased.
Selectively update refers to only update a small portion of so as to avoid overfitting. Keshari et al. (2018) use a set of fixed filters, and only learn to control the multitude of elements within the filters by fitting . Given a pre-trained CNN, both Qiao et al. (2018) and Qi et al. (2018) directly add the weights for each class in as new columns in the weight matrix of the final layer, while leaving pre-trained weights unchanged.
Cluster and update the resultant parameter groups using the same update information can largely constrain the search strategy. In (Yoo et al., 2018), they use auxiliary data to group filers of a pre-trained CNN, and fine-tune the CNN by group-wise back-propagation using .
Model regression networks (Wang and Hebert, 2016b) assume there exists a task-agnostic transformation from parameter trained using a few examples to parameter trained using many samples. Wang and Hebert (2016b) then refine learned with fixed -way--shot problem. Similarly, Kozerawski and Turk (2018) learn to transform the embedding of to a classification decision boundary.
5.1.2. Aggregate a Set of ’s
Usually, we do not have a suitable to fine-tune. Instead, we may have many model parameters
’s learned from related tasks, such as the task is face recognition while only recognition models for eye, nose, ear are available. Therefore, one can pick from’s the relevant ones and aggregate them into the suitable initialization to be adapted by . An illustration of this strategy is shown in Figure 10.
Parameters ’s here are usually pre-trained from other data sets. According to what data sets are used, existing methods can be grouped as follows.
[leftmargin = 18px]
Similar data set. Bart and Ullman (2005) classify a new class of one image by image fragments. The classifier for the new class is built by replacing the features from already learned classes by similar features taken from the novel class and reusing their classifier parameters. Only the threshold for classification is adjusted to avoid confusion with those similar classes. Similar to (Qi et al., 2018) and (Qiao et al., 2018), a pre-trained CNN is adapted to deal with new class in (Gidaris and Komodakis, 2018). But instead of solely using embedding of as classifier parameter which is highly biased, it constructs the classifier for this new class through a linear combination of the embedding of and a tentative classifier built by attending to other classes’ classifier parameter.
Unlabeled data set. ’s learned from an unlabeled data set can also be discriminative to separate samples. Pseudo-labels are iteratively assigned and adjusted for samples in the unlabeled data set, so as to learn decision boundaries (Wang and Hebert, 2016a). Further, these learned decision boundaries are incorporated into a pre-trained CNN (Wang and Hebert, 2016a). Note that in a pre-trained CNN, the higher the embedding layers, the more specific they are. By learning to separate the unlabeled data set, the generality of the embedding of the last layers is improved. Specifically, given a pre-trained CNN, Wang and Hebert (2016a) add a special layer in front of the fully connected layer for classification, fix the rest pre-trained parameters, then learn new which can separate the unlabeled data set well. To classify a new task, one only needs to learn the linear layer for final classification and reuses the rest parts of the CNN.
5.1.3. Fine-tune with New Parameters
The pre-trained may not suit the structure of the new FSL task. Hence we need additional new parameter for the specialty of . Specifically, this strategy fine-tunes while learning , making the model parameter to learn becomes . An illustration of this strategy is shown in Figure 11.
Hoffman et al. (2013) use the parameters of the lower layers of a pre-trained CNN for feature embedding, while learns a linear classifier on top of it using . Consider font style transfer task, Azadi et al. (2018) pre-train a network to capture the font in gray images, and fine-tune it together with the training of a network for generating stylish colored fonts.
Methods discussed in this section reduce the effort of doing architecture search for from the scratch. Since directly fine-tuning can easily overfit, methods that fine-tune a with regularization turn to regularize or modify existing parameters. They usually consider a single of some deep model. However, suitable existing parameters are not always easy to find. Another way is to aggregate a set of parameters ’s from related tasks into a suitable initialization. However, one must make sure that the knowledge embedded in these existing parameters is useful to the current task. Besides, it is costly to search over a large set of existing parameters to find the relevant ones. Fine-tune with new parameters leads to more flexibility. However, given the few-shot , one can only add limited parameters, otherwise overfitting may occur.
5.2. Refine Meta-learned
Methods fall in the following sections all are meta-learning methods555See Appendix A for a formal definition of meta-learning.. Instead of working towards the unreliable , this strategy directly targets at . In the following, we denote the parameter of meta-learner as , and the task-specific parameter for meta-training task as and meta-testing task as . During training, the meta-learner (optimizer) parameterized by provides information to of learner for task (optimizee), and the learner returns error signals such as gradients to meta-learner to improve it. Then, given a meta-testing task with , the meta-learner can be directly used while the learner can learn from . We illustrate mainly use the meta-testing task . An illustration of this strategy is shown in Figure 12.
5.2.1. Refine by Gradient Descent
This strategy refines the meta-learned by gradient descent. Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) is an representative method of this kind. It meta-learns a as a good initialization for task . This can be adjusted effectively through a few gradient descent using to obtain a good task specific . Mathematically, this is done by , where and is a fixed step size to be chosen. Through summing over all samples, it provides a permutation-invariant . Then, meta-learner updates through the averaged gradient steps across all meta-training tasks, , where and is also a fixed step size.
MAML provides the same initialization for all tasks, while neglects the task-specific information. This only suits a set of very similar tasks, while works bad when tasks are distinct. In (Lee and Choi, 2018), it learns to choose from a subset of ’s a initialization for new . In other words, it meta-learns task-specific subspace and metric for the learner to perform gradient descent. Therefore,different initializations of is provided for different ’s.
As refining by gradient descent may not be reliable, regularization is used to correct the descent direction. is further adapted by model regression networks (Wang and Hebert, 2016b) in (Gui et al., 2018). The adapted is regularized to be more close to model trained with many samples. The parameter of the model regression networks is learned by gradient descent like .
5.2.2. Refine in Consideration of Uncertainty
Learning with a few examples inevitably results in a model with high uncertainty (Finn et al., 2018). Can the learned model predict for a new task with high confidence? Will the model be improved with more samples? The ability to measure this uncertainty provides a sign for active learning or further data collection (Finn et al., 2018).
Therefore are three kinds of uncertainty considered so far.
[leftmargin = 18px]
Uncertainty over the shared parameter . A single may not act as a good initialization for all tasks. Therefore, by modeling ’s posterior distribution, one can sample appropriate initialization ’s for different ’s. Finn et al. (2018) propose to model the prior distribution of . It is solved by point estimate of MAP. Instead, Yoon et al. (2018) learn the prior distribution of by Stein Variational Gradient Descent (SVGD). They learn a set of copies of with shared parameter as . These copies communicate with each other to decide the update direction. With the learned , is obtained by taking a few SVGD steps.
Uncertainty over the task-specific parameter . Each is then the point estimate of posterior over . However, as contains a few examples, the learned posterior can be skew. Therefore, Grant et al. (2018) improve MAML by replacing this point estimate for posterior by Laplace approximation. Ravi and Beatson (2019) do not use as the . Instead, they learn inference networks to map to . Then as in MAML, this inference networks are optimized by taking a few gradient descent steps using so as to adapt to . Finally, is used as the parameter for the variational distribution which approximates the posterior over .
Uncertainty over class ’s class-specific parameter . Finally, class-specific uncertainty is modeled in (Rusu et al., 2019). It first maps each of the th class in to the parameter of a class-conditional multivariate Gaussian in a lower dimensional latent space so as to sample the class-dependent initialization . Then it uses an encoder to embed those of class to , where gradient descent is efficiently taken with respect to