1 Introduction
Supervised endtoend learning has been extremely successful in computer vision, speech, or machine translation tasks, thanks to improvements in optimization technology, larger datasets and streamlined designs of deep convolutional or recurrent architectures. Despite these successes, this learning setup does not cover many aspects where learning is nonetheless possible and desirable.
One such instance is the ability to learn from few examples, in the socalled fewshot learning tasks. Rather than relying on regularization to compensate for the lack of data, researchers have explored ways to leverage a distribution of similar tasks, inspired by human learning Lake et al. (2015)
. This defines a new supervised learning setup (also called ‘metalearning’) in which the inputoutput pairs are no longer given by iid samples of images and their associated labels, but by iid samples of collections of images and their associated label similarity.
A recent and highlysuccessful research program has exploited this metalearning paradigm on the fewshot image classification task Lake et al. (2015); Koch et al. (2015); Vinyals et al. (2016); Mishra et al. (2017); Snell et al. (2017). In essence, these works learn a contextual, taskspecific similarity measure, that first embeds input images using a CNN, and then learns how to combine the embedded images in the collection to propagate the label information towards the target image.
In particular, Vinyals et al. (2016)
cast the fewshot learning problem as a supervised classification task mapping a support set of images into the desired label, and developed an endtoend architecture accepting those support sets as input via attention mechanisms. In this work, we build upon this line of work, and argue that this task is naturally expressed as a supervised interpolation problem on a graph, where nodes are associated with the images in the collection, and edges are given by a trainable similarity kernels. Leveraging recent progress on representation learning for graphstructured data
Bronstein et al. (2017); Gilmer et al. (2017), we thus propose a simple graphbased fewshot learning model that implements a taskdriven message passing algorithm. The resulting architecture is trained endtoend, captures the invariances of the task, such as permutations within the input collections, and offers a good tradeoff between simplicity, generality, performance and sample complexity.Besides fewshot learning, a related task is the ability to learn from a mixture of labeled and unlabeled examples — semisupervised learning, as well as
active learning, in which the learner has the option to request those missing labels that will be most helpful for the prediction task. Our graphbased architecture is naturally extended to these setups with minimal changes in the training design. We validate experimentally the model on fewshot image classification, matching stateoftheart performance with considerably fewer parameters, and demonstrate applications to semisupervised and active learning setups.Our contributions are summarized as follows:

We cast fewshot learning as a supervised message passing task which is trained endtoend using graph neural networks.

We match stateoftheart performance on Omniglot and MiniImagenet tasks with fewer parameters.

We extend the model in the semisupervised and active learning regimes.
2 Related Work
Oneshot learning was first introduced by FeiFei et al. (2006), they assumed that currently learned classes can help to make predictions on new ones when just one or few labels are available. More recently, Lake et al. (2015) presented a Hierarchical Bayesian model that reached human level error on fewshot learning alphabet recongition tasks.
Since then, great progress has been done in oneshot learning. Koch et al. (2015)
presented a deeplearning model based on computing the pairwise distance between samples using Siamese Networks, then, this learned distance can be used to solve oneshot problems by knearest neighbors classification.
Vinyals et al. (2016) Presented an endtoend trainable knearest neighbors using the cosine distance, they also introduced a contextual mechanism using an attention LSTM model that takes into account all the samples of the subset when computing the pairwise distance between samples. Snell et al. (2017) extended the work from Vinyals et al. (2016), by using euclidean distance instead of cosine which provided significant improvements, they also build a prototype representation of each class for the fewshot learning scenario. Mehrotra & Dukkipati (2017) trained a deep residual network together with a generative model to approximate the pairwise distance between samples.A new line of metalearners for oneshot learning is rising lately: Ravi & Larochelle (2016)
introduced a metalearning method where an LSTM updates the weights of a classifier for a given episode.
Munkhdalai & Yu (2017) also presented a metalearning architecture that learns metalevel knowledge across tasks, and it changes its inductive bias via fast parametrization. Finn et al. (2017) is using a model agnostic metalearner based on gradient descent, the goal is to train a classification model such that given a new task, a small amount of gradient steps with few data will be enough to generalize. Lately, Mishra et al. (2017) used Temporal Convolutions which are deep recurrent networks based on dilated convolutions, this method also exploits contextual information from the subset providing very good results.Another related area of research concerns deep learning architectures on graphstructured data. The GNN was first proposed in Gori et al. (2005); Scarselli et al. (2009), as a trainable recurrent messagepassing whose fixed points could be adjusted discriminatively. Subsequent works Li et al. (2015); Sukhbaatar et al. (2016) have relaxed the model by untying the recurrent layer weights and proposed several nonlinear updates through gating mechanisms. Graph neural networks are in fact natural generalizations of convolutional networks to nonEuclidean graphs. Bruna et al. (2013); Henaff et al. (2015) proposed to learn smooth spectral multipliers of the graph Laplacian, albeit with high computational cost, and Defferrard et al. (2016); Kipf & Welling (2016)
resolved the computational bottleneck by learning polynomials of the graph Laplacian, thus avoiding the computation of eigenvectors and completing the connection with GNNs. In particular,
Kipf & Welling (2016) was the first to propose the use of GNNs on semisupervised classification problems. We refer the reader to Bronstein et al. (2017) for an exhaustive literature review on the topic. GNNs and the analogous Neural Message Passing Models are finding application in many different domains. Battaglia et al. (2016); Chang et al. (2016) develop graph interaction networks that learn pairwise particle interactions and apply them to discrete particle physical dynamics. Duvenaud et al. (2015); Kearnes et al. (2016) study molecular fingerprints using variants of the GNN architecture, and Gilmer et al. (2017) further develop the model by combining it with set representations Vinyals et al. (2015), showing stateoftheart results on molecular prediction.3 Problem Setup
We describe first the general setup and notations, and then particularize it to the case of fewshot learning, semisupervised learning and active learning.
We consider inputoutput pairs drawn iid from a distribution of partiallylabeled image collections
(1) 
for arbitrary values of and . Where is the number of labeled samples, is the number of unlabeled samples ( for the semisupervised and active learning scenarios) and is the number of samples to classify. is the number of classes. We will focus in the case where we just classify one sample per task . denotes a classspecific image distribution over . In our context, the targets are associated with image categories of designated images with no observed label. Given a training set , we consider the standard supervised learning objective
using the model specified in Section 4 and is a standard regularization objective.
FewShot Learning
When , and , there is a single image in the collection with unknown label. If moreover each label appears exactly times, this setting is referred as the shot, way learning.
SemiSupervised Learning
When and , the input collection contains auxiliary images that the model can use to improve the prediction accuracy, by leveraging the fact that these samples are drawn from common distributions as those determining the output.
Active Learning
In the active learning setting, the learner has the ability to request labels from the subcollection . We are interested in studying to what extent this active learning can improve the performance with respect to the previous semisupervised setup, and match the performance of the oneshot learning setting with known labels when , .
4 Model
This section presents our approach, based on a simple endtoend graph neural network architecture. We first explain how the input context is mapped into a graphical representation, then detail the architecture, and next show how this model generalizes a number of previously published fewshot learning architectures.
4.1 Set and Graph Input Representations
The input contains a collection of images, both labeled and unlabeled. The goal of fewshot learning is to propagate label information from labeled samples towards the unlabeled query image. This propagation of information can be formalized as a posterior inference over a graphical model determined by the input images and labels.
Following several recent works that cast posterior inference using message passing with neural networks defined over graphs Scarselli et al. (2009); Duvenaud et al. (2015); Gilmer et al. (2017), we associate with a fullyconnected graph where nodes correspond to the images present in (both labeled and unlabeled). In this context, the setup does not specify a fixed similarity between images and
, suggesting an approach where this similarity measure is learnt in a discriminative fashion with a parametric model similarly as in
Gilmer et al. (2017), such as a siamese neural architecture. This framework is closely related to the set representation from Vinyals et al. (2016), but extends the inference mechanism using the graph neural network formalism that we detail next.4.2 Graph Neural Networks
Graph Neural Networks, introduced in Gori et al. (2005); Scarselli et al. (2009) and further simplified in Li et al. (2015); Duvenaud et al. (2015); Sukhbaatar et al. (2016) are neural networks based on local operators of a graph , offering a powerful balance between expressivity and sample complexity; see Bronstein et al. (2017) for a recent survey on models and applications of deep learning on graphs.
In its simplest incarnation, given an input signal on the vertices of
a weighted graph , we consider a family of graph intrinsic linear operators that act locally on this signal.
The simplest is the adjacency operator where
with iff and
its associated weight.
A GNN layer receives as input a signal and produces as
(2) 
where , , are trainable parameters and
is a pointwise nonlinearity, chosen in this work to be a ‘leaky’ ReLU
Xu et al. (2015).Authors have explored several modeling variants from this basic formulation, by replacing the pointwise nonlinearity with gating operations Duvenaud et al. (2015), or by generalizing the generator family to Laplacian polynomials Defferrard et al. (2016); Kipf & Welling (2016); Bruna et al. (2013), or including th powers of to , to encode hop neighborhoods of each node Bruna & Li (2017). Cascaded operations in the form (2) are able to approximate a wide range of graph inference tasks. In particular, inspired by messagepassing algorithms, Kearnes et al. (2016); Gilmer et al. (2017) generalized the GNN to also learn edge features
from the current node hidden representation:
(3) 
where
is a symmetric function parametrized with e.g. a neural network. In this work, we consider a Multilayer Perceptron stacked after the absolute difference between two vector nodes. See eq.
4:(4) 
Then is a metric, which is learned by doing a nonlinear combination of the absolute difference between the individual features of two nodes. Using this architecture the distance property Symmetry is fulfilled by construction and the distance property Identity is easily learned.
The trainable adjacency is then normalized to a stochastic kernel by using a softmax along each row. The resulting update rules for node features are obtained by adding the edge feature kernel into the generator family and applying (2). Adjacency learning is particularly important in applications where the input set is believed to have some geometric structure, but the metric is not known a priori, such as is our case.
In general graphs, the network depth is chosen to be of the order of the graph diameter, so that all nodes obtain information from the entire graph. In our context, however, since the graph is densely connected, the depth is interpreted simply as giving the model more expressive power.
Construction of Initial Node Features
The input collection is mapped into node features as follows. For images with known label
, the onehot encoding of the label is concatenated with the embedding features of the image at the input of the GNN.
(5) 
where
is a Convolutional neural network and
is a onehot encoding of the label. Architectural details for are detailed in Section 6.1.1 and 6.1.2. For images with unknown label , we modify the previous construction to account for full uncertainty about the label variable by replacingwith the uniform distribution over the
simplex: , and analogously for .4.3 Relationship with Existing Models
The graph neural network formulation of fewshot learning generalizes a number of recent models proposed in the literature.
Siamese Networks
Siamese Networks Koch et al. (2015) can be interpreted as a single layer messagepassing iteration of our model, and using the same initial node embedding (5) , using a nontrainable edge feature
and resulting label estimation
with selecting the label field from . In this model, the learning is reduced to learning image embeddings whose euclidean metric is consistent with the label similarities.
Prototypical Networks
Prototypical networks Snell et al. (2017) evolve Siamese networks by aggregating information within each cluster determined by nodes with the same label. This operation can also be accomplished with a gnn as follows. we consider
where is the number of examples per class, and
where is defined as in the Siamese Networks. We finally apply the previous kernel applied to to yield class prototypes:
Matching Networks
Matching networks Vinyals et al. (2016) use a set representation for the ensemble of images in , similarly as our proposed graph neural network model, but with two important differences. First, the attention mechanism considered in this set representation is akin to the edge feature learning, with the difference that the mechanism attends always to the same node embeddings, as opposed to our stacked adjacency learning, which is closer to Vaswani et al. (2017). In other words, instead of the attention kernel in (3), matching networks consider attention mechanisms of the form , where is the encoding function for the elements of the support set, obtained with bidirectional LSTMs. In that case, the support set encoding is thus computed independently of the target image. Second, the label and image fields are treated separately throughout the model, with a final step that aggregates linearly the labels using a trained kernel. This may prevent the model to leverage complex dependencies between labels and images at intermediate stages.
5 Training
We describe next how to train the parameters of the GNN in the different setups we consider: fewshot learning, semisupervised learning and active learning.
5.1 FewShot and SemiSupervised Learning
In this setup, the model is asked only to predict the label corresponding to the image to classify , associated with node in the graph. The final layer of the GNN is thus a softmax mapping the node features to the simplex. We then consider the Crossentropy loss evaluated at node :
The semisupervised setting is trained identically — the only difference is that the initial label fields of the node will be filled with the uniform distribution on nodes corresponding to .
5.2 Active Learning
In the Active Learning setup, the model has the intrinsic ability to query for one of the labels from . The network will learn to ask for the most informative label in order to classify the sample . The querying is done after the first layer of the GNN by using a Softmax attention over the unlabeled nodes of the graph. For this we apply a function that maps each unlabeled vector node to a scalar value. Function is parametrized by a two layers neural network. A Softmax is applied over the scalar values obtained after applying :
In order to query only one sample, we set all elements from the
vector to 0 except for one. At test time we keep the maximum value, at train time we randomly sample one value based on its multinomial probability. Then we multiply this sampled attention by the label vectors:
The label of the queried vector is obtained, scaled by the weight . This value is then summed to the current representation , since we are using dense connections in our GNN model we can sum this value directly to where the uniform label distribution was concatenated
After the label has been summed to the current node, the information is forward propagated. This attention part is trained endtoend with the rest of the network by backpropagating the loss from the output of the GNN.
6 Experiments
For the fewshot, semisupervised and active learning experiments we used the Omniglot dataset presented by Lake et al. (2015) and MiniImagenet dataset introduced by Vinyals et al. (2016) which is a small version of ILSVRC12 Krizhevsky et al. (2012). All experiments are based on the shot, way setting. For all experiments we used the same values shot and way for both training and testing.
Code available at: https://github.com/vgsatorras/fewshotgnn
6.1 Datasets and Implementation
6.1.1 Omniglot
Dataset:
Omniglot is a dataset of 1623 characters from 50 different alphabets, each character/class has been drawn by 20 different people. Following Vinyals et al. (2016) implementation we split the dataset into 1200 classes for training and the remaining 423 for testing. We augmented the dataset by multiples of 90 degrees as proposed by Santoro et al. (2016).
Architectures:
Inspired by the embedding architecture from Vinyals et al. (2016), following Mishra et al. (2017), a CNN was used as an embedding function consisting of four stacked blocks of 3
3convolutional layer with 64 filters, batchnormalization, 2
2 maxpooling, leakyrelu} the output is passed through a fully connected layer resulting in a 64dimensional embedding. For the GNN we used 3 blocks each of them composed by 1) a module that computes the adjacency matrix and 2) a graph convolutional layer. A more detailed description of each block can be found at Figure
3.6.1.2 MiniImagenet
Dataset:
MiniImagenet is a more challenging dataset for oneshot learning proposed by Vinyals et al. (2016) derived from the original ILSVRC12 dataset Krizhevsky et al. (2012). It consists of 8484 RGB images from 100 different classes with 600 samples per class. It was created with the purpose of increasing the complexity for oneshot tasks while keeping the simplicity of a light size dataset, that makes it suitable for fast prototyping. We used the splits proposed by Ravi & Larochelle (2016) of 64 classes for training, 16 for validation and 20 for testing. Using 64 classes for training, and the 16 validation classes only for early stopping and parameter tuning.
Architecture:
The embedding architecture used for MiniImagenet is formed by 4 convolutional layers followed by a fullyconnected layer resulting in a 128 dimensional embedding. This light architecture is useful for fast prototyping:
133conv. layer (64 filters), batch normalization, max pool, leaky relu,
133conv. layer (96 filters), batch normalization, max pool, leaky relu,
133conv. layer (128 filters), batch normalization, max pool, leaky relu, dropout,
133conv. layer (256 filters), batch normalization, max pool, leaky relu, dropout,
1 fclayer (128 filters), batch normalization.
The two dropout layers are useful to avoid overfitting the GNN in MiniImagenet dataset.
The GNN architecture is similar than for Omniglot, it is formed by 3 blocks, each block is described at Figure 3.
6.2 FewShot
Fewshot learning experiments for Omniglot and MiniImagenet are presented at Table 1 and Table 2 respectively.
We evaluate our model by performing different qshot, Kway experiments on both datasets. For every fewshot task , we sample K random classes from the dataset, and from each class we sample q random samples. An extra sample to classify is chosen from one of that K classes.
Omniglot: The GNN method is providing competitive results while still remaining simpler than other methods. State of the art results are reached in the 5Way and 20way 1shot experiments. In the 20Way 1shot setting the GNN is providing slightly better results than Munkhdalai & Yu (2017) while still being a more simple approach. The TCML approach from Mishra et al. (2017)
is in the same confidence interval for 3 out of 4 experiments, but it is slightly better for the 20Way 5shot, although the number of parameters is reduced from
5M (TCML) to 300K (3 layers GNN).At MiniImagenet table we are also presenting a baseline ”
Our metric learning + KNN
” where no information has been aggregated among nodes, it is a Knearest neighbors applied on top of the pairwise learnable metric and trained endtoend, this learnable metric is competitive by itself compared to other state of the art methods. Even so, a significant improvement (from 64.02% to 66.41%) can be seen for the 5shot 5Way MiniImagenet setting when aggregating information among nodes by using the full GNN architecture. A variety of embedding functions are used among the different papers for MiniImagenet experiments, in our case we are using a simple network of 4 conv. layers followed by a fully connected layer (Section 6.1.2) which served us to compare between Our GNN and Our metric learning + KNN and it is useful for fast prototyping. More complex embeddings have proven to produce better results, at Mishra et al. (2017) a deep residual network is used as embedding network increasing the accuracy considerably. Regarding the TCML architecture in MiniImagenet, the number of parameters is reduced from 11M (TCML) to 400K (3 layers GNN).5Way  
Model  1shot  5shot 
Matching Networks Vinyals et al. (2016)  43.6%  55.3% 
Prototypical Networks Snell et al. (2017)  46.61% 0.78%  65.77% 0.70% 
Model Agnostic Metalearner Finn et al. (2017)  48.70% 1.84%  63.1% 0.92% 
Meta Networks Munkhdalai & Yu (2017)  49.21% 0.96   
Ravi & Larochelle Ravi & Larochelle (2016)  43.4% 0.77%  60.2% 0.71% 
TCML Mishra et al. (2017)  55.71% 0.99%  68.88% % 
Our metric learning + KNN  49.44% 0.28%  64.02% 0.51% 
Our GNN  50.33% 0.36%  66.41% 0.63% 
6.3 SemiSupervised
Semisupervised experiments are performed on the 5way 5shot setting. Different results are presented when 20% and 40% of the samples are labeled. The labeled samples are balanced among classes in all experiments, in other words, all the classes have the same amount of labeled and unlabeled samples.
Two strategies can be seen at Tables 3 and 4. ”GNN  Trained only with labeled” is equivalent to the supervised fewshot setting, for example, in the 5Way 5shot 20%labeled setting, this method is equivalent to the 5way 1shot learning setting since it is ignoring the unlabeled samples. ”GNN  Semi supervised” is the actual semisupervised method, for example, in the 5Way 5shot 20%labeled setting, the GNN receives as input 1 labeled sample per class and 4 unlabeled samples per class.
Omniglot results are presented at Table 3, for this scenario we observe that the accuracy improvement is similar when adding images than when adding labels. The GNN is able to extract information from the input distribution of unlabeled samples such that only using 20% of the labels in a 5shot semisupervised environment we get same results as in the 40% supervised setting.
In MiniImagenet experiments, Table 4, we also notice an improvement when using semisupervised data although it is not as significant as in Omniglot. The distribution of MiniImagenet images is more complex than for Omniglot. In spite of it, the GNN manages to improve by 2% in the 20% and 40% settings.
5Way 5shot  

Model  20%labeled  40%labeled  100%labeled 
GNN  Trained only with labeled  99.18%  99.59%  99.71% 
GNN  Semi supervised  99.59%  99.63%  99.71% 
5Way 5shot  

Model  20%labeled  40%labeled  100%labeled 
GNN  Trained only with labeled  50.33% 0.36%  56.91% 0.42%  66.41% 0.63% 
GNN  Semi supervised  52.45% 0.88%  58.76% 0.86%  66.41% 0.63% 
6.4 Active Learning
We performed Active Learning experiments on the 5Way 5shot setup when 20% of the samples are labeled. In this scenario our network will query for the label of one sample from the unlabeled ones. The results are compared with the Random baseline where the network chooses a random sample to be labeled instead of one that maximally reduces the loss of the classification task .
Results are shown at Table 5. The results of the GNNRandom criterion are close to the Semisupervised results for 20%labeled samples from Tables 3 and 4. It means that selecting one random label practically does not improve the accuracy at all. When using the GNNAL learned criterion, we notice an improvement of for MiniImagenet, it means that the GNN manages to correctly choose a more informative sample than a random one. In Omniglot the improvement is smaller since the accuracy is almost saturated and the improving margin is less.
7 Conclusions
This paper explored graph neural representations for fewshot, semisupervised and active learning. From the metalearning perspective, these tasks become supervised learning problems where the input is given by a collection or set of elements, whose relational structure can be leveraged with neural message passing models. In particular, stacked node and edge features generalize the contextual similarity learning underpinning previous fewshot learning models.
The graph formulation is helpful to unify several training setups (fewshot, active, semisupervised) under the same framework, a necessary step towards the goal of having a single learner which is able to operate simultaneously in different regimes (stream of labels with few examples per class, or stream of examples with few labels). This general goal requires scaling up graph models to millions of nodes, motivating graph hierarchical and coarsening approaches Defferrard et al. (2016).
Another future direction is to generalize the scope of Active Learning, to include e.g. the ability to ask questions Rothe et al. (2017)
, or in reinforcement learning setups, where fewshot learning is critical to adapt to nonstationary environments.
Acknowledgments
This work was partly supported by Samsung Electronics (Improving Deep Learning using Latent Structure).
References
 Battaglia et al. (2016) Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in Neural Information Processing Systems, pp. 4502–4510, 2016.
 Bronstein et al. (2017) Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
 Bruna & Li (2017) Joan Bruna and Xiang Li. Community detection with graph neural networks. arXiv preprint arXiv:1705.08415, 2017.
 Bruna et al. (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. Proc. ICLR, 2013.
 Chang et al. (2016) Michael B. Chang, Tomer Ullman, Antonio Torralba, and Joshua B. Tenenbaum. A compositional objectbased approach to learning physical dynamics. ICLR, 2016.
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3837–3845, 2016.
 Duvenaud et al. (2015) David Duvenaud, Dougal Maclaurin, Jorge AguileraIparraguirre, Rafael GómezBombarelli, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Neural Information Processing Systems, 2015.
 Edwards & Storkey (2016) Harrison Edwards and Amos Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
 FeiFei et al. (2006) Li FeiFei, Rob Fergus, and Pietro Perona. Oneshot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
 Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
 Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
 Gori et al. (2005) M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In Proc. IJCNN, 2005.
 Henaff et al. (2015) M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graphstructured data. arXiv:1506.05163, 2015.
 Kaiser et al. (2017) Łukasz Kaiser, Ofir Nachum, Aurko Roy, and Samy Bengio. Learning to remember rare events. arXiv preprint arXiv:1703.03129, 2017.
 Kearnes et al. (2016) Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computeraided molecular design, 30(8):595–608, 2016.
 Kipf & Welling (2016) Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Koch et al. (2015) Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for oneshot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 Lake et al. (2015) Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 Li et al. (2015) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
 Mehrotra & Dukkipati (2017) Akshay Mehrotra and Ambedkar Dukkipati. Generative adversarial residual pairwise networks for one shot learning. arXiv preprint arXiv:1703.08033, 2017.
 Mishra et al. (2017) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. Metalearning with temporal convolutions. arXiv preprint arXiv:1707.03141, 2017.
 Munkhdalai & Yu (2017) Tsendsuren Munkhdalai and Hong Yu. Meta networks. arXiv preprint arXiv:1703.00837, 2017.
 Ravi & Larochelle (2016) Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. ICLR, 2016.
 Rothe et al. (2017) Anselm Rothe, Brenden Lake, and Todd Gureckis. Question asking as program generation. NIPS, 2017.
 Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memoryaugmented neural networks. In International conference on machine learning, pp. 1842–1850, 2016.
 Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
 Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for fewshot learning. arXiv preprint arXiv:1703.05175, 2017.
 Sukhbaatar et al. (2016) Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation. In Advances in Neural Information Processing Systems, pp. 2244–2252, 2016.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
 Vinyals et al. (2015) Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391, 2015.
 Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638, 2016.
 Xu et al. (2015) Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
Comments
There are no comments yet.