Few-Shot Abstract Visual Reasoning With Spectral Features

by   Tanner Bohn, et al.
Western University

We present an image preprocessing technique capable of improving the performance of few-shot classifiers on abstract visual reasoning tasks. Many visual reasoning tasks with abstract features are easy for humans to learn with few examples but very difficult for computer vision approaches with the same number of samples, despite the ability for deep learning models to learn abstract features. Same-different (SD) problems represent a type of visual reasoning task requiring knowledge of pattern repetition within individual images, and modern computer vision approaches have largely faltered on these classification problems, even when provided with vast amounts of training data. We propose a simple method for solving these problems based on the insight that removing peaks from the amplitude spectrum of an image is capable of emphasizing the unique parts of the image. When combined with several classifiers, our method performs well on the SD SVRT tasks with few-shot learning, improving upon the best comparable results on all tasks, with average absolute accuracy increases nearly 40 find that combining Relational Networks with this image preprocessing approach improves their performance from chance-level to over 90 tasks.



There are no comments yet.


page 3

page 5


Few-shot Visual Reasoning with Meta-analogical Contrastive Learning

While humans can solve a visual puzzle that requires logical reasoning b...

RAVEN: A Dataset for Relational and Analogical Visual rEasoNing

Dramatic progress has been witnessed in basic vision tasks involving low...

Program synthesis performance constrained by non-linear spatial relations in Synthetic Visual Reasoning Test

Despite remarkable advances in automated visual recognition by machines,...

Attention on Abstract Visual Reasoning

Attention mechanisms have been boosting the performance of deep learning...

Thinking Deeply with Recurrence: Generalizing from Easy to Hard Sequential Reasoning Problems

Deep neural networks are powerful machines for visual pattern recognitio...

Neural Abstract Reasoner

Abstract reasoning and logic inference are difficult problems for neural...

Abstract Reasoning with Distracting Features

Abstraction reasoning is a long-standing challenge in artificial intelli...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The field of artificial intelligence has slowly enveloped human skills ranging from those requiring formal reasoning to those requiring flexibility and intuition (such as image recognition or natural language understanding). The recognition of highly abstract concepts given few examples is a ubiquitous human skill that has yet to see significant progress. In particular, flexible machine learning techniques have not yet been found which can solve a certain type of visual reasoning task known as same-different (SD) problems given only few examples 

(Stabinger et al., 2016; Ricci et al., 2018). Solving these problems requires reasoning about the similarity between patterns located within the same image, something humans perform with ease. In this work we present a conceptually simple image transformation which can be combined with few-shot image classifiers to perform well at these tasks.

Figure 1: Representative examples from the SVRT task set. For SVRT #1, class 1 contains two identical figures, while images from class 2 contain two unique figures. For SVRT #5, class 1 contains two pairs of identical figures, while class 2 contains four unique figures. For SVRT #22, class 1 contains three identical aligned shapes, while in class 2, the shapes are not identical. For all three of these tasks, deep learning approaches have been unable to achieve good performance, even with large amounts of training data. In contrast, our approach is capable of achieving high accuracies with only 10 labelled samples.

There are many potential sources of abstract visual reasoning tasks one could study. These include Bongard problems (given 6 labelled images, try find the rule separating them) (Bongard, 1967), Raven’s Progressive Matrices (given observed sequences of tiles, identify the missing tile) (Raven and others, 1938), the CLEVR dataset (answer questions based on a scene of multiple objects) (Johnson et al., 2017), and SVRT problems (given N labelled images, classify unseen images) (Fleuret et al., 2011). Although the SVRT lacks the variation of hand-drawn Bongard Problems (23 tasks vs. 200), it contains a variety of highly abstract tasks while being procedurally generated. Unlike Raven’s Progressive Matrices or CLEVR tasks which have seen significant progress (Santoro et al., 2018; Hudson and Manning, 2018), SVRT tasks have seen only partial success. It is worth noting that while creating a program to automatically solve these types of problems may often be easy (the images in these tasks are simple, consisting of simple objects on a solid white background), the difficulty comes from finding an approach capable of learning to solve the problems with a minimal amount of expert knowledge or feature engineering.

The 23 SVRT tasks can be split into two groups, based on the type of patterns separating the two classes: spatial relation (SR) problems (ex. shapes in a line vs. not in a line) and SD problems (ex. two pairs of unique shapes vs. two pairs of identical shapes) (Stabinger et al., 2016)

. Previous attempts at these tasks have shown that convolutional neural networks, a staple of image classification research, are capable of solving SR problems given enough training data (20K in 

(Stabinger et al., 2016) and 1 million in (Ricci et al., 2018)), but the SD problems have proven to be more difficult. Perhaps unexpectedly, even relational networks (Santoro et al., 2017), which perform well on other reasoning tasks such as CLEVR, have demonstrated great difficulty in learning to solve same-different problems, requiring millions of training samples to achieve above chance-level performance (Ricci et al., 2018).

Our new approach is inspired by an interesting point of overlap between image processing and visual saliency, namely, Fourier transforms. In image processing, it is well-known that unwanted periodic noise in images can be removed by manually zeroing-out corresponding peaks in the amplitude spectrum of the Fourier transform with so-called ’notch’ filters. Instead of manually removing peaks, a functionally similar approach is to smooth peaks, which can be done automatically. In 

(Li et al., 2013), the authors point out that a Gaussian filtering operation on amplitude spectra is an elegant way to compute visual saliency maps. This technique works because non-salient parts of an image are often those which are frequently occurring, leading to peaks in the amplitude spectrum. The effectiveness of this simple approach for producing saliency maps can be seen in the bottom row of Figure 2. In this work, we apply this insight to solve SD tasks. Duplicated figures in SVRT images that undergo this amplitude spectrum filtering process are partially removed, whereas unique figures are largely untouched. This provides easily usable information for convolutional neural-network-based approaches to learn to classify images from SD tasks.

We observe that using these filtered spectra in place of the raw images, we are able to improve the state-of-the-art at few-shot learning SVRT SD tasks by an average of nearly 30%. By working towards solving these abstract visual problems in a simple and interpretable way, we demonstrate how analogous problems in other areas of machine learning may be approached.

The main contributions of this work are thus:

  • We describe a novel approach for solving same-different visual reasoning tasks which exploits insights from visual saliency (Section 3).

  • We establish the performance of several popular algorithms for few-shot learning on SVRT tasks (Section 4).

  • We experimentally demonstrate that our approach allows for achieving state-of-the-art performance on all of the SVRT same-different tasks with few-shot learning (Section 4.3).

Figure 2: A comparison of saliency maps produced by percentile filtering and Gaussian filtering (Li et al., 2013) on psychological test patterns from (Hou and Zhang, 2007). The expected salient locations are circled. As demonstrated by (Hou and Zhang, 2007; Seo and Milanfar, 2009; Guo et al., 2008), the closure test proves to be difficult for many automated methods (this test was not reported by (Li et al., 2013)). Following (Hou and Zhang, 2007; Li et al., 2013), we square the raw saliency map to increase intensity contrast and apply a Gaussian filter to smooth the highlighted regions.

2 Related Work

In this section we first focus on previous work done on SD problems in particular, then provide an overview of relevant work on visual saliency maps.

2.1 Same-Different Visual Reasoning Tasks

With the aim of highlighting the difference in performance between humans and machines at solving visual reasoning tasks, François Fleuret et al. introduced the SVRT tasks in 2011 (Fleuret et al., 2011). With their set of 23 binary classification tasks, the authors demonstrated that humans are much more proficient than a set of standard machine learning image classification techniques chosen at the time. In their few-shot experiments (only 10 training samples), the performance on most of the 23 tasks hovered around random. With 10,000 training images however, their best models were able to obtain 81% accuracy on SD tasks and 88% accuracy on SR tasks.

Studying the SVRT tasks with more modern computer vision approaches, Sebastian Stabinger et al. (Stabinger et al., 2016) trained LeNet (LeCun et al., 1989) and GoogLeNet (Szegedy et al., 2015) CNNs on the tasks. They found that near-perfect performance was achievable on roughly half of the tasks, with the other half being significantly more difficult (near-random). By observing the abstract concepts required to solve each SVRT task, they noticed that this easy-difficult split closely corresponded to whether or not tasks required same-different comparisons, with a couple exceptions (they determined that a couple SD problems could be solved by exploiting simple pixel distribution patterns). The authors find that both CNN models perform very similarly, achieving perfect or near-perfect accuracies on the spatial-relation problems but poor performance on remaining tasks, despite training each network on 20,000 images. In (Ricci et al., 2018), the authors use a minimal synthetic SVRT-like task for deeper exploration. They show that relational networks (RNs) (Santoro et al., 2017) have trouble learning to solve these problems, requiring several million training samples before achieving above chance-level performance.

2.2 Saliency Maps

The aim of a visual saliency map is to indicate where a human is likely to gaze when looking at an scene, and these maps have proven to be useful for a wide variety of applications. Knowing where humans will attend in an image allows for such interesting applications as content aware image resizing (Suh et al., 2003; Avidan and Shamir, 2007), segmentation of salient objects (Rahtu et al., 2010), aiding general objection-recognition (Rutishauser et al., 2004), and video summarization (Ma et al., 2005).

Methods for calculating visual saliency can roughly be grouped by their underlying assumptions on the meaning of saliency. One common idea is that saliency is "an anomaly with respect to context" (Wang et al., 2011). Separate approaches thus exist for various interpretations of "context". Approaches that use the local context around a pixel to determine saliency are often based on low-level visual features like colour, colour intensity, edge orientation, or texture (Goferman et al., 2012)

. Another set of approaches use a global context, where the saliency at a location depends on the entire image. One technique is to explicitly compare every patch in the image to a representative set of other patches. Those patches that are unique will have low cosine-similarities to other patches 

(Seo and Milanfar, 2009; Goferman et al., 2012).

Another group of approaches most relevant to the present paper has made use of operations on the Fourier transform of images. Adopting notation from (Li et al., 2013), we consider to be the mapping of image to the frequency domain, where is the amplitude spectrum and is the phase spectrum. Xiaodi Hou and Liqing Zhang (Hou and Zhang, 2007) proposed the idea that novelty in images is represented in the amplitude spectra of images as ’residuals’:


where and is an low-pass filter convolved over . To produce the saliency map, , the residual is then used in place of the original amplitude:


To achieve more visually pleasing results, they additionally square the elements of and smooth it with a low-pass filter. Soon after this work, (Guo et al., 2008) showed that similar results could be achieved by simply reconstructing the image using the phase spectrum:


In (Li et al., 2013), the authors suggest that these two approaches achieve very similar visual results because the residual of the amplitude spectrum is very similar to a plane (i.e. ) and reconstructing images with only phase information functions similar to a gradient enhancement, highlighting object boundaries and textured parts of the image. This property leaves the methods ill-suited for identifying large salient regions or salient regions in front of noisy backgrounds.

Guided by the fact that repeated patterns (also called "non-salient" patterns) correspond to spikes in the amplitude spectrum, the authors of (Li et al., 2013) reason that suppressing these peaks corresponds to leaving only the salient parts of an image. To perform this suppression so that sharper spikes are reduced more, a low-pass Gaussian filter, , is used to obtain the smoothed amplitude spectrum:


The saliency map can then be calculated with:


This method is the most similar to our work described in the next section, One main difference is that we additionally consider a percentile filer. Additionally, instead of applying the inverse Fourier transform, we experimentally observe (Section 4.4) that training image classifiers directly on the filtered amplitude spectra leads to improved results on SD problems.

More recent CNN based approaches to saliency also exist (Kümmerer et al., 2014; Kruthiventi et al., 2017), however, work by (Stabinger et al., 2016; Ricci et al., 2018) suggests they do not transfer well to the abstract SVRT tasks despite the excellent performance of these approaches on calculating saliency maps.

3 Our Approach

Figure 3: A demonstration of how non-unique parts of an image are affected by amplitude spectrum filtering. In the top row, columns (a) and (b) are images with varying scales of a repeated vertical lines and columns (c) and (c) are positive and negative samples from SVRT #15. The second row contains the amplitude spectra. The third and fourth rows contain the figures reconstructed with the percentile- and Gaussian-filtered amplitude spectra respectively. Note how the duplicated shapes are partially removed following filtering – for columns (c) and (d) containing SVRT samples, this effect is more clear with the percentile filter. The images in row 2 have undergone contrast enhancement for better printing quality. Rows 3 and 4 have been individually normalized so that their intensities lie in the range .

The core of our SD problem-solving approach is based on the insight that peaks in the amplitude spectrum of an image correspond to the non-unique parts of that image, and removing these peaks corresponds to removing the non-unique parts of the image (Li et al., 2013) (demonstrated in Figure 3). While deep convolutional neural networks have been able to solve a wide variety of visual tasks given only the raw images, they have thus far been unable to solve SD tasks this way (Stabinger et al., 2016; Ricci et al., 2018). A motivation of our model is thus to find a simple transformation of the problem images such that when combined with CNN-based approaches, the classifiers are capable of extracting the necessary information to solve SD problems. Consider SVRT #1 from Figure 1 for example. If an image transformation were capable of making the non-unique figure outlines in class 1 lighter than the unique figure outlines in class 2, then a CNN classifier trained with gradient-descent would have no trouble identifying the relevant visual feature (namely, intensity). This is precisely what amplitude spectrum filtering allows us to do. The main way our primary approach differs from this is that instead of training the classifier on the inverse Fourier transform using the filtered amplitude, we provide the filtered amplitude spectrum directly to the classifier. In Section 4.4 we examine how using the filtered amplitude differs from using the inverse Fourier transform with the filtered amplitude.

In the remainder of this section, we first discuss the intuition of how amplitude filtering works to detect uniqueness and the difference between filtering methods we consider. Second, we discuss how these amplitude spectra fit into the rest of our problem-solving pipeline.

3.1 Amplitude Spectra Filtering

Removing non-uniqueness. Demonstrations of how amplitude filtering affects non-uniqueness in an image are provided in Figure 3. In columns (a) and (b), we can see an image with several unique figures and different scales of repeated (i.e. non-unique) vertical bars superimposed. In the amplitude spectra, these vertical bars correspond to the sharp peaks located symmetrically about the y-axis observable in the second row of the figure. By applying a filter to smooth or remove these peaks before reconstructing the image, we remove the cause of those peaks. As demonstrated, this works even when there are relatively few instances of the repeated shape (column (b)).

Filtering methods. To perform the amplitude spectrum filtering for our model, we consider Gaussian filtering (which was used in (Li et al., 2013)) and (primarily) percentile filtering. Gaussian filtering of a matrix is performed by convolving a kernel whose values approximate a 2D Gaussian function:


where and are respectively the horizontal and vertical distances from the origin and

is the standard deviation. For each position,

, in the amplitude spectrum of a discrete Fourier transform, the -percentile filtered value is:


where is a neighborhood of size centered at and is the number of elements of evaluated at each point of that are less than or equal to . Where extends outside of , we consider to evaluate to . For convenience, in the rest of this paper we use to be the /(image width).

In Section 4.4 we demonstrate that both percentile and Gaussian filtering of amplitude spectra lead to state-of-the-art results on different SD tasks, with percentile filtering performing better on average.

3.2 Pipeline

Our primary approach for solving SD tasks follows a straightforward pipeline. First, for every training image for a task, we apply a percentile filter to the amplitude spectrum. Second, we train a classifier to predict class labels given the spectra (we discuss the classifiers used in Section 4.1). In Section 4.4 we will examine how the various choices in image transformation affect performance.

4 Experiments

4.1 Setup

To evaluate our image preprocessing approach for abstract visual reasoning, we make use of tasks from the SVRT challenge, consisting of 23 binary image classification tasks. As noted by (Stabinger et al., 2016), the SVRT tasks can be split into two groups: spatial relation (SR) problems and same-different (SD) problems. Solving SR problems requires attending to such feature as relative positioning, sizes, alignments, and grouping of figures. The SD problems require comparing individual figures within a single image to identify if they are identical, often under invariants such as scaling or rotation. In particular, we use the task type grouping proposed by (Ricci et al., 2018), so that there are 9 SD tasks and 14 SR tasks.

We combine our image preprocessing technique with the following few-shot classifiers (with their abbreviated names). VGG-pt: pre-trained VGG16 (Simonyan and Zisserman, 2014) feature extractor with k-nearest neighbors classifier. MAML: model-agnostic meta-learning (Finn et al., 2017). PN: Prototypical Networks (Snell et al., 2017). RN: Relational Networks (Santoro et al., 2017)

. Network architectures, training details, and hyperparameters are described in the supplementary material.

Fleuret refers to the Adaboost+spectral features model from (Fleuret et al., 2011), where the only other comparable results on few-shot learning of these tasks is available. The feature type abbreviations are as also follows (with more feature types examined in Section 4.4). Raw: original grayscale images. : unfiltered amplitude spectra. : percentile-filtered amplitude spectra.

4.2 Results At a Glance

Table 1 contains a high-level summary of how amplitude spectra filtering makes few-shot learning of SD tasks easier. In particular, we can make the following observations:

  • By using features for SD tasks, all classifiers tested go from performing at chance-level to above 70% accuracy with only 10 training samples.

  • Relational Networks (Santoro et al., 2017) and Prototypical Networks (Snell et al., 2017) improve the most with features on SD tasks.

  • When solving SR tasks, features appear to make learning slightly easier, but features perform better.

Task type Feature type VGG-pt MAML PN RN
raw 50.7% 50.0% 50.7% 49.6%
SD 60.5% 61.5% 63.4% 66.6%
70.1% 78.4% 78.7% 79.1%
raw 56.5% 51.8% 49.3% 50.0%
SR 58.5% 59.3% 62.9% 55.2%
53.4% 56.2% 55.9% 56.1%
Table 1: The test results for each of the four classifiers we use are averages across all SD and SR problems separately and evaluated with the two baseline feature sets (the original image: raw and the unfiltered amplitude spectrum: ).
1 5 7 15 16 19 20 21 22 Average
Fleuret 53.0% 47.0% 47.0% 54.0% 62.0% 51.0% 48.0% 39.0% 53.0% 50.4%
VGG-pt 89.2% 70.3% 52.2% 91.4% 98.8% 55.9% 51.9% 50.0% 70.7% 70.1%
MAML 99.8% 92.5% 58.4% 100.0% 96.3% 56.9% 56.0% 50.6% 95.3% 78.4%
PN 99.6% 95.9% 58.6% 99.6% 97.3% 56.1% 55.5% 49.6% 95.7% 78.7%
RN 99.9% 94.7% 60.8% 98.7% 98.1% 56.4% 55.3% 50.3% 97.6% 79.1%
Table 2: A comparison of existing results (Fleuret et al. Fleuret et al. (2011)) to four CNN-based few-shot classifiers utilizing features and 10 labelled samples. All four CNN classifiers outperform the existing results on every SD task. On several tasks, our approach is capable of achieving > 95% accuracy.
2 3 4 6 8 9 10 11 12 13 14 17 18 23 Average
Fleuret 55.0% 54.0% 56.0% 50.0% 57.0% 52.0% 50.0% 52.0% 46.0% 50.0% 51.0% 59.0% 50.0% 53.0% 52.5%
VGG-pt 66.2% 50.5% 66.6% 51.3% 75.0% 50.7% 71.2% 50.7% 57.2% 53.7% 67.6% 53.2% 50.8% 54.8% 58.5%
MAML 68.9% 56.1% 61.4% 50.3% 73.8% 51.2% 74.6% 60.1% 54.2% 65.1% 58.9% 51.9% 51.3% 52.0% 59.3%
PN 77.8% 58.2% 66.6% 50.8% 83.4% 51.4% 83.7% 64.1% 54.8% 70.4% 59.8% 53.3% 53.9% 52.0% 62.9%
RN 67.8% 51.2% 53.9% 51.1% 57.1% 51.1% 67.1% 52.8% 51.6% 62.1% 53.4% 51.6% 51.0% 50.4% 55.2%
Table 3: A comparison on the SR tasks of existing results and the three CNN-based approaches utilizing features. MT, PN, and VGG-pt all perform quite close to each other, achieving approximately 8% better than Fleuret et al. on average.

4.3 A Closer Look

Same-different tasks. Perhaps the most immediate observation from Table 2 is the contrast in performance between classifiers utilizing features and the existing results. On every task, our approaches outperform Fleuret et al., with the best classifier, Relational Networks, achieving an accuracy nearly 30% higher on average. Two problems where our models perform particularly well are #1 and #15. In class 1 for both tasks, some number of identical shapes are present (2 and 4 respectively), while in class 2, the shapes are all unique. These represent the most purely same-different problems. The highest performance of Fleuret et al. is achieved on #16 – in this task, both classes contain six identical shapes, but in class 1, the shapes on one side of the image can be obtained from those on the other side by reflection about the vertical image bisector. In class 2, the positions of the shapes are reflected, but not the details of the shapes themselves111Descriptions of each SVRT problem can be found in the appendix for (Fleuret et al., 2011), found at https://www.pnas.org/content/pnas/suppl/2011/10/12/1109168108.DCSupplemental/Appendix.pdf.

Spatial relation tasks. Table 3 contains the results using the best-performing feature type on these tasks, , according to Table 1. Aside from when using the RN classifier, this feature type performs similarly for both SD and SR problems, and consistently better than the raw images. This suggests that while is not very useful for CNNs in providing uniqueness-type information, these features contain information difficult for CNNs to extract from raw images. On these tasks, the spectral features allow us to achieve up to 10% higher on average than Fleuret et al. with the PN classifier. Previous work has shown that using the raw images, CNNs can achieve very high accuracies on the SR tasks when given a large amount of training data (Stabinger et al., 2016; Ricci et al., 2018), but in this few-shot case we observe that the raw images only achieve up to 56.5% accuracy with the VGG-pt classifier, lower than the high-90s averages reported by (Stabinger et al., 2016) with 20,000 training samples and (Ricci et al., 2018) with several million training samples. This demonstrates that improving few-shot performance on the SVRT SR tasks is worth further study.

4.4 Effects of Hyperparameters

1 5 7 15 16 19 20 21 22 Average
raw 51.0% 49.5% 50.3% 52.5% 50.8% 50.3% 49.8% 50.8% 50.9% 50.7%
63.3% 66.1% 54.2% 76.5% 58.9% 53.1% 51.6% 49.0% 71.7% 60.5%
89.2% 70.3% 52.2% 91.4% 98.8% 55.9% 51.9% 50.0% 70.7% 70.1%
64.6% 63.7% 58.3% 85.8% 67.2% 57.1% 53.5% 50.8% 77.5% 64.3%
86.5% 62.4% 52.8% 88.4% 52.1% 52.1% 49.8% 50.4% 69.0% 62.6%
Table 4: A comparison of the SD task performance of various feature types considered. refers to the Gaussian-filtered amplitude spectra and refers to the saliency map produced by taking the inverse Fourier transform with the original phase spectrum and the percentile-filtered amplitude.

Choice of image transformation. In Table 4 we provide support for our choice of percentile-filtered amplitude spectra as the primary image preprocessing approach in our experiments. For these comparisons, we report the results of the VGG-pt classifier (chosen for its speed, as no neural networks require training). While both filtering methods (percentile and Gaussian) achieve the best performance on subsets of the SD tasks, percentile filtering achieves a 6% higher average accuracy than Gaussian filtering. Additionally, while it may be more intuitive to reconstruct the images with the filtered spectra before training the classifiers (feature type ), Table 4 demonstrates that this achieves 8% lower than the , but still better than the raw images and the unfiltered amplitude spectra. However, we note that constructing saliency maps is often subject to additional post-processing parameters which we did not tune.

Filter parameters. We found that the most important parameter for the percentile filter was the size of the neighborhood around each point, , used to calculate the p-percentile, with larger values performing better. Unfortunately, calculating the p-percentile on a window requires first ranking every value, making the calculation slow for large windows. Thus, while we only tried values up to 0.2, we strongly suspect that using larger values would improve performance on the SD tasks by several percent. For the value, we found 10 to generally perform best during hyperparameter tuning on the validation set, but with only minor performance decreases when 5 or 20 were used. When tuning the Gaussian kernel to produce the results in Table 4, we found the optimal value for to sharply increase at 2, and slowly fall off for larger values.

5 Conclusion

We have presented an image preprocessing technique allowing few-shot deep learning classifiers to achieve improved accuracy on same-different (SD) problems. SD problems are a fundamental type of visual reasoning task often trivial for humans to solve with few samples while deep learning approaches training on millions of samples have been unsuccessful. Discovering machine learning approaches capable of solving these tasks is valuable in working towards automating highly abstract human skills.

To solve SD problems, we propose training CNN-based classifiers on the percentile-filtered amplitude spectra. As has been previously established with Gaussian filters, filtering these spectra correspond to removing the non-unique parts of an image. In line with previous work suggesting that fully convolutional approaches have difficulty learning to solve SD tasks, we demonstrate that a variety of state-of-the-art few-shot classifiers achieve only 50% binary classification accuracy on SD problems when trained on raw images. However, combining the classifiers with our image preprocessing technique allows them to achieve between 70% and 80% accuracy, outperforming the existing comparable state-of-the-art on SVRT SD tasks in the few-shot case, solidifying the effectiveness of this approach.


  • [1] S. Avidan and A. Shamir (2007) Seam carving for content-aware image resizing. In ACM Transactions on graphics (TOG), Vol. 26, pp. 10. Cited by: §2.2.
  • [2] M. M. Bongard (1967) The problem of recognition. Fizmatgiz, Moscow. Cited by: §1.
  • [3] F. Chollet (2015) Keras. GitHub. Note: https://github.com/fchollet/keras Cited by: Appendix A.
  • [4] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: Appendix A, §4.1.
  • [5] F. Fleuret, T. Li, C. Dubout, E. K. Wampler, S. Yantis, and D. Geman (2011) Comparing machines and humans on a visual categorization test. Proceedings of the National Academy of Sciences. Cited by: Appendix A, §1, §2.1, §4.1, Table 2, footnote 1.
  • [6] S. Goferman, L. Zelnik-Manor, and A. Tal (2012) Context-aware saliency detection. IEEE transactions on pattern analysis and machine intelligence 34 (10), pp. 1915–1926. Cited by: §2.2.
  • [7] C. Guo, Q. Ma, and L. Zhang (2008) Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In

    2008 IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1–8. Cited by: Figure 2, §2.2.
  • [8] X. Hou and L. Zhang (2007) Saliency detection: a spectral residual approach. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pp. 1–8. Cited by: Figure 2, §2.2.
  • [9] D. A. Hudson and C. D. Manning (2018) Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067. Cited by: §1.
  • [10] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910. Cited by: §1.
  • [11] S. S. Kruthiventi, K. Ayush, and R. V. Babu (2017) Deepfix: a fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing 26 (9), pp. 4446–4456. Cited by: §2.2.
  • [12] M. Kümmerer, L. Theis, and M. Bethge (2014)

    Deep gaze i: boosting saliency prediction with feature maps trained on imagenet

    arXiv preprint arXiv:1411.1045. Cited by: §2.2.
  • [13] B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum (2011) One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 33. Cited by: Appendix A.
  • [14] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel (1989) Backpropagation applied to handwritten zip code recognition. Neural computation 1 (4), pp. 541–551. Cited by: §2.1.
  • [15] J. Li, M. D. Levine, X. An, X. Xu, and H. He (2013) Visual saliency based on scale-space analysis in the frequency domain. IEEE transactions on pattern analysis and machine intelligence 35 (4), pp. 996–1010. Cited by: Figure 2, §1, §2.2, §2.2, §3.1, §3.
  • [16] Y. Ma, X. Hua, L. Lu, and H. Zhang (2005)

    A generic framework of user attention model and its application in video summarization

    IEEE transactions on multimedia 7 (5), pp. 907–919. Cited by: §2.2.
  • [17] E. Rahtu, J. Kannala, M. Salo, and J. Heikkilä (2010) Segmenting salient objects from images and videos. In Computer Vision–ECCV 2010, pp. 366–379. Cited by: §2.2.
  • [18] J. C. Raven et al. (1938) Raven’s progressive matrices. Western Psychological Services Los Angeles, CA. Cited by: §1.
  • [19] M. Ricci, J. Kim, and T. Serre (2018) Not-so-clevr: visual relations strain feedforward neural networks. arXiv preprint arXiv:1802.03390. Cited by: §1, §1, §2.1, §2.2, §3, §4.1, §4.3.
  • [20] U. Rutishauser, D. Walther, C. Koch, and P. Perona (2004) Is bottom-up attention useful for object recognition?. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, Vol. 2, pp. II–II. Cited by: §2.2.
  • [21] A. Santoro, F. Hill, D. Barrett, A. Morcos, and T. Lillicrap (2018) Measuring abstract reasoning in neural networks. In International Conference on Machine Learning, pp. 4477–4486. Cited by: §1.
  • [22] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap (2017) A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: Appendix A, §1, §2.1, 2nd item, §4.1.
  • [23] H. J. Seo and P. Milanfar (2009) Static and space-time visual saliency detection by self-resemblance. Journal of vision 9 (12), pp. 15–15. Cited by: Figure 2, §2.2.
  • [24] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Appendix A, §4.1.
  • [25] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: Appendix A, 2nd item, §4.1.
  • [26] S. Stabinger, A. Rodríguez-Sánchez, and J. Piater (2016) 25 years of cnns: can we compare to human abstraction capabilities?. In International Conference on Artificial Neural Networks, pp. 380–387. Cited by: §1, §1, §2.1, §2.2, §3, §4.1, §4.3.
  • [27] B. Suh, H. Ling, B. B. Bederson, and D. W. Jacobs (2003) Automatic thumbnail cropping and its effectiveness. In Proceedings of the 16th annual ACM symposium on User interface software and technology, pp. 95–104. Cited by: §2.2.
  • [28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.1.
  • [29] M. Wang, J. Konrad, P. Ishwar, K. Jing, and H. Rowley (2011) Image saliency: from intrinsic to extrinsic context. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 417–424. Cited by: §2.2.

Appendix A Network architecture and hyperparameters

Here we discuss the implementation details and hyperparameters used for each model in our experiments. For all models, we use SVRT images of size 96x96, and for each SVRT task, we train models with 10 labelled samples (to compare to the few-shot experiments done in [5]) and evaluate on 1000 samples, where performance is measured with classification accuracy. To produce all experimental results, we average over 10 trials. For hyperparameter tuning, we average across 5 trials for each measurement on only the SD tasks, using 1000 validation images for each task. To produce the test results, we average across 10 trials, with 1000 different unseen test images each trial. For model training, we use 10 new images each trial. The code to generate samples for SVRT tasks is publicly available at http://www.idiap.ch/~fleuret/svrt/. Next, we discuss the details of tuning each classifier.

Transfer Learning. For this model, we used the VGG16 architecture pre-trained on ImageNet [24] and provided through Keras [3]. We extract features from the last set of convolutions. For the classifier, we use k-nearest neighbors. The value for was chosen from and the best value was found to be for all feature types.

Prototypical Networks. We use a publicly available implementation (https://github.com/orobix/Prototypical-Networks-for-Few-shot-Learning-PyTorch) with the same architecture for the embedding stage as in [25] and [13]

. Three training samples are randomly chosen from each class for its support set. The optimal number of epochs was chosen from

to maximize the average score on validation SD problems. When training on raw images, 32 epochs was found to perform best, and 64 epochs was found to perform best when using the spectral features.

MAML. We use a publicly available implementation (https://github.com/katerakelly/pytorch-maml) with the same architecture and hyperparameters as in the supervised Omniglot experiments in [4]. However, we choose the number of training epochs from to maximize the average score across validation SD problems. When training on raw images, 8 epochs was found to perform best, and 32 epochs was found to perform best when using the spectral features.

Relational Networks. We use a publicly available implementation (https://github.com/kimhc6028/relational-networks) with the same architecture used for CLVER in [22]. However, since this model is originally designed for tasks which contain images paired with textual questions, we modify the model by removing components associated with question processing so it can be applied to SVRT tasks. We select the number of training epochs from and learning rate from . When training on raw images, 128 epochs and learning rate of was selected. When training on spectral features, 256 epochs and a learning rate of were selected.

When optimizing the parameters for the percentile-filter for each classifier, we chose from and from . The optimal values for and were identified to be 10 and 0.2 respectively for all classifiers, except for the transfer learning model, which used and .