Dilated DenseNets for Relational Reasoning

by   Antreas Antoniou, et al.

Despite their impressive performance in many tasks, deep neural networks often struggle at relational reasoning. This has recently been remedied with the introduction of a plug-in relational module that considers relations between pairs of objects. Unfortunately, this is combinatorially expensive. In this extended abstract, we show that a DenseNet incorporating dilated convolutions excels at relational reasoning on the Sort-of-CLEVR dataset, allowing us to forgo this relational module and its associated expense.



There are no comments yet.


page 1

page 2

page 3

page 4


SARN: Relational Reasoning through Sequential Attention

This paper proposes an attention module augmented relational network cal...

HR-RCNN: Hierarchical Relational Reasoning for Object Detection

Incorporating relational reasoning in neural networks for object recogni...

Working Memory Networks: Augmenting Memory Networks with a Relational Reasoning Module

During the last years, there has been a lot of interest in achieving som...

Relational Gating for "What If" Reasoning

This paper addresses the challenge of learning to do procedural reasonin...

Generalisable Relational Reasoning With Comparators in Low-Dimensional Manifolds

While modern deep neural architectures generalise well when test data is...

Learning Relational Rules from Rewards

Humans perceive the world in terms of objects and relations between them...

Optimal quadratic binding for relational reasoning in vector symbolic neural architectures

Binding operation is fundamental to many cognitive processes, such as co...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in evolution (Darwin, 1909) have allowed humans to excel at image classification (e.g. identifying a furry creature as a Shih Tzu) and relational reasoning (e.g. noticing that Tom Cruise is much shorter than Dwayne “The Rock” Johnson) among other things. Deep Neural Networks similarly excel at image classification but appear to fall short on relational reasoning tasks (Johnson et al., 2017).

In Santoro et al. (2017), the authors present a simple plug-in module that can be appended to existing network architectures to form a relation network

. These achieve state-of-the-art performance on various relational reasoning tasks. They postulate that the inclusion of this flexible module allows the convolutional parts of the network to focus on processing local spatial information. Unfortunately, this module is combinatorially expensive as it performs operations on pairs of features that are outputted from a convolutional neural network (CNN).

In this work, we show that this module can be avoided by simply incorporating dilated convolutions (Yu & Koltun, 2016) into a powerful CNN architecture, such as a DenseNet (Huang et al., 2017). These Dilated DenseNets exhibit comparable performance to relation networks on the Sort-of-CLEVR (Santoro et al., 2017) dataset without the need for a relational reasoning module.

Our contributions are as follows:

  • We introduce a modification for DenseNets which enables them to achieve multi-scale feature learning and relational reasoning.

  • We empirically show that this Dilated DenseNet can achieve strong relational reasoning results without the need for an explicit relational module.

  • We showcase that usage of an explicit relational module is redundant; we don’t have to incur the computational costs of having to train a model with this module.

2 Background


In this work we train our models to solve SORT-of-CLEVR. This dataset — introduced in Santoro et al. (2017) — consists of 10,000 images each containing squares and circles in random locations. These shapes can be one of six colours. Given an image, the task is to answer a question which can be relational (e.g. What shape is the object closest to the blue object?) or non-relational (e.g. What shape is the red object?). There are 10 relational and 10 non-relational questions in total.

Relation networks

We benchmark against relation networks (Santoro et al., 2017). These consist of a CNN, an LSTM (Hochreiter & Schmidhuber, 1997), and a relational module. The image in each image-question pair is passed through the CNN to produce a set of objects

: one for each 2D spatial location of the CNN output, with a corresponding feature given by the values of each channel at that location. The question is fed into the LSTM, and its output is appended to each vector of object-pair features. Finally, the relational module takes all these vectors, puts them through an MLP and takes the sum to produce an answer. All weights are learnt in an end-to-end manner.

Dilated Convolutions

Dilated convolutions are effectively convolutions with expanded receptive fields. For example, in a standard convolution with

kernels and a stride of 1, the filters scan the image in

regions of adjacent pixels. This filter has dilation

1: centre-to-centre, there is a 1-pixel distance between each filtered pixel and its nearest neighbour. Now, consider the case where there is a 2-pixel distance: the filter is only applied to pixels that are in both odd-numbered rows and columns in each

region. This is dilation 2. For dilation 3, the filter is applied to only pixels in every third row and column in each region, and so on. These are illustrated in Figure 1. These dilations allow a model to learn higher order abstractions without the need for dimensionality reduction. They are frequently used in segmentation networks (Yu & Koltun, 2016; Romera et al., 2017a, b), but can also be used for model compression (Crowley et al., 2018), and audio generation (van den Oord et al., 2016) among other things.

Figure 1: An illustration showing the effect of increasing the dilation of a filter. For a given image patch, each coloured square corresponds to the locations at which the filter is placed. Notice that this has the effect of increasing the receptive field of the filter as dilation is incremented.


DenseNets (Huang et al., 2017) are a powerful family of architectures that encourage feature reuse. They consist of a series of repeating convolutional blocks; the outputs of earlier blocks are concatenated and form the input to later blocks. Typically this block contains either (i) a single convolution, usually referred to as a basic block, or (ii) a bottleneck convolution followed by a convolution (a bottleneck block). Each block has the same number of output channels, this is the network’s growth rate — it controls the rate at which the network expands. A standard DenseNet comprises of 3 dense stages, each consisting of a number of blocks. After each stage, a transition layer is applied, which carries out spatial dimensionality reduction using an average pooling layer and channel-wise dimensionality reduction using a bottleneck convolution.

3 Model

Our proposed model consists of a standard DenseNet utilising basic blocks, which is modified by changing the per-dense-stage convolutions to follow an exponentially increasing dilation rate assignment. The dilation rate assignment scheme can be expressed as:

Where is the dilation rate of the convolution at the th block (for ) of the th dense stage. For example, if each stage consists of 5 blocks, each consisting of 1 convolutional layer, then those layers will have dilation rates of 1, 2, 4, 8, 16 respectively.

Specifically, we use a 16-layer DenseNet with a growth rate of 32. A compression factor of 1 is used for the transition layers.

4 Experiments

In this section we run experiments on the SORT-of-CLEVR dataset to test the relational, and non-relational accuracy of different models, including our Dilated DenseNet.

We first reproduce the original relation network from Santoro et al. (2017)

. It consists of a standard 4 layer convolutional network with batch normalisation and ReLU activation functions, followed by a relational module, and a final softmax layer. This is denoted as CNN + RN in Table 

1. We also train the same network where the relational module has been replaced by a 2-layer MLP (CNN + MLP). We can see from Table 1 that while both networks succeed at non-relational reasoning, the network with a relational module attains a significantly higher relational accuracy than the one without it (87.7% vs. 65.4%). It is practically a required component in this case.

We compare these to our network consisting of a Dilated DenseNet and a 2-layer MLP (Dilated DenseNet + MLP). Notice that it attains a relational accuracy of 83.7%, a few points shy of that achieved by the relation network — and a gigantic step up from CNN + MLP — without any need for a relational reasoning module.

Finally, we train a DenseNet without dilations (DenseNet + MLP) to observe where the dilations are required for relational reasoning. It transpires that they are; not only does this network fail to perform as well as its dilated counterpart — it is only marginally better than CNN + MLP at relational reasoning — it also exhibits very high deviation in performance across three independent training runs, particularly on non-relational questions.

Implementation Details

All networks are trained using the Adam optimiser (Kingma & Ba, 2015) with an initial learning rate of and default momenta. This is cosine annealed to

over 250 epochs. For the DenseNets we use a weight decay of

and a dropout rate of 0.2 in every layer. For the standard CNNs we used no weight decay and a dropout rate of 0.5 between the 2 layers of the MLP. After each epoch, we evaluate our model on a validation set. The model that performs best on the validation set across a training run is then evaluated on the test set. We perform three independent runs per network with different seeds for the model initialisation and the data provider.

Model Non-relational acc. Relational acc. Combined acc.
DenseNet + MLP
Dilated DenseNet + MLP
Table 1:

Results on the test set of SORT-of-CLEVR: We display accuracy on the non-relational questions, accuracy on the relational questions, and the combination thereof. Means and standard deviations are given across three independent runs. CNN + MLP is a simple 4 layer neural network with an MLP. CNN + RN is the same neural network with a relational reasoning module. Our proposed network (Dilated DenseNet + MLP) achieves strong relational reasoning results, very close to those of CNN + RN, without the need for an explicit relational reasoning module. We also compare this to its undilated counterpart (DenseNet + MLP) to show that the dilations are indeed necessary.

5 Conclusion

Relational reasoning is an important task, one in which neural networks were believed to fail at without the addition of an expensive tailored module. In this work we have demonstrated that this is not the case. By taking a powerful network architecture and incorporating dilations we are able to forgo this module, and its associated costs. Future work could entail applying our network to further relational reasoning tasks (Weston et al., 2015; Johnson et al., 2017)