1 Definitions
Receptive Field (RF): is a local region (including its depth) on the output volume of the previous layer that a neuron is connected to. This term has been prevalent in Neurosciences since the study of Hubel and Wiesel [1] in which they suggested local features are detected in early visual layers of the visual cortex and are then progressively combined to create more complex patterns in a hierarchical manner.
As an example, assume that the input RGB image to a CNN has size . For a filter size of , then each neuron in the first convolutional layer will be connected to a region in the input volume. Thus, a total of
weights (+1 bias parameter) needs to be learned. Notice that RF is a 3D tensor with its depth being equal to the depth of the volume in the previous layer. Here, for simplicity, we discard the depth in our calculation.
Effective Receptive Field (ERF): is the area of the original image that can possibly influence the activation of a neuron. One important point to notice here is that RF and ERF are the same for the first convolutional layer. However, they differ as we moves along the CNN hierarchy. The RF is simply equal to filter size over the previous layer but ERF traces the hierarchy back to the input image and indicates the extent of the input image which can modulate the activity of a neuron. Here, we focus on ERF calculation. It is worth noting that ERF and RF are sometimes used interchangeably (and hence confused) in the computer vision community.
Projective Field (PF): is the set of neurons to which a neuron projects its output [3].
Figure 1 illustrated these definitions.
2 Calculating the ERF
In convolutional neural networks [2], the ERF of a neuron indicates which area of the input image is considered by the filter. Calculating the size of the ERF would help choosing suitable filter sizes using domain knowledge for enhancing the performance of CNNs.
There are two ways to calculate the ERF size: 1) BottomUp, and 2) TopDown. Both ways produce the same result. The intermediate values of each approach, however, have different meanings in each case.
2.1 BottomUp Approach
The bottomup approach is the method to calculate the ERF of a neuron at layer k projected on the input image. Let be the ERF of a neuron at layer k. Given the ERF of the previous layer , where is the ERF at input image layer, the ERF for a neuron at current layer can be computed by adding the nonoverlappedarea A to :
(1) 
Let represent the filter size of layer k. There are filters overlap with each other. Since a filter can be convolved with a stride greater than one, it can significantly increase the nonoverlapped area. Thus, it is necessary to account for the number of pixels each extra filter contributes to the ERF. Since the stride of the lower layer also affects the ERF of the higher layer, the pixel contributions of all layers must be accumulated. Therefore the nonoverlapped area is calculated as:
(2) 
where is the stride of the layer i. Combining equation 1 and equation 2, the ERF can be computed as:
(3) 
2.2 TopDown Approach
The computation of ERF in this approach is done via calculating the RF a neuron at layer k projected on the lower layer j, where the RF of the last layer would be the ERF. Given the RF of a neuron at higher layer , if there is no overlap (i.e., stride equal filter size) then the RF of the current layer is:
(4) 
where is the filter size of higher layer. The RF is 1 when . When the filters are overlapped with each other, the overlapped area must be subtracted from the value. Imagine placing down a filter, then every filter being placed after it would have area overlap with the previously placed filter. Since the RF of the higher layer is the one being projected down, the number of filters that overlap with each other would simply be the RF of higher layer minus one:
(5) 
The subsequent filter being placed down would be shifted by the stride, thus the overlapped area is dependent on the size of the filter and the stride. Larger strides would yield to less overlap. Larger filters would result in more overlap. The overlapped area of each filter would be the difference of filter size and stride of the higher layer :
(6) 
Having the number of overlapped filters and the area that each filter overlapped, the RF at the current layer can be computed by combining the equations 4, 5, and 6:
(7) 
Expanding and simplifying the above equation gives the final topdown equation:
(8) 
The topdown approach is helpful during the analysis as it can be computed relatively quickly. Also, given a point on a filter, it is possible to speculate the nodes that contributed to its output. For the deconvolutional networks, the topdown approach can be used to control the resolution of the output image. Thus instead of using upside down CNN layers, the deconvolutional layers can be designed to incorporate any domain knowledge about the problem. Figures 4 and 5 show examples of the progression of RF being projected back to lower layers.
2.3 Case Study
Here, we calculate the ERF of neurons for an the CNN from Wei et al., [4]
. In their paper, Wei et al. proposed a method for pose estimation (known as Convolutional Pose Machine). Figure
6 shows the original architecture. Here, we focus on calculating ERF for part of the network shown in Figure 7, with the 1x1 filters being omitted.BottomUp approach: The ERF of each layer is computed progressively while skipping the filters as they do not have any effect on the ERF size. The process of computing ERF of the architecture in Figure 7, according to equation 3, is shown below:
(9) 
TopDown approach: Here, the ERF is calculated for each layer separately. So for a network with n layers, n passes back to the image are needed. In other words, intermediate numbers can not be reused. The process of computing ERF for the architecture in Figure 6, according to equation 8, is shown below. is the ERF of the layer in the network. Notice that a separate computation is needed to compute or .
(10) 
3 Projective field size
In this section, we discuss the calculation of the PF size of a neuron. For the example in Section 1, assuming 10 filters of size , and stride equal 5, in the first convolutional layer, the PF of each image pixel (i.e., input neuron and in each R, G, or B channels), would be . Notice that this calculation is independent of the filter size but as we will show below depends on the stride size. Further, notice that as in calculation of the RF, there is a depth component involved as well. For simplicity, we discard the depth in what follows.
The size of the projective field can be calculated by sliding the filter over an area and update the counter of each neuron when it overlaps with the filter (Figure 10, 11). However, sliding the filter is prone to error and difficult to keep track of the value in the x and y directions. A simpler method to determine the projective field was is shown in Figures 8 and 9.
With a stride of 1, the immediate PF is the same size as the filter size of the next layer. For example, if the filter size is , then a neuron will influence nodes in the output filter map. For the above example, assuming 10 filters in the first convolutional layer, the PF of each image pixel (i.e., input neuron and in each R, G, or B channel), would be
. The pixels at the corners and the edges would have slightly smaller PFs. Here, for simplicity we assume that the input image has been zero padded.
With a stride greater than 1, then some neurons will have bigger PFs than other neurons. For example, for a filter size of and stride of 2, the center neuron (see figure 8) has the PF of . The neurons on the xaxis and yaxis of the center neuron would have PFs of or , respectively. The neurons diagonal to the center neurons would have PF sizes of .
From the above analysis (See Figures 10, 11, and 9), the projective field of a node at layer k of a CNN is bounded with a set of four pairs of values:
(11) 
where and are the filter size and stride in the next layer. According to the equation 11, if the remainder of the fraction is zero, then all the nodes have equal projective field sizes. Otherwise, depending on the location of the nodes, projective field sizes would be different. In other words, when the fraction does not yield an integer value, there is disparity in the influence of the nodes in the next layer. This is perhaps why researchers tend not to use strides greater than 1 in convolution layers (or use strides equal the filter size in pooling layers). Nonetheless, it is unclear whether such disparity can cause any practical problems.
Deconv nets are inverted versions of CNNs. Therefore, their projective field can be calculated using the ERF formulas. Similarly, their ERF is the same as the projective field in CNNs.
4 Discussion
Here, we discussed how receptive, effective receptive and projective fields of neurons in CNNs are calculated. Understanding these quantities has important implications in deep learning research. First, it helps setting the parameters such as filter size, number of filters, and stride more effectively. Second, it allows analyzing how objects are represented by CNNs and investigate whether some image information is lost along the CNN hierarchy.
5 Acknowledgment
We wish to thank all participants in the Advanced Computer Vision course at UCF who contributed to discussions.
References
 [1] D. H. Hubel and T. N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106–154, 1962.
 [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [3] S. R. Lehky and T. Sejnowski. Network model of shapefromshading: neural function arises from both receptive and projective fields. Nature, 333(6172):452–454, 1988.

[4]
S.E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh.
Convolutional pose machines.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 4724–4732, June 2016.
Comments
There are no comments yet.