In Euclidean space, the geometric definition of inner product is the product of the Euclidean lengths of the two vectors and the cosine of the angle between them. That is , where we denote by the inner product, by the Euclidean length of vector and by the angle between and with the range of . From this formulation, it is obvious that the has a significant impact on the dynamics of neural networks. The gradient of wrt is
, which has several limitations during backpropagation. First, this gradient becomes smaller and smaller asgets closer to 0 or . Second, as is the function of the unit vectors of and , the smaller gradient of will hamper the update of the direction of weight vector . Finally, it also discounts the direction gradient of and weakens the gradient flow to the downstream. As a result, the optimization becomes more and more difficult as the training of neural networks progresses. Several recent investigations of backpropagation [4, 42]
focus on modifying the gradient of activation function. However, few researches propose variants of backpropagation of the inner product function.
In this paper, we propose the PR Product, a substitute of the inner product, which changes the backpropagation of inner product while keeping the same forward propagation. From the perspective of vector orthogonal decomposition, the vector can be decomposed into vector projection on and vector rejection from , as shown in Figure 1. We prove that the conventional inner product of and only contains the information of . While the proposed Projection and Rejection Product (PR Product) involves the information of both the vector projection and the vector rejection . We further analyze the gradients of and in PR Product, proving that the length of direction gradient of is changed from the in conventional inner product to in PR Product, as shown in Figure 1.
There are several advantages of using PR Product: (a) Compared with the behavior of conventional inner product, the length of direction gradient of is always larger and independent of
; (b) As the PR Product maintains the linear property, it can be an honest substitute of inner product operation in the fully connected layer, convolutional layer, and recurrent layer. By honest, we mean it does not introduce any additional parameters and matches with the original configurations such as activation function, batch normalization, and dropout operation; (c) As we show in our experiments, PR Product can robustly promote the performance of many models on multiple applications.
We showcase the effectiveness of PR Product on image classification and image captioning tasks. For both tasks, we replace all the fully connected layers, convolutional layers and recurrent layers of the backbone models with their PR Product version. Experiments on image classification demonstrate that the PR Product can typically improve the accuracy of the state-of-the-art classification models. Moreover, our analysis on image captioning confirms that the PR Product can change the dynamics of neural networks. Without any tricks of improving the performance, like scene graph and ensemble strategy, our PR Product version of captioning model achieves results on par with the state-of-the-art models.
In summary, the main contributions of this paper are:
We propose the PR Product, a substitute of the inner product of weight vector and input vector in neural networks, which involves the information of both the vector projection and the vector rejection while keeping the forward propagation identical;
We develop the PR-FC, PR-CNN, and PR-LSTM, which applies the PR Product into the fully connected layer, convolutional layer and LSTM layer respectively;
Our experiments on image classification and image captioning suggest that the PR Product is generally useful and can become a basic operation of neural networks.
2 Related Work
Variants of Backpropagation. Several recent investigations have considered variants of standard Backpropagation. In particular, 
presents a surprisingly simple backpropagation mechanism that assigns blame by multiplying errors signals with random weights, instead of the synaptic weights on each neuron, and further downstream.
exhaustively considers many Hebbian learning algorithms. The straight-through estimator proposed inheuristically copies the gradient with respect to the stochastic output directly as an estimator of the gradient with respect to the sigmoid argument.  proposes Linear Backprop that backpropagates error terms only linearly. Different from these methods, our proposed PR Product changes the local gradients of weights during backpropagation while maintaining the identical forward propagation.
have become the dominant machine learning approaches for image classification. To train very deep networks, shortcut connections have become an essential part of modern networks. For example, Highway Networks[33, 34] present shortcut connections with gating functions, while variants of ResNet [11, 12, 44, 36] use identity shortcut connections. DenseNet , a more recent network with several parallel shortcut connections, connects each layer to every other layer in a feed-forward fashion.
generate the caption templates whose slots are filled in by the outputs of object detection, attribute classification and scene recognition, which results in captions that sound unnatural. Recently, inspired by the advances in the NLP field, models based encoder-decoder architecture[17, 16, 35, 14]
have achieved striking advances. These approaches typically use a pretrained CNN model as the image encoder, combined with an RNN decoder trained to predict the probability distribution over a set of possible words. To better incorporate the image information into the language processing, visual attention for image captioning was first introduced by which allows the decoder to automatically focus on the image subregions that are important for the current time step. Because of remarkable improvement of performance, many extensions of visual attention mechanism [43, 5, 39, 9, 27, 1] have been proposed to push the limits of this framework for caption generation tasks. Except for those extensions to visual attention mechanism, several attempts [31, 26]
have been made to adapt reinforcement learning to address the discrepancy between the training and the testing objectives for image captioning. More recently, some methods[40, 15, 25, 38] exploit scene graph to incorporate visual relationship knowledge into captioning models for better descriptive abilities.
3 The Projection and Rejection Product
In this section, we begin by shortly revisiting the inner product of weight vector and input vector from the perspective of vector orthogonal decomposition. Then we formally propose the Projection and Rejection Product (PR Product) which involves the information of both vector projection of onto and vector rejection of from . Moreover, we analyze the local gradient of the weight vector in PR Product. Finally, we show the implementation of PR Product and develop the PR-FC, PR-CNN, and PR-LSTM. In the following, for the simplicity of derivation, we only consider a single input vector and a single weight vector except for the last subsection.
3.1 Revisit the Inner Product in Neural Networks
In Euclidean space, the inner product of the two Euclidean vectors and is defined by:
where is the Euclidean length of vector , and is the angle between and with the range of . From this formulation, we can observe that the angle explicitly affects the state of neural networks.
The gradient of wrt is:
When is close to 0 or , this gradient is close to 0. We argue that this is one of the reasons that the optimization becomes more and more difficult as the training progresses.
From the perspective of vector orthogonal decomposition, the vector can be decomposed into vector projection on and vector rejection from , as shown in Figure 1. The former is the orthogonal projection of onto , and the latter is the orthogonal projection of
onto the hyperplane orthogonal to. We denote the vector projection of onto by and the vector rejection of from by . Obviously, the length of is:
And the length of is:
where sign(*) denotes the sign of *. We can observe that this formulation only contains the information of vector projection of on , . As shown in Figure 1, the vector projection changes very little when is near 0 or . That is the reason for optimization difficulty from geometric perspective. Although the length of the rejection vector is small when is close to 0 or , however, it varies greatly and is able to support the optimization of neural networks. That is our basic motivation for the proposed Projection and Rejection Product (PR Product).
3.2 The PR Product
To take advantage of both the vector projection and the vector rejection while maintaining the linear property, we reformulate the inner product of and as follows:
For clarity, we denote by PR the proposed product function. Note that the * denotes detaching * from neural networks. By detaching, we mean * is considered as a constant rather than a variable during backward propagation. Compared with the conventional inner product formulation (Equation (5) or (1)), this formulation involves not only the information of vector projection but also the one of vector rejection without any additional parameters. We call this formulation the Projection and Rejection Product or PR Product for brevity.
Obviously, the PR Product keeps the same forward propagation as the conventional inner product, which means it maintains the linear property. In the following, we theoretically derive the gradient of wrt during backpropagation in the PR Product.
The gradient of . From Figure 1, we can see that neither the weight vector nor the input vector is the function of . So we just need to calculate the gradients of trigonometric functions except for the detached ones in Equation (6). When is in the range of , the gradient of wrt is:
When is in the range of , the gradient of wrt is:
We use the following unified form to express the above two cases:
Compared with the conventional one (Equation (2)), the PR Product changes the gradient wrt from a smoothing function to a hard one. One advantage of this is the gradient of does not decrease as gets closer to 0 or , providing continuous power for the optimization of neural networks.
The gradient of . Above we discussed the gradient of , an implicit variable in neural networks. In this part, we explicitly take a look at the differences between the gradients of in the conventional inner product and our proposed PR Product.
The vector projection is parallel to the weight vector and will update the length of in next training iteration, called the length gradient of . While the vector rejection is orthogonal to , it will change the direction of , called the direction gradient. As the gets closer to 0 or , the direction gradient becomes smaller and smaller, so it is increasingly difficult to update the direction of .
Where is the projection matrix that projects onto the weight vector , which means , and is the unit vector along the vector rejection . Similar to Equation (10), the is the length gradient part and the is the direction gradient part. For the length gradient, the cases in and are identical. For the direction gradient part, however, the one in is consistently larger than the one in , except for the almost impossible case when is equal to or . In addition, the length of direction gradient in is independent of the value of . Figure 1 shows the comparison of the gradients of in the two formulations.
3.3 Implementation of PR Product
As mentioned above, is an implicit variable in neural networks, so we can’t directly implement the PR Product according to Equation (6). By Equation (4) and the Pythagorean theorem, we can derive as follows:
Substituting it into Equation (6), we can get the implementation of PR Product in practice:
Also, the * denotes detaching * from neural networks.
The PR Product is a substitute of the conventional inner product operation, so it can be applied into many existing deep learning modules, such as fully connected layer(FC), convolutional layer(CNN) and LSTM layer. We denote the module X with PR Product by PR-X. In this section, we show the implementation of PR-FC, PR-CNN, and PR-LSTM.
PR-FC. To get PR-FC, we just replace the inner product of the input vector and each weight vector in the weight matrix with the PR Product. Suppose the weight matrix W contains a set of n column vectors, , so the output vector of PR-FC can be calculated as follows:
represents an additive bias vector if any.
To apply the PR Product into CNN, we convert the weight tensor of the convolutional kernel and the input tensor in the sliding window into vectors in Euclidean space, and then use the PR Product to calculate the output. Suppose the size of the convolution kernelis , so the output at position (i, j) is:
where and , represents the input tensor in the sliding window corresponding to output position (i,j), and represents an additive bias if any.
To get the PR Product version of LSTM, just replace all the perceptrons in each gate function with the PR-FC.
In the following, we conduct experiments on image classification to validate the effectiveness of PR-CNN. And then we show the effectiveness of PR-FC and PR-LSTM on image captioning task.
4 Experiments on Image Classification
4.1 Classification Models
We test various classic networks such as ResNet , PreResNet , WideResNet  and DenseNet-BC  as the backbone networks in our experiments. In particular, we consider ResNet with 110 layers denoted by ResNet110, PreResNet with 110 layers denoted by PreResNet110, and WideResNet with 28 layers and a widen factor of 10 denoted by WRN-28-10, as well as DenseNet-BC with 100 layers and a growth rate of 12 denoted by DenseNet-BC-100-12. For ResNet110 and PreResNet110, we use the classic basic block. To get the corresponding PR Product version models, all the fully connected layers and the convolutional layers in the backbone models are replaced by our PR-FC and PR-CNN respectively, and we denote them by PR-X, such as PR-ResNet110, PR-PreResNet110, PR-WRN-28-10, and PR-DenseNet-BC-100-12 respectively.
4.2 Dataset and Settings
We conduct our image classification experiments on the CIFAR dataset , which consists of 50k and 10k images of pixels for the training and test sets respectively. The images are labeled with 10 and 100 categories, namely CIFAR10 and CIFAR100 datasets. We present experiments trained on the training set and evaluated on the test set. We follow the simple data augmentation in 
for training: 4 pixels are padded on each side and acrop is randomly sampled from the padded image or its horizontal flip. For testing, we only evaluate the single view of the original image. Note that our focus is on the effectiveness of our proposed PR Product, but not on pushing the state-of-the-art results, so we do not use any more data augmentation and training tricks to improve accuracy.
4.3 Results and Analysis
For fair comparison, not only are the PR-X models trained from scratch but also the corresponding backbone models, so our results may be slightly different from the ones presented in the original papers due to some hyper-parameters like random number seeds. The strategies and hyper-parameters used to train the respective backbone models, such as the optimization solver, learning rate schedule, parameter initialization method, random seed for initialization, batch size and weight decay, are adopted to train the corresponding PR-X models. The results are shown in Table 1, from which we can see that the PR-X can typically improve the corresponding backbone models on both CIFAR10 and CIFAR100. On average, it reduces the top-1 error by 0.27% on CIFAR10 and 0.16% on CIFAR100. It is worth emphasizing that the PR-X models don’t introduce any additional parameters and keep the same hyper-parameters as the corresponding backbone models.
5 Experiments on Image Captioning
5.1 Captioning Model
Encoder. We use the Bottom-Up model proposed in  to generate the regional representations and the global representation of a given image . The Bottom-Up model employs Faster R-CNN  in conjunction with the ResNet-101  to generate an variably-sized set of representations, , , such that each representation encodes a salient region of the image. We use the global average pooled image representation as our global image representation. For modeling convenience, we use a single layer of PR-FC with rectifier activation function to transform the representation vectors into new vectors with dimension :
where and are the weight parameters. The transformed is our defined regional image representations and is our defined global image representation.
Decoder. For decoding image representations and
to sentence description, we utilize an visual attention model with two PR-LSTM layers according to recent methods[1, 28, 40], which are characterized as Attention PR-LSTM and Language PR-LSTM respectively. We initialize the hidden state and memory cell of each PR-LSTM as zero.
Given the output of the Attention PR-LSTM, we generate the attended regional image representation through the attention model, which is broadly adopted in recent previous work [5, 27, 1]. Here, we use the PR Product version of visual attention model expressed as follows:
where and are learned parameters, and are the outputs of the first layer and the second layer in the attention model respectively. is the attention weight over regional image representations, and is the attended image representation at time step t.
5.2 Dataset and Settings
Dataset. We evaluate our proposed method on the MS COCO dataset . MS COCO dataset contains 123287 images labeled with at least 5 captions. There are 82783 training images and 40504 validation images, and it provides 40775 images as the test set for online evaluation as well. For offline evaluation, we use a set of 5000 images for validation, a set of 5000 images for test and the remains for training, as given in . We truncate captions longer than 16 words and then build a vocabulary of words that occur at least 5 times in the training set, resulting in 9487 words.
Evaluation Metrics. We report results using the COCO captioning evaluation toolkit 
, which reports the widely used automatic evaluation metrics: BLEU(including BLEU-1, BLEU-2, BLEU-3, BLEU-4), METEOR, ROUGE-L, CIDEr, and SPICE.
Implementation Details. In the captioning model, we set the number of hidden units in each LSTM or PR-LSTM to 512, the embedding dimension of a word to 512, and the embedding dimension of image representation to 512. All of our models are trained according to the following recipe. We train all models under the cross-entropy loss using ADAM optimizer with an initial learning rate of
and a momentum parameter of 0.9. We anneal the learning rate using cosine decay schedule and increase the probability of feeding back a sample of the word posterior by 0.05 every 5 epochs until we reach a feedback probability 0.25. We evaluate the model at every 6000 iterations on the validation set and select the last evaluated model as initialization for REINFORCE training. We then run REINFORCE training to optimize the CIDEr metric using ADAM with a learning rate with cosine decay schedule and a momentum parameter of 0.9. During CIDEr optimization mode and testing mode, we use a beam size of 5. Note that in all our model variants, the untransformed image representations and from the Encoder are fixed and not fine-tuned. As our focus is on the effectiveness of our proposed PR Product, so we just exploit the widely used backbone model and settings, without any additional tricks of improving the performance, like scene graph and ensemble strategy.
5.3 Performance Comparison and Experimental Analysis
The effectiveness of PR Product. To test the effectiveness of PR Product, we first compare the performance of models using the following different substitutes of inner product on Karpathy’s split of MS COCO dataset:
P Product: This is just the conventional inner product, which only involves the information of vector projection of on , as shown in Equation (5). In Euclidean geometry, it is also called projection product, so we abbreviate it as P Product.
R Product: Contrary to P Product, R Product only involves the information of vector rejection of from . To keep the same range and sign as P Product, we formulate the R Product as follows:
PR Product: This is the proposed PR Product, which involves not only the information of vector projection but also the one of vector rejection , as shown in Equation (6). Obviously, the PR Product is the combination of the P Product and R Product, with the relationship as follows:
For fair comparison, results are reported for models trained with cross-entropy loss and models optimized for CIDEr score on Karpathy’s split of MS COCO dataset, as shown in Table 2. Although the R Product does not perform as well as the P Product or PR Product, the results show that the vector rejection of input vector from weight vector can be used to optimize neural networks. Compared with the P Product and R Product, the PR Product achieves a remarkable performance improvement across all metrics regardless of cross-entropy training or CIDEr optimization, which experimentally proves the cooperation of vector projection and vector rejection is a great help to the optimization of neural networks. To intuitively illustrate the advantage of the PR Product, we show some examples of image captioning in supplementary material.
To better understand how the PR Product affects neural networks, we plot the minimum of and the maximum of in some layers of the PR Product version of captioning model, which can reflect the dynamics of neural networks to some extent. Figure 3 shows the two statics of the hidden-hidden transfer part in the Attention LSTM. Compared with the backbone model, the PR Product version changes the dynamics of the model. Plots for more layers can be found in the supplementary material.
Comparison with State-of-the-Art Methods. To further verify the effectiveness of our proposed method, we also compare the PR Product version of our captioning model with some state-of-the-art methods on Karpathy’s split of MS COCO dataset. Results are reported in Table 3, of which the top part is for cross-entropy loss and the bottom part is for CIDEr optimization.
Among those methods, SCN-LSTM  and SCST:Att2all  use the ensemble strategy. GCN-LSTM , CAVP  and SGAE  exploit information of visual scene graphs. Even though we do not use any of the above means of improving performance, our PR Product version of captioning model achieves the best performance in most of the metrics, regardless of cross-entropy training or CIDEr optimization. In addition, we also report our results on the official MS COCO evaluation server in Table 4. As the scene graph models can greatly improve the performance, for fair comparison, we only report the results of methods without scene graph models. It is noteworthy that we just use the same model as reported in Table 3, without retraining on the whole training and validation images of MS COCO dataset. We can see that our single model achieves competitive performance compared with the state-of-the-art models, even though some models exploit ensemble strategy.
In this paper, we propose a substitute of the inner product of weight vector and input vector , the PR Product, which involves the information of both the vector projection and the vector rejection . The length of the local direction gradient of in PR Product is consistently larger than the one in conventional inner product. In particular, we show the PR Product version of the fully connected layer, convolutional layer, and LSTM layer. Applying these PR Product version modules to image classification and image captioning, the results demonstrate the robust effectiveness of our proposed PR Product. As the PR Product can be viewed as the basic operation in neural networks, we will apply the PR Product to other tasks like object detection.
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang.
Bottom-up and top-down attention for image captioning and visual
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 3, page 6, 2018.
-  P. Baldi and P. Sadowski. A theory of local learning, the learning channel, and the optimality of backpropagation. Neural Networks, 83:51–74, 2016.
-  S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.
-  Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
-  L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6298–6306. IEEE, 2017.
-  X. Chen, L. Ma, W. Jiang, J. Yao, and W. Liu. Regularizing rnns for caption generation by reconstructing the past with the present. arXiv preprint arXiv:1803.11439, 2018.
-  A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In European conference on computer vision, pages 15–29. Springer, 2010.
-  Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks for visual captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, 2017.
-  J. Gu, J. Cai, G. Wang, and T. Chen. Stack-captioning: Coarse-to-fine learning for image captioning. arXiv preprint arXiv:1709.03376, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
-  W. Jiang, L. Ma, Y.-G. Jiang, W. Liu, and T. Zhang. Recurrent fusion network for image captioning. arXiv preprint arXiv:1807.09986, 2018.
-  J. Johnson, A. Gupta, and L. Fei-Fei. Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1219–1228, 2018.
-  A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
-  R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In International Conference on Machine Learning, pages 595–603, 2014.
-  A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
-  G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2891–2903, 2013.
-  C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In Artificial Intelligence and Statistics, pages 562–570, 2015.
-  T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications, 7:13276, 2016.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  C. Liu, F. Sun, C. Wang, F. Wang, and A. Yuille. Mat: A multimodal attentive translator for image captioning. arXiv preprint arXiv:1702.05658, 2017.
-  D. Liu, Z.-J. Zha, H. Zhang, Y. Zhang, and F. Wu. Context-aware visual policy network for sequence-level image captioning. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 1416–1424. ACM, 2018.
-  S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy. Improved image captioning via policy gradient optimization of spider. In Proc. IEEE Int. Conf. Comp. Vis, volume 3, page 3, 2017.
-  J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 6, page 2, 2017.
-  J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7219–7228, 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 290–298, 2017.
-  S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 3, 2017.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
-  R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In Advances in neural information processing systems, pages 2377–2385, 2015.
-  Q. Wu, C. Shen, L. Liu, A. Dick, and A. van den Hengel. What value do explicit high level concepts have in vision to language problems? In Proceedings of computer vision and pattern recognition, pages 203–212, 2016.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1492–1500, 2017.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
-  X. Yang, K. Tang, H. Zhang, and J. Cai. Auto-encoding scene graphs for image captioning, 2018.
-  Z. Yang, Y. Yuan, Y. Wu, W. W. Cohen, and R. R. Salakhutdinov. Review networks for caption generation. In Advances in Neural Information Processing Systems, pages 2361–2369, 2016.
-  T. Yao, Y. Pan, Y. Li, and T. Mei. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 684–699, 2018.
-  T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. Boosting image captioning with attributes. In IEEE International Conference on Computer Vision, pages 22–29, 2017.
-  M. Yazdani. Linear backprop in non-linear networks. In Advances in Neural Information Processing Systems Workshop, 2018.
-  Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In Proceedings of computer vision and pattern recognition, pages 4651–4659, 2016.
-  S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.