Feature Fusion Effects of Tensor Product Representation on (De)Compositional Network for Caption Generation for Images

12/17/2018 ∙ by Chiranjib Sur, et al. ∙ 0

Progress in image captioning is gradually getting complex as researchers try to generalized the model and define the representation between visual features and natural language processing. This work tried to define such kind of relationship in the form of representation called Tensor Product Representation (TPR) which generalized the scheme of language modeling and structuring the linguistic attributes (related to grammar and parts of speech of language) which will provide a much better structure and grammatically correct sentence. TPR enables better and unique representation and structuring of the feature space and will enable better sentence composition from these representations. A large part of the different ways of defining and improving these TPR are discussed and their performance with respect to the traditional procedures and feature representations are evaluated for image captioning application. The new models achieved considerable improvement than the corresponding previous architectures.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image captioning application are required for transferability between visual features, related to images and videos, and textual contents. Recent works in image captioning with traditional visual features, where the architecture tries to evolve the lower level features of whole image into captions. However, the effectiveness saturated and in this work, we have introduced other procedures that can enhance the effectiveness of these architectures for better caption generation. A bidirectional multi-modal retriever was using visual features and language embedding in [1], while [2] proposed a bidirectional LSTM decoder. [3]

introduced generative model with advanced features. Gradually, attentions were introduced like adaptive regional emphasis based attention model

[4], [5] utilized a high-level concept based attribute attention layer, [6] introduced review attention for decoding, [7] hybridized image attention with LSTM as a sequence-to-sequence model, [8]

utilized self-critical sequence attention training with mixed reinforcement learning for performance improvement. Meanwhile,

[9] pioneered a Semantic Compositional Network (SCN) for sentence generation from images, with additional semantic concepts from image features. As the research moved further, [10] introduced bottom-up top-down mechanism with region based image feature tensor attention from Faster R-CNN model. [11] introduced a sentence template based approach with explicit slots correlated with different specific image regions using R-CNN objects predictors. Another instance of higher level attribute attention was introduced by [12] where the attributes were the objects detected in the images using a separate RNN network was used for detection of these good objects from the images in a sequence that can be favorable for better caption generation. These model leveraged on co-occurrence dependencies among object attributes and used an inference representation based on it. In contrary, our approach utilized only image features and emphasized on better representation and caption generation. A direct comparison of these external attention models with our approach will be unjustified, but we have reported competitive performance with our approach.

The rest of the document is arranged with description of TPR in Section 2, details of our architecture in Section 3, results and analysis in Section 4 and conclusion in Section 5.

2 Tensor Product Representation

Tensor Product [13] is the systematic composition of a series of tensors that can be utilized for special representation with structured interpretation and has nice algebraic properties that can retrieve nearest composite components. However, for our applications, we are dealing with some special situation of tensor products where one of them consists of orthogonal structures and thus creating the perfectly orthogonal segments of feature space to represent the data and the cumulative representation can be well utilized for inference, while the reverse multiplication of the orthogonal representation can retrieve the original space representations. While there can be different ways of generation of tensor products, we have concentrated on deterministic approaches with Hadamard matrix and due to its limitations, we switched to deterministic approximation techniques for tensor product generation. Let we have sentence with word and word embedding

, we can transfer one hot vector for each word

as , we have,


for at and is the TPR. Conversely, to retrieve the information from the TPR, for each , we have,


and if we consider nearest neighbor for in , we find,


We have tested that the retrieval rate is 100% correct for word embedding like Word2Vec, GloVe [14]

for any dimension. The accuracy of the retrieval is not because of the dimension or the embedding, but due to the mutual orthogonal matrix

which creates space for real to be segregated when is multiplied with as . Tensor Product Representation (TPR) [13, 15, 16, 17, 18]

is made scalable through approximating the variations and thus deviates the tensors from being completely orthogonal and using series of non-linearity and memory network parameter estimations. However, the effectiveness of the TPR comes from the uniqueness of the feature space and the TPR itself. Even, the potential of the TPR to be able to help in generalization is immense as new representations get generated from contexts (image features) and can be said to have the same representation that could have been generated by the corresponding caption of that image. Consider an image

, with caption H. Assume that a caption consists of words including the start of a sentence and stop of a sentence. We define , where

is a one-hot encoding vector of dimension

and is the size of the vocabulary. The length usually varies from caption to caption. is a word embedding matrix, the i-th column of which is the embedding vector of the i-th word in the vocabulary; it is obtained from GLoVe [14] algorithm with zero mean. at time