1. Introduction
In ecommerce search, estimating querydocument similarity is essential for retrieval and ranking. Using vector space model both queries and the documents can be represented as feature vectors. Similarity between their corresponding feature vectors would act as a proxy for the querydocument similarity. Similarity between the querydocument pair can either be fed as one of the many features in training a machine learned ranker. Or it can be used as a stand alone measure for relevance. Note in this work we assume queries as text queries issued with or without any category constraint to limit the search only to that category. Likewise, documents are items represented by item title, item image and the listing category
^{1}^{1}1ecommerce sites commonly group listed items into a browsable set of categories. of the item. We do not consider any other source of information such as price or description of the item. To estimate queryitem similarity different features are extracted from different facets of the query and the item. For instance, text of the query can be compared with the text of the item title. Likewise, category constraint on the query can be compared with the listing category of the item. However, one key source of information missing in the queries that every item has is the image. There is no image associated with the query and it is also not easy to infer one because of challenges outlined in the Section 2.In this paper, we propose a novel method to derive image information for queries. We use canonical correlation analysis (CCA) to learn a latent subspace where both query projection and item projection will be most predictive of each other. We hypothesize that after projecting on the learned subspace, query vector will encapsulate information about images as well. Note that we are not learning explicit image features for queries. We are learning a multiview representation of the queries and the items where they will have all the information available about one another that would help capture the similarity between them. Using vector space model, we will estimate this similarity using the cosine between the queryitem vectors. We first propose a baseline cosine similarity between the queryitem feature vectors that does not have any image information in them. We compare this baseline with our proposed method where we include the image information about both query and the item using the item images and show significant improvement in improving the relevance of the vector space model.
The rest of the paper is organized as follows. Section 2 describes the related research in the area of canonical correlation analysis and image generation conditioned on text. Section 3 covers the details of features used in this work. Section 4 explains our proposed method where we discuss how to use CCA. Data sets descriptions and results are outlined in Section 5. In Section 6 we highlight the main contribution of this work.
2. Related Work
Canonical correlation analysis (CCA) (Hotelling, 1936)(Borga, 1998) is a powerful statistical technique of correlating linear relationships between two multidimensional variables. Simply put if there are two data views of the same object then CCA will learn a latent space such that each view’s representation is the most predictive of the other and vice versa. It is commonly used for multi view representation learning (Arora et al., 2017)(Ge et al., 2016). Different variants have been proposed to extend the capabilities of CCA. Kernel canonical correlation analysis (KCCA) is a nonparametric method that extends CCA to learn non linear correlated transformation(Akaho, 2001)(Hardoon et al., 2004). Andrew et al. (Andrew et al., 2013) proposed deep canonical correlation (DCCA) analysis which addressed the scalability concerns about KCCA. It was a parametric method and authors showed the efficacy of their algorithm on handwritten digit recognition dataset and speech dataset. Closest to our application is the work by Yan et al. (Yan and Mikolajczyk, 2015) where they used deep canonical correlation (DCCA) for matching images with captions. However, queries are significantly shorter than captions and descriptions, learning latent representation for query and item images is extremely challenging.
Recently there has been work in generating images given text. It falls under the multimodal learning framework where one modality has to be learned conditioned on another modality. Image generation given text is particularly difficult because there could be many possible pixel configurations that satisfy the given description (text). Scott et al. (Reed et al., 2016) proposed a text to image synthesis approach using a deep convolutional generative adversarial network (DCGAN) conditioned on text features. Later, Zhang et al. (Zhang et al., 2017) proposed a StackGAN approach to generate high quality realistic images from text description using stacked generative adversarial networks. However, both of these approaches used long textual description to generate images. These methods suffer from the same problems as CCA methods is that they cannot be easily extended to search applications where queries are usually short. Queries are not descriptive in a traditional natural language sense. Moreover, ecommerce queries often consist of a bag of independent words and phrases identifying desired attributes. For example, brand and product names, color, sizes and units of measures. This makes image generation a significantly challenging problem for ecommerce queries. In this work we attempt to learn image information from item images instead of explicitly learning image features or generating an image for queries.
3. Features Used
We use vector space model to represent queries and the items as vectors. We derive a hashed tfidf vector of both query text and the item title. We also extract the image features from the item image. We would concatenate the hashed tfidf
vector of item with its image feature vector. Since the queryitem similarity is defined as the cosine similarity between the query vector and the item vector, it is imperative for both the vectors to be equidimensional. However, in the absence of the image information for queries, queryitem vectors are not equidimensional. We will use CCA to get equidimensional feature vectors of both query and the item by projecting them to a learned subspace where cosine similarity between the projected vectors can be computed. We present the details of feature extraction in the following subsections.
3.1. Hashed TfIdf
There are different ways to generate tfidf vectors for the query and the item title. One option is to consider all the items in the index and generate tfidf vectors of the size of the vocabulary of the index. With this approach there is a risk of generating sparse vectors and diluting the information captured in the feature vector as ecommerce sites often have a wide variety of items in their index. Instead, we used listing category of the item to generate the tfidf vector for the item. For every queryitem pair, we considered all the item titles present in the listing category of the corresponding item to generate the tfidf vectors for both query and the item. We made an assumption that both query and item title come from the same distribution. In reality these distributions can be different, but in this work we are ignoring this effect. This is to simplify the tfidf vector generation process and so that we can objectively evaluate the effect of adding image signal by keeping that as the only major differentiating factor. The impact of distribution differences can be explored in followon studies.
Since we formulated the queryitem similarity as vector similarity, tfidf vectors for the queryitem pair should be equidimensional. Our current approach would result in different sizes of queryitem vectors depending on the vocabulary size of the corresponding listing category of the item title. Note that vocabulary size would explode if bigrams, trigrams and/or skip grams are considered as part of the vocabulary. This creates another issue of keeping the mapping of words in the vocabulary to their corresponding feature index in memory. This memory constraint gets accentuated with the fact that we have to maintain such mapping for every listing category. eBay has over listing categories. Therefore, we applied the hashing trick (Weinberger et al., 2009) on the tfidf vectors of both query and the item title which addresses all such concerns. It hashes the high dimensional tfidf vector to a lower dimensional feature space of
dimensions. However, this benefit of equidimensional vectors with low memory footprint comes at a cost. There is a possibility of collision when more than one word in the vocabulary maps to the same index. This probability depends on the dimensionality of the lower dimensional vector. Keeping this in mind, we optimized the number of collisions against the different vocabulary size of different categories and created all the
tfidf vectors of dimensions. Other drawback of hashing is the inability to get the original representation back as the idf weights are not stored as that would mean a higher memory footprint.3.2. Image Features
We trained a Resnet50 model (He et al., 2016) for image to category prediction and used the features from the layer before the prediction layer. In eBay’s category tree we have over 15,000 listing categories and every item is listed in at least one listing category. We collected around 5000 images per listing category for training the model and around 200 images per listing category for validation. This model is used for several applications that would require image to category prediction. We query the trained model with item image and get a dimensional vector from the layer before the prediction layer.
4. Proposed Approach
For t queryitem pairs, we have item image features and tfidf features
for the item titles representing items. An item can be represented as a random variable
where with . Likewise, tfidf features for query can be represented as random variable . We would like to learn a common subspace between and so that we can calculate the similarity between the two in the learned subspace. Canonical Correlation Analysis (CCA) is a statistical technique that learns a linear relationship between two multidimensional variables such as and . It will learn a set of basis vectors for each and such that the correlations between the projections of the variables onto these basis vectors are mutually maximized (Borga, 1998). Given that random variable and have zero mean, we can write the total covariance matrix asHere, and are the within set covariance matrices of query and image features respectively and , are the between set covariance matrices where
As is m dimensional long and is n dimensional long with m < n, CCA would learn m basis vectors for query and item variables respectively. Let us take a case where each set would only have one basis vector i.e we want to learn only one pair of basis vectors (canonical variate pair) with the largest canonical correlation. Given the linear combination of query as and item as , canonical correlation is defined as
(1) 
(2) 
For consecutive pair of basis vectors we would have further constraints.
(3) 
(4) 
(5) 
Canonical correlation between and
can be calculated by solving the eigenvalue equations
(6) 
(7) 
where the eigenvalues (
) are the canonical correlations and eigenvectors (
and ) are the basis vectors. Only one of these equations need to be solved as the solutions are related by(8) 
(9) 
where,
(10) 
We can project both query and image feature vectors to the newly learned set of basis vectors to obtain new query () and item () representations that are of equal dimensions.
Since both and are equidimensional we can estimate the cosine similarity between the two. We hypothesize that since images in item features were used to learn the new space, even the query representation encapsulates the image features despite not having any images representing the query. Learning image information for queries without any images is our main contribution in this work.
AUROC  AUPRC  

Baseline  Proposed Method  Gain  Baseline  Proposed Method  Gain 
0.5529  0.6182  11.89%  0.8169  0.8423  3.1% 
5. Experiments
5.1. Dataset
We used total of 419,905 queryitem pairs from eBay’s search logs. Each of the items appeared in our search result pages for the corresponding query. Note that eBay search supports different sort types in production such as best match sort, price low to high sort among others. These query item pairs are sampled from all sort types. We had human judges label the relevance of the query item pairs. Definition of relevance is defined by eBay’s internal guidelines. Judges marked these query item pairs as relevant or irrelevant. Out of the 419,905 queryitem pairs, we have 328,574 pairs marked as relevant by judges and 91,331 pairs marked as irrelevant. In our dataset we have 378,888 unique items and 226,531 unique queries. All the queries occurred either in US, UK or Germany eBay site. In our dataset, queries are represented as the tokenized text and items are represented as the item title, item image and the listing category of the item.
5.2. Baseline
For a baseline we used the text of the query and the item title to compute a text similarity measure. We used the listing category to derive hashed tfidf representation for the item title and the corresponding query as described in Section 3. Both query and the items feature vectors are dimensional. We estimate the baseline cosine similarity between the query vector and the item title vector as :
with t as 419,905 and both v and m as 1000.
5.3. Including Image Information Using CCA
We will now examine how providing additional information about images makes queryitem similarity a better measure for relevance. Since we formulated the queryitem similarity as a cosine between queryitem vectors, we will need additional information about images in both query and item vectors. However, images are only available for the item there is no image information available about the query. Therefore we will use the proposed method to include image information for both item and query in the current formulation.
We extracted dimensional Resnet feature vectors for the item image. For doing CCA, we concatenate item image and item title features. Therefore, we have dimensional feature vector for the item and dimensional feature vector for the query. CCA is done over the item vector and the query vector with t as 419,905, n as and m as . CCA learns m basis vectors and we obtained a new item vector and a new query vector by projecting the original item and query vectors onto the new learned basis vector. We then estimated proposed cosine similarity between the new item vector and the new query vector as shown below.
5.4. Evaluation
We used the similarity score per queryitem pair as a measure of relevance and evaluated the performance of the proposed method against the baseline method. We used Receiver Operating Characteristic (ROC) and precisionrecall as our evaluation metric. Note that we have 328,574 pairs marked as relevant by judges and 91,331 pairs marked as irrelevant. Since there is some class imbalance in the dataset we decided to use both ROC and precision recall curve as our evaluation metric. ROC and precisionrecall focuses on different classes so it is useful to have both as metric. Precisionrecall compares false positives to true positives while ROC compares false positives to true negatives. Therefore both metric accentuates performance of the algorithm on different classes of the data. ROC and precision recall curve of the proposed method against the baseline is shown in Figure 1 and Figure 2.
The area under the ROC curve (AUROC) and the area under the precision recall curve (AUPRC) of the proposed algorithm against the baseline method is reported in Table 1. Our results demonstrate that including image information in computing the queryitem similarity can significantly improve the relevance of the search engine. We propose a novel way of extracting image information for queries from item images.
6. Conclusion
In this paper we presented a method of deriving image information for queries using images of the items. We show 11.89% relevance improvement in AUROC and 3.1% relevance improvement in AUPRC on eBay search data. Our method is fully unsupervised making it highly useful for commercial information retrieval systems that have large amounts of unlabeled data at their disposal. The proposed method can be immensely helpful in obtaining image information for queries which is otherwise a nontrivial task. The proposed method has multitude of applications in ranking and retrieval as estimating similarity between queryitem pairs remain a fundamental task in search.
References
 (1)
 Akaho (2001) S. Akaho. 2001. A kernel method for canonical correlation analysis. In In Proceedings of the International Meeting of the Psychometric Society (IMPS2001. SpringerVerlag.
 Andrew et al. (2013) Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In International Conference on Machine Learning. 1247–1255.
 Arora et al. (2017) Raman Arora, Teodor Vanislavov Marinov, Poorya Mianjy, and Nati Srebro. 2017. Stochastic approximation for canonical correlation analysis. In Advances in Neural Information Processing Systems. 4775–4784.
 Borga (1998) Magnus Borga. 1998. Learning multidimensional signal processing. Ph.D. Dissertation. Linköping University Electronic Press.
 Ge et al. (2016) Rong Ge, Chi Jin, Sham M Kakade, Praneeth Netrapalli, and Aaron Sidford. 2016. Efficient Algorithms for Largescale Generalized Eigenvector Computation and Canonical Correlation Analysis. arXiv preprint arXiv:1604.03930 (2016).
 Hardoon et al. (2004) David R Hardoon, Sandor Szedmak, and John ShaweTaylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation 16, 12 (2004), 2639–2664.

He
et al. (2016)
Kaiming He, Xiangyu
Zhang, Shaoqing Ren, and Jian Sun.
2016.
Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition
. 770–778.  Hotelling (1936) Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3/4 (1936), 321–377.
 Reed et al. (2016) Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396 (2016).
 Weinberger et al. (2009) Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford, and Alex Smola. 2009. Feature hashing for large scale multitask learning. arXiv preprint arXiv:0902.2206 (2009).
 Yan and Mikolajczyk (2015) Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3441–3450.
 Zhang et al. (2017) Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2017. Stackgan: Text to photorealistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 5907–5915.
Comments
There are no comments yet.