Snap and Find: Deep Discrete Cross-domain Garment Image Retrieval

04/05/2019 ∙ by Yadan Luo, et al. ∙ IEEE The University of Queensland 0

With the increasing number of online stores, there is a pressing need for intelligent search systems to understand the item photos snapped by customers and search against large-scale product databases to find their desired items. However, it is challenging for conventional retrieval systems to match up the item photos captured by customers and the ones officially released by stores, especially for garment images. To bridge the customer- and store- provided garment photos, existing studies have been widely exploiting the clothing attributes (e.g., black) and landmarks (e.g., collar) to learn a common embedding space for garment representations. Unfortunately they omit the sequential correlation of attributes and consume large quantity of human labors to label the landmarks. In this paper, we propose a deep multi-task cross-domain hashing termed DMCH, in which cross-domain embedding and sequential attribute learning are modeled simultaneously. Sequential attribute learning not only provides the semantic guidance for embedding, but also generates rich attention on discriminative local details (e.g., black buttons) of clothing items without requiring extra landmark labels. This leads to promising performance and 306× boost on efficiency when compared with the state-of-the-art models, which is demonstrated through rigorous experiments on two public fashion datasets.



There are no comments yet.


page 1

page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Throughout the world people are gradually tired of shopping in their local stores with the problems of parking, long lines, and wobbly shopping carts. Instead, online buying has grown exponentially, characterized by strong consumer demands and a cumulative number and type of goods available. Most of online retail shops feature the “garment image search” function, which allows users to submit a photo for looking for its corresponding products.

Product image search, a practical example of cross-domain retrieval, requires to measure the similarity of images that are from two heterogeneous domains, termed the user domain and the shop domain. The user domain consists of the query images taken by users (e.g., street snap, Instagram post), and the shop domain includes the database images taken by professional photographers.

Cross-domain garment image retrieval is essentially challenging due to (1) large intra-class variance and (2) minor inter-class variance. For the same product, large intra-class variance includes the different lighting conditions, noisy backgrounds, poses, even occlusion and deformation between the user- and shop- image, causing the garment images from different domains are hard to be matched up. Minor inter-class variance is an intrinsic property of garment images. For example, two dresses from different categories can be very similar in color and design yet have a minor difference in the collar’s shape, where one is V-shaped and the other one is U-shaped. Given a user garment image with a V-shaped collar, returning the dress with a U-shaped collar is not considered as a correct search result in our scenario. The accuracy of fine-grained garment image search is highly affected by the low discriminations among alike garments, so that it is inevitably tougher than the conventional Content-Based Image Retrieval (CBIR), where the subtle and local differences are expected to be identified.

Fig. 1: An overview of the proposed framework. During the offline stage, the triplet label (user image A, shop image A and shop image B) and the corresponding sequential attributes guide the DMCH model with spatial attention to generate the binary embedding. At the online stage, given a query user image, the DMCH generate its predicted sequential attributes and its binary code with which the matched shop image can be retrieved from the shop image embedding database.

In order to alleviate the above-mentioned problems, emerging literatures have proposed solutions in mainly two views: 1) visual objectness learning; 2) semantic attribute learning. Technically, the objectness learning based methods detect the foreground object and/or its semantic parts [1, 2] before extracting features, mainly for suppressing discrepancy of the background and enhancing recognition of related local details. Apparently, this method highly relies on the annotated bounding boxes or clothing landmarks for supervised detection, which is time- and labor-consuming. An alternative for detection is to extract saliency or visual attention [3, 4] from raw images. Different from objectness learning, which facilitates fine-grained recognition by detecting visual parts, attribute learning [5, 6] constructs a latent space between fine-grained labels and low-level features. Attributes of clothing items such as color, texture, fabric or even style assist the model to find the inter-class and intra-class correlation between garment categories. The existing attribute learning models generally address attribute prediction as a multi-label classification problem, where each attribute is considered as a category or class. Actually, each garment image for model training is associated with a sequence of attributes such as “silk pocket shirt”. However, the sequential information is omitted by the conventional attribute learning models.

In this paper, we propose an end-to-end deep multi-task cross-domain hashing (DMCH) to jointly model the sequential correlation among the clothing attributes and learn the attention-aware visual features of garment images for both cross-domain retrieval and sequential attribute learning tasks. From the linguistic perspective111, when describing a noun, the adjectives before it are usually in a particular order. For instance, people usually get used to say a “little” “black” “dress” rather than a “black” “little” “dress”. Two popular fashion datasets [5, 1]

apply alike order to create the attribute lists, based on which each garment image is described by a sequence of attributes in a certain order. Instead of assigning discrete attributes to a query image, our model considers the sequential correlation among attributes and predicts a sequence of attributes with a meaningful order for each query, where the confidence of each attribute to be selected is largely enhanced. Additionally, most of the existing work plainly aggregate features from different convolutional layers and apply the Euclidean distance measure on the concatenated feature vectors, unavoidably resulting in inefficient query processing and unnecessary storage waste. Learning to hash component in the proposed model successfully addresses these potential issues. Figure 1 gives an overview of the proposed framework, which includes the offline training of the DMCH model and the online process for cross-domain retrieval and attribute generation. The main contributions are summarized below:

  • A deep discrete embedding framework is proposed to learn discriminative representations of garment images, being supervised by two objectives simultaneously. Cross-domain correlation and visual-semantic association are utilized jointly in garment image representation learning to facilitate both tasks.

  • The spatial attention of garment images is activated by attribute descriptions, which significantly enhances the recognition on the subtle and local details.

  • Different from the conventional attribute learning, which is usually positioned as a multi-label classification problem, we open a new direction of sequential attribute learning, where the strong linguistic hint is leveraged.

  • The learned binary embedding of garment images is efficient and effective with strong discriminative power, which enables our algorithm to accelerate 306 on querying without compromising the accuracy.

Ii Related Work

Ii-a Cross-domain Image Retrieval

Product image search is one type of Content-Based Image Retrieval (CBIR), which has been widely studied for decades [7, 8]. Hashing based methods [9, 10, 11, 12, 13] serve as effective solutions, which transform homogeneous high dimensional samples into similar compact binary codes with which data similarity can be measured by bit-wise xor operation. Recently, end-to-end deep hashing [14, 15, 16] has shown its superiority in both feature representation and embedding quality in comparison with previous two-pipeline supervised approaches.

However, along with the explosive growth of web data, even deep image retrieval is still facing enormous cross-domain challenges, i.e., heterogeneous query-result pairs and fine-grained classes. For instance, using a street snap of a lady in white coat as a query to find the shop image of the exact same item could hardly cut the mustard, especially with the different backgrounds and poses.

WTBI [17] handles the fine-grained classes problem with an easy-to-hard strategy, that means, training a generic network with five main classes followed by subtle fine-tuning for each sub-class individually. DARN [5] has two NIN subnets, one for street domain and one for shop domain, that enable the model to learn the domain-specific representation. To discover the common semantic features, DARN fuses the feature maps and adds a tree-structured layer on the top for category and attribute prediction layer. FashionNet [1] detects garment landmarks from various scenarios where results are used to subsample the feature maps of the last convolutional layer, considerably suppressing background noise and disturbance. Nevertheless, the above-mentioned methods directly concatenate outputs from different layers to form a query vector, unavoidably resulting in high computational cost during query procedure. In contrast, our method follows deep hashing fashion which learns discrete embeddings for samples and accelerates querying up to without compromising the accuracy.

Ii-B Attribute Learning

Researches on attributed-based visual representation have received wide attention by the computer vision community in the past few years, especially in person re-identification 

[18, 19], captioning [20, 21, 22], and retrieval [23, 24, 25]. Attributes are usually referred as semantic properties of the objects or the scenes that are shared across categories so that attributes could serve as a latent and interpretable connection between image content and abstract labels. Previous work [5, 1] purely views attribute learning as a multi-label classification problem with a global representation, yet its performance limits especially when confronted with excessive number of fine-grained attributes. [4] learns the model for automatically grouping garment attributes into an upper-level concept list (e.g., the neckline concept might consist of attributes like v-neck, round-neck) while [5]

constructs a tree structure for attribute hierarchy. However, a fixed tree-structure design is not flexible enough for learning emerging new attributes. From the linguistic point of view, people describe a noun with a set of adjectives conforming to a certain order. Exploiting a sequence of attributes rather than isolated attributes could preserve a stronger sequential connection between attributes, like “floral” is usually after “bohemian”. Therefore, in order to further explore the word order, we derive the decoder part in the proposed model for generating attribute descriptions in the light of fast development of Long Short Term Memory networks (LSTM) 


Ii-C Multi-task Learning

Multi-task learning (MTL) aims to improve generalization performance of multiple prediction tasks by appropriately sharing relevant information across them. In the context of deep neural networks, this idea is often realized by hand-designed network architectures with layers that are shared across tasks and branches that encode task-specific features 

[5, 27]. [28] integrates multiple face matching criteria to transform images into task-specific subspace which assists to learn a common projection for metric-balanced face retrieval. [6, 5, 1] jointly preserve the category and attribute similarities for image retrieval by applying cross-entropy objective function for attribute prediction and softmax loss for classification. The experiments we conducted also show that multi-task learning greatly leverages on both related tasks by sharing task-specific knowledge.

Iii Our Approach

In this paper, we propose a joint framework to perform the tasks (1) sequential attribute learning; and (2) the cross-domain garment image embedding simultaneously. After presenting the problem statement, we give detailed descriptions of the generic neural encoder-decoder framework for sequential attribute learning, followed by the explanation of the proposed attention-based cross-domain embedding.

Iii-a Problem Statement

We denote the training image from the user domain and its corresponding shop image (i.e., positive sample) and irrelevant shop image (i.e., negative sample) from database images as , and respectively. One objective of the proposed model is to generate a T-gram description of the image , . In the meanwhile, the model jointly encodes images , , and as a set of -length binary codes , , and respectively, by which the embeddings of positive user-shop image pairs are likely to be similar or close in the projected Hamming space.

Iii-B Encoder-Decoder for Sequential Attribute Learning

Given an image and its corresponding attribute descriptions, the encoder-decoder model directly maximizes the following objective:


where are the parameters of the model, denotes the image and is the corresponding attribute description, consisting of

words. Using the chain rule, the


likelihood of the joint probability distribution can be decomposed into the ordered conditionals:


where we drop the dependency on model parameters for convenience. It is natural to model

with a recurrent neural network (RNN), where the variable number of words we condition upon to time

is expressed by a fixed length hidden state or memory . Hence as (3) shows, we adopt a nonlinear function to model the transform from to . To make the RNN valid and effective, it is crucial to decide: what is the exact form of and how are the images and words fed as inputs at time t.


For we use Long-Short Term Memory (LSTM) net, which is shown powerful performance on sequence tasks. Here represents for context vector, is the input vector and the memory cell vector at time respectively. In conventional encoder-decoder settings, the context vector

only depends on the output of the encoder, a Convolutional Neural Network (CNN). The output features extracted from the last fully connected layer give the global visual information of the input image. During the decoder stage,

usually keeps constant.

Things are different in the attention-based framework. will be updated during the whole training procedure because the focused part of the image is changing as the predicted words shift. To compute the context vector , inspired by [29]

, we form a spatial attention model, which is defined as (



where is an attention function, and is a dimensional spatial feature vector corresponding to the -th region.

Given the spatial image feature and the hidden state of the LSTM, we feed them through a single layer neural network followed by a softmax function to generate the attention distribution over the regions of the images.


where is a vector with all elements of 1. and are parameters to be learned. is the attention weight over features in . Therefore, the context vector can be obtained by a weighted sum:


So far, we have obtained a location-aware visual feature that allows the decoder to ”speak” from different views. Yet the spatial attention based decoders still cannot determine when to rely on visual signal and when to rely on the language model. We adopt the visual sentinel, which is a gate that allows decoder to choose whether to focus on linguistic rules of attribute description or the image visual content.


where and are weight parameters to be learned, and is the gate applied to the memory cell . represents the element-wise product and is the logistic sigmoid activation.

Therefore, the new context vector, defined as could be calculated as follows,


where is the new sentinel gate at time and it produces a scalar in the range [0,1]. The scalar helps to balance the importance of the visual sentinel information and the image spatial attention. To compute , we modify the equation (5),


is the attention distribution over both spatial features and the visual sentinel vector.

The probability over a set of possible attributes at time can be calculated as



is the weight parameter to be learned. Hence the loss function of attribute learning part is


Fig. 2: (Left) An illustration of our proposed encoder-decoder structure. The dotted lines denotes sum of output of attention modules. (Right) An illustration of the process through which approximates gradually.

Iii-C Attention-aware Embedding for Cross-domain Retrieval

In this part, we will elaborate how the sequential attribute learning would help image to embed. Previously, the binary hash code is learned directly by converting -length feature maps from the last fully-connected layer, which is continuous in nature, to a binary code taking values of either or

. This binarization process can only be realized by taking the signum function

as the activation function on the top of the embedding layer.


Unfortunately, as the signum function is non-smooth and non-convex, its gradient is zero for all nonzero inputs, and is ill-defined at zeros, which makes the standard back-propagation infeasible for training deep networks. This is known as the vanishing gradient problem, which has been a key difficulty in training deep neural networks via back-propagation

[30, 31, 32]. Approximation solutions that relax the binary constraints are not good alternativse as they lead to a large quantization error and therefore to a suboptimal performance.

In order to alleviate the optimizing problem of non-smooth signum activation, we draw inspiration from recent studies on continuation methods [33, 34]. These studies propose a strategy by gradually reducing the amount of smoothing during the training, which results in a sequence of optimization problems eventually converging to the original optimization problem. Following this strategy, if we find an approximate smooth function of , and then progressively make the smoothed objective function non-smooth as the training proceeds, the final solution should converge to the desired optimization target.

Motivated by the continuation methods, the function is applied to approximate . It is also noticed that there exists a critical relationship between the and the . As illustrated in Figure 2, increasing scaling parameter , the scaled tanh function will become more non-smooth and more satuated, so that the deep networks using as the activation function will be more difficult to optimize. As , the optimization problem will converge to the original deep hashing problem with activation function:


In order to highlight local details of garment images, we fuse the global image feature with weighted local features and use to approximate discretization,


where is the parameter to be learned, is a scaling vector being gradually enlarged after each time the network converges, and denotes concatenation. is the weight to balance the global features and the attended features. As discussed in Section 1, the cross-domain retrieval involves thousands fine-grained classes, which is not suitable to be applied a simple cross-entropy loss. In this case, we adopt the triplet loss function as the training objectives for embedding, which is shown as below.


where is the anchor image from the user domain, / is the positive/negative image from the shop domain. is a constant for setting the margin and measures the distance of two image embeddings. The loss function penalizes the triplets if to make matched image embeddings close and unmatched image embeddings distant in the common space. Thus, the multi-task loss is combined with (11) and (15),


where is a weight, which is empirically set to , in case that dominates the training.

Iv Experiments

In this section, we evaluate our DMCH on both tasks of cross-domain retrieval and sequential attribute learning on two large-scale fashion datasets.

Iv-a Datasets

Iv-A1 Darn[5]

This dataset is created for street-to-shop retrieval, i.e., matching street images taken by users with professional shop photos. After removing corrupted images, a subset of 62,812 street images and 39,756 shop images over 20 categories are generated. Each street image has a matched shop image. The dataset also provides a sequence of attributes for each clothing item. We follow the settings in [3] and generate 2,076,440 training triplets in the form of user image, positive sample, negative sample with the positive-to-negative ratio of 1-to-10. A number of 2,000 distinct user images are randomly selected for testing, each of which corresponds to a unique clothing item. A total of 102 attributes are involved under this setting.

Iv-A2 DeepFashion[1]

This dataset includes 105,562 user images and 28,512 shop images of 19,135 unique clothing items. Additionally, the auxiliary data including category labels, clothing attributes, clothing landmarks, and street-shop image pairs are also provided. Each clothing item may have a set of street images but only a couple of shop images. 1,911,570 training triplets with the positive-to-negative ratio of 1-to-10 are generated without overlap of test dataset. 237 relevant attributes are chosen for taxonomy of attribute learning. 4,582 street images are randomly selected as testing queries to search against the entire 28,512 shop images for evaluation of cross-domain task. Different from [21], landmark data is not used in our experiments as it is not required by the proposed model.

Iv-B Experimental Setting

All the experiments are conducted on a server with Intel Xeon(R) CPU E5-2660 and two Telsa K40c GPU cards. The basic model applied for encoder is ResNet-152 [30]

, on which the last two layers are removed for fine-tuning. Mini-batch Stochastic Gradient is used for parameter updating. The hidden size is fixed at 256 and the dimensionality of word embedding vectors is 256. The batch size is fixed at 32 and the momentum is 0.9. The learning rate is set to 0.001 and decays by 0.1 for every 50 epochs. Margin in the triplet loss function is set to

of the hash code length (e.g., 2 for 32-bit hash codes) for the experiments. The scaling parameters is enlarged by 10 times after each time the network converges.

Iv-B1 Evaluation Metric

Following [3], we evaluate retrieval performance by the top- precision, which is defined as follows:


where Q is the total number of queries; if at least one image of the same product as the query image appears in the returned top-K ranking list; otherwise . For most queries, there is only one matched shop image in both the DARN and DeepFashion datasets.

For evaluation of sequential attribute learning, we employ BLEU, ROUGE-L, and CIDEr metrics. BLEU computes the geometric mean of the n-gram precisions,


where and denote the ground-truth and the prediction of -length attribute sequences, respectively. Similarly, ROUGE-L [35] is a recall-oriented measure that evaluates the quality of the longest common subsequence and CIDEr [36] measures the overall quality of the generated attribute sequence against the ground truth provided by humans.

Iv-B2 Compared Methods

To achieve a fair comparison on accuracy and efficiency of cross-domain garment image retrieval, we compare the proposed model with the state-of-the-art methods in the literature such as WTBI [17], DARN [5], FashionNet [1], TagYNIN and CtxNIN [3]. Note that the length of feature vector varies from 17,920-D (4,096-D after PCA) [1, 5, 17] to 1024-D [3]. Additionally, we juxtapose deep hashing with DMCH with a fixed embedding length (i.e., 128-bit) to observe whether the defined multi-task objectives contribute positively.

Iv-C Evaluation on Cross-domain Retrieval

Iv-C1 Comparisons with the State-of-the-art Methods

Fig. 3: Comparison of P@K on the DeepFashion Dataset.

Figure 3 shows the top-K precision of the baseline methods and our mechanism on the DeepFashion Dataset. We can clearly see that the WTBI is inferior to other cross-domain retrieval methods since they only apply attributes to expand the categories rather than exploiting their correlations. DARN performs better than WTBI as it learns different branches of NIN (Network in Network [37]) for user domain and shop domain and employs a tree-structured layer for attribute prediction, which potentially guides the model to reach strong discrimination. However, it is not competitive to FashionNet, which further applies clothes detection as an initialization step to suppress the noise from image background. FashionNet shares the convolutional layers for both domains while uses different top branches for two tasks (i.e., including attribute prediction and landmark prediction). Though YNet similarly locates the major attention of image beforehand, TNet chooes to utilize attribute information as input and constructs two subnetworks for separate domains. The representation of user images in YNet is endowed strong connection with both positive and negative shop image representations. In this case, YNet shows superior performance among the aforementioned methods, however, it uses attributes as inputs to explore garment attention, which makes it uncommercial for practical application. As for deep hashing based methodologies, We fix the code length to 128-bit for fair comparisons. As we can see, DPH performs better than DPSH since DPH adopts the attribute-preserving strategy, which potentially assists to adjust embeddings for fine-grained categories. However, as discussed in Section 2.1, they are not specifically designed for cross-domain task, where disturbances and noises from the image background exist. Our DMCH modal consistently achieves the best performance on the cross-domain garment image retrieval task, mainly owing to the attention mechanism for local details recognition, sequential modeling for attributes and iterative optimization for embedding.

Fig. 4: Comparison of P@K on the DARN dataset.

Figure 4 shows the top-K precision of the baseline methods and our mechanism on the DARN dataset. The proposed method still remains relatively superior, while the gap between other attribute based methods is closer than that from the DeepFashion dataset. The reason of the phenomenon could be that, in the DARN dataset, each item is associated with attributes of shorter length, makes sequential attribute learning slightly degrade to conventional attribute classification. Nevertheless, the ground-truth of attributes in both DeepFashion and DARN datasets are still incomplete and vague (e.g., “thick” vs “regular thickness”). Even human annotators are easy to misjudge these words. A better performance is expected by refining the attributes before learning.

Components DeepFashion DARN
2.1 2.8
+ 13.7 19.6
+ 20.8 25.8
TABLE I: The Precision@200 (%) of DMCH on DeepFashion and DARN using different combinations of loss with 128-bit code.

Iv-C2 Analysis on Multi-task Loss

In this subsection, we study the effect of loss componenst for cross-domain retrieval task on the DeepFashion dataset with a fixed code length (128-bit). The loss function designed for DMCH mainly consists of the triplet embedding loss () and the cross-entropy loss (). The conventional attribute learning is positioned as a multi-label classification problem, which could be learned with a fully connected layer supervised by the cross-entropy loss (). We report the retrieval result with respect to different sorts of loss functions in Table I. The observation is two-fold, 1) attribute supervision contributes positively to bridging fine-grained classes and endowering embeddings with rich inpretable semantics; 2) taking attribute order into consideration, the proposed sequential attribute learning improves the top-200 precision by up to 51.8% relatively, compared with conventional attribute learning. Usually when classes number goes up, the “long-tail” distribution problem on classification emerges, that is, rare attributes from the tail of distribution becomes much harder to predict. However, our sequential attribute learning exploites conditioned probability to adjust the attribute distribution at every time , which potentially alleviates the “long-tail” problem.

Methods DeepFashion DARN
32-bit 64-bit 128-bit 32-bit 64-bit 128-bit
two-stage solution 6.8 11.5 17.1 5.6 12.6 20.3
8.6 13.2 19.1 7.6 14.4 22.2
10.1 14.6 20.8 9.8 16.7 25.8
TABLE II: The Precision@200 (%) of DMCH on the DeepFashion and DARN dataset. We compare the model with various numbers of bits and with different discretization strategies.

Iv-C3 Study on Binary Optimization and Code Length

The proposed DMCH is capable of generating hash codes directly while most of previous hashing methods follow the two-stage strategy, i.e., first to learn continuous representations and then do discretization with a non-smooth signum function. In this subsection, we primarily investigate the effect of direct binary codes optimization and code length of embedding on retrieval performance, which is shown in Table II. It is observed that the post-processing signum function and the vanilla tanh function are both sub-optimal as they suffer the quantization error with different levels. The proposed iterative optimization alleviates the quantization error of hashing, and shows superior performance among other discretization strategies. As for the code length, we could observe that longer embedding enables to preserve more visual and semantic information thus leading to a better retrieval performance.

Iv-C4 Study on Querying Efficiency of DMCH

To study the efficiency of the proposed model, we measure the running time of 1,000 query samples on the DARN dataset, which is shown in Figure 5. DARN and FashionNet concatenate the local features from the convolutional layers and the global features from the fully connected layers (i.e., 17,920-D). Consequently, their feature dimensionality is much larger than that of YNet [3] (i.e., 1000-d for NIN structure). Even after dimensionality reduction by PCA, the feature dimensionality of DARN is still up to 4096-D. In contrast, the proposed method jointly embeds images into binary codes with much shorter length (e.g., 128-bit), as well query processing is significantly accelerated as the Hamming distances are calculated instead of expensive Euclidean distance calculation. It is observed that our approach speeds up , , in comparison with Y-Net, DARN-PCA and DARN respectively.

Fig. 5: Running time for 1,000 query samples with different cross-domain retrieval methods on the DARN dataset. Note feature vectors or embeddings are of various length.
Fig. 6: Visualization of the spatial attention maps. Correct attribute prediction is shown in blue captions, wrong ones in red captions, and unknown ones in black captions.
Methods DeepFashion DARN
B-1 B-2 B-3 B-4 ROUGE-L CIDEr B-1 B-2 B-3 B-4 ROUGE-L CIDEr
WTBI 14.1 11.4 7.1 5.5 16.7 13.6 17.3 14.2 10.1 7.6 19.6 24.6
DARN 28.5 19.6 11.6 7.6 33.4 24.2 34.6 24.7 15.8 9.3 32.5 53.2
FashionNet 36.8 21.3 13.7 9.8 35.8 26.4 42.3 26.6 18.7 11.5 42.0 76.5
DMCH 35.7 25.9 19.1 14.2 44.1 31.5 47.7 34.2 25.7 18.5 51.0 98.6
TABLE III: The performance of sequential attribute learning on the DeepFashion and DARN dataset. The code length of DMCH is fixed at 128-bit.
Fig. 7: A 2D visualization of the embedding Hamming space of cross-domain images. A query user image is bounded by a black box, and the yellow arrow represents the pre-defined search radius. The shop images falling in the yellow circle will be returned as the matched clothing items of the query image. Each shop image is associated with a product ID. A clothing item may have more than one shop images.

Iv-D Evaluation on Sequential Attribute Learning

Iv-D1 Comparison with State-of-the-art Methods

In this part, we mainly target at verifying the accuracy and comprehensiveness of our sequential attribute learning task. A detailed comparison among different methods is illustrated in Table III. Attributes defined in both datasets follow a certain order similar to human being linguistic convention. The WTBI achieves the relatively lowest scores across all metrics, while DARN performs averagely better. Besides, FashionNet is quite competitive to the proposed method. In terms of the unary metric (i.e., BLEU-1), FashionNet achieves a slight gain over DMCH on the DeepFashion Dataset. It is natural to see the multi-label classification model hits accurately on single words as it is not required to consider the context and the order. Thus thinking through the context metric (e.g., BLEU-2 4, ROUGE-L, CIDEr), our model shows much better performance since the algorithm chooses to memorize the correlation of sequential data besides isolated attributes. It is also noticed that the performance on the DARN dataset is much better than the result on the DeepFashion. That is potentially because the DARN dataset contains much shorter attributes among a small set of taxonomy compared with the DeepFashion Dataset.

Iv-D2 Attention Visualization

For better understanding of the validity of our attention modules, we visualize the attention maps of two sample images from the DeepFashion dataset in Figure 6. The capability of the proposed model on attribute learning is clearly demonstrated, including discovering the spatial correlation (e.g., two attentive parts for “short sleeves”) and preserving the semantic correlation (e.g., “lace” and “lacing clothing”). Take our sequential attribute learning result in Figure 6 as an example. The first clothing item is described in the order of “length”, “details and style of clothes”, “thickness of clothes”, and “collar and sleeve type” where the linguistic order of descriptors is well reserved. A failure case shown in the second example is caused by an incomplete view of the clothing item and its confusing color to the background.

Iv-E Exemplars of Cross-domain Search

In this subsection, we use real garment images from the DeepFashion dataset to visualize the leant embedding space with t-SNE [38] in Figure 7, which gives an intuitive understanding on the cross-domain task and the performance. By projecting 128-bit binary codes to a 2D plane, three potential issues with the training dataset can be clearly observed: 1) highly similar shop images are used by different online merchants. For instance, both items “123” and “149” use a same shop image; 2) the shop images labeled with a same product ID are of large variance. Take items “169” and “159” as an example, the major patterns, the color, and the text on the clothing items are quite different; 3) the categories of products are numerous. It is nearly impossible to get all mutual relationships involved in our training triplets. Better performance of cross-domain garment image retrieval is expected by addressing the above data quality issues in future.

V Conclusion and Future Work

To deal with the problem of cross-domain garment image search, we have proposed a novel joint learning framework which shares visual and verbal knowledge to exploit relationship among fine-grained classes and attributes. Different from the state-of-the-art models in the literature, we treat attribute descriptions as rich context for activating spatial attention that enhances recognition of garment details. Meanwhile we embed images as binary codes with short length to significantly improves the query efficiency. Since our model is of great potential, we will further explore multi-modal retrieval in the near future.


  • [1] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,” in

    2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016

    , 2016, pp. 1096–1104.
  • [2] H. Zhan, B. Shi, and A. C. Kot, “Cross-domain shoe retrieval with a semantic hierarchy of attribute classification network,” IEEE Trans. Image Processing, vol. 26, no. 12, pp. 5867–5881, 2017.
  • [3] X. Ji, W. Wang, M. Zhang, and Y. Yang, “Cross-domain image retrieval with attention modeling,” in Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23-27, 2017, 2017, pp. 1654–1662.
  • [4] X. Han, Z. Wu, P. X. Huang, X. Zhang, M. Zhu, Y. Li, Y. Zhao, and L. S. Davis, “Automatic spatially-aware fashion concept discovery,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 1472–1480.
  • [5] J. Huang, R. S. Feris, Q. Chen, and S. Yan, “Cross-domain image retrieval with a dual attribute-aware ranking network,” in 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 1062–1070.
  • [6] H. Liu, R. Wang, S. Shan, and X. Chen, “Learning multifunctional binary codes for both category and attribute oriented retrieval tasks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 6259–6268.
  • [7] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, “A survey on learning to hash,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 769–790, 2018.
  • [8] B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen, “Adversarial cross-modal retrieval,” in Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23-27, 2017, 2017, pp. 154–162.
  • [9] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2916–2929, 2013.
  • [10] F. Shen, C. Shen, W. Liu, and H. T. Shen, “Supervised discrete hashing,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015, pp. 37–45.
  • [11] W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang, “Supervised hashing with kernels,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, 2012, pp. 2074–2081.
  • [12] Y. Luo, Y. Yang, F. Shen, Z. Huang, P. Zhou, and H. T. Shen, “Robust discrete code modeling for supervised hashing,” Pattern Recognition, vol. 75, pp. 128–135, 2018.
  • [13] Y. Yang, Y. Luo, W. Chen, F. Shen, J. Shao, and H. T. Shen, “Zero-shot hashing via transferring supervised knowledge,” in Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, The Netherlands, October 15-19, 2016, 2016, pp. 1286–1295.
  • [14] W. Li, S. Wang, and W. Kang, “Feature learning based deep supervised hashing with pairwise labels,” in

    Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016

    , 2016, pp. 1711–1717.
  • [15] H. Liu, R. Wang, S. Shan, and X. Chen, “Learning multifunctional binary codes for both category and attribute oriented retrieval tasks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 6259–6268.
  • [16] F. Shen, X. Gao, L. Liu, Y. Yang, and H. T. Shen, “Deep asymmetric pairwise hashing,” in Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23-27, 2017, 2017, pp. 1522–1530.
  • [17] M. H. Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L. Berg, “Where to buy it: Matching street clothing photos in online shops,” in 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 3343–3351. [Online]. Available:
  • [18] C. Su, S. Zhang, F. Yang, G. Zhang, Q. Tian, W. Gao, and L. S. Davis, “Attributes driven tracklet-to-tracklet person re-identification using latent prototypes space mapping,” Pattern Recognition, vol. 66, pp. 4–15, 2017.
  • [19] C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Deep attributes driven multi-camera person re-identification,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II, 2016, pp. 475–491.
  • [20] Y. Pan, T. Yao, H. Li, and T. Mei, “Video captioning with transferred semantic attributes,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 984–992.
  • [21] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with attributes,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 4904–4912.
  • [22] Y. Bin, Y. Yang, J. Zhou, Z. Huang, and H. T. Shen, “Adaptively attending to visual attributes and linguistic knowledge for captioning,” in Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23-27, 2017, 2017, pp. 1345–1353.
  • [23] J. Li, Y. Wei, X. Liang, F. Zhao, J. Li, T. Xu, and J. Feng, “Deep attribute-preserving metric learning for natural language object retrieval,” in Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23-27, 2017, 2017, pp. 181–189.
  • [24] Y. Li, R. Wang, H. Liu, H. Jiang, S. Shan, and X. Chen, “Two birds, one stone: Jointly learning binary code for large-scale face image retrieval and attributes prediction,” in 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 3819–3827.
  • [25] J. Chen, C. Ngo, and T. Chua, “Cross-modal recipe retrieval with rich food attributes,” in Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23-27, 2017, 2017, pp. 1771–1779.
  • [26] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in

    Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015

    , 2015, pp. 2048–2057.
  • [27] G. Lu, Y. Yan, L. Ren, J. Song, N. Sebe, and C. Kambhamettu, “Localize me anywhere, anytime: A multi-task point-retrieval approach,” in 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 2434–2442.
  • [28] B. Bhattarai, G. Sharma, and F. Jurie, “Cp-mtml: Coupled projection multi-task metric learning for large scale face retrieval,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 4226–4235.
  • [29] L. Chen and T. T. Rogers, “Knowing where to look: Conceptual knowledge guides fixation in an object categorization task,” in Proceedings of the 34th Annual Meeting of the Cognitive Science Society, CogSci 2012, Sapporo, Japan, August 1-4, 2012, 2012.
  • [30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 770–778.
  • [31]

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in

    Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 2015, pp. 448–456.
  • [32] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [33]

    Z. Cao, M. Long, J. Wang, and P. S. Yu, “Hashnet: Deep learning to hash by continuation,” in

    IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 5609–5618.
  • [34] J. Song, T. He, L. Gao, X. Xu, A. Hanjalic, and H. T. Shen, “Binary generative adversarial networks for image retrieval,” 2018. [Online]. Available:
  • [35] C. Lin and F. J. Och, “Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics,” in Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 21-26 July, 2004, Barcelona, Spain., 2004, pp. 605–612.
  • [36] L. Du, T. Wo, R. Yang, and C. Hu, “Cider: a rapid docker container deployment system through sharing network storage,” in

    19th IEEE International Conference on High Performance Computing and Communications; 15th IEEE International Conference on Smart City; 3rd IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2017, Bangkok, Thailand, December 18-20, 2017

    , 2017, pp. 332–339. [Online]. Available:
  • [37] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
  • [38] L. van der Maaten, “Accelerating t-sne using tree-based algorithms,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 3221–3245, 2014.