Fashion has recently become one of the most featured topics of interdisciplinary studies in Computer Science. With the emergence of deep learning based solutions, fashion-related researches start to get promising results on various subjects including clothing recognition, attribute prediction, clothing retrieval, body segmentation, and style prediction. Retrieving the desired clothing image from a collection is one of the most challenging tasks in fashion domain, and it is attacked by such a mechanism that learns to capture different notions of the similarities between the images in a common subspace.
to employ Convolutional Neural Networks (CNNs) to their solutions. However, CNNs, by their nature, have some limitations such as losing the hierarchical spatial information of the objects and not being robust to affine transformations. Recently, an alternative deep learning architecture, namelyCapsule Networks, and a novel dynamic routing algorithm have been proposed by Sabour and Hinton . In this design, with the help of the routing-by-agreement algorithm, it is possible to learn more descriptive information about the objects without losing the intrinsic spatial relationship between the object and its parts. Therefore, Capsule Networks have the capacity for recognizing the images regardless of the visual angle and without requiring different transformations, since this architecture can inherently learn higher dimensional pose configuration of the images.
In this study, we employ Capsule Networks to clothing retrieval problem by extending their capabilities with some improvements. First, we extract the features of larger-sized clothing images by more powerful methods (stacked or residual-connected convolutional layers), and forward these features to fully-connected capsules. Next, we introduce a Triplet-based design of Capsule Networks that learns the similarity between triplets. Lastly, we train our proposed architectures on in-shop partition of DeepFashion data set , and compare our results with the baseline study, namely FashionNet  and the other SOTA methods.
2 Related Works
Clothing retrieval has become more important after some major developments in Computer Science and the emergence of e-commerce. Recent studies generally attack to this task by using deep convolutional networks.  introduces an excessively challenging task, namely Exact Street to Shop, where the goal is to match the exact same item in the photos captured by users to online shopping photos.  proposes Dual Attribute-aware Network (DARN) to address the cross-domain image matching problem.  introduces a new data set, namely DeepFashion, which has a vast amount of large-scale clothing images annotated with numerous attributes, landmark information and cross-domain image correspondences. 
demonstrates that integrating bag-of-words approach to weakly-supervised learning process can achieve promising results on clothing retrieval task.
proposes a Visual Attention Model (VAM), and introduces a novel Dropout-like connection after attention layers. addresses the issues of defining a model with right complexity and choosing hard samples carefully during training.  shows how to improve the robustness of the feature embeddings by exploiting the independence within ensembles.  introduces hierarchical triplet loss (HTL) to address the random sampling issue during training a triplet loss.  proposes multiple-way attention-based ensemble architecture that learns the feature embeddings with multiple attention masks.
Capsules are groups of neurons that convey higher dimensional information throughout the network in more refined way. This information is interpreted as the pose configuration and the existence probability of an instance. Each capsule in a higher level is formed by the routing of incoming votes from the capsules in lower level. At this point, these votes are calculated by the linear transformation of the pose configuration. During dynamic routing, the linear combination of incoming votes weighted by their coefficients (coupling coefficients) forms the non-activated outputs in higher level capsules. For each iteration, the weights of these votes are updated with respect to the dot product of the incoming votes and the outputs in higher level capsules. This is called agreement between capsules. Finally, the output of each capsule in lower level is determined by squashing function as proposed in .
3.2 Proposed Architectures
In our design, we adjust the original Capsule Network structure to a Triplet-based version, so that the network can learn the similarity between two images by feeding the objective function with the embedded representations extracted by capsules. At this point, our Capsule Network design aims to minimize the Triplet loss shown in Equation 1, where is the Euclidean distance metric, is distance margin, , , are the latent capsule embeddings extracted from the anchor image , positive image and negative image respectively. During forming these embeddings, we normalize latent capsules by L2-norm, and then we mask all capsules but the one that belongs to the correct class to zero.
As illustrated in Figure 2, Capsule Networks essentially contain two main blocks: feature extraction block and capsule layers. There is only one feature extraction block that has a single convolutional layer with 64 filters in the original design proposed by Sabour and Hinton . Extracting the features by such a shallow structure may be enough for one-channel handwritten digit images with the size of . However, fully-connected capsules need more complex features to achieve better results on more complicated image-related problems. Therefore, we design two different feature extraction blocks to form more powerful features as the input of capsules. First, a number of convolutional layers are stacked without using any pooling operation between them, and the latter is to connect these layers as residual. In both of our designs, leaky form of linear rectifier 5] is applied between convolutional layers.
Furthermore, capsule layers are kept identical in both designs. There are two fully-connected capsule layers, namely Primary Capsule and Class Capsule. Primary Capsule is the layer where the extracted features are grouped with respect to the capsule dimensionality. In our designs, this layer has 32 channels of 16-dimensional capsules that are fully-connected to Class Capsule. Next, there are number of 16-dimensional capsules in Class Capsule layer, where
is the number of classes in the data set. Activations and the latent capsule vectors of Class Capsule are calculated via dynamic routing with 3 iterations. Any kind of reconstruction methods (as in) is not applied to our Capsule Network designs.
The experiments for proposed Stacked-convolutional (SCCapsNet) and Residual-connected (RCCapsNet) architectures are conducted on in-shop partition of DeepFashion data set . Both are trained on 25k training images, and tests are performed by using 14k query and 12k gallery images. Since this task is an information retrieval task, the performance is measured by Recall@K metric, where K is 1 or multiplies of 10 up to 50. Moreover, as mentioned in Schroff , negative hard sampling strategy improves the convergence behavior of the model significantly. Based on this strategy, the negative images are picked as the closest image to the anchor provided that they are of different categories; whereas we pick each possible positive image in the data set as the positive one.
As shown in Table 1, SCCapsNet and RCCapsNet achieve better retrieval performance than all variants of the baseline study (FashionNet) by a wide margin. It is important to note that both of our proposed architectures use only images during training in contrast to the baseline study where the network is supported by different number of attributes and the landmark information. These experiments demonstrate that our Capsule Network designs can inherently learn pose configuration of the objects without any requirement of recovering pose information.
|Models||Top-20 (%)||Top-50 (%)|
Table 2 summarizes in-shop clothing retrieval results of SCCapsNet, RCCapsNet, and the SOTA methods. These figures indicate how successful our proposed designs are, and what the main limitations of them are when compared to the SOTA CNN-based architectures. First, both of our designs outperform the earlier methods (i.e. WTBI  and DARN ) which both disparately use semantic attributes to improve the overall performance, but neglect pose configurations of the images during training. According to Top-20 Recall@K scores, while SCCapsNet improves the scores of the best FashionNet variant by 31% and 14%, RCCapsNet has even better performance with a margin of 34% and 17% respectively. The other approach whose performance falls behind in ours is the method of leveraging weakly-annotated textual descriptors of the images proposed by Corbiére . In this design, these textual descriptors (i.e. bag-of-words) represent different coarse semantic concepts such as texture information, color and shape. Capsules can directly learn these concepts from the images in a sophisticated way, and hence, SCCapsNet and RCCapsNet can achieve higher Recall@K scores than this approach without taking advantage of bag-of-words descriptors.
|Corbiére et al. ||25||39.0||71.8||78.1||81.6||83.8||85.6|
In addition to all these, our proposed architectures cannot achieve the performances of more advanced CNN-based architectures. In these designs, there are various techniques applied to CNNs to boost the overall performance, which are alternative hard sampling strategies , more advanced objective functions [2, 9], network ensembling [9, 6] and attention-based mechanisms [12, 6]. Although these techniques may significantly improve the overall performance in CNNs, in principle, they increase the model complexities by a wide margin, or increase training time considerably. At this point, the numbers of trainable parameters in SCCapsNet and RCCapsNet are respectively 2.5 and 4.5 million, while the SOTA methods have twice as many trainable parameters in their models. Capsule Networks need more time for training than CNNs since dynamic routing algorithm is a relatively slow routing mechanism when compared to the pooling variants. Therefore, within limited computational resources, these techniques are not yet applied to our models to boost the overall performance of our Capsule Network designs, and left as future research ideas.
In this study, we present two different Triplet-based designs of Capsule Networks with more powerful feature extraction blocks, and employ them to clothing retrieval task. Experiments show promising results where both of our designs outperform all FashionNet variants without any extra information besides to the images. Moreover, when compared to the SOTA methods, our designs perform comparably well with only the half of the number of parameters as in the SOTA methods, and it shows the potential of Capsule idea in case of the computational burdens are lightened.
Leveraging weakly annotated data for fashion image retrieval and label prediction.
2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 2268–2274. Cited by: §1, §2, Table 2, §4.
-  (2018-09) Deep Metric Learning with Hierarchical Triplet Loss. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2, Table 2, §4.
-  (2015-12) Where to Buy It: Matching Street Clothing Photos in Online Shops. pp. 3343–3351. External Links: Cited by: §1, §2, Table 2, §4.
-  (2015) Cross-domain image retrieval with a dual attribute-aware ranking network. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1062–1070. External Links: Cited by: §1, §2, Table 2, §4.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. pp. 448–456. External Links: Cited by: §3.2.
-  (2018-09) Attention-based ensemble for deep metric learning. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2, Table 2, §4.
DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations.
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2, Table 1, Table 2, §4.
-  (2013) Rectifier nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Cited by: §3.2.
-  (2017) BIER : boosting Independent Embeddings Robustly. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5199–5208. Cited by: §1, §2, Table 2, §4.
-  (2017) Dynamic routing between capsules. In Advances in Neural Information Processing Systems 30, pp. 3856–3866. Cited by: §1, §3.1, §3.2, §3.2.
FaceNet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 815–823. External Links: Cited by: §4.
-  (2017) Clothing retrieval with visual attention model. 2017 VCIP, pp. 1–4. Cited by: §1, §2, Table 2, §4.
-  (2017-10) Hard-Aware Deeply Cascaded Embedding. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 814–823. External Links: Cited by: §1, §2, Table 2, §4.