We participated in the Fashion IQ Challenge 2019 by building image+text to image retrieval systems on fashion items in three pre-defined categories: dress, shirt and toptee.
Our baseline system consists of an image encoder, a text encoder and a composition module as shown in Figure 1. The image encoder is based on the VGG network [vgg] with landmark-driven attention layers [Liu], and we use BERT [bert] as the text encoder. The composition module is based on the TIRG method introduced in [Nam].
We use the Fashion IQ dataset [iq] for training our system on this task. As additional data, we use the Deepfashion dataset [deepfashion] to pre-train the image encoder and use an in-house fashion-domain corpus to pre-train the text encoder.
We trained our systems using data augmentation techniques, regularizations like dropout and label smoothing.
We evaluated our systems by ensembling several models trained separately. As the task suggested, the evaluation process measures recall percentage for each of the top 10 and top 50 of ranked results for three categories: dress, shirt and toptee.
In the test phase, our system achieved 43.67% average recall and ranked the third place among participants.
In following sections, we will describe details of our method, experiment settings, evaluation results and conclusion.
Our baseline model includes three main parts: the image encoder, the text encoder and the composition layers.
For each candidate and target image pair, we proceed with the following steps:
1) Feed both candidate and target image into image encoder and linearly project outputs into the same space, which has 1024 dimensions.
2) Convert caption tokens using pre-trained text embeddings to 384-dimensional vectors and feed them to the text encoder and use the hidden representation of the token “CLS” as the representation of the whole document.
3) Compose candidate vectors and caption representation into the same space as the image vector’s using TIRG composition. By doing this, we expect the candidate image vector to be biased toward the target image vector so that cosine similarity between candidate and target image vectors is larger than that between candidate and other images which are not the target image.
4) Normalize composed vector and target image vectors with second-order norm.
5) Compute cosine similarities between composed vector and target image vector for final ranking.
In the following subsections, we describe the details of our sub-modules.
2.1 Image encoder
The image encoder, a VGG-based convolutional network, is identical to the model introduced in [Liu]
except for two differences: 1) we add batch normalization layers to VGG, 2) in the attribute prediction layers for pre-training, instead of directly predicting 1000 attributes from a single feature vector, we use Attribute Prediction Network as described in[iq] to predict attributes separately from five feature vectors each corresponding to an attribute category.
Before applying the image encoder to the main task, we pre-trained the image encoder with the Deepfashion dataset [deepfashion] with training objectives consisting of attribute prediction, category prediction and landmark prediction. We optimized attribute prediction using binary cross entropy loss, category prediction with negative log-likelihood loss and landmark prediction with mean squared error loss. We applied weights of 20, 1, 10 for each loss respectively and summed up these weighted losses for single-step optimization.
2.2 Text encoder
The text encoder, a Transformer-based network, is similar to BERT introduced in [bert] except for three parts: 1) we reduced the size of the model due to the fact that in our task, average text length is shorter and text structure is simpler than those in the original paper, 2) we distinguish the role of each layer in the encoder by restricting the attention range of the self-attention mechanism, 3) in addition to masked language modeling, we use item category prediction as a sub-task instead of next sentence prediction used in the original paper.
Our text encoder has four self-attentional layers, each with 384 hidden dimensions, six self-attentional heads and 1536 intermediate hidden dimensions. We noticed that the input text is a document which consists of several sentences that are not necessarily in fixed order, therefore we assign different roles to each layer in the encoder. The first two layers are defined as sentence layers in which self-attention is only applied among inter-sentence tokens. The last two layers are defined as document layers, in which self-attention is applied to all tokens in the document with the purpose of capturing information from the whole document. The input format is the same as the original paper: sentences are separated by a special token “SEP” and the first position is always the class prediction token “CLS”. We also randomly shuffled sentence order during training for better robustness.
2.3 Image-text composition layers
We use TIRG [Nam] to compose candidate image and caption text. In addition to the original TIRG function, we add category embeddings to distinguish composition behaviour in three fashion categories. The TIRG function used in our system has the following form:
Where , , denote candidate image vector, caption vector and category embeddings vector, respectively.
2.4 Loss function
Our loss function for deep metric learning is similar to[rankedlist], except that we use cosine similarity instead of euclidean distance. We found this method can achieve slightly higher recall rate than N-pair loss [n-pair] used in [Nam].
We use the Fashion-IQ dataset for the main task. For training, we simply create pseudo training pairs using images that are never used as a target in the training set. For each of those images, we construct a pseudo example in which both the candidate and target images are identical to the original image itself. We also gather phrases from captions of the training data indicating equivalence such as “exactly same” or “is the same item” using some hand-made rules. Among these phrases, two were randomly selected as the caption for each pseudo example.
For pre-training of the image encoder, we use the Deepfashion dataset [deepfashion].
|Designovel’s Fashion Corpus|
|Maximum document length||140|
|Average document length||52|
For pre-training of the text encoder, we use our in-house fashion-domain corpus built from crawling online shopping malls. Details and statistics of the corpus is shown in Table 1. Each document in the corpus is a description about a unique fashion item. The description includes information like motivation from the designer or brand, visual details, colors, components, materials and stitching methods.
3.2 Data pre-processing and augmentation
For all images used in our experiments, we use the MMDetection tool [mm] to calculate the bounding box of fashion objects appearing in an image and crop the image using these boundaries. The object detector is also trained with the Deepfashion dataset [deepfashion].
For augmentation, we used random horizontal flips, random angle affine transformations, random horizontal and vertical translations, random distortion and random erasing. We found data augmentation could significantly improve performance.
3.3 Hyperparameters and learning curriculum
We used Adam [adam] as the optimizer and set the initial learning rate to 5e-5 for composition layers and 5e-6 for image and text encoders.
We separate the training data according to three fashion categories and trained the model in the order of “dress, shirt, toptee”. In this case, we define one epoch as an iteration over all three categories.
We experimented on the Fashion-IQ dataset with various settings as shown in Table 2. We began with the baseline in which we set the same learning rate 5e-5 on all modules, and gradually adjusted the settings by lowering the learning rate of text and image encoder to 5e-6 (as suggested in [Nam]), applying data augmentation on the training data and ensembling several trained models from different runs. For model ensembling, we first calculated similarity scores with each sub-model separately and then simply averaged these scores.
As final result, we achieved an average recall of 39.12% with a single model and 43.67% with an ensemble of 16 models on the test dataset.
|+Small LR on encoder||37.28||36.49|
|+Ensemble (8 models)||45.00||43.52|
|+Ensemble (16 models)||45.86||43.67|
We participated in the Fashion IQ Challenge 2019 by building an image+text to image retrieval system using methods from recent works and achieved a 43.67% average recall with an ensemble of 16 models in the test phase.
By experimenting on various settings, we found that simple data augmentation and model ensembles could significantly improve recall percentage.
As future work, we will focus on the positive example mining method since each candidate can have multiple matched targets, while the given training and validation datasets only indicate a single target which may potentially lead to overfitting. We will also try various ensembling methods instead of simply averaging scores.