Recent advances in artificial intelligence and image recognition allow a whole new set of services to improve the Internet shopping experience[he2016deep, simonyan2015very]
. Among those new services, visual search is probably one of the most promising technique as it provides an effective and natural way to search through a catalog with a simple picture[huang2015systems].
Improving visual recommendation algorithms requires access to large labeled image datasets, possibly specialized in the core business they address. Available generic image datasets include TinyImage [Torralba08], LabelMe [Russell08], Lotus Hill [Yao07], Microsoft Common Objects in Context [Lin14] (COCO) or OpenImages [Krasin17]
. Of course, ImageNet[Jia09] is the de facto standard to bench image classification algorithms involving extremely large numbers of labels.
Most of the public datasets that are of direct use to on-line retailers are specialized in fashion items: the Exact Street2Shop [Kiapour15] dataset identifies around 40,000 clothing items worn by people on real-world street photos, and provides their exact match amongst hundreds of thousands of images from shopping websites; DeepFashion consists of over 800,000 annotated images that contain clothes [Liu16].
To the best of our knowledge, no comprehensive image dataset covering products typically sold by generalist retailers is yet available to the community. This is the reason why we are releasing a large dataset of such categorized product images. With more than 12M images of 7M products classified into 5270 categories, this dataset should help the community to leverage state-of-the-art neural network architectures in order to develop better recommendation systems.
In the following article, we present several aspects of the dataset such as the way it was build and organized or some specific features that you might want to consider before training a model on it. We also get a grasp of the approaches followed by the 3 winning teams of the Kaggle competition we organized on this dataset.
The full dataset can be downloaded from the Cdiscount challenge page on the Kaggle platform [kaggle] (the url is given in the reference).
2 E-retailer product catalog
The large e-retailer image dataset we present has been extracted from the full list of products available on our web site in July, 2017. Therefore, products may be coming from our own list of products or from our Market Place where independent resellers can put their products up for sell. Our own catalog being rich of approximately 200,000 products, the vast majority of the 7M products in the dataset originates from the nearly 10,000 independent resellers present on our Market Place.
Our catalog is organized according to a 3-level aggregation tree with French labeled categories. The level of aggregation is referred to as Cat I for Category level I and contains a diversity of products that could be compared to a physical store like a drugstore or a wine shop. It is the most generic level of aggregation and as such is of particular interest if one wishes to focus on a particular subset of images such as CHILDCARE (PUERICULTURE), BAGS (BAGAGERIE) or INTERIOR DESIGN (DECORATION). The 49 distinct Cat III in the data set are listed in the table 1 with the corresponding English translation.
|ABONNEMENT / SERVICES||SUBS. / SERVICES|
|AMNG URB.-VOIRIE||URBAN PLANING-ROADWAY|
|ART DE LA TABLE-ART. CUL.||TABLEWARE-COOK. UST.|
|ARTICLES POUR FUMEUR||SMOKER TOOLS|
|BATEAU MOTEUR-VOILIER||MOTOR BOAT-SAILING BOAT|
|COFFRET CADEAU BOX||GIFT BOX|
|INSTRUMENTS DE MUSIQUE||MUSICAL INSTRUMENT|
|JEUX VIDEO||VIDEO GAMES|
|LSRS CREA.-BX ARTS-PAPET.||CRAFT-ARTS-STATIONERY|
|MATERIEL DE BUREAU||OFFICE EQUIPMENT|
|MATERIEL MEDICAL||MEDICAL DEVICE|
|PT DE VTE-COMM.-ADMIN.||SALES OUT-COMM.-ADMIN.|
|PRODUITS FRAIS||FRESH PRODUCES|
|PRODUITS SURGELES||FROZEN FOODS|
|TENUE PROFESSIONNELLE||WORKING CLOTHES|
|Category level||Cat I||Cat II||Cat III|
|Nb of categories||49||483||5270|
|Nb of images||#1||#2||#3||#4|
|Nb of products||4,369,441||1,128,588||542,792||1,029,075|
The level category (Cat II) is of lesser importance. It is just an intermediate step before the and most specific level which gathers identical products or objects. Examples of those level categories (Cat III) belonging to the 3 stores mentioned above would be BABY BOTTLE (BIBERON), TRAVEL BAG (SAC DE VOYAGE) and PHOTO FRAME (CADRE PHOTO). The number of categories for each level is given in table 2. A ratio of roughly 1 to 10 is observed in the number of categories from one level to the next, leading to 5270 distinct Cat III categories. It is worth noting that there are actually 5263 distinct values taken by those 5270 Cat III categories: 7 couples of them share the same name while belonging to different Cat II categories. However, the combination Cat II & Cat III is uniquely defined through the dataset. Finally, each of the 5270 Cat I & Cat II & Cat III category is encoded with an integer index in the dataset.
Down to the level of products, we count between 1 and 4 180x180 images that can be associated to a given product. There aren’t any specific rule to define that number as within our Market Place, a reseller is simply given the choice to insert 1, 2, 3 or 4 images with his products. The table 3 summarizes the distribution of products according to the number of associated images. More than half the number of products have only 1 image. But this reduces to the number of images in the dataset. Finally, we count precisely 12,371,293 images for 7,069,896 products.
The initial product classification on which was created the dataset was made using textual descriptions we have for each product in our catalog. The process of classification is semi-automatized: a K-NN is applied to classify every product and if the required confidence level isn’t met for a given product, it is sent to manual classification. Finally, the overall quality of the classification is assessed by frequent sampling operations in which a trained expert is asked to visually control the classification. The measured overall rate of bad classification based on this sampling technique is around 10 % in each category. This number gives the order of magnitude of the noise associated to our image dataset.
The figure 1 shows some illustrative examples of images that can be found in the dataset. The background may vary a lot from one image to the other. A product might be presented on a white or colored background or they might be shown in an illustrative situation like the wall decoration or the dresser. Images might be views of the same object with different angles like for the helmet or they might be showing a zoom on some specific detail of the product as for the couch. The product may also be represented more than once like the watch. Finally, for some specific products, one of the images might not be showing the product at all. This is the case for the fridge as, according to the European Union regulation, electrical goods all have to carry an EU energy label.
3 Detailed features of the dataset
In this section, we focus on 2 characteristics of the dataset that one should consider when using it to train a neural network. The first one is the unbalancedness of the dataset and the second is the presence of duplicates among the images.
3.1 Distribution of products
Our product catalog is highly diverse and the categorization tree we use is not aimed at balancing the number of products among the categories. Rather, it is aimed at gathering products with similar characteristics and usages. This results in a highly unbalanced number of products per categories.
The figure 2 shows 3 distributions of products per category, one for each hierarchical level of category. It should be noted that bin widths that were used to draw those histograms vary on a logarithmic scale to facilitate the visualization of several orders of magnitude on the same plot.
At the Cat I level, the spread is considerable. There is a small cluster of 5 categories with less than 200 products (APICULTURE, PRODUITS SURGELES, ABONNEMENTS/SERVICES, PRODUITS FRAIS, FUNERAIRE). The rest of the 44 categories gather between roughly and products each. The last bin alone contains 9 categories in which 4.5M products (more than half the products) are to be found in total.
Down to the Cat II level, the spread remains important. It varies between 10 and half of million of products in just one category named PARTS (PIECES). The mode of the distribution for this level as shown on figure 2 is around 3,000 products per category.
Finally, at the Cat III level, the most populated categories are rich of more than 10,000 products each (figure 2). The top 5 being POP ROCK MUSIC, PRINTER TONER, PRINTER CARTRIDGE, FRENCH LITTERATURE and OTHER BOOKS with nearly 70,000 items each. Most of the categories (nearly 2000 of them) count between 50 and 500 products.
Another way to consider the unbalancedness of the dataset is to consider the share of products with respect to the most populated categories. This is shown in figure 3 where the cumulative percentage of products is displayed as a function of the number of Cat III categories. The behavior is nearly exponential with 75 % of the products gathered in only 10 % of the categories. On the other end, the less populated 75 % of the categories account for only 10 % of the total number of products.
3.2 Duplicated images
A second aspect specific to this data set might be the presence of duplicated images. Indeed, for a given product there might be several resellers and nothing prevents them from using similar if not identical images to describe their products. From an image classification point of view, the presence of duplicated images might be consider as a downside or an upside. It reduces the absolute size of the dataset but it may also help to classify some products and to make links between categories that contain identical images.
To trace back identical images, we use the MD5 hash function as defined in [rivest1992md5]. Images with the same hash key are labeled as identical. Although the MD5 hash function suffer weaknesses that will prevent nearly identical images to be detected, it is efficient enough in our case where there is no will to hide duplicates by tricking the image.
The distribution of the measured MD5 hash keys through the entire dataset is shown in figure 4. Again, logarithmic scales have been used to account for the large range of values. Among the 12M images in total in the dataset, 6.85M images are uniquely and the 75 % of the images appear at the most 10 times in the dataset. The image the most replicated appears 16,643 times.
4 Image classification challenge
In this section, we present the image classification challenge that was organized around this dataset and review the 3 winning solutions of the challenge.
The challenge aimed at building an image categorizer based on the dataset presented in this document. It was held between September and December 2017 on the famous Kaggle platform [kaggle]
and involved more than 600 teams from all around the world. The evaluation metric was the accuracy of classification on a test set for which none of the categories is not known (if one wishes to evaluate a categorizer on the test set, please contact the author of the paper). The name, final rank and score of the 3 winning teams is given in the table4. Final results were quite impressive as the best solutions were able to correctly classify almost 80 % of the images of the test set.
|Team name||bestfitting||convoluted prediction||dylan|
4.2 Common features of the 3 winning solutions
The 3 winning solutions share several characteristics. First, they are all ensemble models that aggregate sub-model in different ways: either they use an elaborate method such as Xgboost[chen2016xgboost]
aggregation method or they simply compute the geometric mean or a weighted mean.
All 3 solutions also rely on neural network architectures pre-trained on ImageNet and fine-tuned on our dataset. Different framework were used : Pytorch[paszke2017pytorch]
, Keras[chollet2015keras]abadi2016tensorflow] and Mxnet [chen2015mxnet]
and heavy duty GPUs were required to perform transfer learning[donahue2014decaf].
Also, all the state of the art neural network architecture are present :
the solution uses ResNet [he2016deep] and InceptionResNetV2 [szegedy2017inception];
the solution makes a heavy use of squeeze and excitation blocks [hu2017squeeze] on InceptionV3 [szegedy2017inception], ResNet [he2016deep] and ResNExt [xie2017aggregated] architectures. Dual Path Networks are also used [chen2017dual];
the solution uses SE blocks on ResNExt architecture and densily connected networks [huang2017densely].
It is worth noting that none of the above mentioned neural network techniques or architecture were described in the literature before 2016 for a challenge that took place in 2017.
Finally, all 3 solutions made use of the dropout technique [srivastava2014dropout] to prevent over-fitting and used various technique of data augmentation such as cropping, flipping and resizing.
4.3 Solution specific aspects
At a more detailed level, the winning solution present an interesting approach in which the trained neural networks are specialized by number of images available for each product. When more than one images are available, they are simply aggregated side by side in a larger image and fed into the dedicated model.
The solution took a similar approach for one of its sub-model by specializing neural networks for each number and rank of images. Indeed, the competitor realized that both the number of images per product and the order of those images were not random and thus should be considered for classification.
Finally, a last worth mentioning originality was the use by the winning team of an OCR for products that were recognized as being books or CD. This last trick probably made the small difference that made this solution the winning one.
A large e-retailer image dataset was presented in this document. This large dataset is publicly released in order to stimulate and leverage the use of deep neural network architectures for creative use cases in recommendation systems. The dataset is made up of a large diversity of product images organized according to a 3 level aggregation tree. We count 12M images of 7M products arranged in 5270 categories. It is unbalanced and there are duplicated images in it. An image classification challenge organized around this dataset on the Kaggle platform lead to a classification accuracy of nearly 80 %.