Large e-retailer image dataset for visual search and product classification

09/18/2019 ∙ by Arnaud Belletoile, et al. ∙ 0

Recent results of deep convolutional networks in visual recognition challenges open the path to a whole new set of disruptive user experiences such as visual search or recommendation. The list of companies offering this type of service is growing everyday but the adoption rate and the relevancy of results may vary a lot. We believe that the availability of large and diverse datasets is a necessary condition to improve the relevancy of such recommendation systems and facilitate their adoption. For that purpose, we wish to share with the community this dataset of more than 12M images of the 7M products of our online store classified into 5K categories. This original dataset is introduced in this article and several features are described. We also present some aspects of the winning solutions of our image classification challenge that was organized on the Kaggle platform around this set of images.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in artificial intelligence and image recognition allow a whole new set of services to improve the Internet shopping experience 

[he2016deep, simonyan2015very]

. Among those new services, visual search is probably one of the most promising technique as it provides an effective and natural way to search through a catalog with a simple picture 

[huang2015systems].

Improving visual recommendation algorithms requires access to large labeled image datasets, possibly specialized in the core business they address. Available generic image datasets include TinyImage [Torralba08], LabelMe [Russell08], Lotus Hill [Yao07], Microsoft Common Objects in Context [Lin14] (COCO) or OpenImages [Krasin17]

. Of course, ImageNet 

[Jia09] is the de facto standard to bench image classification algorithms involving extremely large numbers of labels.

Most of the public datasets that are of direct use to on-line retailers are specialized in fashion items: the Exact Street2Shop [Kiapour15] dataset identifies around 40,000 clothing items worn by people on real-world street photos, and provides their exact match amongst hundreds of thousands of images from shopping websites; DeepFashion consists of over 800,000 annotated images that contain clothes [Liu16].

To the best of our knowledge, no comprehensive image dataset covering products typically sold by generalist retailers is yet available to the community. This is the reason why we are releasing a large dataset of such categorized product images. With more than 12M images of 7M products classified into 5270 categories, this dataset should help the community to leverage state-of-the-art neural network architectures in order to develop better recommendation systems.

In the following article, we present several aspects of the dataset such as the way it was build and organized or some specific features that you might want to consider before training a model on it. We also get a grasp of the approaches followed by the 3 winning teams of the Kaggle competition we organized on this dataset.

The full dataset can be downloaded from the Cdiscount challenge page on the Kaggle platform [kaggle] (the url is given in the reference).

2 E-retailer product catalog

The large e-retailer image dataset we present has been extracted from the full list of products available on our web site in July, 2017. Therefore, products may be coming from our own list of products or from our Market Place where independent resellers can put their products up for sell. Our own catalog being rich of approximately 200,000 products, the vast majority of the 7M products in the dataset originates from the nearly 10,000 independent resellers present on our Market Place.

Our catalog is organized according to a 3-level aggregation tree with French labeled categories. The level of aggregation is referred to as Cat I for Category level I and contains a diversity of products that could be compared to a physical store like a drugstore or a wine shop. It is the most generic level of aggregation and as such is of particular interest if one wishes to focus on a particular subset of images such as CHILDCARE (PUERICULTURE), BAGS (BAGAGERIE) or INTERIOR DESIGN (DECORATION). The 49 distinct Cat III in the data set are listed in the table 1 with the corresponding English translation.

Cat I Translation
ABONNEMENT / SERVICES SUBS. / SERVICES
AMNG URB.-VOIRIE URBAN PLANING-ROADWAY
ANIMALERIE PET SHOP
APICULTURE APICULTURE
ART DE LA TABLE-ART. CUL. TABLEWARE-COOK. UST.
ARTICLES POUR FUMEUR SMOKER TOOLS
AUTO-MOTO CAR-MOTORCYCLE
BAGAGERIE BAGS
BATEAU MOTEUR-VOILIER MOTOR BOAT-SAILING BOAT
BJX-LUNETTES-MONTRES JEWEL.-GLASS.-WATCHES
BRICO.-OUTIL.-QUINC. DIY-TOOLING-HARDWARE
CHAUSSURES-ACCESS. SHOES-ACCESSORIES
COFFRET CADEAU BOX GIFT BOX
CONDITIONNEMENT PACKAGING
DECO-LINGE-LUMINAIRE INT. DESIGN-TEXT.-LUM
DROGUERIE DRUGSTORE
DVD-BLURAY DVD-BLURAY
ELECTROMENAGER APPLIANCES
ELECTRONIQUE ELECTRONIC
EPICERIE GROCERY
FUNERAIRE FUNERAL
HYGIENE-BEAUTE-PARFUM HYG-BEAUTY-PERF
INFORMATIQUE COMPUTER EQUIPMENT
INSTRUMENTS DE MUSIQUE MUSICAL INSTRUMENT
JARDIN-PISCINE GARDEN-SWIMMING POOL
JEUX-JOUETS GAMES-TOYS
JEUX VIDEO VIDEO GAMES
LIBRAIRIE BOOKSTORE
LITERIE BEDDING
LSRS CREA.-BX ARTS-PAPET. CRAFT-ARTS-STATIONERY
MANUTENTION MATERIAL HANDLING
MATERIEL DE BUREAU OFFICE EQUIPMENT
MATERIEL MEDICAL MEDICAL DEVICE
MERCERIE HABERDASHERY
MEUBLE FURNITURE
MUSIQUE MUSIC
PARAPHARMACIE DRUGSTORE
PHOTO-OPTIQUE PHOTO-OPTIC
PT DE VTE-COMM.-ADMIN. SALES OUT-COMM.-ADMIN.
PRODUITS FRAIS FRESH PRODUCES
PRODUITS SURGELES FROZEN FOODS
PUERICULTURE CHILDCARE
SONO-DJ SOUND SYSTEM-DJ
SPORT SPORT
TATOUAGE-PIERCING TATOO-PIERCING
TELEPHONIE-GPS TELEPHONY-GPS
TENUE PROFESSIONNELLE WORKING CLOTHES
TV-VIDEO-SON TV-VIDEO-HIFI
VIN-ALCOOL-LIQUIDES WINE-ALCOHOL-LIQUIDS
Table 1: French labeled categories of the first level of aggregation in the dataset and the corresponding translations.
Category level Cat I Cat II Cat III
Nb of categories 49 483 5270
Table 2: Number of values taken by the 3 levels of the categorization tree.
Nb of images #1 #2 #3 #4
Nb of products 4,369,441 1,128,588 542,792 1,029,075
Table 3: Number of products having exactly 1, 2, 3 or 4 associated images.

The level category (Cat II) is of lesser importance. It is just an intermediate step before the and most specific level which gathers identical products or objects. Examples of those level categories (Cat III) belonging to the 3 stores mentioned above would be BABY BOTTLE (BIBERON), TRAVEL BAG (SAC DE VOYAGE) and PHOTO FRAME (CADRE PHOTO). The number of categories for each level is given in table 2. A ratio of roughly 1 to 10 is observed in the number of categories from one level to the next, leading to 5270 distinct Cat III categories. It is worth noting that there are actually 5263 distinct values taken by those 5270 Cat III categories: 7 couples of them share the same name while belonging to different Cat II categories. However, the combination Cat II & Cat III is uniquely defined through the dataset. Finally, each of the 5270 Cat I & Cat II & Cat III category is encoded with an integer index in the dataset.

Down to the level of products, we count between 1 and 4 180x180 images that can be associated to a given product. There aren’t any specific rule to define that number as within our Market Place, a reseller is simply given the choice to insert 1, 2, 3 or 4 images with his products. The table 3 summarizes the distribution of products according to the number of associated images. More than half the number of products have only 1 image. But this reduces to  the number of images in the dataset. Finally, we count precisely 12,371,293 images for 7,069,896 products.

Figure 1: Examples of images from the dataset. From top to bottom and from left to right: 4 images of a watch, 2 images of a wall decoration; 4 images of a couch, 2 images of a dresser; 3 images associated to a fridge (2 views and 1 label), 3 images of a clock; 3 images of a helmet, 1 image of a neck-lace, 1 image of a camera flash, 1 image of a pendant.

The initial product classification on which was created the dataset was made using textual descriptions we have for each product in our catalog. The process of classification is semi-automatized: a K-NN is applied to classify every product and if the required confidence level isn’t met for a given product, it is sent to manual classification. Finally, the overall quality of the classification is assessed by frequent sampling operations in which a trained expert is asked to visually control the classification. The measured overall rate of bad classification based on this sampling technique is around 10 % in each category. This number gives the order of magnitude of the noise associated to our image dataset.

The figure 1 shows some illustrative examples of images that can be found in the dataset. The background may vary a lot from one image to the other. A product might be presented on a white or colored background or they might be shown in an illustrative situation like the wall decoration or the dresser. Images might be views of the same object with different angles like for the helmet or they might be showing a zoom on some specific detail of the product as for the couch. The product may also be represented more than once like the watch. Finally, for some specific products, one of the images might not be showing the product at all. This is the case for the fridge as, according to the European Union regulation, electrical goods all have to carry an EU energy label.

3 Detailed features of the dataset

In this section, we focus on 2 characteristics of the dataset that one should consider when using it to train a neural network. The first one is the unbalancedness of the dataset and the second is the presence of duplicates among the images.

3.1 Distribution of products

Figure 2: Distributions of products per category for each category level. From left to right: Cat I, Cat II and Cat III.

Our product catalog is highly diverse and the categorization tree we use is not aimed at balancing the number of products among the categories. Rather, it is aimed at gathering products with similar characteristics and usages. This results in a highly unbalanced number of products per categories.

The figure 2 shows 3 distributions of products per category, one for each hierarchical level of category. It should be noted that bin widths that were used to draw those histograms vary on a logarithmic scale to facilitate the visualization of several orders of magnitude on the same plot.

At the Cat I level, the spread is considerable. There is a small cluster of 5 categories with less than 200 products (APICULTURE, PRODUITS SURGELES, ABONNEMENTS/SERVICES, PRODUITS FRAIS, FUNERAIRE). The rest of the 44 categories gather between roughly and products each. The last bin alone contains 9 categories in which 4.5M products (more than half the products) are to be found in total.

Down to the Cat II level, the spread remains important. It varies between 10 and half of million of products in just one category named PARTS (PIECES). The mode of the distribution for this level as shown on figure 2 is around 3,000 products per category.

Finally, at the Cat III level, the most populated categories are rich of more than 10,000 products each (figure 2). The top 5 being POP ROCK MUSIC, PRINTER TONER, PRINTER CARTRIDGE, FRENCH LITTERATURE and OTHER BOOKS with nearly 70,000 items each. Most of the categories (nearly 2000 of them) count between 50 and 500 products.

Figure 3: Cumulative percentage of products as a function of the number of Cat III categories.

Another way to consider the unbalancedness of the dataset is to consider the share of products with respect to the most populated categories. This is shown in figure 3 where the cumulative percentage of products is displayed as a function of the number of Cat III categories. The behavior is nearly exponential with 75 % of the products gathered in only 10 % of the categories. On the other end, the less populated 75 % of the categories account for only 10 % of the total number of products.

3.2 Duplicated images

A second aspect specific to this data set might be the presence of duplicated images. Indeed, for a given product there might be several resellers and nothing prevents them from using similar if not identical images to describe their products. From an image classification point of view, the presence of duplicated images might be consider as a downside or an upside. It reduces the absolute size of the dataset but it may also help to classify some products and to make links between categories that contain identical images.

Figure 4: Distribution of images according to their number of duplicates.

To trace back identical images, we use the MD5 hash function as defined in [rivest1992md5]. Images with the same hash key are labeled as identical. Although the MD5 hash function suffer weaknesses that will prevent nearly identical images to be detected, it is efficient enough in our case where there is no will to hide duplicates by tricking the image.

The distribution of the measured MD5 hash keys through the entire dataset is shown in figure 4. Again, logarithmic scales have been used to account for the large range of values. Among the 12M images in total in the dataset, 6.85M images are uniquely and the 75 % of the images appear at the most 10 times in the dataset. The image the most replicated appears 16,643 times.

4 Image classification challenge

In this section, we present the image classification challenge that was organized around this dataset and review the 3 winning solutions of the challenge.

4.1 Description

The challenge aimed at building an image categorizer based on the dataset presented in this document. It was held between September and December 2017 on the famous Kaggle platform [kaggle]

and involved more than 600 teams from all around the world. The evaluation metric was the accuracy of classification on a test set for which none of the categories is not known (if one wishes to evaluate a categorizer on the test set, please contact the author of the paper). The name, final rank and score of the 3 winning teams is given in the table 

4. Final results were quite impressive as the best solutions were able to correctly classify almost 80 % of the images of the test set.

Team name bestfitting convoluted prediction dylan
Rank
Score 0.79567 0.79352 0.79046
Table 4: Name, rank and final score of the 3 winning solutions of the image classification challenge.

4.2 Common features of the 3 winning solutions

The 3 winning solutions share several characteristics. First, they are all ensemble models that aggregate sub-model in different ways: either they use an elaborate method such as Xgboost 

[chen2016xgboost]

aggregation method or they simply compute the geometric mean or a weighted mean.

All 3 solutions also rely on neural network architectures pre-trained on ImageNet and fine-tuned on our dataset. Different framework were used : Pytorch 

[paszke2017pytorch]

, Keras 

[chollet2015keras]

, Tensorflow 

[abadi2016tensorflow] and Mxnet [chen2015mxnet]

and heavy duty GPUs were required to perform transfer learning 

[donahue2014decaf].

Also, all the state of the art neural network architecture are present :

  • the solution uses ResNet [he2016deep] and InceptionResNetV2 [szegedy2017inception];

  • the solution makes a heavy use of squeeze and excitation blocks [hu2017squeeze] on InceptionV3 [szegedy2017inception], ResNet [he2016deep] and ResNExt [xie2017aggregated] architectures. Dual Path Networks are also used [chen2017dual];

  • the solution uses SE blocks on ResNExt architecture and densily connected networks [huang2017densely].

It is worth noting that none of the above mentioned neural network techniques or architecture were described in the literature before 2016 for a challenge that took place in 2017.

Finally, all 3 solutions made use of the dropout technique [srivastava2014dropout] to prevent over-fitting and used various technique of data augmentation such as cropping, flipping and resizing.

4.3 Solution specific aspects

At a more detailed level, the winning solution present an interesting approach in which the trained neural networks are specialized by number of images available for each product. When more than one images are available, they are simply aggregated side by side in a larger image and fed into the dedicated model.

The solution took a similar approach for one of its sub-model by specializing neural networks for each number and rank of images. Indeed, the competitor realized that both the number of images per product and the order of those images were not random and thus should be considered for classification.

Finally, a last worth mentioning originality was the use by the winning team of an OCR for products that were recognized as being books or CD. This last trick probably made the small difference that made this solution the winning one.

5 Conclusion

A large e-retailer image dataset was presented in this document. This large dataset is publicly released in order to stimulate and leverage the use of deep neural network architectures for creative use cases in recommendation systems. The dataset is made up of a large diversity of product images organized according to a 3 level aggregation tree. We count 12M images of 7M products arranged in 5270 categories. It is unbalanced and there are duplicated images in it. An image classification challenge organized around this dataset on the Kaggle platform lead to a classification accuracy of nearly 80 %.

References