Log In Sign Up

Multi-Label Product Categorization Using Multi-Modal Fusion Models

In this study, we investigated multi-modal approaches using images, descriptions, and title to categorize e-commerce products on Specifically, we examined late fusion models, where the modalities are fused at the decision level. Products were each assigned multiple labels, and the hierarchy in the labels were flattened and filtered. For our individual baseline models, we modified a CNN architecture to classify the description and title, and then modified Keras' ResNet-50 to classify the images, achieving F1 scores of 77.0 late fusion model can classify products more accurately than single modal models can, improving the F1 score to 88.2 shortcomings of the other modalities, demonstrating that increasing the number of modalities can be an effective method for improving the accuracy of multi-label classification problems.


page 1

page 2

page 3

page 4


Multimodal E-Commerce Product Classification Using Hierarchical Fusion

In this work, we present a multi-modal model for commercial product clas...

Is a picture worth a thousand words? A Deep Multi-Modal Fusion Architecture for Product Classification in e-commerce

Classifying products into categories precisely and efficiently is a majo...

A Multimodal Late Fusion Model for E-Commerce Product Classification

The cataloging of product listings is a fundamental problem for most e-c...

Many could be better than all: A novel instance-oriented algorithm for Multi-modal Multi-label problem

With the emergence of diverse data collection techniques, objects in rea...

Semi-Supervised Multi-Modal Multi-Instance Multi-Label Deep Network with Optimal Transport

Complex objects are usually with multiple labels, and can be represented...

Modified ResNet Model for MSI and MSS Classification of Gastrointestinal Cancer

In this work, a modified ResNet model is proposed for the classification...

Multi-Modal Representation Learning with Self-Adaptive Thresholds for Commodity Verification

In this paper, we propose a method to identify identical commodities. In...

1 Introduction

1.1 Background

To sell products on many e-commerce systems, sellers are tasked with providing categories for their products. Automating product classification can reduce manual labor time and cost, giving sellers a better experience when uploading new products. Such auto-labeling can also benefit the buyers, as sellers manually tagging their own products may be inaccurate or sub-optimal. An accurate classifier is important, as mislabelled products may lead to missed sales opportunities due to buyers not being able to effectively locate the things they want to buy.

Prior studies have approached product categorization as a text-classification task [1, 2, 3]. However, ideally multiple types of inputs can be considered, including title, description, image, audio, video, item-to-item relationships, and other metadata. Although a few recent studies have explored product categorization using both text and images [4, 5]

, here we report on a strategy for combining an arbitrary number of inputs and modes. We specifically demonstrate a multi-modal model based on images, titles, and descriptions. Our task is different from a regular multi-class classification problem, as a product may be labeled with more than one class. Most products will appear in many classes and sub-classes. Classes can be nested in another class and potentially nested in another sub-class. Given the large amount of products uploaded and the numerous possible labels applicable, machine learning can be used to automatically classify the products in a more efficient manner.

Figure 1: The diagram on the left represents late fusion and the diagram on the right represents early fusion.

As images, titles and descriptions are different modalities of data that can each capture unique aspects of a product, we explored fusing individual models trained for each modality. Note that although fusing and ensembling both involve combining multiple models, for the purposes of our discussion each of the models utilize different modalities of data in fusion, whereas all of the models are trained on the same data in an ensemble. There are two common ways to fuse different modal networks: late fusion and early fusion (Fig. 1

). Late fusion refers to combining the predictions (outcome probabilities) of multiple networks using a certain policy. Such a policy can be using the maximum or minimum of the outcomes. In contrast, in early fusion vector representations of each modality can be extracted at an early level and fused with one another through concatenation or addition to produce a multi-modal representation vector. The model then performs classification on the resulting multi-modal representation vector.

1.2 Related Works

Zahavy et al. [5]

used a convolutional neural network (CNN) architecture based on the architecture of Kim

[6] to classify the title of the products. The first layer uses random word embedding. In addition they used a VGG network for image classification [7]. While they experimented with both early and late fusion, only the late fusion resulted in an improvement in accuracy. The image and text classifiers were trained separately to achieve maximal performance individually before being combined by a policy network. The policy network which achieved the highest accuracy is a neural network with 2 fully connected layers and takes in the top-3 class probabilities from the image and text CNNs as input. Their dataset contained 1.2 million images and 2890 possible shelves. On average, each product falls in 3 shelves. Their model is considered accurate when the network correctly outputs one of the three shelves.

Åberg [4] is one of the first authors to use the image, title, and description of an ad/product to classify products into single categories. Åberg concatenated the title and description, and used fastText (Joulin et al.) [8] as the baseline model for text classification, while using the Inception V3 for image classification. Åberg also explored a similar implementation of Kim’s CNN architecture [6] but could not achieve the level of accuracy of fastText.

The dataset contained 96,806 products belonging to 193 different classes. Note that each product was assigned to one class. Hence Åberg applied a softmax function in the final layer before outputting the class probabilities. Similar to Zahavy et al. [5]

, both late and early fusion were explored, and late fusion yielded better results. Both heuristic policies and network policies were explored. Heuristic policies refer to some static rule; as an example, the mean of the probabilities from different modals. Network policies refer to training a neural network that takes the output probabilities from different networks and produces a new probability vector.

2 Dataset

Our dataset comprises of Amazon products, which has been extracted by SNAP [9]. There are 9.4 million products in total. The class hierarchical information was not available, as the classes and subclasses were pre-flattened as given. We randomly sampled 119,073 products from this dataset, in which the first 90,000 products are kept for the training set. After pre-processing, there are 122 possible classes in which a product can belong to. Unlike many previous studies, here each product can be assigned multiple labels. Each product in the dataset contains the image, description, title, price, and co-purchasing network.

Product categorization systems can be challenging to build due to the trade-off between the number of classes and accuracy. As an example, adding more classes and sub-classes to a product might make it easier to discover, but more classes would also increase the likelihood of an incorrect class being applied. To address this issue, some studies [5, 10] reduced the number of sub-classes. One method is to create a shelf and categorize the products based on the shelves they are in. A shelf is a group of products presented together on the same e-commerce webpage, which usually contains products under the same categories [5]. Since our dataset does not contain the webpage information necessary to form shelves, our method was to remove the classes containing less than 400 products.

Figure 2: The -axis represents the number of products in a category, whereas the -axis represents the number of categories with that number of products.

On average, each product belongs to 3 categories after pre-processing. The maximum number of products in a category is 37,102 and the minimum number of products in a category is 558. On average, there are 2,919 products per category. In addition, we can see from Fig. 2 that the number of products per categories is not evenly distributed, which could introduce bias into the model.

3 Baseline Models

In order to understand how much we benefit from fusing the different modal classifiers, we report the baseline accuracy for each modal below. We evaluate our accuracy using the score (micro-averaged), which is an accepted metric for multi-label classification and imbalanced datasets [11]. During training, for all classifiers, we used Adam [12]

as our optimizer and categorical cross-entropy as our loss function. To accommodate multi-labeling, the final activation for all classifiers is a sigmoid function. Although both titles and descriptions are textual data, we leverage their different use-cases by treating them as different modalities, allowing us to perform different pre-processing steps as described below.

3.1 Description Classifier

The description was pre-processed to remove stop words, excessive whitespace, digits, punctuations, and words longer than 30 characters. In addition, sentences were truncated to 300 words. To classify the pre-processed descriptions, we slightly modified Kim’s CNN architecture for sentence classification. Kim’s architecture is a CNN with one layer of convolution on top of word vectors initialized using Word2Vec [6, 13]

. Max-pooling over time is then applied

[14], which serves to capture the most important features. Finally, dropout is employed in the penultimate layer.

Unlike Kim, we used GloVe as our embedding. Words not covered by GloVe were initialized randomly. For our dataset, GloVe covers only 61.0% of the vocabulary from the description. Our first convolution layer uses a kernel of size 5 with 200 filters. We then performed global max pooling, followed by a fully connected layer of 170 units with ReLU activations. Our final layer is another densely connected layer of 122 units with sigmoid activation. This model achieves 77.0% on the test set.

3.2 Title Classifier

Although an identical classifier to the description classifier was used for the title, the title data was pre-processed differently. For the title, we did not remove the stop words and limited or padded the text to 57 words. We again chose GloVe for the embedding, in which words not covered were initialized randomly. GloVe covers 77.0% of the vocabulary from the title. This model achieves 82.7% on the test set.

3.3 Image Classifier

We modified the ResNet-50 architecture from Keras by removing the final densely connected layer and adding a densely connected layer with 122 units to match the number of labels we have. In addition, we changed the final activation to be sigmoidal. ResNet-50 is based on the architecture of He et al. [15]

, which achieves competitive results compared to other state of the art models. We also used the pre-trained imagenet weights, which has been trained on the imagenet dataset

[16], containing more than 14 million images. We kept the earlier layers frozen and trained only the deeper/later layers [17]. We experimented with the number of trainable layers, in which our top model was trained only the last 40 layers, achieving 61% accuracy on the test set.

3.4 Summary

Modal Accuracy (%)
Table 1: The accuracy for each individual classifier.

The results summarized in Table 1 underscore that the classifiers differ in discriminative powers as the title and description classifiers significantly outperform the image classifier. This result is consistent with Zahavy et al. as their result also demonstrated a significant difference between the image and title classifiers [5]. Moreover, we have shown that the description classifier also significantly outperforms the image classifier. Such results suggests that text can provide more information regarding a product’s categories.

4 Error Analysis

Image Description Title
Martial Arts Women Horses
Ballpoint Pens Accessories Novelty, Costumes & More
Reptiles & Amphibians Clothing, Shoes & Jewelry Women
Small Animals Boating Accessories
Chew Toys Novelty, Costumes & More Feeding
Squeak Toys Parts & Components Clothing, Shoes & Jewelry
Cards & Card Stock Men Hunting & Tactical Knives
Filter Accessories Chew Toys Balls
Other Sports Balls Hunting Knives
Bedding Boating & Water Sports Boating
Tape, Adhesives & Fasteners Tape, Adhesives & Fasteners Small Animals
Birds Office Furniture & Lighting Men
Pumps & Filters Forms, Recordkeeping & Money Handling Chew Toys
Cages & Accessories Hunting Knives Boating & Water Sports
Horses Team Sports Carriers & Travel Products
Table 2: This table shows the top 15 most misclassified categories/classes for each classifier. The fraction represents the number of products which should be predicted as class , but is not, over the total number of products that is in .

From Table 2 we can see that the top misclassified categories for each classifier generally reflect their inadequate representation in the dataset. Recall that the average number of products per category is 2,919. The Accessories category contains the most products (924) out of all the misclassified categories, but it is still far below the average. In addition, we can see that the top misclassified categories for each classifier seldom overlap between the modal classifiers. For the categories that the image classifier is classifying inaccurately, the description and title classifiers are classifying more accurately and vice versa. This suggests that we should be able to combine the classifiers to effectively complement each other’s shortcomings for a more accurate result.

5 Multi-Modality

As Åberg and Zahavy et al. found that late fusion models were more accurate than early-fusion models [5, 4], here we focus our studies on improving late fusion.

5.1 Predefined Policies

Since both Åberg and Zahavy et al. experimented with predefined rules [5, 4], we included predefined rules to compare with other non-static policies. We experimented with max policy and mean policy of the output from each of the classifiers. The max policy selects the highest output for each class prediction from among the image, label, and title classifiers. This can be represented as


where represent the output from each classifier.

The mean policy can be represented as


Both mean and max policy resulted in lower accuracies when compared to the top classifier, which is the title classifier. The mean policy yielded 81.7%, while the max yielded 78.8%. Intuitively, each classifier contributes equally to the mean policy. Therefore, we would expect that the average performance is less than that of the best performer. For the max policy, the erroneous maximal outputs from the low performing classifiers detriment the ultimate predictions.

5.2 Linear Regression

We trained a simple ridge linear regression model to fuse the individual classifiers into a single classifier. The model achieves 83.0% on the test set. The model can be written as follows


where is the true label and is the predicted label. Nevertheless, the simple non-static policy can outperform static policies above.

5.3 Bi-Modal Fusion

The work by Zahavy et al. involved two neural networks, one for classifying images and another for classifying titles, using late fusion [5]. For comparison purposes, we examined models developed from fusing two of the three modal networks in this study. The first fused network included the image classifier’s output (as in Section 3.1) and the title classifier’s output (as in Section 3.2). This is essentially the method by Zahavy et al. [5]. We then fused the title classifier’s output and description classifier’s output for the second fused network and fused the image classifier’s output and description classifier’s output for the third network. All three networks were fused the same way, using a three layer neural network to concatenate the outputs from each of the classifiers. The first, second, and third layers contained 200, 150 and 122 units, respectively. All the activations were sigmoidal. The image-description, image-title, and description-title fused networks yielded 82.0%, 85.0%, and 87.0% accuracies, respectively (Table 3).

5.4 Tri-Modal Fusion

Figure 3: The proposed triple modals fusion architecture. The CNN is based on Yoon Kim’s architecture [6].

Finally, we developed a tri-modal model to include the titles, images, and descriptions. To our knowledge, we are the first to fuse three classifiers/neural networks to categorize products. We fused the three classifiers (as in Sections 3.1, 3.2, and 3.3

) using a policy network, which is an additional neural network that takes in the output of each of the classifiers. We varied the number of layers, activation functions, and units of the neural networks. Through hyperparameter optimization, we found that the top policy network consists of three layers. It uses the sigmoidal activation on the first and last layers and hyperbolic tangent activation on the middle layer. This fused model achieves 88.2% beating all of the previous methods.

Model Accuracy (%)
Linear Regression 83.0
Image-Description Fused
Image-Title Fused
Title-Description Fused
Image-Description-Title Fused
Table 3: The accuracy for fused classifiers.

5.4.1 Discussion

1 Chew Toys 6 Hunting Knives 11 Balls
2 Accessories 7 Men 12 Squeak Toys
3 Women 8 Clothing, Shoes & Jewelry 13 Horses
4 Novelty, Costumes & More 9 Hunting & Tactical Knives 14 Boating
5 Snacks 10 Shampoos 15 Airsoft
Table 4: This table shows the top 15 most misclassified categories using our proposed method.

Compared to Table 2, the proportion of misclassified products has reduced significantly in Table 4. In examining Accessories, Horses, Clothing, and Shoes & Jewelry, we can see that the proposed method outperforms the individual classifiers by a considerable margin. However, the proposed method fails to significantly reduce the number of misclassified products on certain categories, such as Chew Toys. According to table 2, each of the individual classifiers performed poorly predicting products as Chew Toys. This suggests that there remains categories that are underserved across all classifiers. To address this shortcoming, more data or other modes could be considered in future work. On the other hand, the result also suggests that as long as one classifier performs well on some of the tasks, it is sufficient for the overall model. For example, the number of misclassified products in Clothing, Shoes & Jewelry dropped from 384 to 256. Overall, this method improves over the top individual classifier and the top two-modals fused network by 5.5% and 1.2%, respectively (Table 3).

6 Conclusion

We have shown that the title classifier can outperform the description classifier, and that the description classifier can outperform the image classifier. Moreover, a tri-modal fused network comprising of all three modalities outperformed any of the bi-modal fused networks. The performance improvements can be attributed to each of the classifiers addressing at least complementary portions of the tasks to account for the shortcomings of each individual classifier. While this study focused on late fusion, an early fusion approach can be explored in the future. In addition, more products, including products that may not fall under the predefined categories, can be added to reduce overfitting. A better text classifier can be built with contextualized word embeddings [18]. Transformers can be considered to replace CNNs and RNNs for both text and images [19, 20, 21]. Finally, one possible extension to our work could be to build a vector representation of the products. Just as how word embeddings enabled us to more accurately classify text, a product embedding can be useful for capturing the relationship between products. Such a product embedding could help discover products that are “similar” for recommendation purposes and be used as input to a model to predict categories.