To sell products on many e-commerce systems, sellers are tasked with providing categories for their products. Automating product classification can reduce manual labor time and cost, giving sellers a better experience when uploading new products. Such auto-labeling can also benefit the buyers, as sellers manually tagging their own products may be inaccurate or sub-optimal. An accurate classifier is important, as mislabelled products may lead to missed sales opportunities due to buyers not being able to effectively locate the things they want to buy.
Prior studies have approached product categorization as a text-classification task [1, 2, 3]. However, ideally multiple types of inputs can be considered, including title, description, image, audio, video, item-to-item relationships, and other metadata. Although a few recent studies have explored product categorization using both text and images [4, 5]
, here we report on a strategy for combining an arbitrary number of inputs and modes. We specifically demonstrate a multi-modal model based on images, titles, and descriptions. Our task is different from a regular multi-class classification problem, as a product may be labeled with more than one class. Most products will appear in many classes and sub-classes. Classes can be nested in another class and potentially nested in another sub-class. Given the large amount of products uploaded and the numerous possible labels applicable, machine learning can be used to automatically classify the products in a more efficient manner.
As images, titles and descriptions are different modalities of data that can each capture unique aspects of a product, we explored fusing individual models trained for each modality. Note that although fusing and ensembling both involve combining multiple models, for the purposes of our discussion each of the models utilize different modalities of data in fusion, whereas all of the models are trained on the same data in an ensemble. There are two common ways to fuse different modal networks: late fusion and early fusion (Fig. 1
). Late fusion refers to combining the predictions (outcome probabilities) of multiple networks using a certain policy. Such a policy can be using the maximum or minimum of the outcomes. In contrast, in early fusion vector representations of each modality can be extracted at an early level and fused with one another through concatenation or addition to produce a multi-modal representation vector. The model then performs classification on the resulting multi-modal representation vector.
1.2 Related Works
Zahavy et al. 
used a convolutional neural network (CNN) architecture based on the architecture of Kim to classify the title of the products. The first layer uses random word embedding. In addition they used a VGG network for image classification . While they experimented with both early and late fusion, only the late fusion resulted in an improvement in accuracy. The image and text classifiers were trained separately to achieve maximal performance individually before being combined by a policy network. The policy network which achieved the highest accuracy is a neural network with 2 fully connected layers and takes in the top-3 class probabilities from the image and text CNNs as input. Their dataset contained 1.2 million images and 2890 possible shelves. On average, each product falls in 3 shelves. Their model is considered accurate when the network correctly outputs one of the three shelves.
Åberg  is one of the first authors to use the image, title, and description of an ad/product to classify products into single categories. Åberg concatenated the title and description, and used fastText (Joulin et al.)  as the baseline model for text classification, while using the Inception V3 for image classification. Åberg also explored a similar implementation of Kim’s CNN architecture  but could not achieve the level of accuracy of fastText.
The dataset contained 96,806 products belonging to 193 different classes. Note that each product was assigned to one class. Hence Åberg applied a softmax function in the final layer before outputting the class probabilities. Similar to Zahavy et al. 
, both late and early fusion were explored, and late fusion yielded better results. Both heuristic policies and network policies were explored. Heuristic policies refer to some static rule; as an example, the mean of the probabilities from different modals. Network policies refer to training a neural network that takes the output probabilities from different networks and produces a new probability vector.
Our dataset comprises of Amazon products, which has been extracted by SNAP . There are 9.4 million products in total. The class hierarchical information was not available, as the classes and subclasses were pre-flattened as given. We randomly sampled 119,073 products from this dataset, in which the first 90,000 products are kept for the training set. After pre-processing, there are 122 possible classes in which a product can belong to. Unlike many previous studies, here each product can be assigned multiple labels. Each product in the dataset contains the image, description, title, price, and co-purchasing network.
Product categorization systems can be challenging to build due to the trade-off between the number of classes and accuracy. As an example, adding more classes and sub-classes to a product might make it easier to discover, but more classes would also increase the likelihood of an incorrect class being applied. To address this issue, some studies [5, 10] reduced the number of sub-classes. One method is to create a shelf and categorize the products based on the shelves they are in. A shelf is a group of products presented together on the same e-commerce webpage, which usually contains products under the same categories . Since our dataset does not contain the webpage information necessary to form shelves, our method was to remove the classes containing less than 400 products.
On average, each product belongs to 3 categories after pre-processing. The maximum number of products in a category is 37,102 and the minimum number of products in a category is 558. On average, there are 2,919 products per category. In addition, we can see from Fig. 2 that the number of products per categories is not evenly distributed, which could introduce bias into the model.
3 Baseline Models
In order to understand how much we benefit from fusing the different modal classifiers, we report the baseline accuracy for each modal below. We evaluate our accuracy using the score (micro-averaged), which is an accepted metric for multi-label classification and imbalanced datasets . During training, for all classifiers, we used Adam 
as our optimizer and categorical cross-entropy as our loss function. To accommodate multi-labeling, the final activation for all classifiers is a sigmoid function. Although both titles and descriptions are textual data, we leverage their different use-cases by treating them as different modalities, allowing us to perform different pre-processing steps as described below.
3.1 Description Classifier
The description was pre-processed to remove stop words, excessive whitespace, digits, punctuations, and words longer than 30 characters. In addition, sentences were truncated to 300 words. To classify the pre-processed descriptions, we slightly modified Kim’s CNN architecture for sentence classification. Kim’s architecture is a CNN with one layer of convolution on top of word vectors initialized using Word2Vec [6, 13]
. Max-pooling over time is then applied, which serves to capture the most important features. Finally, dropout is employed in the penultimate layer.
Unlike Kim, we used GloVe as our embedding. Words not covered by GloVe were initialized randomly. For our dataset, GloVe covers only 61.0% of the vocabulary from the description. Our first convolution layer uses a kernel of size 5 with 200 filters. We then performed global max pooling, followed by a fully connected layer of 170 units with ReLU activations. Our final layer is another densely connected layer of 122 units with sigmoid activation. This model achieves 77.0% on the test set.
3.2 Title Classifier
Although an identical classifier to the description classifier was used for the title, the title data was pre-processed differently. For the title, we did not remove the stop words and limited or padded the text to 57 words. We again chose GloVe for the embedding, in which words not covered were initialized randomly. GloVe covers 77.0% of the vocabulary from the title. This model achieves 82.7% on the test set.
3.3 Image Classifier
We modified the ResNet-50 architecture from Keras by removing the final densely connected layer and adding a densely connected layer with 122 units to match the number of labels we have. In addition, we changed the final activation to be sigmoidal. ResNet-50 is based on the architecture of He et al. 
, which achieves competitive results compared to other state of the art models. We also used the pre-trained imagenet weights, which has been trained on the imagenet dataset, containing more than 14 million images. We kept the earlier layers frozen and trained only the deeper/later layers . We experimented with the number of trainable layers, in which our top model was trained only the last 40 layers, achieving 61% accuracy on the test set.
The results summarized in Table 1 underscore that the classifiers differ in discriminative powers as the title and description classifiers significantly outperform the image classifier. This result is consistent with Zahavy et al. as their result also demonstrated a significant difference between the image and title classifiers . Moreover, we have shown that the description classifier also significantly outperforms the image classifier. Such results suggests that text can provide more information regarding a product’s categories.
4 Error Analysis
|Ballpoint Pens||Accessories||Novelty, Costumes & More|
|Reptiles & Amphibians||Clothing, Shoes & Jewelry||Women|
|Chew Toys||Novelty, Costumes & More||Feeding|
|Squeak Toys||Parts & Components||Clothing, Shoes & Jewelry|
|Cards & Card Stock||Men||Hunting & Tactical Knives|
|Filter Accessories||Chew Toys||Balls|
|Other Sports||Balls||Hunting Knives|
|Bedding||Boating & Water Sports||Boating|
|Tape, Adhesives & Fasteners||Tape, Adhesives & Fasteners||Small Animals|
|Birds||Office Furniture & Lighting||Men|
|Pumps & Filters||Forms, Recordkeeping & Money Handling||Chew Toys|
|Cages & Accessories||Hunting Knives||Boating & Water Sports|
|Horses||Team Sports||Carriers & Travel Products|
From Table 2 we can see that the top misclassified categories for each classifier generally reflect their inadequate representation in the dataset. Recall that the average number of products per category is 2,919. The Accessories category contains the most products (924) out of all the misclassified categories, but it is still far below the average. In addition, we can see that the top misclassified categories for each classifier seldom overlap between the modal classifiers. For the categories that the image classifier is classifying inaccurately, the description and title classifiers are classifying more accurately and vice versa. This suggests that we should be able to combine the classifiers to effectively complement each other’s shortcomings for a more accurate result.
5.1 Predefined Policies
Since both Åberg and Zahavy et al. experimented with predefined rules [5, 4], we included predefined rules to compare with other non-static policies. We experimented with max policy and mean policy of the output from each of the classifiers. The max policy selects the highest output for each class prediction from among the image, label, and title classifiers. This can be represented as
where represent the output from each classifier.
The mean policy can be represented as
Both mean and max policy resulted in lower accuracies when compared to the top classifier, which is the title classifier. The mean policy yielded 81.7%, while the max yielded 78.8%. Intuitively, each classifier contributes equally to the mean policy. Therefore, we would expect that the average performance is less than that of the best performer. For the max policy, the erroneous maximal outputs from the low performing classifiers detriment the ultimate predictions.
5.2 Linear Regression
We trained a simple ridge linear regression model to fuse the individual classifiers into a single classifier. The model achieves 83.0% on the test set. The model can be written as follows
where is the true label and is the predicted label. Nevertheless, the simple non-static policy can outperform static policies above.
5.3 Bi-Modal Fusion
The work by Zahavy et al. involved two neural networks, one for classifying images and another for classifying titles, using late fusion . For comparison purposes, we examined models developed from fusing two of the three modal networks in this study. The first fused network included the image classifier’s output (as in Section 3.1) and the title classifier’s output (as in Section 3.2). This is essentially the method by Zahavy et al. . We then fused the title classifier’s output and description classifier’s output for the second fused network and fused the image classifier’s output and description classifier’s output for the third network. All three networks were fused the same way, using a three layer neural network to concatenate the outputs from each of the classifiers. The first, second, and third layers contained 200, 150 and 122 units, respectively. All the activations were sigmoidal. The image-description, image-title, and description-title fused networks yielded 82.0%, 85.0%, and 87.0% accuracies, respectively (Table 3).
5.4 Tri-Modal Fusion
Finally, we developed a tri-modal model to include the titles, images, and descriptions. To our knowledge, we are the first to fuse three classifiers/neural networks to categorize products. We fused the three classifiers (as in Sections 3.1, 3.2, and 3.3
) using a policy network, which is an additional neural network that takes in the output of each of the classifiers. We varied the number of layers, activation functions, and units of the neural networks. Through hyperparameter optimization, we found that the top policy network consists of three layers. It uses the sigmoidal activation on the first and last layers and hyperbolic tangent activation on the middle layer. This fused model achieves 88.2% beating all of the previous methods.
|1||Chew Toys||6||Hunting Knives||11||Balls|
|3||Women||8||Clothing, Shoes & Jewelry||13||Horses|
|4||Novelty, Costumes & More||9||Hunting & Tactical Knives||14||Boating|
Compared to Table 2, the proportion of misclassified products has reduced significantly in Table 4. In examining Accessories, Horses, Clothing, and Shoes & Jewelry, we can see that the proposed method outperforms the individual classifiers by a considerable margin. However, the proposed method fails to significantly reduce the number of misclassified products on certain categories, such as Chew Toys. According to table 2, each of the individual classifiers performed poorly predicting products as Chew Toys. This suggests that there remains categories that are underserved across all classifiers. To address this shortcoming, more data or other modes could be considered in future work. On the other hand, the result also suggests that as long as one classifier performs well on some of the tasks, it is sufficient for the overall model. For example, the number of misclassified products in Clothing, Shoes & Jewelry dropped from 384 to 256. Overall, this method improves over the top individual classifier and the top two-modals fused network by 5.5% and 1.2%, respectively (Table 3).
We have shown that the title classifier can outperform the description classifier, and that the description classifier can outperform the image classifier. Moreover, a tri-modal fused network comprising of all three modalities outperformed any of the bi-modal fused networks. The performance improvements can be attributed to each of the classifiers addressing at least complementary portions of the tasks to account for the shortcomings of each individual classifier. While this study focused on late fusion, an early fusion approach can be explored in the future. In addition, more products, including products that may not fall under the predefined categories, can be added to reduce overfitting. A better text classifier can be built with contextualized word embeddings . Transformers can be considered to replace CNNs and RNNs for both text and images [19, 20, 21]. Finally, one possible extension to our work could be to build a vector representation of the products. Just as how word embeddings enabled us to more accurately classify text, a product embedding can be useful for capturing the relationship between products. Such a product embedding could help discover products that are “similar” for recommendation purposes and be used as input to a model to predict categories.
-  Zornitsa Kozareva. Everyone likes shopping! multi-class product categorization for e-commerce. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1329–1333, 2015.
-  Hsiang-Fu Yu, Chia-Hua Ho, Prakash Arunachalam, Manas Somaiya, and Chih-Jen Lin. Product title classification versus text classification. 2012.
-  Damir Vandic, Flavius Frasincar, and Uzay Kaymak. A framework for product description classification in e-commerce. J. Web Eng., 17(1&2):1–27, 2018.
-  Ludvig Åberg. Multimodal Classification of Second-Hand E-Commerce Ads. PhD thesis, 2018.
-  Tom Zahavy, Alessandro Magnani, Abhinandan Krishnan, and Shie Mannor. Is a picture worth a thousand words? A deep multi-modal fusion architecture for product classification in e-commerce. CoRR, abs/1611.09534, 2016.
-  Yoon Kim. Convolutional neural networks for sentence classification. CoRR, abs/1408.5882, 2014.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2014.
-  Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. CoRR, abs/1607.01759, 2016.
-  Julian J. McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-based recommendations on styles and substitutes. CoRR, abs/1506.04757, 2015.
-  Fengjiao Lyu, Joseph Lee, and Yaqing Li. Category classification for amazon items using hidden state node embeddings of large graphs. 2017.
-  Yiming Yang and Xin Liu. A re-examination of text categorization methods. In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, pages 42–49, New York, NY, USA, 1999. ACM.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
Efficient estimation of word representations in vector space, 2013.
-  Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. Natural language processing (almost) from scratch. CoRR, abs/1103.0398, 2011.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
-  Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? CoRR, abs/1411.1792, 2014.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
-  Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Łukasz Kaiser Illia Polosukhin Ashish Vaswani, Noam Shazeer. Attention is all you need. arXiv, abs/1706.03762, 2018.
-  Artit Wangperawong. Attending to mathematical language with transformers. arXiv, abs/1812.02825, 2019.
-  Jakob Uszkoreit Łukasz Kaiser Noam Shazeer Alexander Ku Dustin Tran Niki Parmar, Ashish Vaswani. Image transformer. arXiv, abs/1802.05751, 2018.