Product Classification in E-Commerce using Distributional Semantics

06/20/2016 ∙ by Vivek Gupta, et al. ∙ Indian Institute of Technology Kanpur Flipkart 0

Product classification is the task of automatically predicting a taxonomy path for a product in a predefined taxonomy hierarchy given a textual product description or title. For efficient product classification we require a suitable representation for a document (the textual description of a product) feature vector and efficient and fast algorithms for prediction. To address the above challenges, we propose a new distributional semantics representation for document vector formation. We also develop a new two-level ensemble approach utilizing (with respect to the taxonomy tree) a path-wise, node-wise and depth-wise classifiers for error reduction in the final product classification. Our experiments show the effectiveness of the distributional representation and the ensemble approach on data sets from a leading e-commerce platform and achieve better results on various evaluation metrics compared to earlier approaches.



There are no comments yet.


page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Existing e-commerce platforms have evolved into large B2C and/or C2C marketplaces having large inventories with millions of products. Products in ecommerce are generally organized into a hierarchical taxonomy of multilevel hierarchical categories. Product classification is an important task in catalog formation and plays a vital role in customer oriented services like search and recommendation and seller oriented services like seller utilities on a seller platform. Product classification is a hierarchical classification problem and presents the following challenges: a) a large number of categories have data that is extremely sparse with a skewed long tailed distribution, b) a hierarchical taxonomy imposes constraints on activation of labels. If a child label is active then it is necessary for a parent label to be active, c) for practical use the prediction should happen in real time - ideally within few milli-seconds.

Traditionally, documents have been represented as a weighted bag-of-words (BoW) or tf-idf feature vector, which contains weighted information about the presence or absence of words in a document by using a fixed length vector. Words that define the semantic content of a document are expected to be given higher weight. While tf-idf and BoW representations perform well for simple multi-class classification tasks, they generally do not do as well for more complex tasks because the BoW representation ignores word ordering and polysemy, is extremely sparse and high dimensional and does not encode word meaning.

Such disadvantages have motivated continuous, low-dimensional, non-sparse distributional representations. A word is encoded as a vector in a low dimension vector space typically to . The vector encodes local context and therefore is sensitive to local word order and captures word meaning to some extent. It relies on the ‘Distributional Hypothesis’[Harris1954] i.e. “Similar words occur in similar contexts”. Similarity between two words can be calculated via cosine distance between their vector representations. Le and Mikolov [Le and Mikolov2014] proposed paragraph vectors, which use global context together with local context to represent documents. But paragraph vectors suffer from the following problems: a) current techniques embed paragraph vectors in the same space (dimension) as word vectors although a paragraph can consist of words belonging to multiple topics (senses), b) current techniques also ignore the importance and distinctiveness of words across documents. They assume all words contribute equally both quantitatively (weight) and qualitatively (meaning).

In this paper we describe a new compositional technique for formation of document vectors from semantically enriched word vectors to address the above problems. Further, to capture importance, weight and distinctiveness of words across documents we use a graded weights approach, inspired by the work of Mukerjee et al. [Pranjal Singh2015], for our compositional model. We also propose a new two-level approach for product classification which uses an ensemble of classifiers for label paths, node labels and depth-wise labels (with respect to the taxonomy) to decrease classification error . Our new ensemble technique efficiently exploits the catalog hierarchy and achieves improved results in top taxonomy path prediction. We show the effectiveness of the new representation and classification approach for product classification of two e-commerce data-sets containing book and non-book descriptions.

2 Related Work

2.1 Distributional Semantic Word Representation

The distributional word embedding method was first introduced by Bengio et al. as the Neural Probabilistic Language Model [Bengio et al.2003]. Later, Mikolov et al. [Mikolov et al.2013a] proposed a simple log-linear model which considerably reduced training time - Word2Vec Continuous Bag-of-Words (CBoW) model and Skip-Gram with Negative Sampling (SGNS) model. Figure 1 shows the architecture for CBoW (Left) and Skip-Gram (Right).

Later Glove [Jeffrey Pennington2014] a log-bilinear model with a weighted least-squares objective was proposed which uses the statistical ratio of global word-word co-occurrences in the corpus for training word vectors. The word vectors learned using the skip-gram model are known to encode many linear linguistic regularities and patterns [Levy and Goldberg2014b].

While the above methods look very different they implicitly factorize a shifted positive point-wise mutual information matrix (PPMI) with tuned hyper parameters as shown by Levy and Goldberg [Levy and Goldberg2014c]. Some variants incorporate ordering information in context words to capture syntactic information by replacing summation of context word vectors with concatenation during training [Wang Ling2015] of CBoW and SGNS models.

2.2 Distributional Paragraph Representation

Most models for learning distributed representations for long text such as phrases, sentences or documents that try to capture semantic composition do not go beyond simple weighted average of word vectors. This approach is analogous to a bag-of-words approach and neglects word order while representing documents. Socher et al. [Socher et al.2013]

propose a recursive tensor neural network where the dependency parse-tree of the sentence is used to compose word vectors in a bottom-up approach to represent sentences or phrases. This approach considers syntactic dependencies but cannot go beyond sentences as it depends on parsing.

Mikolov proposed a distributional paragraph vector framework called paragraph vectors which are trained in a manner similar to word vectors. He proposed two types of models called Distributed Memory Model Paragraph Vectors (PV-DM) [Le and Mikolov2014] and Distributed BoWs paragraph vectors (PV-DBoW) [Le and Mikolov2014]. In PV-DM the model is trained to predict the center word using context words in a small window and the paragraph vector [Le and Mikolov2014]. Here context words to be predicted are represented by ,…., and the document vector is represented by . In PV-DBoW the paragraph vector is trained to predict context words directly. Figure 2 shows the network architecture for PV-DM(L) and PV-DBoW(R).

The paragraph vector presumably represents the global semantic meaning of the paragraph and also incorporates properties of word vectors i.e. meanings of the words used. A paragraph vector exhibits close resemblance to an n-gram model with a large

. This property is crucial because the n-gram model preserves a lot of information in a sentence (and the paragraph) and is sensitive to word order. This model mostly performs better than the BoW models which usually create a very high-dimensional representation leading to poorer generalization.

Figure 1: Neural Network Architecture for CBoW and Skip Gram Model
Figure 2: Neural Network Architecture for Distributed Memory version of Paragraph Vector (PV-DM) and Distributed BoWs version of paragraph vectors (PV-DBoW)

2.3 Problem with Paragraph Vectors

Paragraph vectors obtained from PV-DM and PV-DBoW are shared across context words generated from the same paragraph but not across paragraphs. On the other hand a word is shared across paragraphs. Paragraph vectors are also represented in the same space (dimension) as word vectors though a paragraph can contain words belonging to multiple topics (senses). The formulation for paragraph vectors ignores the importance and distinctiveness of a word across documents i.e. assumes all words contribute equally both quantitatively (weight wise) and qualitatively (meaning wise). Quantitatively, only binary weights i.e. 0 weight for stop-words and non-zero weight for others are used. Intuitively, one would expect the paragraph vector to be embedded in a larger and enriched space.

2.4 Hierarchical Product Categorization

Most methods for hierarchical classification follow a “gates-and-experts” method which have a two level classifier. The high-level classifier serves as a “gate” to a lower level classifier called the “expert” [Shen et al.2011]. The basic idea is to decompose the problem into two models, the first model is simple and does coarse-grained classification while the second model is more complex and does more fine-grained classification. The coarse-grained classification deals with a huge number of examples while the fine-grained distinction is learned within a subtree under every top level category with better feature generation and classification algorithms and deals with fewer categories.

Kumar et al. [Kumar et al.2002], proposed an approach that learnt a tree structure over the set of classes. They used a clustering algorithm based on Fisher’s discriminant that clustered training examples into mutually exclusive groups inducing a partitioning on the classes. As a result the prediction by this method is faster but the training process is slow as it involves solving many clustering problems.

Later, Xue et al. [Xue et al.2008] suggested an interesting two stage strategy called “deep classification”. The first stage (search) groups documents in the training set that are similar to a given document. In the second stage (classification) a classifier is trained on these classes and used to classify the document. In this approach a specific classifier is trained for each document making the algorithm computationally inefficient.

For large scale classification Bengio et al. [Bengio et al.2010]

use the confusion matrix for estimating class similarity instead of clustering data samples. Two classes are assumed to be similar if they are often confused by a classifier. Spectral clustering, where the edges of the similarity graph are weighted by class confusion probabilities, is used to group similar classes together.

Shen and Ruvini [Shen et al.2012] [Shen et al.2011] extend the previous approach by using a mixture of simple and complex classifiers for separating confused classes rather then spectral clustering methods which has faster training times. They approximate the similarity of two classes by the probability that the classifier incorrectly predicts one of the categories when the correct label is the other category. Graph algorithms are used to generate connected groups from estimated confusion probabilities. They represent the relationship among classes using an undirected graph , where the set of vertices is the set of all classes and is the set of all edges. Two vertices’s are connected by an edge if the confusion probability is greater than a given threshold [Shen et al.2012].

Other simple approaches like flat classification and top down classification are intractable due to the large number of classes and give poor results due to error propagation as described in [Shen et al.2012].

3 Graded Weighted Bag of Word Vectors

We propose a new method to form a composite document vector using word vectors i.e. distributional meaning and tf-idf and call it a Graded Weighted Bag of Words Vector (gwBoWV). gwBoWV is inspired from the computer vision literature where we use a Bag of Visual words to form feature vectors. gwBoWV is calculated as follows:

  1. Each document is represented in a lower dimensional space , where represents number of semantic clusters and is the dimension of the word-vectors.

  2. Each document is also concatenated with inverse cluster frequency(icf) values which is calculated using idf values of words present in the document.

Idf values from the training corpus are directly used for the test corpus for weighting. Word vectors are first separated into a pre-defined number of semantic clusters using a suitable clustering algorithm (e.g. k-means). For each document we add the word-vectors of each word in the document belonging to a cluster to form a cluster vector. We finally concatenate the cluster vector and the icf for each of the

clusters to obtain the document vector. Algorithm 1 describes this in more detail.

Data: Documents , n = 1 N
Result: Document vectors , n = 1 N
1 Train SGNS model to obtain word vector representation () using all document ;
Calculate idf values for all words: ;
  /* is vocabulary size */
2 Use K-means algorithm for clustering all words in using their word-vectors into K clusters;
3 for   do
4      Initialize cluster vector = ;
5      Initialize cluster frequency ;
6      while not at end of document  do
7           read current word and obtain wordvec ;
8           obtain cluster index = for wordvec ;
9           update cluster vector ;
10           update cluster frequency ;
12           end while
          obtain ;
            /* is concatenation */
14           end for
Algorithm 1 Graded Weighted Bag of Word Vectors

Since semantically different vectors are in separate clusters we avoid averaging of semantically different words during Bag of Words Vector formation. Incorporation of idf values captures the weight of each cluster vector which tries to model the importance and distinctiveness of words across documents.

4 Ensemble of Multitype Predictors

We propose a two level ensemble technique to combine multiple classifiers predicting product paths, node labels and depth-wise labels respectively. We construct an ensemble of multi-type features for categorization inspired by the recent work of Zornitsa et. el. from Yahoo Labs [Kozareva2015]. Below are the details of each classifier used at level one:

  • Path-Wise Prediction Classifier: We take each possible path in the catalog taxonomy tree, from leaf node to root node, as a possible class label and train a classifier () using these labels.

  • Node-Wise Prediction Classifier: We take each possible node in the catalog taxonomy tree as a possible prediction class and train a classifier () using these class labels.

  • Depth-Wise Node Prediction Classifiers: We train multiple classifiers () one for each depth level of the taxonomy tree. Each possible node in the catalog taxonomy tree at that depth is a possible class label. All data samples which have a potential node at depth , in addition 10% samples of data points which have no node at depth (sample of data point whose path ended before depth ) are used for training.

We use the output probabilities of these classifiers at level one (, , ) as a feature vector and train a classifer (level two) after some dimensionality reduction.

The increase in training time can be reduced by training all level one classifiers in parallel. The algorithm for training the ensemble is described in Algorithm 2. The testing algorithm is similar to training and described in supplementary section 3.

Data: Catalog Taxonomy Tree (T) of depth K and training data = (d, ) where is the product description and is the taxonomy path label.
Result: Set of level one Classifiers C = {} and level two classifier .
1 Obtain features for each product description d ;
2 Train Path-Wise Prediction Classifier () with possible classes as product taxonomy paths ();
3 Train Node-Wise Prediction Classifier () with possible classes as nodes in taxonomy path i.e. (). Here each description will have multiple node labels.
4 for k () do
5      Train Depth-Wise Node Classifier for depth () with labels as nodes at depth i.e. ()
6      end for
7     Obtain output probabilities over all classes for each level one classifier i.e. , and .;
      Obtain feature vector for each description as:
/* is the concatenation operation */
8      Reduce feature dimension (

) using suitable supervised feature selection technique based on mutual information criteria;

Train Final Path-Wise Prediction Classifier () using as feature vector and possible class labels as product taxonomy paths ()
Algorithm 2 Training Two Level Boosting Approach

5 Dataset

Level #Categories %Data Samples
1 21 34.9%
2 278 22.64%
3 1163 25.7%
4 970 12.9%
5 425 3.85%
6 18 0.10%
Table 1: Percentage of Book Data ending at each depth level of the book taxonomy hierarchy which had a maximum depth of 6.

We use seller product descriptions and title samples from a leading e-commerce site for experimentation111This data is proprietary to the e-commerce Company.. The data set had two product taxonomies:non-book and book

. Non-book data is more discriminative with average description + title length of around 10 to 15 words, whereas book descriptions have an average length greater than 200 words. To give more importance to the title we generally weight it three times the description value. The distribution of items over leaf categories (verticals) exhibits high skewness and heavy tailed nature and suffers from sparseness as shown in Figure 3. We use random forest and k nearest neighbor as base classifiers as they are less affected by data skewness

We have removed data samples with multiple paths to simplify the problem to single path prediction. Overall, we have 0.16 million training and 0.11 million testing samples for book data and 0.5 million training and 0.25 million testing samples for non-book data. Since the taxonomy evolved over time all category nodes are not semantically mutually exclusive. Some ambiguous leaf categories are even meta categories. We handle this by giving a unique id to every node in the category tree of book-data. Furthermore, there are also category paths with different categories at the top and similar categories at the leaf nodes i.e. reduplication of the same path with synonymous labels.

The quality of the descriptions and titles also varies a lot. There are titles and descriptions that do not contain enough information to decide an unique appropriate category. There were labels like Others and General at various depths in the taxonomy tree which carry no specific semantic meaning. Also, descriptions with the special label ‘wrong procurement’ are removed manually for consistency.

Figure 3: Figure shows distribution of items over sub-categories and leaf category (verticals) for non-book dataset
Figure 4: Comparison of prediction accuracy for path prediction using different methods for document vector generation.
Level #Categories %Data Samples
1 21 34.9%
2 278 22.64%
3 1163 25.7%
4 970 12.9%
5 425 3.85%
6 18 0.10%
Table 2: Percentage of Book Data ending at each depth level of the book taxonomy hierarchy which had a maximum depth of 6.

The quality of the descriptions and titles also varies a lot. There are titles and descriptions that do not contain enough information to decide an unique appropriate category. There were labels like Others and General at various depths in the taxonomy tree which carry no specific semantic meaning. Also, descriptions with the special label ‘wrong procurement’ are removed manually for consistency.

6 Results

The classification system is evaluated using the usual precision metric defined as fraction of products from test data for which the classifier predicts correct taxonomy paths. Since there are multiple similar paths in the data set predicting a single path is not appropriate. One solution is to predict more than one path or better a ranked list of of 3 to 6 paths with predicted label coverage matching labels in the true path. The ranking is obtained using the confidence score of the predictor. We also calculate the confidence score of the correct prediction path by using the (3 to 6) confidence scores of the individual predicted paths. For the purpose of measuring accuracy when more than one path is predicted, the classifier result is counted as correct when the correct class (i.e. path assigned by seller) is one of the returned class (paths). Thus we calculated Top 1, Top 3 and Top 6 prediction accuracy when 1, 3 and 6 paths are predicted respectively.

6.1 Non-Book Data Result

We also compare our results with document vectors formed by averaging word-vectors of words in the document i.e. Average Word Vectors (AWV), Distributed Bag of Words version of Paragraph Vector by Mikolov (PV-DBoW), Frequency Histogram of word distribution in Word-Clusters i.e. Bag of Cluster Vector (BoCV). We keep the classifier (random forest with 20 trees) common for all document vector representations. We compare performance with respect to number of clusters, word-vector dimension, document vector dimension and vocabulary dimension (tf-idf) for various models.

Figure 4 shows results for a random forest (20 trees) on various classifiers trained by various methods on 0.2 million training and 0.2 million testing samples with 3589 classes. It compares our approach gwBoWV with PV-DBoW and PV-DM models with varying word vector dimension and number of clusters. The dimension of word vector for gwBoVW and BoCV is 200. Note AWV , PV-DM and PV-DBoW are independent of cluster number and have dimension 600. Clearly gwBoWV performs much better than other methods especially PV-DBoW and PV-DM.

Table 3 Shows the effect of varying cluster numbers on accuracy for Non Book Data for 2 lakh training and 2 lakh testing using 200 dimension word vector.

#Cluster CBoW SGNS
10 81.35% 82.12%
20 82.29% 82.74%
50 83.66% 83.92%
65 83.85% 84.07%
80 83.91% 84.06%
100 84.40% 84.80%
Table 3: Result of classification on varying Cluster Numbers for fixed word vector size 200 for Non Book Data for CBow and SGNS architecture #Train Sample = 0.2 million , #Test Sample = 0.2 million

We use the notation given below to define our evaluation metrics for Top K path prediction :

  • represents the true path for a product description.

  • represents the predicted path by our algorithm , where i .

  • represent the nodes in true path .

  • represents the nodes in predicted path , where i .

  • p() represents the probability predicted by our algorithm for true path . p() = 0 if

  • p() represents the probability of predicted path by our algorithm, here i .

We use four evaluation metrics to measure performance for the top predictions as described below:

  1. Prob Precision K (PP) : = p() / (p() + p() + + p()).

  2. Count Precision K (CP) : = 1 if else = 0.

  3. Label Recall K (LR) : = .Here S represent number of elements in set S.

  4. Label Correlation k (LC) : = . Here S represent number of elements in set S.

Table 4 shows the results on all evaluation metrics with varying word-vec dimension and clusters. Table 5 shows results of top 6 paths prediction for tfidf baseline with varying dimension.

#Clus, #Dim %PP %CP %LR %LC
40, 50 82.07 96.43 98.27 34.50
40, 100 83.18 96.67 98.39 34.91
100, 50 82.05 96.40 98.26 34.41
100,100 83.13 96.75 98.42 34.88
Table 4: Result for top 6 paths predicted for multiple Bag of Word Vectors with varying dimension and number of clusters with weighting on Non-Book Data with #Train Samples = 0.50 million, #Test Samples = 0.35 million.
#Dim %PP %CP %LR %LC
2000 81.10 94.04 96.85 35.37
4000 82.74 94.78 97.33 35.61
Table 5: Result of top 6 paths prediction for tfidf with varying dimension on Non Book Data #Train Samples = 0.50 million, #Test Samples = 0.35 million.

6.2 Book Data Result

Book data is harder to classify. There are more cases of improper paths and labels in the taxonomy and hence we had to do a lot of pre-processing. Around 51% of the books did not have labels at all and 15% books were given extremely ambiguous labels like ‘general’ and ‘others’. To maintain consistency we prune the above 66% data samples and work with the remaining 44% i.e. 0.37 million samples.

To handle improper labels and ambiguity in the taxonomy we use multiple classifiers one predicting path (or leaf) label, another predicting node labels and multiple classifiers, one at each depth level of the taxonomy tree, that predict node labels at that level. In depth-wise node classification we also introduce the ‘none’ label to denote missing labels at a particular level i.e. for paths that end at earlier levels. However we only take a random strata sample for this ‘none’ label.

6.3 Ensemble Classification

We use the ensemble of multi-type predictors as describe in Section 4

for final classification. For dimensionality reduction we use feature selection methods based on mutual information criteria (ANOVA F-value i.e. analysis of variance). We obtain improved results for all four evaluation metrics with the new ensemble technique as shown in Table 

6for Book Data.The list below says how the first column in Table 5 should be interpreted

  • tf-idf (A-2-C): term frequency and inverse document frequency feature with #A top 1, 2 gram words and #C random forest trees

  • path-1 (A-C): path prediction model without ensemble, trained with gwBoWV with #A (cluster*wordvec dimension) using C trees

  • Dep (A+B-C): trained with gwBoWV with A features, B represents size of out-probability vectors (#total nodes) for all depths using depth classifier of level 1 using #C trees.

  • node (A+B): trained with gwBoWV with A features, B represents size of output probability vector(#total nodes) by level 1 node classifier using #C tree.

  • comb-2(A): level two combined ensemble classifier with A reduced features (original features 21706).

Method PP CP LR LC
tfidf(4000-2-20) 41.33 75.00 86.86 22.14
tfidf(8000-2-20) 41.39 74.95 86.83 22.16
tfidf(10000-2-20) 41.39 74.96 86.85 22.18
path-1(4100-15) 39.86 74.17 86.37 22.19
path-1(8080-20) 41.08 74.83 86.60 22.19
Dep(5100+2875-20) 41.54 75.34 87.08 22.47
node(4100+2810) 41.54 74.68 86.65 22.34
comb-2(8000) 45.64 77.26 88.86 24.57
comb-2(6000) 46.68 75.74 87.67 25.08
comb-2(10000) 42.82 75.83 87.62 23.08
Table 6: Results from various approaches for Top 6 predictions for Book Data

7 Conclusions

We presented a novel compositional technique using embedded word vectors to form appropriate document vectors. Further, to capture importance, weight and distinctiveness of words across documents we used a graded weighting approach for composition based on recent work by Mukerjee et. el. [Pranjal Singh2015] where instead of weighting we weight using cluster frequency. Our document vectors are embedded in a vector space different from the word embedding vector space. This document vector space is higher dimensional and tries to encode the intuition that a document has more topics or senses than a word.

We also developed a new technique which uses an ensemble of multiple classifiers that predicts label paths, node labels and depth-wise labels to decrease classification error. We tested our method on data sets from a leading e-commerce platform and show improved performance when compared with competing approaches.

8 Future Work

Instead of using k-means we can use the Chinese Restaurant Process. Extending the gwBOVW approach to learn supervised class document vectors that consider the label in some fashion during embedding.


  • [Bengio et al.2003] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model.

    The Journal of Machine Learning Research

    , 3:1137–1155.
  • [Bengio et al.2010] Samy Bengio, Jason Weston, and David Grangier. 2010. Label embedding trees for large multi-class tasks. In Advances in Neural Information Processing Systems 23, pages 163–171. Curran Associates, Inc.
  • [Bi and Kwok2011] Wei Bi and James T. Kwok. 2011. Multi-label classification on tree- and dag-structured hierarchies. In In ICML.
  • [Harris1954] Zellig Harris. 1954. Distributional structure. Word, 10:146–162.
  • [Jeffrey Pennington2014] Christopher D. Manning Jeffrey Pennington, Richard Socher. 2014. Glove: Global vectors for word representation. In

    Empirical Methods in Natural Language Processing (EMNLP)

    , pages 1532–1543. ACL.
  • [Kozareva2015] Zornitsa Kozareva. 2015. Everyone likes shopping! multi-class product categorization for e-commerce. Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 1329–1333.
  • [Kumar et al.2002] Shailesh Kumar, Joydeep Ghosh, and M. Melba Crawford. 2002. Hierarchical fusion of multiple classifiers for hyperspectral data analysis. Pattern Analysis and Applications, 5:210–220.
  • [Le and Mikolov2014] Quoc V Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053.
  • [Levy and Goldberg2014a] Omer Levy and Yoav Goldberg. 2014a. Dependencybased word embeddings. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2:302–308.
  • [Levy and Goldberg2014b] Omer Levy and Yoav Goldberg. 2014b. Linguistic regularities in sparse and explicit word representations. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 171–180. Association for Computational Linguistics.
  • [Levy and Goldberg2014c] Omer Levy and Yoav Goldberg. 2014c. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27, pages 2177–2185.
  • [Mikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • [Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119.
  • [Pranjal Singh2015] Amitabha Mukerjee Pranjal Singh. 2015. Words are not equal: Graded weighting model for building composite document vectors. In Proceedings of the twelfth International Conference on Natural Language Processing (ICON-2015). BSP Books Pvt. Ltd.
  • [Shen et al.2011] Dan Shen, Jean David Ruvini, Manas Somaiya, and Neel Sundaresan. 2011. Item categorization in the e-commerce domain. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pages 1921–1924.
  • [Shen et al.2012] Dan Shen, Jean-David Ruvini, and Badrul Sarwar. 2012. Large-scale item categorization for e-commerce. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, pages 595–604.
  • [Socher et al.2013] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), volume 1631, page 1642. Citeseer.
  • [Wang Ling2015] Chris Dyer Wang Ling. 2015. Two/too simple adaptations of wordvec for syntax problems. In Proceedings of the 50th Annual Meeting of the North American Association for Computational Linguistics. North American Association for Computational Linguistics.
  • [Xue et al.2008] Gui-Rong Xue, Dikan Xing, Qiang Yang, and Yong Yu. 2008. Deep classification in large-scale text hierarchies. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pages 619–626.

9 Supplementary Material

9.1 Label Embedding Based Approach

Apart from tree based approaches there are label based embedding approches for Product Classification. Wei and Kwok [Bi and Kwok2011] suggested a label based embedding approach which exploits label dependency in tree-structured hierarchies for hierarchical classification. Kernel Dependency Estimation (KDE) is used to first project or embed the label vector (multi-label) in fewer orthogonal dimensions. An advantage of this approach is that all m learners in the projected space can learn from the full training data. In contrast n tree based methods training data reduces as we reach leaf nodes.

To preserve dependencies during prediction the authors suggest a greedy approach. The problem can be efficiently solved using a greedy algorithm called Condensing Sort and Select Algorithm. However, the algorithm is computationally intensive.

9.1.1 Dependency Based Word Vectors

SGNS and CBoW both use linear bag of words context for training word vectors [Mikolov et al.2013b]. Levy and Goldberg [Levy and Goldberg2014a] suggested use of arbitrary functional context instead like syntactic dependencies generated from a parse of the sentence. Each word and its modifiers are extracted from a sentence parse. Contexts in the form (, ) are generated for every sentence. Here is the dependency relationship type between word and the modifier and is used to denote the inverse relationship. Figure 5 shows dependency based context for words in a given sentence.

Figure 5: Example for Dependency-based context extraction

The dependency based word vectors use the same training methods as SGNS. Compared to similarly learned linear context based vectors learned it is found that the dependency based vectors perform better on functional similarity. However, for the task of topical similarity estimation the linear context based word vectors encode better distributional semantic content.

Data: Catalog Taxonomy Tree (T) of depth K and testing data = (d,) where d is product description is taxonomy of paths. Set of level one Classifiers C = {,, } and final level two classifier
Result: top m prediction path for training description d, here i = 1 m
1 Obtain features for each product description d in test data;
2 Get Prediction Probabilities from all level one classifiers to obtain level two feature vector () using Equation 1;
3 Obtain () reduced feature vector;
Output top m paths from final prediction using output probabilities from level two classifier for description d.
Algorithm 3 Testing Two Level Boosting Approach

9.2 Example of gwBoWV Approach

  1. Assume there are four clusters , here represents the cluster

  2. Let be a document consisting of words in order, whose document vectors need to be composed using word vectors respectively. Let us assume the following word-cluster assignment for document as given in Table 7

    Word Cluster
    Table 7: Document Word Cluster Assignment
  3. Obtain cluster ’s contribution in document by summation of word vectors for words coming from document and cluster :

    • = + ++

    • =

    • = + +

    • = +

    Similarly, calculate idf values for each cluster for document :

    • = idf()+idf() +idf()+idf()

    • = idf()

    • = idf() + idf()+idf()

    • = idf()+idf()

  4. Concatenate cluster vectors to form Bag of Words Vector of dimension, :

  5. Concatenate word-cluster idf values to form graded weighted Bag of Word Vector of dimension :


9.3 Quality of WordVec Clusters

Below are examples of words contained in some clusters and their possible cluster topic meaning for the book data. Each cluster is formed using clustering of word-vectors where the word belongs to particular topics. We number clusters according to the distance of the centroid from the origin to avoid confusion.

  1. Cluster #0 basically talks about crime and punishment related terms like “accused, arrest, assault, attempted, beaten, attorney,brutal,confessions, convicted cops, corrupt, custody, dealer, gang, investigative, gangster, guns, hated, jails, judge, mob, undercover, trail, police, prison, lawyer, torture, witness etc“

  2. Cluster #10 talks about scientific experiments related terms like “yield, valid, variance, alternatives, analyses, calculating, comparing, assumptions, criteria, determining, descriptive, evaluation, formulation, experiments, measures model, parameters, inference, hypothesis etc“

Similarly, Cluster #13 is talking about dating and marriage, Cluster #11 about tools and tutorials and Cluster #15 about persons. Other clusters also represent single or multiple topics similiar to each other. Similarity of words within a cluster represents efficient distributional semantic representation of wordvectors trained by the SGNS model.

9.4 Two Level Classification Approach

We also experimented with a modified approach of two level classification given by Shen and Ruvini [Shen et al.2011] [Shen et al.2012] as describe in Section 2.4. However, instead of randomly giving direction and then finding a dense graph using Strongly Connected Components, we decided the edge direction from misclassification and used various methods like weakly connected component, bi connected component and articulation points to find Highly Connected Component. We followed this approach to improve sensitivity and cover missing edges as discussed in section 2.4. The value of the confusion probability and direction of edges is decided by the value of element in the confusion matrix .

Data: Set of Categories and threshold
Result: Set of dense sub-graphs representing highly connected groups
1 Train a weak classifier H on all possible categories ;
2 Compute pairwise confusion probabilities between classes using values from the confusion matrix (CM).
here, Conf() may not be equal to Conf() due to non symmetric nature of Confusion Matrix .;
3 Construct confusion graph with vertices () as confused categories and edges () from with weight = Conf().;
Apply Bi-Connected Component, Strongly Connected Component or Weakly Connected Component finding graph algorithm on to obtain set of dense sub-graphs .
Algorithm 4 Modified Connected Component Grouping

9.5 Confused Category Group Discovery

Figure 6 shows “Hard Disk, Hard Drive, Hard Disk Case and Hard Drive Enclosure” are misclassified as each other and form a latent group in Computer and Computer accessories extracted by finding bi-connected components in the misclassification graph.

Figure 6: Misclassification Graphs with latent groups in Computer and Computer Accessories, here each edge from C1 C2 represents misclassification from C1 C2 with threshold of correct prediction, here isolated vertexes represent almost correctly predicted classes

Figure 7 shows the final latent groups discovered (color bubble) in Non-Book Data using graded weighted Bag of Word Vector methods and random forest classifier without class balance on raw data with varying thresholds on #mis classification for dropping edges based on edge weight.

Figure 7: Final Misclassification Graphs and latent groups (color bubble) discovered during search phase on multiple categories with threshold of 800 (weighted) examples for weakly bi-connected component.
Figure 8: Final Misclassification Graphs and latent groups (color bubble) discovered during search phase on multiple categories with threshold of 1000 (weighted) example and above weakly bi-connected component. Here we represent multiple edge by single edge for image clarity.

9.6 Data Visualization Tree Maps

We visualize the taxonomy for some main categories using Tree-Maps. Figures 9 - 10 show tree maps at various depths for the book taxonomy. It is evident from these maps that the tree exhibits high skewness and heavy tailed nature.

Figure 9:

Book data visualization using Tree Map at root

Figure 10: Book data visualization using Tree Map at depth 1 for Academic and Professional
Figure 11: Example result on Non Book Data, input description and output labels with top categories
Figure 12: Example result on Non Book Data, input description and output labels with top categories
Figure 13: Result of Glove vs WordVec on Non Book Dataset

9.7 More Results for Ensemble Approach

We use kNN and random forest for initial classification instead of SVM because of better stability to class imbalance and better performance due to generation of good set of meta features. Also SVM doesn’t perform well with huge number of classes. Table

8 confirm the same empirically

Algorithm %Accuracy
kNN 74.2%
multiclass svm 77.4%
random forest 79.6%
Table 8: Comparison of flat classification with multiple classifiers using 95000 training, 9000 testing samples and 460 probable classes on non-book data-set using tf-idf features for Computer dataset
Data #Class #Train #Test %CP@1
Computer 54 0.1 2.5 93.0%
Computer 54 0.66 3.3 98.5%
Home 49 0.1 2.5 97.2%
8-top cat 460 0.09 0.95 77.3%
8-top cat* 460 0.09 0.95 79.0%
20-top cat 900 0.09 0.95 74.2%
Table 9: Performance of flat classification using kNN classifier on various sample non-book categories(cat) *Used ensemble here i.e. a level two classifier trained on output probabilities of flat weak classifiers at level one.

We observed improvement in classification accuracy by using Shen and Lee approach of two level classifier for discovering latent groups and running fine classifiers on them. Table 10 shows improvement in accuracy by a level two classifier.

Algorithm %Accuracy
kNN-SVM 91.0%
kNN 78.0%
Table 10: Coarse - fine level classification results on highly connected category set {Hard Disk Cases, Hard Drive Enclosures,Internal Hard Disk and External Hard Drive} with 1457 training and 728 testing samples

To prove that books were more confusing compared to non-book, we did a small experiment. We sampled all computer and computer accessories and Computer related books and binary labeled them and compared this with a direct classifier without binary labelling. The results are in table 11.

Model #Classes #Accuracy
Binary 2 97%
Non-Binary 700 73%
Table 11: Accuracy drop due to misclassification within book categories on #Training = 25000 and #Testing = 10000
Red-Dim %PP %CP %LP %LC
1000 40.30 74.45 86.38 22.47
2000 40.88 74.92 86.87 22.56
3000 40.99 75.24 87.06 22.60
4000 41.11 75.24 87.07 22.53
Table 12: Results from reduced gwBoWV vectors for Top 6 path prediction (Orignal Dimension: 8080)
Red-Dim %PP %CP %LP %LC
1000 46.26 72.46 84.84 24.85
2000 47.05 .72.26 84.52 25.05
2500 .47.70 72.77 84.58 24.81
3000 .44.45 73.84 85.83 23.74
Table 13: Results of varying reduced dimension for level one node classifer where level two classifier uses Top 6 prediction

Results from various approaches on top 3 taxonomy prediction are in Table 14. 13 shows results of node level two classifier on various reduced dimension vectors (using ANOVA) - original vectors were concatenated output probabilities of node prediction probabilities using gwBoWV. 13 show results of level one classifier on various reduced dimension vectors (using ANOVA) where original vectors were gwBoWV. Ensemble and gwBoWV perform better than other approaches.

Method PP CP LP LC
tfidf(4000-2-20) 44.68 71.34 83.36 40.02
tfidf(8000-2-20) 44.70 71.18 83.30 40.08
tfidf(10000-2-20) 44.69 71.21 83.30 40.13
path-1(4100-15) 42.67 70.46 82.78 37.85
path-1(8080-20) 44.48 71.13 83.21 40.26
depth(7975-20) 44.91 71.49 83.52 40.54
node(4100+2810) 44.86 71.04 83.23 40.34
comb-2(8000) 47.62 73.01 85.16 41.69
comb-2(6000) 48.17 72.07 84.36 41.52
comb-2(10000) 45.78 71.85 84.03 40.81
Table 14: Results from various approaches for Top 3 prediction

9.8 Classification Example from Book Data

Description : ignorance is bliss or so hopes antoine the lead character in martin pages stinging satire how i became stupida modern day candide with a darwin award like sensibility a twenty five year old aramaic scholar antoine has had it with being brilliant and deeply self aware in todays culture so tortured is he by the depth of his perception and understanding of himself and the world around him that he vows to renounce his intelligence by any means necessary in order to become stupid enough to be a happy functioning member of society what follows is a dark and hilarious odyssey as antoine tries everything from alcoholism to stock trading in order to lighten the burden of his brain on his soul. how i became stupid. how i became stupid. how i became stupid
Actual Class : books-tree literature and fiction

Predictions, Probability

books-tree literature and fiction literary collections essays 0.1
books-tree reference bibliographies and indexes 0.1
books-tree hobbies and interests travel other books reference 0.1
books-tree children children literature fairy tales and bedtime stories 0.1
books-tree dummy 0.2

Description : harpercollins continues with its commitment to reissue maurice sendaks most beloved works in hardcover by making available again this 1964 reprinting of an original fairytale by frank r stockton as illustrated by the incomparable maurice sendak in the ancient country of orn there lived an old man who was called the beeman because his whole time was spent in the company of bees one day a junior sorcerer stopped at the hut of the beeman the junior sorcerer told the beeman that he has been transformed if you will find out what you have been transformed from i will see that you are made all right again said the sorcerer could it have been a giant or a powerful prince or some gorgeous being whom the magicians or the fairies wish to punish the beeman sets out to discover his original form. the beeman of orn. the beeman of orn. the beeman of orn.
Actual Class : books-tree children knowledge and learning animals books reptiles and amphibians

Predictions, Probability

books-tree children knowledge and learning animals books reptiles and amphibians , 0.28
books-tree children fun and humor, 0.72

Description : a new york times science reporter makes a startling new case that religion has an evolutionary basis for the last 50000 years and probably much longer people have practiced religion yet little attention has been given to the question of whether this universal human behavior might have been implanted in human nature in this original and thought provoking work nicholas wade traces how religion grew to be so essential to early societies in their struggle for survival how an instinct for faith became hardwired into human nature and how it provided an impetus for law and government the faith instinct offers an objective and non polemical exploration of humanity’s quest for spiritual transcendence. the faith instinct how religion evolved and why it endures. the faith instinct how religion evolved and why it endures. the faith instinct how religion evolved and why it endures
Actual Class : books-tree academic texts humanities

Predictions, Probability
books-tree academic texts humanities 0.067
books-tree religion and spirituality new age and occult witchcraft and wicca 0.1
books-tree health and fitness diet and nutrition diets 0.1
books-tree dummy 0.4

Description : behavioral economist and new york times bestselling author of predictably irrational dan ariely returns to offer a much needed take on the irrational decisions that influence our dating lives our workplace experiences and our general behaviour up close and personal in the upside of irrationality behavioral economist dan ariely will explore the many ways in which our behaviour often leads us astray in terms of our romantic relationships our experiences in the workplace and our temptations to cheat blending everyday experience with groundbreaking research ariely explains how expectations emotions social norms and other invisible seemingly illogical forces skew our reasoning abilities among the topics dan explores are what we think will make us happy and what really makes us happy why learning more about people make us like them less how we fall in love with our ideas what motivates us to cheat dan will emphasize the important role that irrationality plays in our daytoday decision making not just in our financial marketplace but in the most hidden aspects of our livesabout the author an ariely is the new york times bestselling author of predictably irrational over the years he has won numerous scientific awards and his work has been featured in leading scholarly journals in psychology economics neuroscience and in a variety of popular media outlets including the new york times the wall street journal the washington post the new yorker scientific american and science. the upside of irrationality. the upside of irrationality. the upside of irrationality
Actual Class : books-tree business, investing and management business economics

Predictions, Probability books-tree business, investing and management business economics 0.15
books-tree philosophy logic 0.175
books-tree self-help personal growth 0.21
books-tree academic texts mathematics 0.465