HeteGCN: Heterogeneous Graph Convolutional Networks for Text Classification

08/19/2020 ∙ by Rahul Ragesh, et al. ∙ Microsoft 0

We consider the problem of learning efficient and inductive graph convolutional networks for text classification with a large number of examples and features. Existing state-of-the-art graph embedding based methods such as predictive text embedding (PTE) and TextGCN have shortcomings in terms of predictive performance, scalability and inductive capability. To address these limitations, we propose a heterogeneous graph convolutional network (HeteGCN) modeling approach that unites the best aspects of PTE and TextGCN together. The main idea is to learn feature embeddings and derive document embeddings using a HeteGCN architecture with different graphs used across layers. We simplify TextGCN by dissecting into several HeteGCN models which (a) helps to study the usefulness of individual models and (b) offers flexibility in fusing learned embeddings from different models. In effect, the number of model parameters is reduced significantly, enabling faster training and improving performance in small labeled training set scenario. Our detailed experimental studies demonstrate the efficacy of the proposed approach.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Text classification has been an important class of machine learning problems for several decades with challenges arising from different dimensions including a large number of documents, features and labels, label sparsity, availability of unlabeled data and side information, training and inference speed, type of classification problem (binary, multi-class, multi-label). Many statistical models 

(Rennie et al., 2003; Joachims, 2002; Cesa-Bianchi et al., 2006; Aggarwal and Zhai, 2012) and machine learning techniques (Hsieh et al., 2008; Selvaraj et al., 2011; Keerthi et al., 2008; Liu et al., 2017) have been proposed to address these challenges.

Traditional document classification approaches use Bag-of-Words (BoW) sparse representation of documents (Rennie et al., 2003) and contributed towards designing models for binary, multi-class and multi-label classification problems (Joachims, 2002; Cesa-Bianchi et al., 2006; Selvaraj et al., 2011; Liu et al., 2017) and speeding-up learning algorithms (Hsieh et al., 2008)

for large scale applications. Semi-supervised learning methods 

(Joachims, 2002; Blum and Mitchell, 1998) became an important area of research with the availability of an extensive collection of unlabeled data (e.g., web pages). Furthermore, web pages and publications brought in rich information through link graphs enabling the use of auxiliary information. Development of new graph-based learning methodologies (Lu and Getoor, 2003; Sindhwani et al., 2005)

started with the goal of improving classifier model performance.

For nearly a decade, there has been a surge in the development of learning representation of text data using distributed representation models 

(Mikolov et al., 2013; Pennington et al., 2014; Le and Mikolov, 2014)

and deep learning models (e.g., convolutional and recurrent neural networks). See 

(Minaee et al., 2020) for a comprehensive review. These models achieve superior performance compared to traditional models and exploit large volume of available unlabeled data to learn word, sentence and document embeddings. The key idea is to learn better representation (embedding) for documents that help to get significantly improved performance using even off-the-shelf linear classifier models. Recently, there has been tremendous progress in learning embeddings for entities in relational graphs (Perozzi et al., 2014; Dong et al., 2017; Tang et al., 2015b, a) and designing graph convolutional and neural network models (Kipf and Welling, 2017) that exploit rich relational information present among entities in the graph. Many variants of GCNs and GNNs (Kipf and Welling, 2017; Yao et al., 2019; He et al., 2020; Shi et al., 2019; Wang et al., 2019) have been proposed and explored for solving classification and recommendation problems.

Our interest lies in learning text classifier models that make use of underlying relations among words and documents. In Predictive Text Embedding (Tang et al., 2015a)

modeling approach, a document corpus is viewed as a heterogeneous graph that encodes relationship among words, documents and labels. The method learns embeddings for words using unsupervised learning techniques using a large volume of unlabeled data and some labeled data. It derives document embedding using learned word embeddings and learns a simple linear classifier model for class prediction. Like

PTE, TextGCN (Yao et al., 2019) uses a heterogeneous graph but learns a text classifier model with a graph convolutional network and it outperforms many popular neural network models (Joulin et al., 2017; Liu et al., 2016) and PTEon several benchmark datasets. PTE’s modelling approach is simple, efficient and inductive, but it suffers in performance compared to the complex, slow and transductive TextGCN. Therefore, we focus on designing a text classification modeling approach that unites the best aspects of PTEand TextGCNtogether.

A key contribution of this work is a proposal to compose different heterogeneous graph convolution networks (HeteGCN) with individual graphs used in TextGCN. Unlike traditional GCNand TextGCN, HeteGCNmakes use of compatible graphs across layers. They are simple and efficient as the number of model parameters is reduced significantly, enabling faster training and better generalization performance when the number of labeled training examples are small. Our HeteGCNmodeling approach helps to the understand effectiveness of different HeteGCNmodel variants and their usefulness in different application scenarios (e.g., availability of auxiliary information, compute and storage constrained settings). HeteGCNoffers flexibility in designing networks with fusion and shared learning capabilities.

Following PTE, we suggest a simple idea of using learned feature embeddings from both TextGCNand HeteGCNfor inductive inference. Our work also raises a few research questions from inductive inference perspective. Further, LightGCNwork talks about simplifying GCNfor recommendation application (He et al., 2020). We show how LightGCNcan be used for the text classification task as a competitive baseline for TextGCN.

We conduct a detailed experimental study on several benchmark datasets and compare HeteGCNwith many state-of-the-art methods. HeteGCNoutperforms these methods when the number of labeled training examples is small and gives as high as improvement over TextGCNand PTEfor several datasets. It provides competitive performance in transductive large labeled training data scenario. In inductive setting, we find that the idea of using only learned feature embeddings is quite effective and HeteGCNachieves significantly improved performance with lifts on several datasets compared to TextGCNand PTE. Training time comparison shows that HeteGCNis faster than TextGCNby times. Overall, we demonstrate how HeteGCNunites the best aspects of PTEand TextGCNby offering a high-performance text classification solution with lower model complexity, faster training and inductive capabilities.

The paper is organized as follows. We present notation, problem formulation and background for graph embedding methods in Sections 2 & 3, our approach in Section 4, experimental details and results in Section 5 followed by related work in Section 6. The paper ends with discussion on investigations and possible extensions in Section 7 followed by conclusion.

2. Notation and Problem Formulation

We introduce notation used throughout the paper. We use to denote

-dimensional feature vector of

example, to denote the corresponding binary representation of target class and to denote the number of classes. We use to denote the document-feature matrix of the entire corpus (where each row represents the feature vector of an example) and to denote the set of labeled training examples. We use to denote a feature-feature relation matrix and to denote a document-document relation matrix. The relation matrices are either explicitly provided or implicitly constructed from . We view each matrix as a graph (and vice-versa) with rows and columns representing nodes. For example, is a graph connecting document nodes and feature nodes with edge weights that may be set to TF-IDF scores. We use () to denote the document embedding of the document and to denote the embedding of the feature (). The corresponding embedding matrices are denoted as and respectively. Finally, and denote the

dimensional identity matrix and the network model weights respectively. We use

to denote the class probability distribution of documents over

classes. The terms document and example are used interchangeably. Similarly, we use feature and word interchangeably. In some privacy sensitive applications, only hashed feature identifiers are available and pre-trained word embeddings (e.g.,Word2Vec) cannot be used.

Problem Formulation. We are given a document corpus ( with a set of labeled and unlabeled examples. Additionally, we may have access to and . Our goal is to learn feature embeddings () using simple and efficient graph convolutional networks and learn an inductive classifier model that makes use of document embeddings () computed using to predict the target class.

3. Background and Motivation

Predictive Text Embedding (PTE). Tang et al. (Tang et al., 2015a) proposed a semi-supervised embedding learning method that makes use of both labeled and unlabeled documents. They construct a heterogeneous text graph which is a composition of three graphs, document-word (), word-word () and word-label (). Note that and are available for the entire corpus and is constructed using only labeled training examples. The core idea is to learn shared word embeddings () by leveraging labeled information available through jointly with and . The embedding of any document is computed as an average of the embeddings of words present in the document. Finally, a linear (Softmax

) classifier model is learned by minimizing cross-entropy loss function with labeled training examples:


Note that the embeddings and classifier model are learned sequentially. PTEis efficient and has the inductive capability.

Graph Convolutional Network . Kipf et al. (Kipf and Welling, 2017)

proposed a semi-supervised learning method using graph convolutional networks. A graph convolutional network is a multi-layer network and each layer takes node embeddings (

of the layer as input and produces either embeddings ( for the

layer or predictions for a given task (e.g., target class probability distribution) at the final layer. Formally,

where and denote the embedding and weight matrices of the layer respectively. is a transformation function (e.g., ReLU or Softmax), and is the input feature matrix (e.g., ) or 1-hot encoding representation of nodes (i.e., ). denotes an adjacency matrix and is fixed across layers. Like PTE, the model weights are learned using only labeled training examples (see 1). However, GCNis only transductive, as embeddings for unseen documents cannot be computed during inference.

TextGCN. Yao et al.(Yao et al., 2019) proposed a text graph convolutional network (TextGCN) method by using the matrix:


in GCN across layers. Note that it makes use of both word-word and document-word relational graphs excluding (which may not be explicitly available). With a two-layer network, document-document relations are inferred and used along with word embeddings to obtain document embeddings. TextGCN (Yao et al., 2019) achieved significantly improved accuracy performance over PTE. TextGCNdoes not make use of , and the gain mainly comes from using GCN. Nevertheless, TextGCNhas several limitations which we explain next.

Consider the first layer output of TextGCN. With as given in 2, the feature and document embedding matrices are given by:


where 1-hot representation is used for features and documents. The subscripts in and denote embedding type in the first (aka input) layer. In other layers, we use the subscript in the weight matrix (e.g., ) to match the type of multiplicand (e.g., document embedding, ).

We note that the number of model parameters is dependent on the number of features () and documents (). This implies TextGCNis not suitable for large scale applications when is very large with . Furthermore, when the number of labeled training examples is small, it is difficult to learn a large number of model parameters reliably, resulting in poor generalization. Next, learning document embedding using (2) makes TextGCNtransductive.

Motivated by the success of PTE and TextGCN in semi-supervised text classification, yet recognizing their limitations, we focus on designing a graph convolutional network-based approach that brings the best capabilities of PTE (i.e., efficient training and inductive) and TextGCN (i.e., superior performance) together.

4. Proposed Approach

We present our main idea of constructing a novel heterogeneous graph convolution network (HeteGCN) variants with individual graphs used in TextGCN. Decomposition of TextGCNinto different HeteGCNmodels helps to understand the usefulness and importance of each model. HeteGCNmodeling approach offers flexibility in fusing embeddings from different HeteGCNmodels with or without layer sharing possibilities. Furthermore, taking a cue from , we suggest a simple method to make inference on unseen documents. Finally, we explain how HeteGCNand TextGCNcan be simplified by removing intermediate non-linearity and transformations, as suggested in (Wu et al., 2019) and (He et al., 2020).

4.1. Heterogeneous Graph Convolutional Network (HeteGCN)

We start with the observation that feature embeddings are computed using both word-word ( and word-document () matrices (see in Equation 3) in TextGCN. This happens mainly because of how (see Equation 2) is used in GCN. Our proposal is to consider individual matrices , and used in separately, decompose the embedding computation operations and fuse the embeddings from different layer outputs if required. We illustrate our main idea of composing different HeteGCNmodels in Figure  1 using the following set of GCNlayers:

  • - ReLU GCN: This layer has feature embeddings as both input and output. For example, with in the first layer and 1-hot encoding for feature embeddings, we have: with the subscript indicating the graph used and .

  • - ReLU GCN: This layer takes document embeddings as input and produces feature embeddings as output. We use the prefix to denote the transpose operation in . An example is:

  • - ReLU GCN: This layer takes feature embeddings as input and produces document embeddings as output. In this case, we may have the second layer as: .

  • - ReLU GCN: This layer has document embeddings as both input and output. With in the first layer and 1-hot encoding for document embeddings, we have: .

  • - Softmax GCN: This layer takes feature embeddings as input and produces probability distribution of documents over classes as output: .

  • - Softmax GCN: This layer takes document embeddings as input and produces the output: .

The basic Softmax layer produces as output. Note that neither PTEnor TextGCNmakes use of , while we allow the possibility of using or any other graph (e.g., the word-label relational graph used in PTE) when available. We define a homogeneous layer as a layer that consumes and produces embeddings of the same entity type (e.g., features or documents). Similarly, a heterogeneous layer consumes embeddings of one entity type (e.g., feature) and produces embeddings of another entity type (e.g., document). Thus, - ReLU GCN is a homogeneous layer and - ReLU GCN is a heterogeneous layer.

Figure 1. HeteGCN Architecture. (a),(b),(c) and (d) show networks starting from F, X, TX and N matrices, producing document embeddings, used to predict class probabilities. (e) and (f) show two possibilities of fusing feature and document embeddings coming from two different but compatible networks. and can be 1-hot representations of features and documents.

4.2. HeteGCNModels, Complexity and Implications

Given the basic homogeneous or heterogeneous layers, the idea is to compose different HeteGCNmodels using compatible layers. We illustrate four HeteGCNmodels using Figure 1. In the top row (Figure 1(a)), we have a HeteGCN() model where and are the graphs used in the first and second layer. The feature embeddings (1-hot representation at the input) and Softmax classifier model weights are learned using Equation 1. Unlike TextGCN, we only learn feature embeddings. Therefore, the model complexity is with and denoting the feature embedding dimension and number of classes. Similarly, we have a HeteGCN() model (Figure 1(c)) where and are the graphs used in consecutive layers and the document embeddings are learned and the model complexity is . HeteGCNis heterogeneous in the sense of using different graphs across layers. This is different from traditional GCNand TextGCN. However, two consecutive layers have to be compatible in terms of output-input relation. For example, a - ReLU GCN layer cannot be followed by a - ReLU GCN. This is because the first layer produces feature embeddings as outputs and the second layer consumes document embeddings as inputs.

We observe that the model size plays an important role from two perspectives. (1) It affects the training time and tuning with a large number of hyper-parameter configurations becomes difficult. (2) When the number of labeled examples is small, it is preferable to learn models with lesser number of parameters, so that good generalization performance is achievable. We note that TextGCNlearns both feature embeddings and document embeddings. Therefore, its model complexity is where is the embedding dimension, assuming the same dimension for features and documents. Therefore, we may expect HeteGCNmodels explained above to take lesser time to train and generalize well for small labeled training set scenario.

Besides model complexity, the generalization performance of each HeteGCNnetwork is dependent on the information available. For example, HeteGCN() has rich side information (knowledge) available through and may help to get significantly improved performance over HeteGCN() when the number of labeled examples are small. The main reason why we may expect such performance difference is that the quality of learned feature embeddings in the former model is highly likely to be better due to feature co-occurrence knowledge. On the other hand, this knowledge has to be implicitly learned by the first two layers to derive feature embeddings in the latter mode; with a limited number of labeled examples. Similarly, HeteGCN() may perform better than HeteGCN() when the quality of is better than .

It is important to note that it may not be possible to experiment with models such as HeteGCN() due to several practical reasons. While this model may prove to be very effective, it may not always be feasible because it assumes that is accessible or can be explicitly computed and stored. However, may not be accessible or cannot be pre-computed and stored when is very large. Similarly, it may not be feasible to pre-compute and store when is very large. In such situations, we are constrained to use only models such as HeteGCN() or HeteGCN() with only available .

Finally, we also note that our proposed approach is flexible. For example, we can learn a composite HeteGCNmodel by fusing feature embedding outputs at the layers and of two HeteGCN() and HeteGCN() models and passing the fused embedding to an X - Softmax GCNas shown in the middle row (Figure 1(e)). This composite HeteGCNmodel is equivalent to performing nonlinear operation on each of the terms in (see Equation 3) separately and passing to the classifier model layer. Another possibility is to compose a composite model where the output embedding of one layer (e.g., X - ReLU GCN) is passed to two subsequent layers (e.g., TX - ReLU GCN and N - ReLU GCN) layers enabling shared learning of the first layer.

4.3. Inductive Inference

We observe that PTEis inductive because the feature embeddings are pre-computed and used during inference on unseen documents using a Softmax GCNlayer. Building upon this, our simple idea is to store the feature embeddings available at the appropriate layer outputs and make inference like PTE. For example, we store feature embeddings at the layer for HeteGCN(). Similarly, we store embeddings at the and layers for HeteGCN() and HeteGCN() models respectively. It is easy to see that the same idea can be used with TextGCNby storing . Note that any out-of-vocabulary features found in the test data will be ignored as the corresponding embeddings are not available. Experimental results in Section 4 show that this simple idea works reasonably well in practice. However, designing a good inductive method for small labeled set scenario is still a challenge when feature embeddings in the unseen data cannot be learned or are not explicitly available from other data sources (e.g., GloVe (Pennington et al., 2014) or Word2Vec (Mikolov et al., 2013)).

We note that the above method does not make TextGCNinductive in a strict sense because the way document embedding is inferred during testing and training is different, as no 1-hot document representations are available for test documents – the same holds for HeteGCN(). Conversely, the scenario is quite different for HeteGCNmodels with the first layer learning only the feature embeddings (e.g., and ). For these paths, it is possible to update and as learned model weights are used for inference on new documents. and may require updates because IDF factors or feature co-occurrence based may change. However, the quality of inference with updated matrices will be dependent on the degree of changes and the sensitivity of the learned HeteGCNmodel with respect to the changes. More analysis and investigations are needed and are left as future work.

4.4. Simplifying HeteGCNand C-Light Gcn(C-LightGCN) for classification

There has been some effort (Wu et al., 2019; He et al., 2020) to simplify GCNmodel by understanding the usefulness of feature and nonlinear transformations. (Wu et al., 2019) presented some evidence that feature smoothing (i.e., ) is the most important function in GCNand performance degradation due to removal of nonlinear activation is not much. These experiments were conducted using several benchmark datasets for a classification task. The simplified GCNclassifier model turns out to be linear with a smoothed feature matrix () fed as input where denotes the number of GCNlayers. The same idea can be used to simplify all HeteGCNmodels (i.e., removing non-linearity).

Using a similar idea, LightGCN (He et al., 2020) proposed to learn embeddings for users and items (starting with 1-hot representation) for recommendation problems and demonstrated that dropping both nonlinear and feature transformation operations does not degrade performance much. The resultant model is linear and a weighted linear combiner model was suggested to fuse embeddings of different layers (i.e., ) where and the weights are tuned by treating them as hyper-parameters. Like TextGCN, LightGCNuses (see Equation 2) with representing the user-item interaction matrix but without . One simple idea is to extend LightGCNfor the text classification problem by using only off-diagonal block matrices (i.e., ) in (ref. Equation 2). It is straight-forward to derive expression for . It turns out that each document embedding is dimensional with contributions coming from both document-document and document-feature relations. The C-LightGCNclassification model is interesting and useful because it can be interpreted as a simplified TextGCNmodel and compared against simplified HeteGCNmodels.

5. Experiments

We conducted several experiments to bring out different aspects of the proposed approach in comparison with the baselines and state-of-the-art graph embedding methods and answer several research questions related to network architecture.

5.1. Experimental Setup

5.1.1. Datasets

We consider five benchmark datasets for our experiments. 20NG consists of long text documents, categorized into 20 newsgroups. MR consists of movie reviews classified into positive and negative sentiments. R8 and R52 consist of documents that appear on Reuters-21578 newswire grouped into 8 and 52 categories, respectively. Ohsumed consists of medical abstracts, which are categorized into 23 cardiovascular diseases.

5.1.2. Dataset Preparation

For consistency, we prepare the datasets in an identical setup as described in TextGCN(Yao et al., 2019). We leveraged the code111https://github.com/yao8839836/text_gcn provided by the authors to prepare the dataset. Each of the raw text is cleaned and tokenized. Stop words and low-frequency words (less than 5 occurrences) are removed in 20NG, R8, R52 and Ohsumed except for MR, as they are short text. The statistics of the pre-processed datasets are detailed in Table 1.

Large Labeled Data. We use the standard train/test split. 10% of the train documents were randomly sampled to form the val set.

Small Labeled Data. We do a stratified sampling (1%, 5%, 10% and 20%) of the above training documents to form small labeled sets. Additionally, we enforce that smaller labeled training documents are included in the higher labeled training set for consistency. This is repeated 5 times to create 5 splits for each label percent. We use the val/test split as above.

Dataset Words Docs Train Docs Test Docs Classes Avg. Length
20NG 42,757 18,846 11,314 7,532 20 221.26
MR 18,764 10,662 7,108 3,554 2 20.39
R8 7,688 7,674 5,485 2,189 8 65.72
R52 8,892 9,100 6,532 2,568 52 69.82
Ohsumed 14,157 7,400 3,357 4,043 23 135.82
Table 1. Dataset Statistics (Yao et al., 2019)

5.1.3. Graph Construction

is a word-word Pointwise Mutual Information (PMI) graph. is a document-word Term Frequency-Inverse Document Frequency (TF-IDF) graph. Refer (Yao et al., 2019) for further details. is a document-document nearest neighbor graph constructed from

with top 25 neighbours obtained using cosine similarity score. PMI and IDF are computed over all the documents in the transductive setting while only train documents are used to estimate in the inductive setting. Unseen words are removed from the validation and test documents.

5.1.4. Methods of comparison

We compared the methods given below and all models were trained by minimizing the cross-entropy loss function (1). Methods that learn embeddings are set to learn embeddings of size 200 for consistency.


: We trained Logistic Regression classifier on the

TF-IDF transformed word vectors. We tuned hyper-parameter (- regularization) over [1e-5, 1e5] in powers of 10 on the validation set.

PTE: PTE (Tang et al., 2015a) learns words’ embeddings, generates text embedding from word embeddings and then utilizes them to train a logistic regression model. We used the code provided by the authors222https://github.com/mnqu/PTE to learn the embeddings with following parameters: window = 5, min count = 0, and negative samples = 5. A logistic regression model was trained with these embeddings using LibLinear (Fan et al., 2008) where the regularization parameter was swept over [1e-4, 1e4] in powers of 10. The parameter was tuned on the validation set.

TextGCN: We used the code provided by the authors333https://github.com/yao8839836/text_gcn to set up the experiments. Apart from the best configuration suggested by the authors, we swept learning rate over [1e-1, 1e-3] in powers of 10, weight decay over [1e-2, 1e-4] in logarithmic steps and dropout over [0, 0.75] in steps of 0.25.

C-LightGCN: LightGCN (He et al., 2020) adapted to classification problems (Section 4.4) is trained with the aggregation of adjacency matrices , , and . The relevant rows of aggregated adjacency matrix are treated as document features, and a Logistic classifier is trained from them. The hyper-parameters , , , of LightGCN (for aggregating adjacency matrices) and the regularization hyper-parameter of Logistic classifier are tuned using the validation set.

GCN: We run GCN (Kipf and Welling, 2017) on our datasets using as the adjacency graph and as input features. GCNis equivalent to HeteGCN() with as input. We used the code provided by the authors444https://github.com/tkipf/gcn to run our experiments. We swept over hyper-parameters ranges as suggested in (Shchur et al., 2018) (except for embedding dimension). Additionally, graph normalization was also treated as a hyper-parameter with raw graph, row normalization and symmetric normalization as options.

HeteGCNModels: We consider , , and sequences of layers in our experiments. All HeteGCN

models were trained for a maximum of 300 epochs using Adam optimizer and training stops when there is no increase in validation accuracy for consecutive 30 epochs. Learning rate was decayed by 0.99 after every 50 epochs. Learning rate was swept over [1e-2, 1e-4], weight decay over [0, 1e2], embedding regularization over [0, 1e2] in logarithmic steps and dropout over [0, 0.75] in steps of 0.25. Graph normalization was treated as a hyper-parameter as done for


5.1.5. Evaluation Metrics

We evaluated the performance of all classifier models using Micro-F1 and Macro-F1 scores (Manning et al., 2008). We use model accuracy (Micro-F1) evaluated on a held-out validation set to select the best model from various hyper-parameter configurations.

5.2. Large Labeled Data Scenario

We present results obtained from our experiments on five benchmark datasets in Table 2. We observe that thorough tuning of hyper-parameters gives us better performance for LR, PTEand TextGCNmodels than compared to performance reported in (Yao et al., 2019). For nonlinear models (i.e., GCN, TextGCNand HeteGCN), we repeated experiments for different seeds and report average performance. In (Yao et al., 2019), TextGCNgave competitive or better performance compared to many models. For this reason, we only compare the performance of our models with TextGCN. The proposed HeteGCN() model achieves similar or slightly better performance compared to TextGCNon all datasets, suggesting more complex TextGCNis unnecessary. The HeteGCN() model gives only slightly inferior performance on four datasets (20NG, MR, R8 and R52) compared to TextGCN. The main reason is that models other than HeteGCN() do not have direct access to the word-word relational information and this information is learned only through labeled training examples. HeteGCN() and models give similar performance and their inferior performance is due to the quality of . We found the document - document matrix to be noisy in the sense that more documents belonging to different classes are connected. Finally, LightGCNperforms reasonably well on 20NG, MR and R8 but is inferior to TextGCNand we believe this is due to over-smoothing with higher powers of and a lack of nonlinear transformation. We found that LightGCNis sensitive to hyper-parameter tuning. We also conducted experiments with simplified HeteGCNmodels (i.e., removing non-linearity) and observed performance closer to that of HeteGCNmodels.

Dataset Method Micro F1 Macro F1
76.32 (0.15)
85.86 (0.13)
87.15 (0.15)
84.39 (0.15)
84.18 (0.07)
76.22 (0.14)
76.07 (0.14)
85.30 (0.12)
86.59 (0.16)
83.86 (0.13)
83.71 (0.06)
75.99 (0.13)
74.13 (1.29)
77.03 (0.10)
76.71 (0.33)
75.48 (0.30)
75.21 (0.37)
74.23 (0.82)
74.12 (1.30)
77.02 (0.10)
76.71 (0.33)
75.46 (0.32)
75.20 (0.37)
74.20 (0.84)
94.10 (0.57)
96.65 (0.21)
97.24 (0.51)
97.09 (0.15)
96.98 (0.21)
94.39 (0.18)
85.29 (0.83)
87.39 (1.71)
92.95 (2.01)
90.99 (1.08)
91.84 (1.46)
86.14 (0.28)
90.41 (0.54)
93.80 (0.09)
94.35 (0.25)
92.05 (0.41)
92.87 (0.60)
91.93 (0.11)
61.77 (2.07)
68.62 (0.84)
68.42 (1.76)
52.75 (2.41)
57.69 (5.67)
64.61 (2.05)
62.23 (0.32)
68.11 (0.19)
68.11 (0.70)
61.44 (1.31)
66.14 (0.45)
62.46 (0.20)
54.87 (0.44)
60.61 (0.22)
60.62 (1.54)
48.23 (4.17)
58.52 (2.31)
54.59 (0.36)
Table 2.

Test Micro and Macro F1 scores on the text classification task in the large labeled setting. Models with random initializations were run with 5 different seeds, and standard deviations are reported for them in bracket wherever applicable.

5.3. Small Labeled Training Data Scenario

We report our results from small labeled training data scenario specific experiment in Figure 2. We see that HeteGCN() performs significantly better than TextGCNand other models in almost all datasets and varying percentage of labeled data. The performance gains are in the range of . The superior performance of HeteGCN() can be attributed to: (1) learning with lesser number of model parameters, (2) using

information and (3) neighborhood aggregation and non-linear transformation advantages of

GCN. We find that the performance gap reduces as the percentage labeled data increases in several cases. GCNand LightGCNperform better compared to PTEon MR and R8. performed reasonably well only in 20NG. Note that we show only HeteGCN() results and do not include results obtained from other HeteGCNmodels to make Figure 2 clutter-free. However, the observations made on other HeteGCNmodels in Table 2 with respect to rest of the models (i.e., TextGCN, PTE, etc.) nearly hold even in this scenario. We also observed that the performance of HeteGCN() model is close to (within ) when the percentage of labeled examples is very small (i.e., and . Thus, HeteGCN() model is a useful alternative to HeteGCN() when (1) there are memory and compute constraints related to and (2) we have very small labeled set. We also conducted experiments with HeteGCN() model with PTEembeddings fed as input. This model has lower model complexity () and gave performance lift of on 20NG and MR datasets for and cases.

Figure 2. Test Micro-F1 on the text classification task plotted by varying training data sizes. The training data sizes are varied in steps of 1%, 5%, 10% and 20%.

5.4. Inductive Experimental Study

We conducted this experiment on large labeled data and performed inference using the inductive inference method explained in Section 4.3. Results in Table 3 show that this method is effective and useful even for TextGCN. We find that HeteGCN() generalizes significantly better achieving () lifts on all datasets (except MR) compared to TextGCNand outperforms PTEon all datasets. We find that HeteGCN() gives improvement on all datasets over TextGCNexcept 20NG and is a useful alternative to HeteGCN(), as explained earlier. PTEgives surprisingly much lower performance than LR for e.g. in MR. We observed that as we increased the embedding dimension size in PTE, it tends to reach towards LR performance, suggesting that direct factorization of co-occurrence matrix may lead to a loss in information and that convolutional models are better from that perspective.

Dataset Method Micro F1 Macro F1
81.61 (0.09)
80.88 (0.54)
84.59 (0.14)
79.83 (0.54)
80.91 (0.08)
80.47 (0.48)
83.95 (0.12)
79.14 (0.57)
68.78 (0.18)
74.60 (0.43)
75.62 (0.26)
75.15 (0.31)
68.76 (0.18)
74.55 (0.48)
75.62 (0.26)
75.15 (0.31)
92.73 (0.14)
94.00 (0.40)
97.17 (0.33)
96.32 (0.60)
82.94 (0.45)
78.29 (0.59)
92.33 (0.86)
89.32 (1.08)
87.97 (0.12)
89.39 (0.38)
93.89 (0.45)
91.39 (0.42)
53.10 (0.65)
47.30 (2.09)
66.53 (4.06)
51.31 (2.57)
56.93 (0.12)
56.32 (1.36)
63.79 (0.80)
59.12 (1.46)
47.51 (0.45)
36.74 (1.50)
50.17 (2.33)
41.76 (3.01)
Table 3. Test Micro and Macro F1 scores on the text classification task in inductive setting. Models with random initializations were run with 5 different seeds, and standard deviations are reported for them in bracket wherever applicable.
Avg. Time (s) 20NG Ohsumed
HeteGCN(F-X) 4.23 1.09
HeteGCN(TX-X) 0.78 0.18
HeteGCN(X-TX-X) 1.17 0.27
TextGCN 6.86 1.70
Table 4. Average Per Epoch Time: HeteGCN, TextGCN

5.5. Timing Comparison

We made training time comparisons on the various HeteGCN variants and TextGCN

. All the analysis was done on a machine with Intel Xeon 2.60Ghz processor, 112 GB RAM, Ubuntu 16.04 OS, python 3.5 and tensorflow 1.14 (CPU). We are reporting average time taken for an epoch on 20NG and Ohsumed in Table 

4. We observe that we get speed up by using HeteGCN(), speed up by using HeteGCN() and speed up using HeteGCN(. Similar speedups were observed for other datasets as well. The obtained speedups are proportional to the sparsity of the graphs involved ( is denser than ).

5.6. Visualization of learned embeddings

Figure 3. TSNE Plots of 20NG document embeddings obtained from the following 4 models: LABEL: HeteGCN(F-X); LABEL: TextGCN; LABEL: PTE; and LABEL: GCN.

In this subsection, we discuss the effectiveness of the learned word representation. We, first, provide a tSNE (van der Maaten and Hinton, 2012) transformed document representations computed from the HeteGCNtrained on R8-1% labeled dataset in Figure 3 and compare it against those from TextGCN, PTEand GCN. Two things to note: (1) the majority classes (cyan and violet) are much better separated in HeteGCN() embeddings than other models, (2) the minority classes get quite scattered in TextGCNand GCNmodels, and although PTEdoes somewhat better, HeteGCN() shows significantly better clustering of these points even in the 1% labeled setting. We also qualitatively analyse word embeddings by training a logistic regression model on aggregated training document embeddings and predicting words’ labels using word embeddings. We show top-10 words with the highest probabilities for few a classes in 20NG 1% labeled dataset in Table 5. We note that the top-10 words are interpretable even under low-labeled setting.

comp.graphics sci.space sci.med rec.autos
Table 5. Top-10 words per class in 20NG as computed using embeddings trained on 1% labeled data.

6. Related Work

Traditional text classification models (Aggarwal and Zhai, 2012) and neural models (Minaee et al., 2020) discussed in Section 1 require large amount of labeled data and/or pre-trained embeddings. In practice, large labeled data is not always available. Also, raw text information might be inaccessible due to privacy concerns making it infeasible to associate any pre-trained embeddings with this data. In such cases, the various models discussed in Section 1 cannot work effectively.

PTE (Tang et al., 2015a) addresses these problems by learning word representations from given data by constructing a heterogeneous graph of documents, words and labels. This method can work even in a low labeled setting as long as it also has access to some unlabeled data. It can be shown that PTEfactorizes a joint heterogeneous graph to learn word representations (Qiu et al., 2018). Note that utilizing unlabeled data to improve models, and using graphs to improve performance in the low labeled setting has been studied significantly (Chapelle and Zien, 2005; Nigam et al., 2006; Belkin et al., 2006). PTEbuilds on these ideas to learn better representations, while earlier models focused on improving classifier performance.

TextGCN (Yao et al., 2019) combines ideas from PTEwith Graph Convolutional Network (GCN) to give better performance. GCN (Kipf and Welling, 2017) has shown excellent performance on text classification datasets. However, it assumes access to a graph structure among documents like citation networks to provide a boost in performance. Such graphs may not always be available. TextGCN, like PTE, constructs a heterogeneous graph of documents and words (excluding labels) and uses it along with GCN. Unlike PTE, TextGCNjointly learns word representations and classifier together, thereby getting good performance. However, TextGCNhas three issues: (1) it cannot scale to large datasets, (2) it force-fits a heterogeneous graph in GCNdefined for homogeneous graph, (3) it is transductive without a natural inductive formulation. Our proposed approach solves all these issues by (1) decomposing TextGCNsuch that the parameters are independent of the number of documents, (2) individual layers deal with one graph, either document-word or word-word, thereby ensuring consistency in the layer, giving rise to a novel heterogeneous formulation of GCNand (3) finally the approach yields a natural inductive formulation. In this paper, we present preliminary results of the inductive formulation, and there are nuances that need further investigation.

Other recent works in text classification include (Huang et al., 2019; Zhang et al., 2020) and these models construct graphs at text level and use GNNs to exploit local structure in the raw text to learn text embeddings from pre-trained word embeddings. Since we assume that we do not have access to raw text information and pre-trained word embeddings, thereby, comparing against these models would not be fair.

7. Discussion and Future Work

In this work, we showed how different HeteGCNmodel variants could be composed and demonstrated their effectiveness compared to complex TextGCNin different scenarios. HeteGCNcan be extended to work with recommendation problems by modifying the loss function. Another application area of interest is to use HeteGCNmodels for learning embeddings for metapaths (Shi et al., 2019). A metapath is defined as a sequence of entity types (e.g., user movie director, user movie genre) with each edge in the path specifying the relationship between entities and each metapath encodes distinct semantics. Note that embeddings for several metapaths can be learned using multiple HeteGCN

models with fusion and shared capabilities. Finally, deep learning models using knowledge graphs for recommendation 

(Wang et al., 2019) has been an active area of research where our approach can be used and extended with mechanisms such as attention. We intend to explore these directions as future work.

8. Conclusion

We proposed a HeteGCNmodeling approach to construct simpler models by using GCNlayers with different graphs. HeteGCNmodel using word-word graph outperforms state-of-the-art models on several benchmark datasets when the number of labeled examples is small. It is quite effective in terms of model complexity, training time and performance under different labeled training data scenarios compared to TextGCN. We also suggested simpler HeteGCNmodels that are useful when there is storage and compute constraints arising from a large number of features and documents. Finally, we also demonstrated how inductive inference is made with HeteGCNand TextGCNmodels.


  • (1)
  • Aggarwal and Zhai (2012) Charu C. Aggarwal and ChengXiang Zhai. 2012. A Survey of Text Classification Algorithms. Springer US, Boston, MA, 163–222.
  • Belkin et al. (2006) Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. 2006. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. Journal of Machine Learning Research (2006), 2399–2434.
  • Blum and Mitchell (1998) Avrim Blum and Tom Mitchell. 1998. Combining Labeled and Unlabeled Data with Co-Training. In

    Proceedings of the Eleventh Annual Conference on Computational Learning Theory

    . Association for Computing Machinery, New York, NY, USA, 92–100.
  • Cesa-Bianchi et al. (2006) Nicolò Cesa-Bianchi, Claudio Gentile, and Luca Zaniboni. 2006. Hierarchical Classification: Combining Bayes with SVM. In Proceedings of the 23rd International Conference on Machine Learning. Association for Computing Machinery, New York, NY, USA, 177–184.
  • Chapelle and Zien (2005) Olivier Chapelle and Alexander Zien. 2005. Semi-Supervised Classification by Low Density Separation. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, AISTATS 2005, Bridgetown, Barbados, January 6-8, 2005

    , Robert G. Cowell and Zoubin Ghahramani (Eds.). Society for Artificial Intelligence and Statistics.

  • Dong et al. (2017) Yuxiao Dong, Nitesh V. Chawla, and Ananthram Swami. 2017. Metapath2vec: Scalable Representation Learning for Heterogeneous Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, 135–144.
  • Fan et al. (2008) Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. JMLR 9 (2008), 1871–1874.
  • He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, YongDong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, NY, USA, 639–648.
  • Hsieh et al. (2008) Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. 2008. A Dual Coordinate Descent Method for Large-Scale Linear SVM. In Proceedings of the 25th International Conference on Machine Learning. Association for Computing Machinery, New York, NY, USA, 408–415.
  • Huang et al. (2019) Lianzhe Huang, Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2019. Text Level Graph Neural Network for Text Classification. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    . Association for Computational Linguistics, Hong Kong, China, 3444–3450.
  • Joachims (2002) Thorsten Joachims. 2002.

    Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

    Kluwer Academic Publishers, USA.
  • Joulin et al. (2017) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Valencia, Spain, 427–431.
  • Keerthi et al. (2008) S. Sathiya Keerthi, S. Sundararajan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. 2008. A sequential dual method for large scale multi-class linear svms. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 408–416.
  • Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, Online.
  • Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning. JMLR.org, II–1188–II–1196.
  • Liu et al. (2017) Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. 2017. Deep Learning for Extreme Multi-Label Text Classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, NY, USA, 115–124.
  • Liu et al. (2016) Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent Neural Network for Text Classification with Multi-Task Learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016. IJCAI/AAAI Press, 2873–2879.
  • Lu and Getoor (2003) Qing Lu and Lise Getoor. 2003. Link-based Classification. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003). AAAI Press, 496–503.
  • Manning et al. (2008) Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, USA.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems. Curran Associates, Inc., Red Hook, NY, USA, 3111–3119.
  • Minaee et al. (2020) Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2020. Deep Learning Based Text Classification: A Comprehensive Review. arXiv:2004.03705 [cs.CL]
  • Nigam et al. (2006) Kamal Nigam, Andrew McCallum, and Tom M. Mitchell. 2006. Semi-Supervised Text Classification Using EM. In Semi-Supervised Learning, Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien (Eds.). The MIT Press, 32–55.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1532–1543.
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, 701–710.
  • Qiu et al. (2018) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018. Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and Node2vec. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. Association for Computing Machinery, New York, NY, USA, 459–467.
  • Rennie et al. (2003) Jason D. M. Rennie, Lawrence Shih, Jaime Teevan, and David R. Karger. 2003.

    Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In

    Proceedings of the Twentieth International Conference on International Conference on Machine Learning. AAAI Press, 616–623.
  • Selvaraj et al. (2011) Sathiya Keerthi Selvaraj, Bigyan Bhar, Sundararajan Sellamanickam, and Shirish Shevade. 2011. Semi-Supervised SVMs for Classification with Unknown Class Proportions and a Small Labeled Dataset. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, New York, NY, USA, 653–662.
  • Shchur et al. (2018) Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pitfalls of Graph Neural Network Evaluation. Relational Representation Learning Workshop, NeurIPS 2018 (2018).
  • Shi et al. (2019) C. Shi, B. Hu, W. Zhao, and P. S. Yu. 2019. Heterogeneous Information Network Embedding for Recommendation. IEEE Transactions on Knowledge and Data Engineering 31, 02 (2019), 357–370.
  • Sindhwani et al. (2005) V. Sindhwani, P. Niyogi, and M. Belkin. 2005. A Co-Regularization Approach to Semi-supervised Learning with Multiple Views. In Proceedings of the 22nd International Conference on Machine Learning. New York, NY, USA.
  • Tang et al. (2015a) Jian Tang, Meng Qu, and Qiaozhu Mei. 2015a. PTE: Predictive Text Embedding through Large-Scale Heterogeneous Text Networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, 1165–1174.
  • Tang et al. (2015b) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015b. LINE: Large-Scale Information Network Embedding. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1067–1077.
  • van der Maaten and Hinton (2012) Laurens van der Maaten and Geoffrey E. Hinton. 2012. Visualizing non-metric similarities in multiple maps. Mach. Learn. 87, 1 (2012), 33–55.
  • Wang et al. (2019) Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019. KGAT: Knowledge Graph Attention Network for Recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, 950–958.
  • Wu et al. (2019) Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying Graph Convolutional Networks. In Proceedings of the 36th International Conference on Machine Learning.
  • Yao et al. (2019) Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Graph Convolutional Networks for Text Classification. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019. AAAI Press, 7370–7377.
  • Zhang et al. (2020) Yufeng Zhang, Xueli Yu, Zeyu Cui, Shu Wu, Zhongzhen Wen, and Liang Wang. 2020. Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 334–339.