1 Introduction
Graph neural networks have emerged as stateoftheart models for undertaking semisupervised node classification on graphs perozzi2014deepwalk, kipf2016semi, velivckovic2017graph, wu2019simplifying, hamilton2017inductive
. It aims to leverage a small subset of labeled nodes together with a large number of unlabeled nodes to train an accurate classifier. Most modern GNNs rely on an iterative message passing procedure that aggregates and transforms the features of neighboring nodes to learn node embeddings, which are then used for node classification. However, under extreme cases where very few labels are available (e.g., only a handful of labeled nodes per class), popular GNN architectures (e.g., graph convolutional networks (GCNs) typically with two layers) are ineffective in propagating the limited training labels to learn discriminative node embeddings, resulting in inferior classification performance. Recently, a central theme of latest studies has attempted to improve classification accuracy by designing deeper GNNs or new network architectures
qu2019gmnn; verma2019graphmix. However, the challenge of how to effectively learn GNNs with few labels is still underexplored.Recently, pseudolabeling, also called selftraining, has been proposed as one prevalent semisupervised method to explicitly tackle the label scarcity problem on graphs. Pseudolabeling expands the label set by assigning a pseudo label to highconfidence unlabeled nodes, and iteratively retrains the model with both given labels and pseudo labels. Li et al. li2018deeper first proposed a selftrained GCN that chooses top highconfidence unlabeled nodes to enlarge the training set for model retraining. Sun et al. Sun2020MultiStageSL pointed out a shallow GCN’s ineffectivenss in propagating label information under fewlabel settings. A multistage approach was then proposed, which applies deep clustering techniques to assign pseudo labels to unlabeled nodes with high prediction confidence. Zhou et al. zhou2019dynamic proposed a dynamic selftraining framework, which assigns a soft label confidence on the pseudo label loss to control their contribution to gradient update.
Despite offering promising results, the existing pseudolabeling approaches on GNNs have not fully explored the power of selftraining, due to two major limitations. First, these methods impose strict constraints that only unlabeled nodes with high prediction probabilities are selected for pseudo labeling. However, these selected nodes often convey similar information with given labels, causing
information redundancy in the expanded label set. On the contrary, if unlabeled nodes with lower prediction probabilities are allowed to enlarge the label set, more pseudo label noise would be incurred to significantly degrade the classification accuracy. This creates a dilemma for pseudolabeling strategies to achieve desirable performance improvements. Second, the existing methods all treat pseudo labels and genuine labels equally important. They are incorporated into the same loss function, such as the standard cross entropy loss, for node classification, neglecting their distinct contributions to the classification task. In the presence of unreliable or noisy pseudo labels, model performance might deteriorate during retraining.Motivated by the above observations, in this paper, we propose a novel informative pseudolabeling framework called InfoGNN for semisupervised node classification with few labels. Our aim is to fully harness the power of selftraining by incorporating more pseudo labels, but at the same time, alleviate possible negative impact caused by noisy (i.e. incorrect) pseudo labels. To address information redundancy
, we define node representativeness via neural estimation of mutual information between a node and its local context subgraph in the embedding space. This method offers two advantages: 1) It provides an informativeness measure to select unlabeled nodes for pseudo labeling, such that the added pseudo labels can bring in more information gain. 2) It implicitly encourages each node to approximate its own local neighborhood and depart away from other neighborhoods. The intuition behind is that an unlabeled node is considered informative when it can maximally reflect its local neighborhood. By integrating this informativeness measure with model prediction probabilities, our approach enables to selectively pseudo label nodes with maximum performance gains. To mitigate negative impact of
noisy pseudo labels, we also propose a generalized cross entropy loss on pseudo labels to improve model robustness against noise. This loss allows us to maximize the pseudolabeling capacity while minimizing the model collapsing risk. In addition, to cope with the potential classimbalance problem caused by pseudo labeling under extremely fewlabel settings, we propose a classbalanced regularization that regularizes the number of pseudo labels to keep relative equilibrium in each class.Our main contributions can be summarized as follows:

Our study analyzes the ineffectiveness of existing pseudolabeling strategies and proposes a novel pseudolabeling framework for semisupervised node classification with extremely few labels;

Our approach has unique advantages to incorporate an MIbased informativeness measure for pseudolabel candidate selection and to alleviate the negative impact of noisy pseudo labels via a generalized cross entropy loss.

We validate our proposed approach on six realworld graph datasets of various types, showing its superior performance to stateoftheart baselines.
2 Related works
2.1 Graph Learning with Few Labels
GNNs have emerged as a new class of deep learning models on graphs
kipf2016semi; velivckovic2017graph. The principle of GNNs is to learn node embeddings by recursively aggregating and transforming continuous feature vectors from local neighborhoods
wu2019simplifying; chen2020simple; gao2019graph; chen2018fastgcn. The generated node embeddings can then be used as input to any differentiable prediction layer, for example, a softmax layer for node classification. Recently, a series of semisupervised GNNs such as GCNs and their variants have been proposed for node classification. The success of these models relies on a sufficient number of labeled nodes for training. How to train GNNs with a very small set of labeled nodes has remained a challenging task.
PseudoLabeling on Graphs.
To tackle label scarcity, pseudolabeling has been proposed as one of the prevalent semisupervised methods. It refers to a specific training regime, where the model is bootstrapped with additional labeled data obtained by using a confidencebased thresholding method lee2013pseudo; rosenberg2005semi. Recently, pseudolabeling has shown promising results on semisupervised node classification. Li et al. li2018deeper proposed a selftrained GCN that enlarges the training set by assigning a pseudo label to top confidence unlabeled nodes, and then retrains the model using both given labels and pseudo labels. A cotraining method was also proposed that utilizes two models to complement each other. The pseudo labels are given by another random walk model rather than the GNN classifier itself. A similar method was also proposed in zhan2021mutual. Sun et al. Sun2020MultiStageSL showed that a shallow GCN is ineffective in propagating label information under fewlabel settings, and proposed a multistage selftraining framework that relies on a deep clustering model to assign pseudo labels. Zhou et al. zhou2019dynamic proposed a dynamic pseudolabeling approach called DSGCN that selects unlabeled nodes with prediction probabilities higher than a prespecified threshold for pseudo labeling, and assigns soft label confidence to them as label weight.
We argue that all of the existing pseudolabeling methods on GNNs share two major problems: information redundancy and noisy pseudo labels. Our work is proposed to explicitly overcome these limitations, with a focus on developing a robust pseudolabeling framework that allows to expand the pseudo label set with more informative nodes, and to mitigate the negative impact of noisy pseudo labels simultaneously.
Graph Fewshot Learning.
Originally designed for image classification, fewshot learning primarily focuses on the tasks where a classifier is adapted to accommodate new classes unseen during training, given only a few labeled examples for each class snell2017prototypical. Several recent studies ding2020graph; huang2020graph; wang2020generalizing have attempted to generalize fewshot learning to graph domains. For example, Ding et al. ding2020graph proposed a graph prototypical network for node classification, which learns a transferable metric space via metalearning, such that the model can extract metaknowledge to achieve good generalization ability on target fewshot classification task. Huang et al. huang2020graph proposed to transfer subgraphspecific information and learn transferable knowledge via meta gradients.
Despite the fact that fewshot learning and our work both tackle the label scarcity problem, their problem settings and learning objectives are fundamentally different: in fewshot learning, the training and test sets typically reside in different class spaces. Hence, fewshot learning aims to learn transferable knowledge to enable rapid generalization to new tasks. On the contrary, our work follows the transductive GNN setting where the training and test sets share the same class space. Our objective is to improve model training in face of very few labels.
Graph Selfsupervised Learning
Our work is related to selfsupervised learning on graphs
velickovic2019deep, which also investigates how to best leverage the unlabeled data. However, there is a clear distinction in the objectives: the primary aim of selfsupervised learning is to learn node/graph representations by designing pretext tasks without labelrelated supervision, such that the generated representations could facilitate specific classification tasks Liu2021GraphSL. For example, You et al. you2020graph demonstrated that selfsupervised learning can provide regularization for graphrelated classification tasks. This work proposed three pretext tasks (i.e., node clustering, graph partitioning, and graph completion) based on graph properties. Other research works attempted to learn better node/graph representations through creating contrastive views, such as local node vs. global graph view in velickovic2019deep, or performing graph augmentation zhu2020deep.In contrast, our work resort to explicitly augmenting labelspecific supervision for node classification. This is achieved by expanding the existing label set with reliable pseudo labels to best boost model performance in a semisupervised manner.
2.2 Mutual Information Maximization
The Infomax principal was first proposed to encourage an encoder to learn effective representations that share maximized Mutual Information (MI) with the input linsker1988self; belghazi2018mutual; hjelm2018learning. Recently, this MI maximization idea has been applied to improve graph representations. Velickovic et al. velickovic2019deep applied MI maximization to learn node embeddings by contrasting local subgraphs and the highlevel, global graph representations. Qiu et al. qiu2020gcc proposed to learn intrinsic and transferable structural representations by contrasting subgraphs from different graphs via a discriminator. Hassani et al. hassani2020contrastive contrasted node representations from a local view with graph representations from a global view to learn more informative node embeddings. In our context, we leverage the idea of contrastive learning to maximize the MI between each node and its neighboring context. The estimated MI enables to select more representative unlabeled nodes in local neighborhoods for pseudo labeling so as to further advance model performance.
3 Problem Statement
Let represents an undirected graph, where denotes a set of nodes, and denotes the set of edges connecting nodes. denotes the node feature matrix, and is dimensional feature vector of node . The graph structure is represented by the adjacent matrix , where . We assume that only a small fraction of nodes are labeled in the node set, where denotes the set of labeled nodes, and denotes the set of unlabeled nodes.
is the onehot encoding of node
’s class label, and is the number of classes.We consider the semisupervised node classification problem kipf2016semi; velivckovic2017graph under a pseudolabeling paradigm, which is formally defined as follows:
Problem 1
Given an undirected graph together with a small subset of labeled nodes , we aim to design a strategy for expanding the label set from unlabeled nodes, a method for generating reliable pseudo labels, and an exclusive loss function for pseudo labels, such that , and can be combined together with the taskspecific loss to maximize the classification performance of graph neural network . This problem can be formally formulated as Eq.(1).
(1) 
4 Methodology
4.1 Framework Overview
The primary aim of our work is to develop a robust pseudolabeling framework for GNN training with few labels. As shown in Fig. 1, our proposed InfoGNN framework comprises of four primary components: 1) a GNN encoder; 2) candidate selection via MI maximization; 3) pseudo labeling; 4) GNN retraining with the generalized cross entropy loss (GCE) and a classbalanced regularization (CBR) on pseudo labels.
Taking a graph as input, a GNN encoder is first utilized to generate node embeddings and class predictions. Based on the generated node embeddings and graph structure, we derive a measure of node representativenss via MI maximization to assess the informativenss of the nodes, serving for node selection of pseudo labeling. According to the informativeness measure and class prediction probabilities, we assign pseudo labels to selected reliable nodes and use them to augment the existing label set for model retraining. During GNN retraining phase, we propose a GCE loss to improve model robustness against potentially noisy pseudo labels. Furthermore, a KLDivergence loss is used as a regularizer to mitigate the potential classimbalanced problem caused by pseudo labeling, which could be exacerbated by the label scarcity problem. Finally, the standard cross entropy loss (SCE) on labeled nodes, the GCE and CBR losses on pseudo labels are integrated to retrain the GNN network.
4.2 The GNN Encoder
The GNN encoder is the backbone for our framework. It mainly serves for generating node embeddings and giving class prediction probabilities that reflect model confidence in terms of predictions. Any GNN that focuses on node classification can be utilized here for embedding learning and classification. A GNN encoder generally learns node embeddings by recursively aggregating and transforming node features from topological neighborhoods. In our work, we utilize GCN kipf2016semi as our GNN encoder . For , the node embedding at th layer’s prorogation can be obtained by:
(2) 
is the activation function,
is the adjacency matrix of with added selfconnections. , and is a layerspecific trainable weight matrix. We use the SCE loss to optimize GCN for node classification:(3) 
Finally, according to the class prediction probabilities, we can obtain the confidence score for each node :
(4) 
The confidence score is utilized for node selection in combination with the representativeness score, which is detailed below.
4.3 Candidate Selection for Pseudo Labelling
Most existing pseudolabeling methods typically select unlabeled nodes accounting for only model confidence or uncertainty zhou2019dynamic; li2018deeper; mukherjee2020uncertainty. These methods are in favor of selecting the nodes with high prediction probabilities, expecting to bring in less noisy pseudo labels for model retraining. However, these highconfidence nodes tend to carry redundant information with given labels, resulting in limited capacity to improve model performance. Therefore, besides model confidence, we propose to take node informativeness into account for node selection so as to maximally boost model performance. To this end, the key problem lies in how to measure node informativeness.
Informativeness Measure by MI Maximization
We define the node informativeness as the representativeness of a node in relation to its contextual neighborhood. The intuition behind is that a node is informative when it could maximally represent its surrounding neighborhood while minimally reflect other arbitrary neighborhoods. Hence, the representativeness of a node can be measured by the mutual information between the node and its neighborhood with positive correlation. On account of this, we employ MI maximization techniques belghazi2018mutual to estimate the MI by measuring how much one node can represent its surrounding neighborhood and discriminate an arbitrary neighborhood. This thus provides a score to quantify the representativeness for each node. We achieve this by formulating it to be a subgraphbased contrastive learning task, which contrasts each node with its positive and negative context subgraphs.
Given a graph with learned node embeddings , for each node , we define its local hop subgraph as the positive sample, and an arbitrary hop subgraph from node as the negative sample. The mutual information between node and its neighborhood can then be measured by a GANlike divergence nowozin2016f:
(5) 
where is node ’s embedding generated from the GNN encoder, and are the embedding sets of subgraphs centered at node and . is the trainable neural network parameterized by . indicates the affinity between positive pairs, while indicates the discrepancy between negative pairs. Our objective is to estimate by maximizing while minimizing , which is in essence a contrastive objective.
This contrastive objective is further achieved by employing a discriminator as illustrated in Figure 1. At each iteration, after obtaining the learned node embeddings , both positive and negative subgraphs for each node are firstly sampled and paired. Then those nodes and their corresponding paired subgraphs are passed on to a discriminator after being separately processed by an MLP encoder and a subgraph encoder . This discriminator finally produces a representativeness score for each node by distinguishing a node’s embedding from its subgraph embedding.
Formally, we specify . Here, is an MLP encoder for node embedding transformation. is a subgraph encoder that aggregates embeddings of all nodes in the subgraph to generate an embedding of the subgraph, which is implemented using a onelayer GCN on a hop subgraph:
(6) 
where is the original graph adjacent matrix, and is a binary function that guarantees . is the corresponding degree matrix of . What we need to notice is that we use a hop adjacent matrix , instead of the original adjacent matrix for feature aggregation, and the aggregated embedding of the centre node in the subgraph will be the subgraph embedding. With regard to the discriminator , we implement it using a bilinear layer:
(7) 
where is the learnable parameter. To enable the discriminator to measure the affinity between node and its corresponding local subgraph , we minimize the binary cross entropy loss between positive and negative pairs, which is formulated as the contrastive loss:
(8) 
By minimizing , the discriminator could maximally distinguish a node from any arbitrary subgraphs that it does not belong to in the embedding space. This process is equivalent to maximizing their MI in the sense of Eq.(5).
Pseudo Labeling.
The discriminator measures the affinity between each node and its local subgraph. We utilize this affinity to define the informativeness score for each node:
(9) 
where indicates to what extent a node could reflect its neighborhood, and a higher score means that the node is more informativeness. Therefore, by considering both the informativeness score and model prediction confidence, we derive the selection criterion to construct the pseudolabel set :
(10) 
where is the confidence score as in Eq.(4), and
is a hyperparameter whose value can be empirically determined (See Fig.
4(b) in Section 5.6). We then produce the pseudo labels for utilizing the GNN encoder :(11) 
Where the pseudo label is actually the predicted label by the GNN encoder.
4.4 Mitigating Noisy Pseudo Labels
During the retraining phase, existing pseudolabeling methods regard given labels and pseudo labels equally important, so an identical loss function, e.g., the SCE loss, is applied. However, with more nodes added in, it is inevitable to introduce unreliable or noisy (i.e., incorrect) pseudo labels. If the same SCE loss is still applied on unreliable pseudo labels, it would degrade model performance. This is because, the SCE loss implicitly weighs more on the difficult nodes whose predictions deviate away from the supervised labels during gradient update zhang2018generalized; van2015learning. This is beneficial for training with clean labels and ensures faster convergence. However, when there exist noisy pseudo labels in the label set, more emphasis would be put on noisy pseudo labels as they are harder to fit than correct ones. This would ultimately cause the model to overfit incorrect labels, thereby degrading model performance.
To address this issue, we propose to apply the negative BoxCox transformation box1964analysis to the loss function on pseudo label set , inspired by zhang2018generalized. The transformed loss function is given as follows:
(12) 
where , is the pseudo label. To further elaborate how this loss impacts parameter update, we have its gradient as follows:
(13) 
where for . Compared with the SCE loss, it actually weighs each gradient by an additional , which reduces the gradient descending on those unreliable pseudo labels with lower prediction probabilities. Actually, can be regarded as the generalization of the SCE loss and the unhinged loss. It is equivalent to SCE when approaches zero, and becomes the unhinged loss when is equal to . Thus, this loss allows the network to collect more additional information from a larger amount of pseudo labels while alleviating their potential negative effect.
In practice, we apply a truncated version of to filter out potential impact from unlabeled nodes with low prediction probabilities, given by:
(14) 
where , and . Formally, the truncated loss version is derived as:
(15) 
where if , otherwise . Intuitively, when the prediction probability of one node is lower than , the corresponding truncated loss would be a constant. As the gradient of a constant loss is zero, this node would have no contribution to gradient update, thus eliminating negative effect of pseudo labels with low confidence.
4.5 Classbalanced Regularization
Under extreme cases where only very few labels are available for training, severe classimbalance problem would occur during pseudo labeling. That means, one or two particular classes might dominate the whole pseudo label set, thus conversely impacting model retraining. To mitigate this, we propose to apply a classbalanced regularizer that prevents the number of different classes deviating a lot from each other. For this purpose, we apply a KLdivergence between the pseudo label distribution and a default label distribution:
(16) 
where is the default probability of class . Since it would be desirable to have roughly equal numbers of pseudo labels from each class, we set the default label distribution as in our situation.
is the mean value of prediction probability distribution over pseudo labels, which is calculated as follows:
(17) 
4.6 Model Training and Computational Complexity
Our proposed InfoGNN framework is given by Algorithm 1, which consists of one pretraining phase and one formal training phase. The pretraining phase (Step 24) is used to train a parameterized GNN with given labels. Accordingly, network parameters are updated by:
(18) 
At the beginning of the formal training phase, the pretrained GNN is applied to generate prediction probabilities and informativeness score for each node, which are then used to produce pseudo labels (Step 68). Finally, both given labels and pseudo labels are used to retrain the GNN by minimizing the following loss function (Step 9):
(19) 
In terms of computational complexity, by comparison with GNN models based on the SCE loss, InfoGNN incurs slightly extra computational overhead in its attempt to mitigate label noise. The is mainly due to the calculation of the contrastive loss with subgraph encoder. Since we utilize a onelayer GCN as subgraph encoder on a hop subgraph, its computational complexity is linear with the number of edges , where is the number of edges in the hop subgraph, i.e. . This is reasonably acceptable.
5 Experiments
To validate the effectiveness of the proposed pseudolabeling framework, we carry out extensive experiments on six realworld graph datasets to compare against stateoftheart baselines. We also conduct ablation study and sensitivity analysis to better understand key ingredients of our approach.
5.1 Datasets
Our experiments use six benchmark graph datasets in three different domains:

Citation networks:Cora, Citeseerkipf2016semi and Dblp^{1}^{1}1https://github.com/abojchevski/graph2gaussbojchevski2017deep. On the three networks, each node represents a paper with a certain label and each edge represents the citation links between two papers. Node features are bagofwords vectors of papers.

Webpage networks: Wikics^{2}^{2}2https://github.com/pmernyei/wikicsdataset/raw/master/datasetmernyei2020wiki is computer science related Webpage networks in Wikipedia. Nodes represent articles about computer science, and edges represent hyperlinks between articles. The features of nodes are mean vectors of GloVe word embeddings of articles.

Coauther networks: CoautherCS and CoautherPhy^{3}^{3}3https://github.com/shchur/gnnbenchmarkshchur2018pitfalls. They are coauthor networks in the domain of computer science and Physics. Nodes are authors, and edges mean whether two authors coauthored a paper. Node features are paper keywords from the author’s papers.
Detailed dataset statistics are listed in Table 1 below.
Dataset  Nodes  Edges  Classes  Features 

Citeseer  3327  4732  6  3703 
Cora  2708  5429  7  1433 
Dblp  17716  105734  4  1639 
Wikics  11701  216123  10  300 
Coauthor_CS  18333  81894  15  6805 
Coauthor_Phy  34493  247962  5  8415 
5.2 Baselines
For comparison, we use 12 representative methods as our baselines. Since all methods are based on the original GCN, GCN kipf2016semi is selected as the benchmark. A total of 11 recently proposed methods on graphs are used as strong competitors, which can be categorized into two groups:

Pseudolabeling methods: M3S Sun2020MultiStageSL, Selftraining li2018deeper, Cotraining li2018deeper, Union li2018deeper, Intersection li2018deeper, and DSGCN zhou2019dynamic;

Selfsupervised methods: SuperGCN Kim2021HowTF, GMI peng2020graph, SSGCNclu You2020WhenDS, SSGCNcomp You2020WhenDS, SSGCNpar You2020WhenDS.
We run all experiments for 10 times with different random seeds, and report their mean MicroF1 scores for comparison.
5.3 Experimental setup
Model Specification.
For fair comparison, all baselines are adapted to use a twolayer GCN with 16 units of hidden layer. The hyperparameters are the same with the GCN in kipf2016semi, with L2 regularization of , learning rate of 0.01, dropout rate of 0.5. As for subgraph encoder , we utilize a onelayer GCN with outputs, where is the number of classes. Both positive and negative subgraphs share the same subgraph encoder. is also a onelayer MLP with 16dimension output. The discriminator is a 1layer bilinear network with onedimension output. For dataset split, we randomly choose nodes per class for training as different settings. Then we randomly pick 30 nodes per class as validation set, and the remaining nodes are used for testing.
Hyperparameter Specification.
We specify hyperparameters conforming to the following rules:
setting  

1.0  1.0  1.0  0.55  
0.2  0.2  0.1  0.55 
Generally, a larger and would be beneficial to model training when given labels are very scarce, while it is more likely to achieve better performance with a smaller and as the number of given labels increases. We empirically find that our model has relatively lower sensitivity to with the regularization of loss , so we can fix its value for most of the settings. Specifically, we utilize when 1 label per class are given, and for all of the other situations. As for , we fix it to be 0.55 for all the settings. The best for subgraph embedding in loss depends on the edge density of the input graph. We apply for edgesparse graphs {Cora, Citeseer, Dblp, Coauther_cs}, for Wikics and for Coauther_phy.
Implementation Details.
When training InfoGNN, we first pretrain the network to generate reliable predictions using Eq.(18
) for 200 epoches, and then proceed with formal training using the full loss function Eq.(
19) for another 200 epoches. During formal training, in order to get a steady model, we allow the model to update pseudolabel set every 5 epoches using Eq.(10). When updating the pseudolabel set, we use the mean score of unlabeled nodes in its last 10 training epoches, rather than the current prediction and informativeness score. Our framework is implemented using Pytorch. All experiments are run on a machine powered by Intel(R) Xeon(R) Gold 6126 @ 2.60GHz CPU and 2 Nvidia Tesla V100 32GB Memory Cards with Cuda version 10.2.
Method  Cora  Citeseer  

1  3  5  10  15  20  1  3  5  10  15  20  
GCN  0.416  0.615  0.683  0.742  0.784  0.797  0.379  0.505  0.568  0.602  0.660  0.682 
SuperGCN  0.522  0.673  0.720  0.760  0.788  0.799  0.499  0.610  0.665  0.700  0.706  0.712 
GMI  0.502  0.672  0.715  0.757  0.783  0.797  0.497  0.568  0.621  0.632  0.670  0.683 
SSGCNclu  0.407  0.684  0.739  0.776  0.797  0.810  0.267  0.388  0.507  0.616  0.634  0.647 
SSGCNcomp  0.451  0.609  0.676  0.741  0.772  0.794  0.433  0.547  0.638  0.682  0.692  0.709 
SSGCNpar  0.444  0.649  0.692  0.734  0.757  0.770  0.457  0.578  0.643  0.693  0.705  0.716 
Cotraining  0.533  0.661  0.689  0.741  0.764  0.774  0.383  0.469  0.563  0.601  0.640  0.649 
Selftraining  0.399  0.608  0.693  0.761  0.789  0.793  0.324  0.463  0.526  0.647  0.683  0.685 
Union  0.505  0.663  0.713  0.764  0.792  0.797  0.366  0.491  0.560  0.631  0.663  0.667 
Intersection  0.408  0.596  0.674  0.736  0.770  0.775  0.337  0.497  0.582  0.671  0.694  0.699 
M3S  0.439  0.651  0.688  0.754  0.763  0.789  0.307  0.515  0.635  0.674  0.683  0.695 
DSGCN  0.596  0.712  0.745  0.777  0.792  0.795  0.463  0.613  0.652  0.674  0.681  0.684 
InfoGNN  0.602  0.737  0.775  0.792  0.814  0.829  0.541  0.654  0.717  0.723  0.725  0.734 
Method  Dblp  Wikics  
1  3  5  10  15  20  1  3  5  10  15  20  
GCN  0.469  0.583  0.628  0.652  0.688  0.718  0.384  0.548  0.639  0.682  0.713  0.721 
SuperGCN  0.472  0.583  0.685  0.708  0.729  0.738  0.399  0.552  0.599  0.683  0.712  0.721 
GMI  0.544  0.597  0.656  0.728  0.739  0.754  0.325  0.484  0.546  0.654  0.683  0.700 
SSGCNclu  0.369  0.528  0.649  0.692  0.721  0.744  0.335  0.579  0.627  0.694  0.714  0.725 
SSGCNcomp  0.458  0.525  0.598  0.634  0.674  0.707             
SSGCNpar  0.418  0.545  0.639  0.683  0.708  0.733  0.332  0.593  0.659  0.706  0.732  0.740 
Cotraining  0.545  0.646  0.634  0.674  0.703  0.701  0.367  0.584  0.645  0.692  0.724  0.737 
Selftraining  0.437  0.580  0.634  0.707  0.738  0.759  0.350  0.602  0.655  0.701  0.725  0.738 
Union  0.485  0.618  0.652  0.712  0.737  0.746  0.351  0.584  0.646  0.694  0.723  0.740 
Intersection  0.458  0.581  0.566  0.665  0.715  0.734  0.359  0.599  0.654  0.706  0.726  0.740 
M3S  0.547  0.635  0.672  0.733  0.749  0.752  0.401  0.593  0.621  0.685  0.711  0.734 
DSGCN  0.587  0.671  0.720  0.738  0.744  0.764  0.414  0.607  0.635  0.705  0.716  0.728 
InfoGNN  0.597  0.669  0.748  0.765  0.772  0.789  0.462  0.611  0.649  0.723  0.740  0.742 
Method  Coauther_cs  Coauther_phy  
1  3  5  10  15  20  1  3  5  10  15  20  
GCN  0.642  0.800  0.847  0.893  0.901  0.909  0.699  0.851  0.868  0.901  0.912  0.918 
SuperGCN  0.668  0.841  0.869  0.895  0.897  0.897  0.688  0.848  0.891  0.908  0.923  0.923 
GMI  OOM  OOM  OOM  OOM  OOM  OOM  OOM  OOM  OOM  OOM  OOM  OOM 
SSGCNclu  0.770  0.886  0.890  0.905  0.908  0.911  0.889  0.923  0.930  0.935  0.936  0.936 
SSGCNcomp  0.711  0.858  0.888  0.904  0.907  0.909  0.798  0.892  0.904  0.927  0.921  .928 
SSGCNpar  0.737  0.860  0.881  0.898  0.901  0.903  0.824  0.915  0.919  0.925  0.931  0.931 
Cotraining  0.643  0.745  0.810  0.849  0.864  0.885  0.758  0.842  0.850  0.898  0.891  0.917 
Selftraining  0.592  0.770  0.828  0.873  0.892  0.895  0.744  0.865  0.890  0.908  0.914  0.921 
Union  0.621  0.772  0.812  0.856  0.864  0.885  0.750  0.855  0.870  0.908  0.902  0.910 
Intersection  0.650  0.775  0.851  0.887  0.893  0.898  0.612  0.763  0.854  0.901  0.904  0.926 
M3S  0.648  0.818  0.879  0.897  0.909  0.912  0.828  0.868  0.895  0.914  0.922  0.930 
DSGCN  0.743  0.829  0.863  0.879  0.883  0.892  0.781  0.812  0.862  0.896  0.908  0.916 
InfoGNN  0.682  0.866  0.892  0.906  0.913  0.918  0.842  0.924  0.934  0.938  0.942  0.942 
5.4 Comparison with Stateoftheart Baselines
Table 3 reports the mean MicroF1 scores of our method and all baselines with respect to various label rates. The best performer is highlighted by bold, and the second best performer is highlighted by underline on each setting. On the whole, our proposed InfoGNN algorithm outperforms other baseline methods by a large margin over almost all the settings. Compared with GCN, we averagely achieve of performance improvement on the six datasets when nodes per class are labeled, respectively. With the help of our pseudolabelling method, InfoGNN can achieve better results with less nodes. Particularly, by leveraging less 10 nodes per class, InfoGNN succeeds to achieve the similar MicroF1 scores that GCN achieves using 20 nodes per class over the six datasets. With regard to the selfsupervised methods, they have unstable performances over different datasets. For example, although SSGCNclu obtains the advantageous results on Coauthorcs/phy datasets, it has relatively worse results on other four datasets. We can also find the SSGCNComp even does not work on Wikics dataset. This is because their pretext tasks are probably not always able to help generate representations that generalize well on graphs with different properties. From the table we can also see that, although most baseline methods gradually lose their edges as more labels are available for training, our proposed method still achieves relatively better classification accuracy. Taking 20 labels per class as an example, when all baselines hardly improve classification accuracy over GCN, our proposed method still achieves further improvements. This proves that our algorithm is able to effectively relieve the information redundancy problem when label information is relatively sufficient.
Method  Cora  Citeseer  Dblp  Wikics  Coauther_cs  Coauther_phy  

3  10  3  10  3  10  3  10  3  10  3  10  
GCN  0.615  0.742  0.505  0.602  0.583  0.652  0.548  0.682  0.800  0.893  0.851  0.901 
InfoGNNI  0.683  0.763  0.582  0.694  0.599  0.739  0.548  0.695  0.823  0.887  0.885  0.928 
InfoGNNIT  0.696  0.791  0.589  0.723  0.619  0.768  0.586  0.726  0.827  0.892  0.899  0.941 
InfoGNNITS  0.720  0.792  0.624  0.728  0.645  0.766  0.592  0.723  0.826  0.886  0.905  0.941 
InfoGNN  0.737  0.792  0.654  0.723  0.669  0.765  0.611  0.723  0.866  0.906  0.924  0.942 
5.5 Ablation Study
To further analyze how different components of the proposed method take effect, we conduct a series of ablation experiments. Due to space limit, we only report experimental results on the settings where 3 and 10 nodes are labeled per class. The ablations are designed as follows:

InfoGNNI: only is applied based on GCN, which is used to evaluate the role of the contrastive loss;

InfoGNNIT: both and are applied, which is utilized to evaluate the impact of the GCE loss by comparing with InfoGNNI. Note that only prediction score is applied here for , i.e. ;

InfoGNNITS: on the basis of InfoGNNIT, the informativeness score, i.e., Eq.(10), is also applied for , which is to test the efficacy of the informativeness score by comparing with InfoGNNIT. The impact of the loss can be revealed by comparing with InfoGNN.
The ablation results are reported in Table 4. From this table, we can see that each component of InfoGNN has its own contribution, but their contribution might differ at the two different settings. The constrastive loss seems to make similar contribution at the two settings, which achieves an average improvement of 3.6% and 3.9% over GCN on the six datasets. This proves that enables to produce node embeddings beneficial to node classification by maximizing the MI between nodes and their neighborhood. On top of , when GCE is applied, model performance has been further boosted. Taking Dblp and Wikics as an example, it respectively boosts the accuracy by 1.6% and 3.8 % with 3 given labels per class, 2.9% and 3.1% with 10 given lables per class. This confirms the effectiveness of the GCE loss . With regard to the CBR , it improves the performance by 2.48% on average, with nearly 4% on Coauther_cs when 3 labels per class are given. Yet, we also find it contributes less on the setting of 10 given labels per class than that of 3, which is in line with our expectation. As in the situation of lower label rate, due to the limited label information, GNN is more prone to generate imbalanced predictions. Hence, the contribution of is more remarkable. A similar consequence could also be seen on the contribution of candidate selection using representativeness scores. When there are few given labels, the selected nodes can bring in relatively more auxiliary information. In contrast, when more given labels are available, a larger amount of unlabeled nodes could be selected due to our slack pseudolabeling constraints, which, to some extent, counteracts the effect of representative scoring.
5.6 Sensitivity Analysis
We also conduct experiments to test the impact of hyperparameters ( and ) on the performance of InfoGNN. We take turns to test the effect of each hyperparameter while fixing the values of the rest. Due to space limit, we only report the results on Cora, Citeseer and Dblp when and labels per class are given for training.
Hyperparameter controls the contribution of the contrastive loss to the total loss. Its impact on model performance is shown in Fig. 2. With 3 given labels per class provided, we find that a larger could lead to better performance before . After that, the performance retains at a good level with very slight changes. With 10 labels per class provided, except on Dblp, the changes of do not largely impact model performance on Cora and Citeseer. This indicates that, when label information is very limited, our model requires stronger structural regularization to help generate discriminative node embeddings. On the other hand, when label information is relatively sufficient, network training is dominated by supervised loss from given labels. Thus, mainly takes effects when given labels are scarce.
Fig. 3 shows performance comparisons on different values of . A similar trend with can be observed on both settings. With only 3 labels per class provided, the classimbalance problem is more likely to occur during pseudo labeling. Thus, our model favors a larger to regularize numbers of each pseudo label class to be relatively equivalent, as shown in Fig. 3(a). As increases from 0.1 to 1.0, our model boosts its classification accuracy by around 3% on Citeseer and Cora. When 10 labels are given, as more label information can be exploited, the classimbalance problem is less likely to arise. Hence, the change of does not result in much impact on model performance.
Hyperparameter is the generalization coefficient in . Fig. 4(a) illustrates model performance changes with an increase of when one label per class is given. We can see that, as rises, the performance of our method shows a gradual increase on the three datasets. This is because the severe lack of label information is more probable to incur noise in pseudo labels. A larger is then able to decay the gradient update on unreliable samples that have lower prediction probabilities. This reduces the sensitiveness of our model towards incorrect pseudo labels, leading to better performance. On the other hand, when descending near zero, the GCE loss is approaching close to SCE, and at the same time, the model has a significant performance degradation. This further proves the superiority of GCE over SCE loss when only few labels are given for training.
Hyperparameter is the threshold for , which controls how many unlabeled nodes are selected for pseudo labeling. Fig. 4(b) depicts the performance changes by varying with one given label per class. As we can see in this figure, a medium achieves better accuracy, while either too small or too large would undermine model performance.
6 Conclusion
In this paper, we propose an informativeness augmented pseudolabeling framework, called InfoGNN, to address semisupervised node classification with few labels. We argue that all of the existing pseudolabeling approaches on GNNs suffer from two major pitfalls: information redundancy and noisy pseudo labels. To address these issues, we design a representativeness measuring method to assess node informativeness based on MI estimation maximization. Taking both informativeness and prediction confidence into consideration, more informative unlabeled nodes are selected for pseudo labeling. We then propose a generalized cross entropy loss for pseudo labels to mitigate the negative effect of unreliable pseudo labels. Furthermore, we propose a classbalanced regularization in response to the potential classimbalance problem caused by pseudo labeling. Extensive experimental results and ablation studies verify the effectiveness of our proposed framework, and demonstrate its superior performance to stateoftheart baseline models, especially under very fewlabel settings.