Feature-Attention Graph Convolutional Networks for Noise Resilient Learning

12/26/2019 ∙ by Min Shi, et al. ∙ Florida Atlantic University 0

Noise and inconsistency commonly exist in real-world information networks, due to inherent error-prone nature of human or user privacy concerns. To date, tremendous efforts have been made to advance feature learning from networks, including the most recent Graph Convolutional Networks (GCN) or attention GCN, by integrating node content and topology structures. However, all existing methods consider networks as error-free sources and treat feature content in each node as independent and equally important to model node relations. The erroneous node content, combined with sparse features, provide essential challenges for existing methods to be used on real-world noisy networks. In this paper, we propose FA-GCN, a feature-attention graph convolution learning framework, to handle networks with noisy and sparse node content. To tackle noise and sparse content in each node, FA-GCN first employs a long short-term memory (LSTM) network to learn dense representation for each feature. To model interactions between neighboring nodes, a feature-attention mechanism is introduced to allow neighboring nodes learn and vary feature importance, with respect to their connections. By using spectral-based graph convolution aggregation process, each node is allowed to concentrate more on the most determining neighborhood features aligned with the corresponding learning task. Experiments and validations, w.r.t. different noise levels, demonstrate that FA-GCN achieves better performance than state-of-the-art methods on both noise-free and noisy networks.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many real-world applications involve knowledge mining and analysis from network or graph-based data such as citation networks, social networks, telecommunication networks, and biological networks, etc, where data are often collected from noisy channels with erroneous/inconsistent labels or features [11]. In order to carry out pattern mining from networks, such as community detection [22], node classification [28], link prediction [30], etc., network representation learning (or embedding learning) [29] is commonly used to construct features to represent nodes for learning.

To capture node relations, early network embedding learning mainly focuses on topology features [16, 22]

, where nodes sharing similar topology structures are enforced to have close feature representation. For example, two scholarly publications citing same set of literature in a citation network would be represented by similar feature vectors

[8], and two users interacting with many common friends in a social network would share similar features in the learned representation space [13]. However, structure-based methods can only model the explicit node relations already reflected by the network edges, which fail to capture the implicit relationships between nodes because of the sparse graph connections. For example, two users in a social network do not have an immediate link not because they are not friends in reality, but they might be unaware of each other’s existence online. To mitigate this problem, recent studies propose to embed the content information associated with a network to enhance node structures modeling [26, 12]. Indeed, networks with rich textual contents are ubiquitous in real world, such as the citation network and Wikipedia network where nodes are usually described by substantial texts. In general, content features are able to reveal relationships between nodes aligned with the network structures (e.g., two nodes with many shared content features are highly likely to form a neighborhood) [2, 23], but in a more fine-grained and interpretable fashion, i.e., the affinity between two linked nodes can be measured by the number of shared content features other than just a single edge in the graph.

In additional to the above adjacency matrix or random walk based network embedding learning, recently, the spectral-based Graph Convolutional Networks (GCN) [7, 18] have shown impressive performance to directly embed graphs with rich content features by a semi-supervised node classification training. GCN relies on the assumption that each node tends to have the same label with its neighbors that is guaranteed by the aggregated features from all neighborhood nodes, where different features are typically treated as independent and equally important. However, such a learning mechanism may be challenged by the following two realities. First, in graphs (e.g., citation networks and Wikipedia networks) where contents are word sequences, each word feature usually does not appear to represent a complete semantic alone but to correlate with others to form an unique meaning [15], i.e., the meaning of a sentence is usually demonstrated not by every single word, but by the context of all words having dependencies or correlations with each other. Second, while nodes have connections are assumed to have dependencies with their content features, not all features function equally to trigger interactions between nodes, i.e., although a research publication may have rich text information such as title and abstract, yet many words are actually not distinguishing features to reveal its citation relationships with others.

Indeed, existing methods have made significant progress for network embedding learning, but they all consider networks as error-free sources and treat features in each node content as independent and equally important while modeling the node relations. The ignorance of the impact of erroneous content, combined with sparse node features, provide essential challenges for existing methods to handle real-world noisy networks. In fact, a recent work has empirically validated various embedding approaches on sparse and noisy knowledge graphs, and concluded that “

embeddings are sensitive to sparse and unreliable data” [14].

In summary, existing methods are sensitive and ineffective to noise and sparse content mainly because of the following challenges.

  • Noise Propagation: When noise is imposed to the node (e.g. incorrect words), it will force existing methods, such as GCN or attention networks, to learn deteriorated weight values, corresponding to noisy features. Such noise propagation directly deteriorates network embedding results as we will show in Section 5.

  • Feature Interaction: While existing methods have taken node content into consideration, they treat all features equally for embedding learning. In reality, features have different interaction with respect to neighboring nodes, and should be differentiated for learning each node’s representation.

  • Sparsity and Dimensionality: Most networks have high dimensionality and sparse node content, (e.g. each node only has about 1% features, compared to the whole feature space). Noise impact, in a high dimensionality and sparse content setting, is even more severe because the underlying models are highly vulnerable to errors.

To address aforementioned problems, we propose a novel Feature-based Attention GCN (FA-GCN) model to perform noise resilient learning for networks with sparse and noisy node content. Figure 1 shows an illustrative example of our proposed approach, each content feature is first represented as a dense semantic vector, with feature correlation/dependency being well preserved based on a Bi-directional Long Short-Term Memory (LSTM) network. In other words, the representation learning for each feature is dependent on the semantic representations of other features in a node content. Meanwhile, to minimize the impact of noisy content features, we introduce an attention layer over the LSTM network to determine the importance of various neighborhood features, aligned with the corresponding node classification task. As a result, noisy features/words in a node will receive reduced attention values to minimize its impact, and resulting in noise-resilient learning.

It is worth noting that our work is different from a recently proposed Graph Attention Networks (GAT) [21], wherein features are modeled as independent and the attention is calculated at the node level. In comparison, we argue that features in a node content could interact with each other to reveal richer and more accurate node semantics, i.e., the same word feature may have different meanings under different contexts and different word features may indicate the same meaning within a similar context. Meanwhile, the node-level attention in GAT assumes that all features in the node contents contributing equally to edge connections, whereas our feature-level attention enables differentiation of relevant features triggering node interactions in a network. Specifically, our main contributions are as follows:

  • We proposed to model node relations at feature level, where each node interacts with different neighbors are attributed to the most influential node features.

  • We proposed to model feature correlations for enhanced node representation learning and classification, which is more reasonable for graphs where node contents or features are word sequences.

  • We proposed a noise-resilient learning framework for networks with sparse and noisy text features. It models feature correlations by a Bi-directional LSTM network and meanwhile conducts differentiable neighborhood features aggregation by a higher attention layer over the LSTM network.

The rest of the paper is organized as follows. Section 2 discusses related work, followed by problem definition and preliminaries related to LSTM network and spectral-based graph convolutional networks in Section 3. The proposed algorithm is described in Section 4, and experiments and comparisons are reported in Section 5. Finally, Section 6 concludes the paper.

Ii Related Work

Graph representation learning [29, 24] aims to represent each node of a target network as a low-dimensional vector, such that various downstream analytic tasks can be benefited. Early work in the area mainly focuses on shallow neural models to preserve only the node structures [29]. To capture high-order neighborhood relationships between nodes, DeepWalk [13] performs a random walk process over the whole graph to generate a collection of fixed-length node sequences similar to the natural language sentences. It then explores a widely used neural model Skip-Gram [10] to learn node representations from these node sequences. However, Node2vec [5] demonstrates that DeepWalk has not fully preserved the connectivity patterns between nodes and thus proposes to combine the breadth-first sampling and depth-first sampling in the random walk process, where the community properties between nodes can be well preserved. LINE [19] is proposed for large scale network representation learning by preserving the first- and second- order node relations, where the first-order is determined by the immediate links and the second-order relation between two nodes is created by their shared neighbors. However, in addition to the complex graph structures that have encoded node relations, graphs are usually associated with rich content information such as attributes and texts that also revealed the affinities between nodes [8, 20]. For examples, the Relational Topic Model (RTM) [8] is utilized to model both the documents and link relationships, which assumes that documents with links also have similar topic distributions and semantic representations. TADW [26] leverages the rich texts to enhance the structure-based representation learning based on an equivalent matrix factorization method as the DeepWalk. TriDNR [12] can integrate the node structure, content and labels in an unified framework, which enforces the node representations to be learned from simultaneously the network structure and text content under the shared model parameters.

Because shallow models have limitations in learning complex relational patterns between nodes [24], there is an increasing number of efforts to explore graph neural networks, which take a graph as input and learn node representations by a supervised training process [17]

. Recently, inspired by the huge success of convolutional neural networks on grid-like data such as images, a lot of tentative works emerged that seek to adopt a similar convolutional feature extraction process directly on the arbitrarily structured data such as graphs

[4, 1]. To date, Graph Convolutional Networks (GCN) [7] have appeared to achieve the state-of-the-art performance in many graph related analytic tasks, which can naturally learn node representations from graph structures and contents. For example, Schlichtkrull et al. [18] proposed the relational GCN and applied them to two standard knowledge base completion tasks of link prediction and entity classification. Yao et al. [28] proposed the Text GCN for text classification, where the text graph is built based on the word co-occurrence and document word relations. Yan et al. [25] proposed the spatial-temporal GCN for skeleton-based action recognition. It is formulated on top of a sequence of skeleton graphs, where each node corresponds to a joint of the human body and edges represent the connectivity between joints. In general, while a node could connect with many others in a graph, different neighbors may have different contributions when generating the representation of this node. The Graph Attention Networks [21] were proposed to solve this problem by using a self-attention strategy to assign large weights on important neighbors for feature aggregations. However, since nodes interact with each other are usually resulted by fine-grained features [20], node level attention maybe insufficient to characterize node relations.

In comparison, all existing work considers networks as quality sources without considering noise or errors in the networks. As a recent study [14] has show that all embedding methods suffer significantly from “sparse and unreliable data”, we propose a feature-attention mechanism to differentiate and aggregate neighbor features for noise resilient learning.

Fig. 1: An illustrative example of the proposed approach: Feature representation is used to explore feature correlation and learn a dense vector for each feature. Feature attention is used to differentiate feature interaction between each node and its neighboring node features, allowing better feature aggregation for noise resilient embedding learning.

Iii Problem Definition & Preliminaries

Iii-a Problem Definition & Motivation

A network is represented as , where is a set of unique nodes and is a set of edges. Let denote the adjacency matrix representation of edges with if and if . Let and be the

degree matrix and identity matrix respectively, where

is calculated by . For each node , we use to denote its content, which is a sequence of word features represented by . For all nodes in , their contents form the content corpus . We use to denote vocabulary (or number of unique words) in the content corpus. It is common that each node has very sparse content, so .

In this paper, we refer to noise as erroneous node content, where content of each node contains some errors (e.g. erroneous feature values or words). Under the sparse and noisy node content setting, our goal is to learn good feature representation for each node in the network for classification.

In order to tackle sparse and noisy node content, our motivation, as shown in Figure 1, is to employ an optimization approach to address the sparsity and errors: (1) learning a dense vector to represent each content word (), and (2) using feature-attention to learn weight values for each word, based on node-node interaction, and then use feature-attention to aggregate neighboring nodes for noise resilient representation learning for each node.

As shown in Figure  1, node has three neighbors (, , and ), each containing some content words. Our first learning objective is to learn a dense feature vector for each word (). Then feature-attention is used to learn weight values which quantify node ’s feature level interactions with respect to neighbor on word . After that, feature aggregation is used to aggregate all ’s neighbors to learn a good feature representation for . It is worth noting that the learning of feature/word representation and node representation are carried out simultaneously under a unified optimization goal, as we will detail in Eq. (28).

Iii-B Long Short-Term Memory

To learn vector representation for words, Long Short-Term Memory (LSTM) networks [6] have achieved significant success [3, 27], thanks to its recurrent learning capacity on sequential data like text. In LSTM, features are not independently modeled but can interact with each other through the memory and state transmission mechanisms. When two reverse-order LSTM networks are combined, each feature is allowed to semantically correlate with any other feature within the same sequence. Figure 2 shows the interior structure of an LSTM unit/cell [6], where , and are the forget, input and output gates, respectively. denotes the cell output at time , and is the global cell state that enables the sharing of different cell outputs throughout the LSTM network. Features are usually sequentially fed into the LSTM network, where the corresponding parameters for a feature at time are updated by:


where and are weight parameters for the corresponding gates, and , , and are their biases, respectively.

Iii-C Graph Convolutional Networks

GCN is an efficient variant of the convolutional neural networks operating directly on graphs [7] by encoding both the graph structures and node features. Given a network , which has vertices and each node has dimension feature values ( denotes the feature value matrix), GCN intends to learn a low-dimensional node representations though a convolutional learning process. In general, with one convolutional layer, GCN is able to preserve the first-order neighborhood relations between nodes, where each node is represented as an dimension vector, and node feature matrix is computed by:

Fig. 2: Structure of an LSTM unit/cell.
Fig. 3: The proposed FA-GCN model. Feature representations of nodes are dynamically learned by the LSTM network with attention mechanisms. Then, with the spectral-based convolutional filter, each node could gather the most important content features from itself and its neighbors to form the node representation. The feature and node representations are trained in an unified manner to optimize the collective classification objective by Eq. (28).

where is the normalized symmetric adjacency matrix, is the initial input feature matrix of and is a weight matrix for the first convolutional layer.

is an activation function such as the

ReLU represented by . If it is necessary to encode higher-order (e.g., -hop) neighborhood relationships, one can easily stack multiple GCN layers, where the output node features of the th (0 ) layer is computed by:


where (or if it is the last layer) is the weight matrix for the th layer.

Iv The Proposed Approach

For existing GCN-based methods, they model node features as independent (e.g. using one-hot representation of content features where each element in the feature matrix indicates whether a corresponding feature (e.g., word) appears or not), making these methods sensitive to noise and sparse input.

In order to tackle high dimensional, sparse, noisy node content, we propose to represent each single node feature as a dense semantic vector and various features can influence each other by interactions of their semantic vectors during learning. In addition, each feature will be assigned with an importance weight, which allows each node aggregates important neighbor features for representation learning.

The proposed Feature Attention-based GCN (FA-GCN) model for above two purposes is shown in Figure 3. First, the feature representations are learned by a Bi-directional LSTM network, where feature correlations can be preserved. Then, with an attention layer on the top, each feature is weighted based on its importance to the target node classification task. In this paper, we investigate the effectiveness of two types of feature attention mechanisms that either consider a self-transformation process or introduce a context-aware bilinear term. Finally, each node dynamically aggregates weighted sum of neighborhood features to form its representation. The feature representations and node representations are integrally learned and could enhance each other to optimize a collective classification loss at the end of this framework.

Iv-a Feature Representation Learning

Network node features contain rich semantics and they frequently correlate with each other to trigger complex node connections in a graph. For example, a word feature may have different semantics when correlating with different words under different sentence contexts and these fine-grained semantics could help differentiate the links of the corresponding node with its different neighbors.

As shown in Figure 3, we use a Bi-directional LSTM network to learn representation of each feature from the content corpus . Let the content features of node be represented by , we initialize the representations (e.g., with dimension ) of these features in the input layer with following a uniform range distribution. Assume the input semantic vector for feature at time is represented by

, it then undergoes a series of non-linear transformations in temporal order formulated by:




where represents the feature at time . Then, we use the element-wise sum to combine outputs of feature from the two opposite LSTM layers to form its final semantic vector representation:


where represents the vector output of from the opposite temporal order, which is calculated in a similar way as Eq.(14).

Iv-B Feature-Attention Mechanisms

Attention mechanisms have been widely used in many sequence-based tasks such as sentiment classification [27] and machine translation [9], which are favorable designs allowing models to learn alignments between different modalities, i.e., focusing on the most relevant neighborhood features that are helpful to the node classification task. In this paper, we investigate two types of attention mechanisms both at the feature level.
Attention 1 Let the semantic vectors of all features in node content be represented by , where and is calculated from Eq.(16). Inspired by a recent attention design that aims to capture the most important/relevant words in a given sentence for relation classification [31], we calculate a weight vector for all features in by:


where is the attention weight for feature . is the non-linear transformation of , is a trained parameter vector shared across all nodes and is the transpose.

Attention 2 However, the above way of obtaining attention scores adopts a simple self-transformation without considering the neighborhood relationships. Therefore, we propose to perform a context-aware attention calculation, which determines the importance of neighborhood features by taking the corresponding context node into consideration. In Figure 3, assume all neighbors (each node also aggregates information from itself) of node are represented by a collection and they have a shared contextual semantic vector computed by the element-wise sum of all individual feature vectors for node content :


where are semantic vectors of all features of node calculated based on Eq. (16). Then, the weight score for each feature of the neighborhood node is computed by using a bilinear term:


where is a trained parameter matrix and is a transpose.

By incorporating the attention weights, the final feature representation for neighborhood node is formed by a weighted sum of all corresponding individual features by:


Iv-C Node Representation Learning

The spectral-based convolutional filter [7] is chosen as a key building component in our framework to gather features dynamically learned by the LSTM network for node representation learning, where each node only aggregates features from itself and all its first-order neighbors. In this paper, we adopts a two-layer convolutional node representation learning process, where the embedding for each node in the first layer is computed by:


where is the weight matrix and is the dimension of node embeddings in the first layer. Assume embeddings of all nodes out in the first layer are represented by , then the node embeddings in the second layer are computed by:


where is the normalized symmetric adjacency matrix with self-loops. is the parameter matrix that transforms each node embedding to a -length vector. Finally, the output node embeddings of the last layer are subsequently through a softmaxclassifier to perform a multi-class classification task by:


Let be the one-hot label indicator matrix of all nodes, the classification loss can be defined as the cross-entropy error by:


where is the set of node indices that have labels.

Input : An information network:
Output : The node embeddings:

, training epochs

and labeled nodes while  do
        for a vertex  do
               Learn the dense semantic vector for each feature in by Eq. (16);
               Learn original node feature representation by Eq. (21).
        end for
        Learn node embeddings by Eq. (23);
        Calculate classification loss by Eq. (28);
        Update feature representation learning weights in Eq. (15), attention weight in Eq. (20), and node embedding learning weights & in Eqs. (22) and (23);
        = +1.
end while
Algorithm 1 FA-GCN: Feature-attention GCN for noise resilient learning

The weight parameters for feature representation learning (e.g., and ) and node representation learning (e.g., and ) are trained collectively using the gradient descent algorithm as in [7] and [28]

. Since the feature representation learning of one node can be influenced by that of other neighborhood nodes, both the LSTM network parameters and GCN network parameters could vary dramatically without regularization, which might leads to the over-fitting and instability problems. To mitigate these issues, we add a L2-norm regulation term on the loss function by:


where and are penalty terms to control the weight magnitude of the regularization terms on feature and node representation learning weight parameters, respectively. The collective learning process of FA-GCN is summarized in Algorithm 1.

V Experiments

V-a Datasets

We choose three widely used benchmark datasets described as follows:
Citeseer dataset contains 3,312 literature from 6 categories and 4,732 links between them. Each publication is described by a text with average number of words of 32, where the word features in each node content are not ordered in a meaningful sequence (e.g., alphabetical order). There are 3,703 unique words in the vocabulary.

dataset contains 2,708 research papers from 7 machine learning directions such as

Reinforcement Learning and Genetic Algorithms. Each paper corresponds to a category label. There are 5,214 citation relations between these papers. Each paper is described by an abstract in the form of word sequence. There are 14,694 unique words in the vocabulary and the average number of words for each node is 90.
DBLP dataset contains 10,310 publications from 4 research areas in computer science, including database, data mining, artificial intelligence and computer vision. There are 52,890 edges in total and each publication is associated with a title in the form of word sequence. There are 15,135 unique words in the vocabulary and the average number of words for each publication is 8.

Items Citeseer Cora DBLP
# Nodes
# Edges
# unique words
# average words per node
# Categories
TABLE I: Benchmark network characteristics.

The detailed statistic information is summarized in Table 1. It is necessary to mention that since both Cora and DBLP have sequential word features, it is reasonable to consider the feature dependencies or correlations for more accurate relationship modeling between nodes. In addition, to further evaluate the capacity of the proposed approach to model feature correlation and feature attention in a more general setting, we also use the Citeseer dataset in which features are disordered, i.e., features of all nodes are sorted by the alphabetical order. As we adopt a Bi-directional LSTM network to learn the feature representations, each feature is able to reach and interact with others within the same node content.

10% 15% 20% 25% 30% 35% 40% 45% 50%
Methods DeepWalk
GAT 70.42 73.05
FA-GCN 72.07 75.04 75.45 77.06 79.60 81.29 81.66
TABLE II: Node classification results on Citeseer ( denotes percentage of labeled nodes).
10% 15% 20% 25% 30% 35% 40% 45% 50%
Methods DeepWalk
GCN 82.13 82.68
GAT 84.33 85.15 86.73 87.62
FA-GCN 86.33 87.71
FA-GCN 87.73
TABLE III: Node classification results on Cora ( denotes percentage of labeled nodes).

V-B Baselines

We choose the following state-of-the-art comparative methods classified into three categories as follows.

Structure only:

  • DeepWalk [13] preserves only the neighborhood relations between nodes by the truncated random walk, and uses SkipGram model to learn the node embeddings.

  • Node2vec [5] adopts a more flexible neighborhood sampling process than DeepWalk, i.e., biased random walk, to better capture the local structure (the second-order node proximity) and the global structure (the high-order node proximity).

Both structure and content without attention:

  • TriDNR [12] is a method that exploits network structure, node content and label information for node representation learning. It is based on the assumption that network structures and contents can enhance each other to collectively determine the affinities between nodes.

  • GCN [7] is a state-of-the-art method that can efficiently model node relations from network structures and contents, where each node generate representation by adopting a spectral-based convolutional filter to recursively aggregate information from all its neighbors.

  • FA-GCN is a variant of our proposed method that models the feature correlations by a Bi-directional LSTM network and then learns node representations based on the graph convolutional filter as GCN. The attention mechanism is not incorporated in this model.

Both structure and content with attention:

  • GAT [21] is a method built on the GCN model. It introduces an attention mechanism at the node level, which allows each node specifies different weights to different nodes in a neighborhood.

  • FA-GCN is a variant of our proposed method that learns feature representation based on the Bi-directional LSTM network with introducing a self-transformation attention mechanism (attention1 described in Section 4.2). It then learns node representations based on the graph convolutional filter.

  • FA-GCN is our proposed method that learns feature representation based on the Bi-directional LSTM network and learns node representations based on the graph convolutional filter. The context-aware attention mechanism (attention 2 proposed in Section 4.2) is adopted in this method.

10% 15% 20% 25% 30% 35% 40% 45% 50%
Methods DeepWalk
GAT 82.07 85.72 86.15
FA-GCN 83.87 84.21 85.15 86.93
FA-GCN 82.87 88.24
TABLE IV: Classification results on DBLP ( denotes percentage of labeled nodes).
Fig. 4: Algorithm performance comparisons w.r.t. different levels of injected noisy content features. The axis denotes noise levels, where 0.1 means adding 10% of random noise to each node (e.g. a node with 10 words would be injected one random word as noisy node content).

V-C Experimental Settings

Node Classification. We first perform the supervised node classification based on the learned node representations, which is a widely used way to demonstrate the graph learning performance [13, 12, 7]. labeled nodes are randomly selected for training the model (e.g., classifier), which is then used to predict labels for the rest of nodes. Similar to literature [7, 21], we adopt Accuracy

to measure the classification performance, where experiments are repeated 5 times w.r.t different portions of training data and the average performance and standard deviation are finally reported.

Noise Intervention. We further test the performance of the proposed models to handle sparse and noisy content networks against various baselines. Two different types of noise intervention methods on the original node contents are adopted: 1) Inject different ratios (e.g., ranging from 0.1 to 1.0) of random noisy features into each node content; 2) Replace different ratios (e.g., between 0.05 and 0.5) of original features of each node content with randomly sampled noisy features. It is necessary to mention that the first intervention method will makes each node content contain more irrelevant features, while the second method will make each node content more erroneous and meanwhile become more sparser (e.g., the original correct content features are removed). Since the contents for DBLP network are already very sparse (e.g., 7 words per node), we only present the impacts of the second noise intervention w.r.t Citeseer and Cora datasets.

Parameter Setting. Extensive experiments are designed to test the sensitivities of various parameters. We test the input and out feature representation dimensions, and , in LSTM network between 20 and 200, the training ratio, , of the labeled nodes between 0.1 and 0.5. For comparison, the default settings for , and are 80, 80 and 0.4, respectively.

for Citeseer, Cora and DBLP are set as 6, 7 and 4, respectively. Both the LSTM and GCN networks use the dropout technique to reduce the effect of over-fitting, where dropout probabilities for LSTM and GCN are 0.2 and 0.3, respectively. The L2 norm regularization weight decay parameters

and are respectively set as 5e-4 and 5e-4 for DBLP, and 5e-3 and 5e-4 for other datasets. We use the Adam optimizer to train the model, where the learning rate and training epoch are set as 2e-3 and 200, respectively.

V-D Experimental Results

Node Classification Performance. Table 2, Table 3 and Table 4 presents the classification accuracy of all baselines on the three datasets, where the top three best results are bold-faced, italic-formatted and underscored, respectively. From the results, we have the following four main observations:

  • [leftmargin=10pt, topsep=4pt]

  • From Table 2 and Table 3, we can conclude that methods incorporating both network structures and contents perform generally better than methods preserving only structures. For example, after modeling the text content of Cora network, the average performance of TriDNR improved 52.5% over DeepWalk and 2.0% over Node2vec, respectively. The reason is probably because of the fact that both Citeseer and Cora are sparse networks, where structures fail to capture the holistic relations between nodes. In such case, the rich network contents may can be leveraged to enhance the node relationship modeling. Above phenomenon can be strengthen by the results from Table 4 showing Node2vec performs significantly better than TriDNR, where the DBLP network has denser connectivity and sparser content compared with Citeseer and Cora networks. One the other hand, the shallow models such as DeepWalk, Node2vec and TriDNR suffers from the limitations of modeling complex node relations [29], i.e, they all use random walk over the network to capture the node relationships, but the random walk techniques cannot differentiate the affinities of node neighborhood relations of varying hops. In comparison, other GCN-based methods (e.g., GCN, GAT and FA-GCN) can enforce a rigid neighborhood relations modeling by the efficient spectral-based convolutional feature aggregation process, where the learned node representations can naturally and precisely preserve the network structures and contents.

    Fig. 5: Algorithm performance comparisons w.r.t. different levels of replaced noisy content features. The axis denotes noise levels, where 0.1 means replacing 10% of words in each node as noise (e.g. a node with 10 words would have one word being replaced with a random word).
  • Based on the classification results over all three datasets, in most cases FA-GCN outperforms GCN, i.e., an improvement of 1.7% w.r.t the Citeseer dataset. The superiority was actually brought by a more reasonable way of modeling the network content features in the proposed model. Existing GCN-based methods typically take the static content features as input, where features are treated as independent and nodes link each others are assumed to have dependencies with their shared individual features. However, in many situations (e.g., especially for text-described networks) each feature (e.g., word) usually not only appear to represent a single meaning, but also have correlations with others to reveal a complete and complex semantic. In comparison, the proposed approach adopts a Bi-directional LSTM network to effectively model the feature correlations for more accurate node relations modeling. The performance gain have demonstrated the benefits of the proposed model to learn accurate feature semantics.

  • As can be seen from Table 2, Table 3 and Table4, models (e.g., GAT, FA-GCN and FA-GCN) with incorporating either node-level attention mechanism or feature-level attention mechanism perform generally better than the basic GCN model. For example, on the Citeseer dataset, the average performance of GAT improved 0.9% over GCN, and FA-GCN and FA-GCN improved 2.2% and 2.6%, respectively. In general, edges in a graph could reveal complex relationships between nodes, i.e.

    , in the citation network a paper may cite many others of various subject matters, and in the social network a user may connect many friends of different degrees of affinities. GCN enforces each node to indiscriminately aggregate information from all neighbors, which is inflexible and insufficient to model neighborhood relations between nodes. In comparison, GAT and the proposed attention model allow each node attends the important neighborhood nodes or their features for differentiable neighborhood features aggregation, which is helpful to accurately model node relations and meanwhile learn the alignment between the aggregated features and the node classification task.

  • In terms of methods with considering the attention models, from Table 2 and Table 4 we can observe that FA-GCN outperforms GAT in most cases (e.g., the average performance improved 1.3% w.r.t Citeseer dataset). The reason lies in that FA-GCN adopts a more fine-grained attention mechanism at feature level, against GAT at node level. As we know, features in a node content could function differently, where two node sharing many identical features cannot guarantee they are highly similar. For example, in the citation network, the abstract of a publication has rich word features in which many are not distinguishing features to reveal the accurate topics involved. The node-level attention assumes that all features in each node content contribute equally, while the feature-level attention in FA-GCN is able to assign higher weights to the most influential features for node representation learning and classification. In addition, as can be seen from all three result Tables, the proposed attention model FA-GCN can achieve even better performance than FA-GCN. The reason is because it has considered the contextual information while calculating each feature attention of all neighboring nodes, which enables each convolution node have more capacity in selecting the useful features aligned with the corresponding node classification. The experimental results have demonstrated the effectiveness of the proposed attention models.

Fig. 6: Impact of the input feature vector dimension .
Fig. 7: Impact of the output feature vector dimension .

Noise Intervention Performance. Figure 4 shows the performance of various GCN-based methods on all three graphs w.r.t different ratios of injected noisy features. As can be seen, with the increase of noise level, performances of all GCN-based methods such as GCN, GAT and FA-GCN tend to decline to some degrees. The phenomenon is mainly caused by the fact that in the spectral-based graph convolution learning process, nodes are forced to aggregate content features from their respective neighbors so as to maintain the complex link relationships in a graph. Once node contents are floated with noisy features, the content similarities between nodes would become less likely to reflect accurate neighborhood relations. Nevertheless, From Figure 4 (a) we can observe the proposed attention models are less sensitive to noises. The reason is that the introduced feature-level attention models are helpful to select the important features from the noisy, which has guaranteed, to some extent, the accurate neighborhood relations modeling. In addition, as can be seen Figure 4 (b) and (c), FA-GCN and FA-GCN both outperform other baselines in most cases. Figure 5 shows the impact of the second type of noise intervention, which as can be seen has a significant impact on the performance of all comparative methods, i.e., with the noise ratio increased from 0.1 to 0.5, the performance of GCN decreased 3.4% and 2.3% on Citeseer and Cora, respectively. Nevertheless, the proposed FA-GCN model still performs better than other methods in most cases, which again demonstrated the effectiveness of the proposed models for sparse and noisy content network learning.

Fig. 8: Impact of the hidden node vector dimension .

Parameter Influence. They are three hyper-parameters, , and , that are important in the feature and node representation learning process. Extensive experiments are designed to test their sensitivities on Citeseer and DBLP datasets. controls the dimension of the feature vector fed into the LSTM network. Figure 6 shows on both datasets the performance changes in a limited range which first increases and then drops with larger values pf . indicates the dimension of the feature representation output by the LSTM network and its influences are shown in Figure 7. As can be seen, the performances first slightly increase and then decrease within a very small range after the dimensions are set as 60 and 100 for Citeseer and DBLP, respectively. Figure 8 shows the impact of parameter which represents the dimension of node representing out by the first convolutional layer in FA-GCN. We test impact of on Citeseer between 6 and 60, and on DBLP between 4 and 13. The results show it has a significant impact on the performance, where the performance dramatically declines after 6 and 5 for Citeseer and DBLP, respectively.

Vi Conclusions

In this paper, we studied noise resilient learning for networks with sparse noisy node content. We argued that sparse, noisy, and erroneous graph content are ubiquitous. They present critical challenges to many graph learning methods that rely on network content to constrain and measure node relationships. To tackle feature sparsity, we first proposed to represent content features as dense vectors by an LSTM network, which leverages feature semantic correlation and dependency to learn dense vector for each feature. After that, we introduced a feature attention mechanism that allows each node to vary feature weight values with respect to different neighbors, allowing our method to minimize the noise impact and emphasize on consistent features between connected nodes. As a result, each node can gather the most important content features from itself and its neighbors to learn its node representation. The effectiveness of the proposed models have been validated on three sparse content benchmark networks. Experiments on noise-free and noisy networks, including different noise intervention by either injecting noise into the node content or replacing correct content features with error ones, confirm that the proposed method outperforms state-of-the-art methods such as GCN and GAT. Our method is less sensitive to erroneous graph contents, and is noise resilient for learning node representation.


  • [1] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §II.
  • [2] J. Chang and D. Blei (2009) Relational topic models for document networks. In Artificial Intelligence and Statistics, pp. 81–88. Cited by: §I.
  • [3] J. Cheng, L. Dong, and M. Lapata (2016) Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733. Cited by: §III-B.
  • [4] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §II.
  • [5] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §II, 2nd item.
  • [6] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §III-B.
  • [7] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §I, §II, §III-C, §IV-C, §IV-C, 2nd item, §V-C.
  • [8] T. M. Le and H. W. Lauw (2014) Probabilistic latent document network embedding. In 2014 IEEE International Conference on Data Mining, pp. 270–279. Cited by: §I, §II.
  • [9] M. Luong, H. Pham, and C. D. Manning (2015)

    Effective approaches to attention-based neural machine translation

    arXiv preprint arXiv:1508.04025. Cited by: §IV-B.
  • [10] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    arXiv preprint arXiv:1301.3781. Cited by: §II.
  • [11] H. NT, J. J. Choong, and T. Murata (2019) Learning graph neural networks with noisy labels. In Proc. of Intl. Conf. on Learning Representation (ICLR) the 2nd Learning from Limited Labeled Data (LLD) Workshop, Cited by: §I.
  • [12] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang (2016) Tri-party deep network representation. Proc. of IJCAI. Cited by: §I, §II, 1st item, §V-C.
  • [13] B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §I, §II, 1st item, §V-C.
  • [14] J. Pujara, E. Augustine, and L. Getoor (2017) Sparsity and noise: where knowledge graph embeddings fall short. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    pp. 1751–1756. Cited by: §I, §II.
  • [15] G. Qu, S. Hariri, and M. Yousif (2005) A new dependency and correlation analysis for features. IEEE Transactions on Knowledge and Data Engineering 17 (9), pp. 1199–1207. Cited by: §I.
  • [16] L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo (2017) Struc2vec: learning node representations from structural identity. In Proc. of the 23rd ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 385–394. Cited by: §I.
  • [17] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §II.
  • [18] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: §I, §II.
  • [19] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei (2015) Line: large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pp. 1067–1077. Cited by: §II.
  • [20] C. Tu, H. Liu, Z. Liu, and M. Sun (2017) Cane: context-aware network embedding for relation modeling. In Proc. of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 1722–1731. Cited by: §II, §II.
  • [21] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §I, §II, 1st item, §V-C.
  • [22] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang (2017) Community preserving network embedding. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §I, §I.
  • [23] Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014) Knowledge graph and text jointly embedding. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1591–1601. Cited by: §I.
  • [24] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §II, §II.
  • [25] S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §II.
  • [26] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Chang (2015) Network representation learning with rich text information. In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §I, §II.
  • [27] M. Yang, W. Tu, J. Wang, F. Xu, and X. Chen (2017) Attention based lstm for target dependent sentiment classification. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §III-B, §IV-B.
  • [28] L. Yao, C. Mao, and Y. Luo (2018) Graph convolutional networks for text classification. arXiv preprint arXiv:1809.05679. Cited by: §I, §II, §IV-C.
  • [29] D. Zhang, J. Yin, X. Zhu, and C. Zhang (2018) Network representation learning: a survey. IEEE transactions on Big Data. Cited by: §I, §II, 1st item.
  • [30] M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. In Proc. of 32nd Conf. on Neural Info. Proc. Systems (NeurIPS), Cited by: §I.
  • [31] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, and B. Xu (2016) Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 207–212. Cited by: §IV-B.