Condensing Graphs via One-Step Gradient Matching

by   Wei Jin, et al.
Michigan State University

As training deep learning models on large dataset takes a lot of time and resources, it is desired to construct a small synthetic dataset with which we can train deep learning models sufficiently. There are recent works that have explored solutions on condensing image datasets through complex bi-level optimization. For instance, dataset condensation (DC) matches network gradients w.r.t. large-real data and small-synthetic data, where the network weights are optimized for multiple steps at each outer iteration. However, existing approaches have their inherent limitations: (1) they are not directly applicable to graphs where the data is discrete; and (2) the condensation process is computationally expensive due to the involved nested optimization. To bridge the gap, we investigate efficient dataset condensation tailored for graph datasets where we model the discrete graph structure as a probabilistic model. We further propose a one-step gradient matching scheme, which performs gradient matching for only one single step without training the network weights. Our theoretical analysis shows this strategy can generate synthetic graphs that lead to lower classification loss on real graphs. Extensive experiments on various graph datasets demonstrate the effectiveness and efficiency of the proposed method. In particular, we are able to reduce the dataset size by 90 and our method is significantly faster than multi-step gradient matching (e.g. 15x in CIFAR10 for synthesizing 500 graphs).


page 1

page 2

page 3

page 4


Delving into Effective Gradient Matching for Dataset Condensation

As deep learning models and datasets rapidly scale up, network training ...

CAFE: Learning to Condense Dataset by Aligning Features

Dataset condensation aims at reducing the network training effort throug...

Dataset Condensation via Efficient Synthetic-Data Parameterization

The great success of machine learning with massive amounts of data comes...

Dataset Condensation with Distribution Matching

Computational cost to train state-of-the-art deep models in many learnin...

Dataset Condensation with Contrastive Signals

Recent studies have demonstrated that gradient matching-based dataset sy...

SSORN: Self-Supervised Outlier Removal Network for Robust Homography Estimation

The traditional homography estimation pipeline consists of four main ste...

A tool framework for tweaking features in synthetic datasets

Researchers and developers use benchmarks to compare their algorithms an...

1. Introduction

Graph-structured data plays a key role in various real-world applications. For example, by exploiting graph structural information, we can predict the chemical property of a given molecular graph (Ying et al., 2018), detect fraud activities in a financial transaction graph (Wang et al., 2019), or recommend new friends to users in a social network (Fan et al., 2019)

. Due to its prevalence, graph neural networks (GNNs) 

(Kipf and Welling, 2017; Velickovic et al., 2018; Battaglia et al., 2018; Wu et al., 2019b) have been developed to effectively extract meaningful patterns from graph data and thus tremendously facilitate computational tasks on graphs. Despite their effectiveness, GNNs are notoriously data-hungry like traditional deep neural networks: they usually require massive datasets to learn powerful representations. Thus, training GNNs is often computationally expensive. Such cost even becomes prohibitive when we need to repeatedly train GNNs, e.g., in neural architecture search (Liu et al., 2019) and continual learning (Li and Hoiem, 2017).

One potential solution to alleviate the aforementioned issue is dataset condensation or dataset distillation. It targets at constructing a small-synthetic training set that can provide sufficient information to train neural networks (Wang et al., 2018; Zhao et al., 2021a; Zhao and Bilen, 2021a; Nguyen et al., 2021a, b). In particular, one of the representative methods, DC (Zhao et al., 2021a), formulates the condensation goal as matching the gradients of the network parameters between small-synthetic and large-real training data. It has been demonstrated that such a solution can greatly reduce the training set size of image datasets without significantly sacrificing model performance. For example, using images generated by DC can achieve

% test accuracy on MNIST compared with

% on the original dataset ( images). These condensed samples can significantly save space for storing datasets and speed up retraining neural networks in many critical applications, e.g., continual learning and neural architecture search. In spite of the recent advances in dataset distillation/condensation for images, limited attention has been paid on domains involving graph structures.

To bridge this gap, we investigate the problem of condensing graphs such that GNNs trained on condensed graphs can achieve comparable performance to those trained on the original dataset. However, directly applying existing solutions (Wang et al., 2018; Zhao et al., 2021a; Zhao and Bilen, 2021a; Nguyen et al., 2021a) for dataset condensation to graph domain faces some challenges. First, existing solutions have been designed for images where the data is continuous and they cannot output binary values to form the discrete graph structure. Thus, we need to develop a strategy that can handle the discrete nature of graphs. Second, they usually involve a complex bi-level problem that is computationally expensive to optimize: they require multiple iterations (inner iterations) of updating neural network parameters before updating the synthetic data for multiple iterations (outer iterations). It can be catastrophically inefficient for learning pairwise relations for nodes, of which the complexity is quadratic to the number of nodes.

To address the aforementioned challenges, we propose an efficient condensation method for graphs, where we follow DC (Zhao et al., 2021a) to match the gradients of GNNs between synthetic graphs and real graphs. In order to produce discrete values, we model the graph structure as a probabilistic graph model and optimize the discrete structures in a differentiable manner. Based on this formulation, we further propose a one-step gradient matching strategy which only performs gradient matching for one single step. Consequently, the advantages of the proposed strategy are twofold. First, it significantly speeds up the condensation process while providing reasonable guidance for synthesizing condensed graphs. Second, it removes the burden of tuning hyper-parameters such as the number of outer/inner iterations of the bi-level optimization as required by DC. Furthermore, we demonstrate the effectiveness of the proposed one-step gradient matching strategy both theoretically and empirically. Our contributions can be summarized as follows:

  1. We study a novel problem of learning discrete synthetic graphs for condensing graph datasets, where the discrete structure is captured via a graph probabilistic model that can be learned in a differentiable manner.

  2. We propose a one-step gradient matching scheme that significantly accelerates the vanilla gradient matching process.

  3. Theoretical analysis is provided to understand the rationality of the proposed one-step gradient matching. We show that learning with one-step matching produces synthetic graphs that lead to a smaller classification loss on real graphs.

  4. Extensive experiments have demonstrated the effectiveness and efficiency of the proposed method. Particularly, we are able to reduce the dataset size by % while approximating up to % of the original performance and our method is significantly faster than multi-step gradient matching (e.g. × in CIFAR10 for synthesizing graphs).

2. The Proposed Framework

Before detailing the framework, we first introduce the main notations used in this paper. We majorly focus on the graph classification task where the goal is to predict the labels of given graphs. Specifically, we denote a graph dataset as with ground-truth label set . Each graph in is associated with a discrete adjacency matrix and a node feature matrix. Let , represent the adjacency matrix and the feature matrix of -th real graph, respectively. Similarly, we use and to indicate the synthetic graphs and their labels, respectively. Note that the number of synthetic graphs is essentially much smaller than that of real graphs . We use and to denote the number of feature dimensions the number of nodes in each synthetic graph, respectively. 222We set to the average number of nodes in original dataset.. Let denote the number of classes and denote the cross entropy loss. The goal of our work is to learn a set of synthetic graphs such that a GNN trained on can achieve comparable performance to the one trained on the much larger dataset .

In the following subsections, we first introduce how to apply the vanilla gradient matching to condensing graphs for graph classification (Section 2.1). However, it cannot generate discrete graph structure and is highly inefficient. To correspondingly address these two limitations, we discuss the approach to handling the discrete nature of graphs (Section 2.2) and propose an efficient solution, one-step gradient matching, which significantly accelerates the condensation process (Section 2.3).

2.1. Gradient Matching as the Condensation Objective

Since we aim at learning synthetic graphs that are highly informative, one solution is to allow GNNs trained on synthetic graphs to imitate the training trajectory on the original large dataset. Dataset condensation (Zhao et al., 2021a; Zhao and Bilen, 2021a)

introduces a gradient matching scheme to achieve this goal. Concretely, it tries to reduce the difference of model gradients w.r.t. large-real data and small-synthetic data for model parameters at every training epoch. Hence, the model parameters trained on synthetic data will be close to these trained on real data at every training epoch. Let

denote the network parameters at the -th epoch and indicate the neural network parameterized by . The condensation objective is expressed as:


where is a distance function, is the number of steps of the whole training trajectory and is the optimization operator for updating parameter . Note that Eq. (1) is a bi-level problem where we need to learn the synthetic graphs at the outer optimization and update model parameters at the inner optimization. To learn synthetic graphs that generalize to a distribution of model parameters , we sample and rewrite Eq. (1) as:


Discussion. The aforementioned strategy has demonstrated promising performance on condensing image datasets (Zhao et al., 2021a; Zhao and Bilen, 2021a). However, it is not clear how to model the discrete graph structure. Moreover, the inherent bi-level optimization inevitably hinders its scalability. To tackle these shortcomings, we propose DosCond that models the structure as a probabilistic graph model and is optimized through one-step gradient matching. In the following subsections, we introduce the details of DosCond.

2.2. Learning Discrete Graph Structure

For graph classification, each graph in the dataset is composed of an adjacency matrix and a feature matrix. For simplicity, we use to denote the node features in all synthetic graphs and to indicate the graph structure information in . Note that can be instantiated as any graph neural network and it takes both graph structure and node features as input. Then we rewrite the objective in Eq. (2) as follows:


where we aim to learn both graph structure and node features . However, Eq. (2.2

) is challenging to optimize as it requires a function that outputs binary values. To address this issue, we propose to model the graph structure as a probabilistic graph model with Bernoulli distribution. Note that in the following, we reshape

from to for the purpose of demonstration only. Specifically, for each entry in the adjacency matrix , it follows a Bernoulli distribution:



is the sigmoid function;

is the success probability of the Bernoulli distribution and also the parameter to be learned. Since

is independent of all other entries, the distribution of can be modeled as:


Then, the objective in Eq. (2) needs to be modified to


With the new parameterization, we obtain a function that outputs discrete values but it is not differentiable due to the involved sampling process. Thus, we employ the reparameterization method (Maddison et al., 2016)

, binary concrete distribution, to refactor the discrete random variable into a differentiable function of its parameters and a random variable with fixed distribution. Specifically, we first sample

, and edge weight is calculated by:


where is the temperature parameter that controls the continuous relaxation. As , the random variable smoothly approaches the Bernoulli distribution. In other words, we have . While small is necessary for obtaining discrete samples, large is useful in getting large gradients as suggested by (Maddison et al., 2016). In practice, we employ an annealing schedule (Abid et al., 2019) to gradually decrease the value of in training. With the reparameterization trick, the objective function becomes differentiable w.r.t. with well-defined gradients. Then we rewrite our objective as:


2.3. One-Step Gradient Matching

The vanilla gradient matching scheme in Eq. (2) presents a bi-level optimization problem. To solve this problem, we need to update the synthetic graphs at the outer loop and then optimize the network parameters at the inner loop. The nested loops heavily impede the scalability of the condensation method, which motivates us to design a new strategy for efficient condensation. In this work, we propose a one-step gradient matching scheme where we only match the network gradients for the model initializations while discarding the training trajectory of . Essentially, this strategy approximates the overall gradient matching loss for with the initial matching loss at the first epoch, which we term as one-step matching loss. The intuition is: the one-step matching loss informs us about the direction to update the synthetic data, in which, we have empirically observed a strong decrease in the cross-entropy loss (on real samples) obtained from the model trained on synthetic data. Hence, we can drop the summation symbol in Eq. (8) and simplify Eq. (8) as follows:


where we sample and . Compared with Eq. (8), one-step gradient matching avoids the expensive nested-loop optimization and directly updates the synthetic graph . It greatly simplifies the condensation process. In practice, as shown in Section 3.3, we find this strategy yields comparable performance to its bi-level counterpart while enabling much more efficient condensation. Next, we provide theoretical analysis to understand the rationality of the proposed one-step gradient matching scheme.

Theoretical Understanding. We denote the cross entropy loss on the real graphs as , and that on synthetic graphs as . Let denote the optimal parameter and be the parameter trained on at the -th epoch by optimizing . For notation simplicity, we assume that and are already normalized. When not specified, the matrix norm is the Frobenius norm. We focus on the GNN of Simple Graph Convolutions (SGC) (Wu et al., 2019a) to study our problem since SGC has a simpler architecture but shares a similar filtering pattern as GCN.

Theorem 1 ().

When we use a -layer SGC as the GNN used in condensation, i.e., with and assume that all network parameters satisfy , we have


where if we use sum pooling in ; if we use mean pooling, with as the number of nodes in the -th synthetic graph.

We provide the proof of Theorem 1 in Appendix B.1. Theorem 1 suggests that the smallest gap between the resulted loss (by training on synthetic graphs) and the optimal loss has an upper bound. This upper bound depends on two terms: (1) the difference of gradients w.r.t. real data and synthetic data and (2) the norm of input matrices. Thus, the theorem justifies that reducing the gradient difference w.r.t real and synthetic graphs can help learn desirable synthetic data that preserves sufficient information to train GNNs well. Based on Theorem 1, we have the following proposition.

Proposition 1 ().

Assume the largest gradient gap happens at -th epoch, i.e., with , we have


We omit the proof for the proposition since it is straightforward. The above proposition suggests that the smallest gap between the and is bounded by the one-step matching loss and the norm . As we will show in Section 3.3.4

, when using mean pooling, the second term tend to have a smaller scale than the first one and can be neglected; the second term matters more when we use max pooling. Hence, we solely optimize the one-step gradient matching loss for GNNs with mean pooling and additionally include the second term (the norm of input matrices) as a regularization for GNNs with sum pooling. As such, when we consider the optimal loss

as a constant, reducing the one-step matching loss indeed learns synthetic graphs that lead to a smaller loss on real graphs. This demonstrates the rationality of one-step gradient matching from a theoretical perspective.

Remark 1.

Note that the spectral analysis from (Wu et al., 2019a) demonstrated that both GCN and SGC share similar graph filtering behaviors. Thus practically, we extend the one-step gradient matching loss from -layer SGC to -layer GCN and observe that the proposed framework works well under the non-linear scenario.

Remark 2.

While we focus on the graph classification task, it is straightforward to extend our framework to node classification and we obtain similar conclusions for node classification as shown in Theorem 2 in Appendix B.2.

2.4. Final Objective and Training Algorithm

In this subsection, we describe the final objective function and the detailed training algorithm. We note that the objective in Eq. (8) involves two nested expectations, we adopt Monte Carlo to approximately optimize the objective function. Together with one-step gradient matching, we have


where is the number of sampled model initializations and is the number of sampled graphs. We find that is able to yield good performance in our experiments.

Regularization. In addition to the one-step gradient matching loss, we note that the proposed DosCond can be easily integrated with various priors as regularization terms. In this work, we focus on exerting sparsity regularization on the adjacency matrix, since a denser adjacency matrix will lead to higher cost for training graph neural networks. Specifically, we penalize the difference of the sparsity between and a given sparsity :


We initialize and as randomly sampled training graphs333If an entry in the real adjacency matrix is 1, the corresponding value in is initialized as a large value, e.g.,5. and set to the average sparsity of initialized so as to maintain a low sparsity. On top of that, as we discussed earlier in Section 2.3, we include the following regularization for GNNs with sum pooling:

1:  Input: Training data 2:  Required: Pre-defined condensed labels , graph neural network , temperature , desired sparsity , regularization coefficient , learning rates , number of epochs . 3:  Initialize 4:  for  do 15:     Sample 6:     Sample 7:     Compute 8:     for  do 29:         Sample and 10:         Compute 311:         Compute 12:         Compute 413:         Update 514:         Update 15:     end for 16:  end for 617:  Return:
Algorithm 1 DosCond for Condensing Graphs
Graphs/Cls. Ratio Random Herding K-Center DCG DosCond Whole Dataset
ogbg-molbace (ROC-AUC) 1 0.2% 0.5800.067 0.5480.034 0.5480.034 0.6230.046 0.6570.034 0.7140.005
10 1.7% 0.5980.073 0.6390.039 0.5910.056 0.6550.033 0.6740.035
50 8.3% 0.6320.047 0.6830.022 0.5890.025 0.6520.013 0.6880.012
ogbg-molbbbp (ROC-AUC) 1 0.1% 0.5190.016 0.5460.019 0.5460.019 0.5590.044 0.5810.005 0.6460.004
10 1.2% 0.5860.040 0.6050.019 0.5300.039 0.5680.032 0.6050.008
50 6.1% 0.6060.020 0.6170.003 0.5760.019 0.5790.032 0.6200.007
ogbg-molhiv (ROC-AUC) 1 0.01% 0.7190.009 0.7210.002 0.7210.002 0.7180.013 0.7260.003 0.7570.007
10 0.06% 0.7200.011 0.7250.006 0.7130.009 0.7280.002 0.7280.005
50 0.3% 0.7210.014 0.7250.003 0.7250.006 0.7260.010 0.7310.004
DD (Accuracy) 1 0.2% 57.694.92 61.971.32 61.971.32 58.812.90 70.422.21 78.920.64
10 2.1% 64.692.55 69.792.30 63.462.38 61.841.44 73.531.13
50 10.6% 67.291.53 73.951.70 67.410.92 61.271.01 77.041.86
MUTAG (Accuracy) 1 1.3% 67.479.74 70.847.71 70.847.71 75.008.16 82.211.61 88.631.44
10 13.3% 77.897.55 80.421.89 81.002.51 82.660.68 82.762.31
20 26.7% 78.215.13 80.001.10 82.974.91 82.891.03 83.262.34
NCI1 (Accuracy) 1 0.1% 51.271.22 53.980.67 53.980.67 51.141.08 56.580.48 71.700.20
10 0.6% 54.333.14 57.110.56 53.211.44 51.860.81 58.021.05
50 3.0% 58.511.73 58.940.83 56.583.08 52.171.90 60.071.58
CIFAR10 (Accuracy) 1 0.06% 15.610.52 22.380.49 22.370.50 21.600.42 24.700.70 50.750.14
10 0.2% 23.070.76 28.810.35 20.930.62 29.270.77 30.700.23
50 1.1% 30.560.81 33.940.37 24.170.51 34.470.52 35.340.14
E-commerce (Accuracy) 1 0.2% 51.31+-2.89 52.18+-0.25 52.36+-0.38 57.14+-1.72 60.82+-1.23 69.250.50
10 0.9% 54.99+-2.74 56.83+-0.87 56.49+-0.36 61.03+-1.32 64.73+-1.34
20 3.6% 57.80+-3.58 62.56+-0.71 62.76+-0.45 64.92+-1.35 67.71+-1.22
Table 1. The classification performance comparison to baselines. We report the ROC-AUC for the first three datasets and accuracies (%) for others. Whole Dataset indicates the performance with original dataset.

Training Algorithm. We provide the details of our proposed framework in Algorithm 1. Specifically, we sample model initializations to perform one-step gradient matching. Following the convention in DC (Zhao et al., 2021a), we match gradients and update synthetic graphs for each class separately in order to make matching easier. For class , we first retrieve the synthetic graphs of that class, denoted as , and sample a batch of real graphs . We then forward them to the graph neural network and calculate the one-step gradient matching loss together with the regularization term. Afterwards, and are updated via gradient descent. It is worth noting that the training process for each class can be run in parallel since the graph updates for one class is independent of another class.

Comparison with DC. Recall that the gradient matching scheme in DC involves a complex bi-level optimization. If we denote the number of inner-iterations as and that of outer-iterations as , its computational complexity can be of our method. Thus DC is significantly slower than DosCond. In addition to speeding up condensation, DosCond removes the burden of tuning some hyper-parameters, i.e., the number of iterations for outer/inner optimization and learning rate for updating , which can potentially save us enormous training time when learning larger synthetic sets.

Comparison with Coreset Methods. Coreset methods (Welling, 2009; Sener and Savarese, 2018)

select representative data samples based on some heuristics calculated on the pre-trained embedding. Thus, it requires training the model first. Given the cheap cost on calculating and ranking heuristics, the major computational bottleneck for coreset method is on pre-training the neural network for a certain number of iterations. Likewise, our proposed

DosCond has comparable complexity because it also needs to forward and backward the neural network for multiple iterations. Thus, their efficiency difference majorly depends on how many epochs we run for learning synthetic graphs in DosCond and for pre-training the model embedding in coreset methods. In practice, we find that DosCond even requires less training cost than the coreset methods as shown in Section 3.2.2.

3. Experiment

In this section, we conduct experiments to evaluate DosCond. Particularly, we aim to answer the following questions: (a) how well can we condense a graph dataset and (b) how efficient is DosCond. Our code can be found in the supplementary files.

3.1. Experimental settings

Datasets. To evaluate the performance of our method, we use multiple molecular datasets from Open Graph Benchmark (OGB) (Hu et al., 2020) and TU Datasets (DD, MUTAG and NCI1) (Morris et al., 2020) for graph-level property classification, and one superpixel dataset CIFAR10 (Dwivedi et al., 2020)

. We also introduce a real-world e-commerce dataset. In particular, we randomly sample 1,109 sub-graphs from a large, anonymized internal knowledge graph. Each sub-graph is created from the ego network of a random selected product on the e-commerce website. We form a binary classification problem aiming at predicting the product category of the central product node in each sub-graph. We use the public splits for OGB datasets and CIFAR10. For TU Datasets and the e-commerce dataset, we randomly split the graphs into 80%/10%/10% for training/validation/test. Detailed dataset statistics are shown in Appendix 


Baselines. We compare our proposed methods with four baselines that produce discrete structures: three coreset methods (Random, Herding (Welling, 2009) and K-Center (Farahani and Hekmatfar, 2009; Sener and Savarese, 2018)), and a dataset condensation method DCG (Zhao et al., 2021a): (a) Random: it randomly picks graphs from the training dataset. (b) Herding: it selects samples that are closest to the cluster center. Herding is often used in replay-based methods for continual learning  (Rebuffi et al., 2017; Castro et al., 2018). (c) K-Center: it selects the center samples to minimize the largest distance between a sample and its nearest center. (d) DCG: As vanilla DC (Zhao et al., 2021a) cannot generate discrete structure, we randomly select graphs from training and apply DC to learn the features for them, which we term as DCG. We use the implementations provided by Zhao et al. (2021a) for Herding, K-Center and DCG. Note that coreset methods only select existing samples from training while DCG learns the node features.

Evaluation Protocol.

To evaluate the effectiveness of the proposed method, we test the classification performance of GNNs trained with condensed graphs on the aforementioned graph datasets. Concretely, it involves three stages: (1) learning synthetic graphs, (2) training a GCN on the synthetic graphs and (3) test the performance of GCN. We first generate the condensed graphs following the procedure in Algorithm 1. Then we train a GCN classifier with the condensed graphs. Finally we evaluate its classification performance on the real graphs from test set. For baseline methods, we first get the selected/condensed graphs and then follow the same procedure. We repeat the generation process of condensed graphs 5 times with different random seeds and train GCN on these graphs with 10 different random seeds. In all experiments, we report the mean and standard deviation of these results.

Parameter Settings. When learning the synthetic graphs, we adopt 3-layer GCN with 128 hidden units as the model for gradient matching. The learning rates for structure and feature parameters are set to 1.0 (0.01 for ogbg-molbace and CIFAR10) and 0.01, respectively. We set to 1000 and to 0.1. Additionally, we use mean pooling to obtain graph representation for all datasets except ogbg-molhiv. We use sum pooling for ogbg-molhiv as it achieves better classification performance on the real dataset. During the test stage, we use GCN with the same architecture and we train the model for 500 epochs (100 epochs for ogbg-molhiv) with an initial learning rate of 0.001.

3.2. Performance with Condensed Graphs

3.2.1. Classification Performance Comparison.

To validate the effectiveness of the proposed framework, we measure the classification performance of GCN trained on condensed graphs. Specifically, we vary the number of learned synthetic graphs per class in the range of ( for MUTAG and E-commerce) and train a GCN on these graphs. Then we evaluate the classification performance of the trained GCN on the original test graphs. Following the convention in OGB (Hu et al., 2020), we report the ROC-AUC metric for ogbg-molbace, ogbg-molbbbp and ogbg-molhiv; for other datasets we report the classification accuracy (%). The results are summarized in Table 1. Note that the Ratio column presents the ratio of synthetic graphs to original graphs and we name it as condensation ratio; the Whole Dataset column shows the GCN performance achieved by training on the original dataset. From the table, we make the following observations:

  1. The proposed DosCond consistently achieves better performance than the baseline methods under different condensation ratios and different datasets. Notably, when generating only 2 graphs on ogbg-molbace dataset (0.2%), we achieve an ROC-AUC of 0.657 while the performance on full training set is 0.714, which means we approximate 92% of the original performance with only 0.2% data. Likewise, we are able to approximate 96.5% of the original performance on ogbg-molhiv with 0.3% data. By contrast, baselines underperform our method by a large margin. Similar observations can be made on other datasets, which demonstrates the effectiveness of learned synthetic graphs in preserving the information of the original dataset.

  2. Increasing the number of synthetic graphs can improve the classification performance. For example, we can approximate the original performance by 89%/93%/98% with 0.2%/2.1%/10.6% data. More synthetic samples indicate more learnable parameters that can preserve the information residing in the original dataset and present more diverse patterns that can help train GNNs better. This observation is in line with our experimental results in Section 3.3.1.

  3. The performance on CIFAR10 is less promising due to the limit number of synthetic graphs. We posit that the dataset has more complex topology and feature information and thus requires more parameters to preserve sufficient information. However, we note that our method still outperforms the baseline methods especially when producing only 1 sample per class, which suggests that our method is much more data-efficient. Moreoever, we are able to promote the performance on CIFAR10 by learning a larger synthetic set as shown in Section 3.3.1.

  4. Learning both synthetic graph structure and node features is necessary for preserving the information in original graph datasets. By checking the performance DCG, which only learns node features based on randomly selected graph structure, we see that DCG underperforms DosCond by a large margin in most cases. This indicates that learning node features solely is sub-optimal for condensing graphs.

3.2.2. Efficiency Comparison

Since one of our goals is to enable scalable dataset condensation, we now evaluate the efficiency of DosCond. We compare DosCond with the coreset method Herding, as it is less time-consuming than DCG and generally achieves better performance than other baselines. We adopt the same setting as in Table 1: 1000 iterations for DosCond, i.e., , and 500 epochs (100 epochs for ogbg-molhiv) for pre-training the graph convolutional network as required by Herding. We also note that pre-training the neural network need to go over the whole dataset at every epoch while DosCond only processes a batch of graphs. In Table 2, we report the running time on an NVIDIA V100 GPU for CIFAR10, ogbg-molhiv and DD. From the table, we make two observations:

  1. DosCond can be faster than Herding. In fact, DosCond requires less training time in all the cases except in DD with 50 graphs per class. Herding needs to fully training the model on the whole dataset to obtain good-quality embedding, which can be quite time-consuming. On the contrary, DosCond only requires matching gradients for initializations and does not need to fully train the model on the large real dataset.

  2. The running time of DosCond increases with the increase of the number of synthetic graphs . It is because DosCond processes the condensed graphs at each iteration, of which the time complexity is for an -layer GCN. Thus, the additional complexity depends on . By contrast, the increase of has little impact on Herding since the process of selecting samples based on pre-defined heuristic is very fast.

  3. The average nodes in synthetic graph also impacts the training cost of DosCond. For instance, the training cost on ogbg-molhiv (=26) is much lower than that on DD (=285), and the gap of cost between the two methods on ogbg-molhiv and DD is very different. As mentioned earlier, it is because the complexity of the forward process in GCN is for condensed graphs with node size of .

To summarize, the efficiency difference of Herding and DosCond depends on the number of condensed/selected samples and the training iterations adopted in practice and we empirically found that DosCond consumes less training cost.

CIFAR10 ogbg-molhiv DD
G./Cls. Herding DosCond Herding DosCond Herding DosCond
1 44.5m 4.7m 4.3m 0.66m 1.6m 1.5m
10 44.5m 4.9m 4.3m 0.67m 1.6m 1.5m
50 44.5m 5.7m 4.3m 0.68m 1.6m 2.0m
Table 2. Comparison of running time (minutes).
(a) Learning larger synthetic set.
(b) One-Step v.s. bi-Level matching
(c) Varying on DD
(d) Varying on NCI1
Figure 1. Parameter analysis w.r.t. the sparsity regularization.
(a) Random
(b) DCG
(c) DosCond
(d) Whole Dataset
Figure 2. T-SNE visualizations of embedding learned with condensed graphs on DD.

3.3. Further Investigation

In this subsection, we perform further investigations to provide a better understanding of our proposed method.

3.3.1. Increasing the Number of Synthetic Graphs.

We study whether the classification performance can be further boosted when using larger synthetic size. Concretely, we vary the size of the learned graphs from 1 to 300 and report the results of absolute and relative accuracy w.r.t. whole dataset training accuracy for CIFAR10 in Figure (a)a. It is clear to see that both Random and DosCond achieve better performance when we increase the number of samples used for training. Moreover, our method outperforms the random baseline under different condensed dataset sizes. It is worth noting that the performance gap between the two methods diminishes with the increase of the number of samples. This is because the random baseline will finally approach the whole dataset training if we continue to enlarge the size of the condensed set, in which the performance can be considered as the upper bound of DosCond.

3.3.2. Ablation Study.

To examine how different model components affect the model performance, we perform ablation study on the proposed one-step gradient matching and regularization terms. We create an ablation of our method, namely DosCond-Bi, which adopts the vanilla gradient matching scheme that involves a bi-level optimization. Without loss of generality, we compare the training time and classification accuracy of DosCond and DosCond-Bi in the setting of learning 50 graphs/class synthetic graphs on CIFAR10 dataset. The results are summarized in Figure (c)c and we can see that DosCond needs approximately 5 minutes to reach the performance of DosCond-Bi trained for 75 minutes, which indicates that DosCond only requires 6.7% training cost. It further demonstrates the efficiency of the proposed one-step gradient matching strategy.

Next we study the effect of sparsity regularization on DosCond. Specifically, we vary the sparsity coefficient in the range of and report the classification accuracy and graph sparsity on DD and NCI datasets in Figure (d)d. Note that the graph sparsity is defined as the ratio of the number of edges to the square of the number of nodes. As shown in the figure, when gets larger, we exert a stronger regularization on the learned graphs and the graphs become more sparse. Furthermore, the increased sparsity does not affect the classification performance. This is a desired property since sparse graphs can save much space for storage and reduce training cost for GNNs. We also remove the regularization of Eq. (14) for ogbg-molhiv, we obtain the performance of 0.724/0.727/0.731 for 1/10/50 graphs per class, which is slightly worse than the one with this regularization.

Cora, =2.6%
Citeseer, =1.8%
Pubmed, =0.3%
Arxiv, =0.25%
Flickr, =0.1%
GCond 80.1 (75.9s) 70.6 (71.8s) 77.9 (51.7s) 59.2 (494.3s) 46.5 (51.9s)
DosCond 80.0 (3.5s) 71.0 (2.8s) 76.0 (1.3s) 59.0 (32.9s) 46.1 (14.3s)
Whole Dataset 81.5 71.7 79.3 71.4 47.2
Table 3. Node classification accuracy (%) comparison. The numbers in parentheses indicate the running time for 100 epochs and indicates the ratio of number of nodes in the condensed graph to that in the original graph.

3.3.3. Visualization.

We further investigate whether GCN can learn discriminative representations from the synthetic graphs learned by DosCond. Specifically, we use t-SNE (Van der Maaten and Hinton, 2008) to visualize the learned graph representation from GCN trained on different condensed graphs. We train a GCN on graphs produced by different methods and use it to extract the latent representation for real graphs from test set. Without loss of generality, we provide the t-SNE plots on DD dataset with 50 graphs per class in Figure 2. It is observed that the graph representations learned with randomly selected graphs are mixed for different classes. This suggests that using randomly selected graphs cannot help GCN learn discriminative features. Similarly, DCG graphs also resulted in poorly trained GCN that outputs indistinguishable graph representations. By contrast, the representations are well separated for different classes when learned with DosCond graphs (Figure 2c) and they are as discriminative as those learned on the whole training dataset (Figure 2d). This demonstrates that the graphs learned by DosCond preserve sufficient information of the original dataset so as to recover the original performance.

(a) DD
(b) ogbg-molhiv
Figure 3. Scale of the two terms in Eq. (11).

3.3.4. Scale of the two terms in Eq. (11).

As mentioned earlier in Section 2.3, the scale of the first term is essentially larger than the second term in Eq. (11). We now perform empirical study to verify this statement. Since both terms contain the factor , we simply drop it and focus on studying and . Specifically, we set to 500 and to 50, and plot the changes of these two terms during the training process of DosCond. The results on DD (with mean pooling) and ogbg-molhiv (with sum pooling) are shown in Figure 3. We can observe that the scale of is much larger than at the first few epochs when using mean pooling as shown in Figure (a)a. By contrast, is not negligible when using sum pooling as shown in Figure (b)b and it is desired to include it as a regularization term in this case. These observations provide support for ours discussion of theoretical analysis in Section 2.3.

3.4. Node Classification

Next, we investigate whether the proposed method works well in node classification so as to support our analysis in Theorem 2 in Appendix B.2. Specifically, following GCond (Jin et al., 2022b), a condensation method for node classification, we use 5 node classification datasets: Cora, Citeseer, Pubmed (Kipf and Welling, 2017), ogbn-arxiv (Hu et al., 2020) and Flickr (Zeng et al., 2020). The dataset statistics are shown in 5. We follow the settings in GCond to generate one condensed graph for each dataset, train a GCN on the condensed graph, and evaluate its classification performance on the original test nodes. To adopt DosCond into node classification, we replace the bi-level gradient matching scheme in GCond with our proposed one-step gradient matching. The results of classification accuracy and running time per epoch are summarized in Table 3. From the table, we make the following observations:

  1. The proposed DosCond achieves similar performance as GCond and the performance is also comparable to the original dataset. For example, we are able to approximate the original training performance by 99% with only 2.6% data on Cora. It demonstrates the effectiveness of DosCond in the node classification case and justifies Theorem 2 from an empirical perspective.

  2. The training cost of DosCond is essentially lower than GCond as DosCond avoids the expensive bi-level optimization. By examining their running time, we can see that DosCond is up to 40 times faster than GCond.

We further note that GCond produces weighted graphs which require storing the edge weights in float formats, while DosCond outputs discrete graph structure which can be stored as binary values. Hence, the graphs learned by DosCond are more memory-efficient.

4. Related Work

Graph Neural Networks. As the generalization of deep neural network to graph data, graph neural networks (GNNs) (Kipf and Welling, 2017; Klicpera et al., 2019; Velickovic et al., 2018; Wu et al., 2019b, a; Tang et al., 2020; Jin et al., 2020; Liu et al., 2022; Wang et al., 2022b) have revolutionized the field of graph representation learning through effectively exploiting graph structural information. GNNs have achieved remarkable performances in basic graph-related tasks such as graph classification (Xu et al., 2019; Guo et al., 2021), link prediction (Fan et al., 2019) and node classification (Kipf and Welling, 2017). Recent years have also witnessed their great success achieved in many real-world applications such as recommender systems (Fan et al., 2019)

, computer vision 

(Li et al., 2019), drug discovery (Duvenaud et al., 2015) and etc. GNNs take both adjacency matrix and node feature matrix as input and output node-level representations or graph-level representations. Essentially, they follow a message-passing scheme (Gilmer et al., 2017) where each node first aggregates the information from its neighborhood and then transforms the aggregated information to update its representation. Furthermore, there is significant progress in developing deeper GNNs (Liu et al., 2020; Jin et al., 2022a), self-supervised GNNs (You et al., 2021, 2020; Wang et al., 2022a) and graph data augmentation (Zhao et al., 2021b; Ding et al., 2022; Zhao et al., 2022).

Dataset Distillation & Dataset Condensation. It is widely received that training neural networks on large datasets can be prohibitively costly. To alleviate this issue, dataset distillation (DD) (Wang et al., 2018) aims to distill knowledge of a large training dataset into a small number of synthetic samples. DD formulates the distillation process as a learning-to-learning problem and solves it through bi-level optimization. To improve the efficiency of DD, dataset condensation (DC) (Zhao et al., 2021a; Zhao and Bilen, 2021a) is proposed to learn the small synthetic dataset by matching the gradients of the network parameters w.r.t. large-real and small-synthetic training data. It has been demonstrated that these condensed samples can facilitate critical applications such as continual learning (Zhao et al., 2021a; Zhao and Bilen, 2021a; Kim et al., 2022; Lee et al., 2022; Zhao and Bilen, 2021b), neural architecture search (Nguyen et al., 2021a, b; Yang et al., 2022) and privacy-preserving scenarios (Dong et al., 2022) Recently, following the gradient matching scheme in DC, Jin et al. (2022b) propose a condensation method to condense a large-scale graph to a small graph for node classification. Different from (Jin et al., 2022b) which learns weighted graph structure, we aim to solve the challenge of learning discrete structure and we majorly target at graph classification. Moreover, our method avoids the costly bi-level optimization and is much more efficient than the previous work. A detailed comparison is included in Section 3.4.

5. Conclusion

Training graph neural networks on a large-scale graph dataset consumes high computational cost. One solution to alleviate this issue is to condense the large graph dataset into a small synthetic dataset. In this work, we propose a novel framework DosCond that adopts a one-step gradient matching strategy to efficiently condenses real graphs into a small number of informative graphs with discrete structures. We further justify the proposed method from both theoretical and empirical perspectives. Notably, our experiments show that we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance. In the future, we plan to investigate interpretable condensation methods and diverse applications of the condensed graphs.


Wei Jin and Jiliang Tang are supported by the National Science Foundation (NSF) under grant numbers IIS1714741, CNS1815636, IIS1845081, IIS1907704, IIS1928278, IIS1955285, IOS2107215, and IOS2035472, the Army Research Office (ARO) under grant number W911NF-21-1-0198, and, Inc.


  • A. Abid, M. F. Balin, and J. Zou (2019)

    Concrete autoencoders for differentiable feature selection and reconstruction

    arXiv preprint arXiv:1901.09346. Cited by: §2.2.
  • P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. (2018) Relational inductive biases, deep learning, and graph networks. ArXiv preprint. Cited by: §1.
  • F. M. Castro, M. J. Marín-Jiménez, N. Guil, C. Schmid, and K. Alahari (2018) End-to-end incremental learning. In ECCV, Cited by: §3.1.
  • K. Ding, Z. Xu, H. Tong, and H. Liu (2022) Data augmentation for deep graph learning: a survey. arXiv preprint arXiv:2202.08235. Cited by: §4.
  • T. Dong, B. Zhao, and L. Lyu (2022) Privacy for free: how does dataset condensation help privacy?. In ICML, Cited by: §4.
  • D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R. Gómez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In NeurIPS, Cited by: §4.
  • V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson (2020) Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982. Cited by: §3.1.
  • W. Fan, Y. Ma, Q. Li, Y. He, Y. E. Zhao, J. Tang, and D. Yin (2019) Graph neural networks for social recommendation. In WWW, Cited by: §1, §4.
  • R. Z. Farahani and M. Hekmatfar (2009) Facility location: concepts, models, algorithms and case studies. Cited by: §3.1.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §4.
  • Z. Guo, C. Zhang, W. Yu, J. Herr, O. Wiest, M. Jiang, and N. V. Chawla (2021) Few-shot graph learning for molecular property prediction. In Proceedings of the Web Conference 2021, pp. 2559–2567. Cited by: §4.
  • W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec (2020)

    Open graph benchmark: datasets for machine learning on graphs

    In NeurIPS, Cited by: §3.1, §3.2.1, §3.4.
  • W. Jin, X. Liu, Y. Ma, C. Aggarwal, and J. Tang (2022a) Feature overcorrelation in deep graph neural networks: a new perspective. In KDD, Cited by: §4.
  • W. Jin, Y. Ma, X. Liu, X. Tang, S. Wang, and J. Tang (2020) Graph structure learning for robust graph neural networks. In KDD, Cited by: §4.
  • W. Jin, L. Zhao, S. Zhang, Y. Liu, J. Tang, and N. Shah (2022b) Graph condensation for graph neural networks. In ICLR 2022, Cited by: §3.4, §4.
  • K. Killamsetty, D. S, G. Ramakrishnan, A. De, and R. Iyer (2021) GRAD-match: gradient matching based data subset selection for efficient deep model training. In ICML, Cited by: §B.1.
  • J. Kim, J. Kim, S. J. Oh, S. Yun, H. Song, J. Jeong, J. Ha, and H. O. Song (2022) Dataset condensation via efficient synthetic-data parameterization. arXiv:2205.14959. Cited by: §4.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §1, §3.4, §4.
  • J. Klicpera, A. Bojchevski, and S. Günnemann (2019) Predict then propagate: graph neural networks meet personalized pagerank. In ICLR 2019, Cited by: §4.
  • S. Lee, S. Chun, S. Jung, S. Yun, and S. Yoon (2022) Dataset condensation with contrastive signals. In ICML, Cited by: §4.
  • G. Li, M. Müller, A. K. Thabet, and B. Ghanem (2019) DeepGCNs: can gcns go as deep as cnns?. In ICCV, Cited by: §4.
  • Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §1.
  • H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. In ICLR, Cited by: §1.
  • M. Liu, H. Gao, and S. Ji (2020) Towards deeper graph neural networks. In KDD, Cited by: §4.
  • M. Liu, Y. Luo, K. Uchino, K. Maruhashi, and S. Ji (2022) Generating 3d molecules for target protein binding. In ICML, Cited by: §4.
  • C. J. Maddison, A. Mnih, and Y. W. Teh (2016) The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712. Cited by: §2.2.
  • C. Morris, N. M. Kriege, F. Bause, K. Kersting, P. Mutzel, and M. Neumann (2020) Tudataset: a collection of benchmark datasets for learning with graphs. arXiv preprint arXiv:2007.08663. Cited by: §3.1.
  • T. Nguyen, Z. Chen, and J. Lee (2021a)

    Dataset meta-learning from kernel ridge-regression

    In ICLR, Cited by: §1, §1, §4.
  • T. Nguyen, R. Novak, L. Xiao, and J. Lee (2021b) Dataset distillation with infinitely wide convolutional networks. NeurIPS 34. Cited by: §1, §4.
  • S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) ICaRL: incremental classifier and representation learning. In CVPR, Cited by: §3.1.
  • O. Sener and S. Savarese (2018) Active learning for convolutional neural networks: A core-set approach. In ICLR, Cited by: §2.4, §3.1.
  • X. Tang, Y. Li, Y. Sun, H. Yao, P. Mitra, and S. Wang (2020) Transferring robustness for graph neural network against poisoning attacks. In WSDM, Cited by: §4.
  • L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of machine learning research (11). Cited by: §3.3.3.
  • P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In ICLR, Cited by: §1, §4.
  • D. Wang, J. Lin, P. Cui, Q. Jia, Z. Wang, Y. Fang, Q. Yu, J. Zhou, S. Yang, and Y. Qi (2019) A semi-supervised graph attentive network for financial fraud detection. In ICDM, Cited by: §1.
  • T. Wang, J. Zhu, A. Torralba, and A. A. Efros (2018) Dataset distillation. ArXiv preprint. Cited by: §1, §1, §4.
  • Y. Wang, W. Jin, and T. Derr (2022a)

    Graph neural networks: self-supervised learning

    In Graph Neural Networks: Foundations, Frontiers, and Applications, pp. 391–420. Cited by: §4.
  • Y. Wang, Y. Zhao, Y. Dong, H. Chen, J. Li, and T. Derr (2022b) Improving fairness in graph neural networks via mitigating sensitive attribute leakage. In KDD, Cited by: §4.
  • J. Watt, R. Borhani, and A. K. Katsaggelos (2020) Machine learning refined: foundations, algorithms, and applications. Cambridge University Press. Cited by: §B.1.
  • M. Welling (2009) Herding dynamical weights to learn. In ICML, Cited by: §2.4, §3.1.
  • F. Wu, A. H. S. Jr., T. Zhang, C. Fifty, T. Yu, and K. Q. Weinberger (2019a) Simplifying graph convolutional networks. In ICML, Cited by: §2.3, Remark 1, §4.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019b) A comprehensive survey on graph neural networks. ArXiv preprint. Cited by: §1, §4.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In ICLR, Cited by: §4.
  • S. Yang, Z. Xie, H. Peng, M. Xu, M. Sun, and P. Li (2022) Dataset pruning: reducing training data by examining generalization influence. arXiv preprint arXiv:2205.09329. Cited by: §4.
  • R. Yedida, S. Saha, and T. Prashanth (2021) LipschitzLR: using theoretically computed adaptive learning rates for fast convergence. Applied Intelligence 51 (3), pp. 1460–1478. Cited by: §B.1, §B.2.
  • Z. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton, and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. In NeurIPS, Cited by: §1.
  • Y. You, T. Chen, Y. Shen, and Z. Wang (2021) Graph contrastive learning automated. In ICML, Cited by: §4.
  • Y. You, T. Chen, Z. Wang, and Y. Shen (2020) When does self-supervision help graph convolutional networks?. In ICML, Cited by: §4.
  • H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and V. K. Prasanna (2020) GraphSAINT: graph sampling based inductive learning method. In ICLR, Cited by: §3.4.
  • B. Zhao and H. Bilen (2021a) Dataset condensation with differentiable siamese augmentation. In ICML, Proceedings of Machine Learning Research. Cited by: §1, §1, §2.1, §4.
  • B. Zhao and H. Bilen (2021b) Dataset condensation with distribution matching. arXiv preprint arXiv:2110.04181. Cited by: §4.
  • B. Zhao, K. R. Mopuri, and H. Bilen (2021a) Dataset condensation with gradient matching. In ICLR, Cited by: §1, §1, §1, §2.1, §2.4, §3.1, §4.
  • T. Zhao, G. Liu, S. Günnemann, and M. Jiang (2022) Graph data augmentation for graph machine learning: a survey. arXiv:2202.08871. Cited by: §4.
  • T. Zhao, Y. Liu, L. Neves, O. Woodford, M. Jiang, and N. Shah (2021b) Data augmentation for graph neural networks. In AAAI, Cited by: §4.

Appendix A Experimental Setup

a.1. Dataset Statistics and Code

Dataset statistics are shown in Table 4 and 5. We provide our code in the supplementary file for the purpose of reproducibility.

Dataset Type #Clases #Graphs Avg. Nodes Avg. Edges
CIFAR10 Superpixel 10 60,000 117.6 941.07
ogbg-molhiv Molecule 2 41,127 25.5 54.9
ogbg-molbace Molecule 2 1,513 34.1 36.9
ogbg-molbbbp Molecule 2 2,039 24.1 26.0
MUTAG Molecule 2 188 17.93 19.79
NCI1 Molecule 2 4,110 29.87 32.30
DD Molecule 2 1,178 284.32 715.66
E-commerce Transaction 2 1,109 33.7 56.3
Table 4. Graph classification dataset statistics.
Dataset #Nodes #Edges #Classes #Features
Cora 2,708 5,429 7 1,433
Citeseer 3,327 4,732 6 3,703
Pubmed 19,717 44,338 3 500
Arxiv 169,343 1,166,243 40 128
Flickr 89,250 899,756 7 500
Table 5. Node classification dataset statistics.

Appendix B Proofs

b.1. Proof of Theorem 1

Let , denote the adjacency matrix and the feature matrix of -th real graph, respectively. We denote the cross entropy loss on the real samples as , and denote that on synthetic samples as . Let denote the optimal parameter and let be the parameter trained on condensed data at -th epoch by optimizing . For simplicity of notations, we assume and are already normalized. Part of the proof is inspired from the work (Killamsetty et al., 2021).

Theorem 1 ().

When we use a linearized -layer SGC as the GNN used in condensation, i.e., with and assume that all network parameters satisfy , we have


where if we use sum pooling in ; if we use mean pooling, with being the number of nodes in -th synthetic graph.


We start by proving that is convex and is lipschitz continuous when we use as the mapping function. Before proving these two properties, we first rewrite as:


where is the number of nodes in and is an matrix filled with constant one. From the above equation we can see that with different pooling methods only differ in a multiplication factor . Thus, in the following we focus on with sum pooling to derive the major proof.

I. For with sum pooling:

Substitute for and we have for the case with sum pooling. Next we show that is convex and is lipschitz continuous when we use with .

(a) Convexity of . From chapter 4 of the book (Watt et al., 2020), we know that softmax classification with cross entropy loss is convex w.r.t. the parameters . In our case, the mapping function applies an affine function on . Given that applying affine function does not change the convexity, we know that is convex.

(b) Lipschitz continuity of . In (Yedida et al., 2021), it shows that the lipschitz constant of softmax regression with cross entropy loss is , where is the input feature matrix, is the number of classes and is the number of samples. Since is cross entropy loss and is linear, we know that the is lipschitz continuous and it satisfies:


With (a) and (b), we are able to proceed our proof. First, from the convexity of we have


We can rewrite as follows:


Given that we use gradient descent to update network parameters, we have where is the learning rate. Then we have,


Combining Eq. (18) and Eq. (20) we have,


We sum up the two sides of the above inequality for different values of :


Since , we have


As we assume that , we have . Then Eq. (23) can be rewritten as,


Recall that is lipschitz continuous as shown in Eq. (17), and combine :


Then we choose and we can get:


II. For with mean pooling:

Following similar derivation as in the case of sum pooling, we have