1. Introduction
Graphstructured data plays a key role in various realworld applications. For example, by exploiting graph structural information, we can predict the chemical property of a given molecular graph (Ying et al., 2018), detect fraud activities in a financial transaction graph (Wang et al., 2019), or recommend new friends to users in a social network (Fan et al., 2019)
. Due to its prevalence, graph neural networks (GNNs)
(Kipf and Welling, 2017; Velickovic et al., 2018; Battaglia et al., 2018; Wu et al., 2019b) have been developed to effectively extract meaningful patterns from graph data and thus tremendously facilitate computational tasks on graphs. Despite their effectiveness, GNNs are notoriously datahungry like traditional deep neural networks: they usually require massive datasets to learn powerful representations. Thus, training GNNs is often computationally expensive. Such cost even becomes prohibitive when we need to repeatedly train GNNs, e.g., in neural architecture search (Liu et al., 2019) and continual learning (Li and Hoiem, 2017).One potential solution to alleviate the aforementioned issue is dataset condensation or dataset distillation. It targets at constructing a smallsynthetic training set that can provide sufficient information to train neural networks (Wang et al., 2018; Zhao et al., 2021a; Zhao and Bilen, 2021a; Nguyen et al., 2021a, b). In particular, one of the representative methods, DC (Zhao et al., 2021a), formulates the condensation goal as matching the gradients of the network parameters between smallsynthetic and largereal training data. It has been demonstrated that such a solution can greatly reduce the training set size of image datasets without significantly sacrificing model performance. For example, using images generated by DC can achieve
% test accuracy on MNIST compared with
% on the original dataset ( images). These condensed samples can significantly save space for storing datasets and speed up retraining neural networks in many critical applications, e.g., continual learning and neural architecture search. In spite of the recent advances in dataset distillation/condensation for images, limited attention has been paid on domains involving graph structures.To bridge this gap, we investigate the problem of condensing graphs such that GNNs trained on condensed graphs can achieve comparable performance to those trained on the original dataset. However, directly applying existing solutions (Wang et al., 2018; Zhao et al., 2021a; Zhao and Bilen, 2021a; Nguyen et al., 2021a) for dataset condensation to graph domain faces some challenges. First, existing solutions have been designed for images where the data is continuous and they cannot output binary values to form the discrete graph structure. Thus, we need to develop a strategy that can handle the discrete nature of graphs. Second, they usually involve a complex bilevel problem that is computationally expensive to optimize: they require multiple iterations (inner iterations) of updating neural network parameters before updating the synthetic data for multiple iterations (outer iterations). It can be catastrophically inefficient for learning pairwise relations for nodes, of which the complexity is quadratic to the number of nodes.
To address the aforementioned challenges, we propose an efficient condensation method for graphs, where we follow DC (Zhao et al., 2021a) to match the gradients of GNNs between synthetic graphs and real graphs. In order to produce discrete values, we model the graph structure as a probabilistic graph model and optimize the discrete structures in a differentiable manner. Based on this formulation, we further propose a onestep gradient matching strategy which only performs gradient matching for one single step. Consequently, the advantages of the proposed strategy are twofold. First, it significantly speeds up the condensation process while providing reasonable guidance for synthesizing condensed graphs. Second, it removes the burden of tuning hyperparameters such as the number of outer/inner iterations of the bilevel optimization as required by DC. Furthermore, we demonstrate the effectiveness of the proposed onestep gradient matching strategy both theoretically and empirically. Our contributions can be summarized as follows:

We study a novel problem of learning discrete synthetic graphs for condensing graph datasets, where the discrete structure is captured via a graph probabilistic model that can be learned in a differentiable manner.

We propose a onestep gradient matching scheme that significantly accelerates the vanilla gradient matching process.

Theoretical analysis is provided to understand the rationality of the proposed onestep gradient matching. We show that learning with onestep matching produces synthetic graphs that lead to a smaller classification loss on real graphs.

Extensive experiments have demonstrated the effectiveness and efficiency of the proposed method. Particularly, we are able to reduce the dataset size by % while approximating up to % of the original performance and our method is significantly faster than multistep gradient matching (e.g. × in CIFAR10 for synthesizing graphs).
2. The Proposed Framework
Before detailing the framework, we first introduce the main notations used in this paper. We majorly focus on the graph classification task where the goal is to predict the labels of given graphs. Specifically, we denote a graph dataset as with groundtruth label set . Each graph in is associated with a discrete adjacency matrix and a node feature matrix. Let , represent the adjacency matrix and the feature matrix of th real graph, respectively. Similarly, we use and to indicate the synthetic graphs and their labels, respectively. Note that the number of synthetic graphs is essentially much smaller than that of real graphs . We use and to denote the number of feature dimensions the number of nodes in each synthetic graph, respectively. ^{2}^{2}2We set to the average number of nodes in original dataset.. Let denote the number of classes and denote the cross entropy loss. The goal of our work is to learn a set of synthetic graphs such that a GNN trained on can achieve comparable performance to the one trained on the much larger dataset .
In the following subsections, we first introduce how to apply the vanilla gradient matching to condensing graphs for graph classification (Section 2.1). However, it cannot generate discrete graph structure and is highly inefficient. To correspondingly address these two limitations, we discuss the approach to handling the discrete nature of graphs (Section 2.2) and propose an efficient solution, onestep gradient matching, which significantly accelerates the condensation process (Section 2.3).
2.1. Gradient Matching as the Condensation Objective
Since we aim at learning synthetic graphs that are highly informative, one solution is to allow GNNs trained on synthetic graphs to imitate the training trajectory on the original large dataset. Dataset condensation (Zhao et al., 2021a; Zhao and Bilen, 2021a)
introduces a gradient matching scheme to achieve this goal. Concretely, it tries to reduce the difference of model gradients w.r.t. largereal data and smallsynthetic data for model parameters at every training epoch. Hence, the model parameters trained on synthetic data will be close to these trained on real data at every training epoch. Let
denote the network parameters at the th epoch and indicate the neural network parameterized by . The condensation objective is expressed as:(1) 
where is a distance function, is the number of steps of the whole training trajectory and is the optimization operator for updating parameter . Note that Eq. (1) is a bilevel problem where we need to learn the synthetic graphs at the outer optimization and update model parameters at the inner optimization. To learn synthetic graphs that generalize to a distribution of model parameters , we sample and rewrite Eq. (1) as:
(2) 
Discussion. The aforementioned strategy has demonstrated promising performance on condensing image datasets (Zhao et al., 2021a; Zhao and Bilen, 2021a). However, it is not clear how to model the discrete graph structure. Moreover, the inherent bilevel optimization inevitably hinders its scalability. To tackle these shortcomings, we propose DosCond that models the structure as a probabilistic graph model and is optimized through onestep gradient matching. In the following subsections, we introduce the details of DosCond.
2.2. Learning Discrete Graph Structure
For graph classification, each graph in the dataset is composed of an adjacency matrix and a feature matrix. For simplicity, we use to denote the node features in all synthetic graphs and to indicate the graph structure information in . Note that can be instantiated as any graph neural network and it takes both graph structure and node features as input. Then we rewrite the objective in Eq. (2) as follows:
(3) 
where we aim to learn both graph structure and node features . However, Eq. (2.2
) is challenging to optimize as it requires a function that outputs binary values. To address this issue, we propose to model the graph structure as a probabilistic graph model with Bernoulli distribution. Note that in the following, we reshape
from to for the purpose of demonstration only. Specifically, for each entry in the adjacency matrix , it follows a Bernoulli distribution:(4) 
where
is the sigmoid function;
is the success probability of the Bernoulli distribution and also the parameter to be learned. Since
is independent of all other entries, the distribution of can be modeled as:(5) 
Then, the objective in Eq. (2) needs to be modified to
(6) 
With the new parameterization, we obtain a function that outputs discrete values but it is not differentiable due to the involved sampling process. Thus, we employ the reparameterization method (Maddison et al., 2016)
, binary concrete distribution, to refactor the discrete random variable into a differentiable function of its parameters and a random variable with fixed distribution. Specifically, we first sample
, and edge weight is calculated by:(7) 
where is the temperature parameter that controls the continuous relaxation. As , the random variable smoothly approaches the Bernoulli distribution. In other words, we have . While small is necessary for obtaining discrete samples, large is useful in getting large gradients as suggested by (Maddison et al., 2016). In practice, we employ an annealing schedule (Abid et al., 2019) to gradually decrease the value of in training. With the reparameterization trick, the objective function becomes differentiable w.r.t. with welldefined gradients. Then we rewrite our objective as:
(8)  
2.3. OneStep Gradient Matching
The vanilla gradient matching scheme in Eq. (2) presents a bilevel optimization problem. To solve this problem, we need to update the synthetic graphs at the outer loop and then optimize the network parameters at the inner loop. The nested loops heavily impede the scalability of the condensation method, which motivates us to design a new strategy for efficient condensation. In this work, we propose a onestep gradient matching scheme where we only match the network gradients for the model initializations while discarding the training trajectory of . Essentially, this strategy approximates the overall gradient matching loss for with the initial matching loss at the first epoch, which we term as onestep matching loss. The intuition is: the onestep matching loss informs us about the direction to update the synthetic data, in which, we have empirically observed a strong decrease in the crossentropy loss (on real samples) obtained from the model trained on synthetic data. Hence, we can drop the summation symbol in Eq. (8) and simplify Eq. (8) as follows:
(9) 
where we sample and . Compared with Eq. (8), onestep gradient matching avoids the expensive nestedloop optimization and directly updates the synthetic graph . It greatly simplifies the condensation process. In practice, as shown in Section 3.3, we find this strategy yields comparable performance to its bilevel counterpart while enabling much more efficient condensation. Next, we provide theoretical analysis to understand the rationality of the proposed onestep gradient matching scheme.
Theoretical Understanding. We denote the cross entropy loss on the real graphs as , and that on synthetic graphs as . Let denote the optimal parameter and be the parameter trained on at the th epoch by optimizing . For notation simplicity, we assume that and are already normalized. When not specified, the matrix norm is the Frobenius norm. We focus on the GNN of Simple Graph Convolutions (SGC) (Wu et al., 2019a) to study our problem since SGC has a simpler architecture but shares a similar filtering pattern as GCN.
Theorem 1 ().
When we use a layer SGC as the GNN used in condensation, i.e., with and assume that all network parameters satisfy , we have
(10) 
where if we use sum pooling in ; if we use mean pooling, with as the number of nodes in the th synthetic graph.
We provide the proof of Theorem 1 in Appendix B.1. Theorem 1 suggests that the smallest gap between the resulted loss (by training on synthetic graphs) and the optimal loss has an upper bound. This upper bound depends on two terms: (1) the difference of gradients w.r.t. real data and synthetic data and (2) the norm of input matrices. Thus, the theorem justifies that reducing the gradient difference w.r.t real and synthetic graphs can help learn desirable synthetic data that preserves sufficient information to train GNNs well. Based on Theorem 1, we have the following proposition.
Proposition 1 ().
Assume the largest gradient gap happens at th epoch, i.e., with , we have
(11) 
We omit the proof for the proposition since it is straightforward. The above proposition suggests that the smallest gap between the and is bounded by the onestep matching loss and the norm . As we will show in Section 3.3.4
, when using mean pooling, the second term tend to have a smaller scale than the first one and can be neglected; the second term matters more when we use max pooling. Hence, we solely optimize the onestep gradient matching loss for GNNs with mean pooling and additionally include the second term (the norm of input matrices) as a regularization for GNNs with sum pooling. As such, when we consider the optimal loss
as a constant, reducing the onestep matching loss indeed learns synthetic graphs that lead to a smaller loss on real graphs. This demonstrates the rationality of onestep gradient matching from a theoretical perspective.Remark 1.
Note that the spectral analysis from (Wu et al., 2019a) demonstrated that both GCN and SGC share similar graph filtering behaviors. Thus practically, we extend the onestep gradient matching loss from layer SGC to layer GCN and observe that the proposed framework works well under the nonlinear scenario.
Remark 2.
While we focus on the graph classification task, it is straightforward to extend our framework to node classification and we obtain similar conclusions for node classification as shown in Theorem 2 in Appendix B.2.
2.4. Final Objective and Training Algorithm
In this subsection, we describe the final objective function and the detailed training algorithm. We note that the objective in Eq. (8) involves two nested expectations, we adopt Monte Carlo to approximately optimize the objective function. Together with onestep gradient matching, we have
(12)  
where is the number of sampled model initializations and is the number of sampled graphs. We find that is able to yield good performance in our experiments.
Regularization. In addition to the onestep gradient matching loss, we note that the proposed DosCond can be easily integrated with various priors as regularization terms. In this work, we focus on exerting sparsity regularization on the adjacency matrix, since a denser adjacency matrix will lead to higher cost for training graph neural networks. Specifically, we penalize the difference of the sparsity between and a given sparsity :
(13) 
We initialize and as randomly sampled training graphs^{3}^{3}3If an entry in the real adjacency matrix is 1, the corresponding value in is initialized as a large value, e.g.,5. and set to the average sparsity of initialized so as to maintain a low sparsity. On top of that, as we discussed earlier in Section 2.3, we include the following regularization for GNNs with sum pooling:
(14) 
Graphs/Cls.  Ratio  Random  Herding  KCenter  DCG  DosCond  Whole Dataset  

ogbgmolbace (ROCAUC)  1  0.2%  0.5800.067  0.5480.034  0.5480.034  0.6230.046  0.6570.034  0.7140.005 
10  1.7%  0.5980.073  0.6390.039  0.5910.056  0.6550.033  0.6740.035  
50  8.3%  0.6320.047  0.6830.022  0.5890.025  0.6520.013  0.6880.012  
ogbgmolbbbp (ROCAUC)  1  0.1%  0.5190.016  0.5460.019  0.5460.019  0.5590.044  0.5810.005  0.6460.004 
10  1.2%  0.5860.040  0.6050.019  0.5300.039  0.5680.032  0.6050.008  
50  6.1%  0.6060.020  0.6170.003  0.5760.019  0.5790.032  0.6200.007  
ogbgmolhiv (ROCAUC)  1  0.01%  0.7190.009  0.7210.002  0.7210.002  0.7180.013  0.7260.003  0.7570.007 
10  0.06%  0.7200.011  0.7250.006  0.7130.009  0.7280.002  0.7280.005  
50  0.3%  0.7210.014  0.7250.003  0.7250.006  0.7260.010  0.7310.004  
DD (Accuracy)  1  0.2%  57.694.92  61.971.32  61.971.32  58.812.90  70.422.21  78.920.64 
10  2.1%  64.692.55  69.792.30  63.462.38  61.841.44  73.531.13  
50  10.6%  67.291.53  73.951.70  67.410.92  61.271.01  77.041.86  
MUTAG (Accuracy)  1  1.3%  67.479.74  70.847.71  70.847.71  75.008.16  82.211.61  88.631.44 
10  13.3%  77.897.55  80.421.89  81.002.51  82.660.68  82.762.31  
20  26.7%  78.215.13  80.001.10  82.974.91  82.891.03  83.262.34  
NCI1 (Accuracy)  1  0.1%  51.271.22  53.980.67  53.980.67  51.141.08  56.580.48  71.700.20 
10  0.6%  54.333.14  57.110.56  53.211.44  51.860.81  58.021.05  
50  3.0%  58.511.73  58.940.83  56.583.08  52.171.90  60.071.58  
CIFAR10 (Accuracy)  1  0.06%  15.610.52  22.380.49  22.370.50  21.600.42  24.700.70  50.750.14 
10  0.2%  23.070.76  28.810.35  20.930.62  29.270.77  30.700.23  
50  1.1%  30.560.81  33.940.37  24.170.51  34.470.52  35.340.14  
Ecommerce (Accuracy)  1  0.2%  51.31+2.89  52.18+0.25  52.36+0.38  57.14+1.72  60.82+1.23  69.250.50 
10  0.9%  54.99+2.74  56.83+0.87  56.49+0.36  61.03+1.32  64.73+1.34  
20  3.6%  57.80+3.58  62.56+0.71  62.76+0.45  64.92+1.35  67.71+1.22 
Training Algorithm. We provide the details of our proposed framework in Algorithm 1. Specifically, we sample model initializations to perform onestep gradient matching. Following the convention in DC (Zhao et al., 2021a), we match gradients and update synthetic graphs for each class separately in order to make matching easier. For class , we first retrieve the synthetic graphs of that class, denoted as , and sample a batch of real graphs . We then forward them to the graph neural network and calculate the onestep gradient matching loss together with the regularization term. Afterwards, and are updated via gradient descent. It is worth noting that the training process for each class can be run in parallel since the graph updates for one class is independent of another class.
Comparison with DC. Recall that the gradient matching scheme in DC involves a complex bilevel optimization. If we denote the number of inneriterations as and that of outeriterations as , its computational complexity can be of our method. Thus DC is significantly slower than DosCond. In addition to speeding up condensation, DosCond removes the burden of tuning some hyperparameters, i.e., the number of iterations for outer/inner optimization and learning rate for updating , which can potentially save us enormous training time when learning larger synthetic sets.
Comparison with Coreset Methods. Coreset methods (Welling, 2009; Sener and Savarese, 2018)
select representative data samples based on some heuristics calculated on the pretrained embedding. Thus, it requires training the model first. Given the cheap cost on calculating and ranking heuristics, the major computational bottleneck for coreset method is on pretraining the neural network for a certain number of iterations. Likewise, our proposed
DosCond has comparable complexity because it also needs to forward and backward the neural network for multiple iterations. Thus, their efficiency difference majorly depends on how many epochs we run for learning synthetic graphs in DosCond and for pretraining the model embedding in coreset methods. In practice, we find that DosCond even requires less training cost than the coreset methods as shown in Section 3.2.2.3. Experiment
In this section, we conduct experiments to evaluate DosCond. Particularly, we aim to answer the following questions: (a) how well can we condense a graph dataset and (b) how efficient is DosCond. Our code can be found in the supplementary files.
3.1. Experimental settings
Datasets. To evaluate the performance of our method, we use multiple molecular datasets from Open Graph Benchmark (OGB) (Hu et al., 2020) and TU Datasets (DD, MUTAG and NCI1) (Morris et al., 2020) for graphlevel property classification, and one superpixel dataset CIFAR10 (Dwivedi et al., 2020)
. We also introduce a realworld ecommerce dataset. In particular, we randomly sample 1,109 subgraphs from a large, anonymized internal knowledge graph. Each subgraph is created from the ego network of a random selected product on the ecommerce website. We form a binary classification problem aiming at predicting the product category of the central product node in each subgraph. We use the public splits for OGB datasets and CIFAR10. For TU Datasets and the ecommerce dataset, we randomly split the graphs into 80%/10%/10% for training/validation/test. Detailed dataset statistics are shown in Appendix
A.1.Baselines. We compare our proposed methods with four baselines that produce discrete structures: three coreset methods (Random, Herding (Welling, 2009) and KCenter (Farahani and Hekmatfar, 2009; Sener and Savarese, 2018)), and a dataset condensation method DCG (Zhao et al., 2021a): (a) Random: it randomly picks graphs from the training dataset. (b) Herding: it selects samples that are closest to the cluster center. Herding is often used in replaybased methods for continual learning (Rebuffi et al., 2017; Castro et al., 2018). (c) KCenter: it selects the center samples to minimize the largest distance between a sample and its nearest center. (d) DCG: As vanilla DC (Zhao et al., 2021a) cannot generate discrete structure, we randomly select graphs from training and apply DC to learn the features for them, which we term as DCG. We use the implementations provided by Zhao et al. (2021a) for Herding, KCenter and DCG. Note that coreset methods only select existing samples from training while DCG learns the node features.
Evaluation Protocol.
To evaluate the effectiveness of the proposed method, we test the classification performance of GNNs trained with condensed graphs on the aforementioned graph datasets. Concretely, it involves three stages: (1) learning synthetic graphs, (2) training a GCN on the synthetic graphs and (3) test the performance of GCN. We first generate the condensed graphs following the procedure in Algorithm 1. Then we train a GCN classifier with the condensed graphs. Finally we evaluate its classification performance on the real graphs from test set. For baseline methods, we first get the selected/condensed graphs and then follow the same procedure. We repeat the generation process of condensed graphs 5 times with different random seeds and train GCN on these graphs with 10 different random seeds. In all experiments, we report the mean and standard deviation of these results.
Parameter Settings. When learning the synthetic graphs, we adopt 3layer GCN with 128 hidden units as the model for gradient matching. The learning rates for structure and feature parameters are set to 1.0 (0.01 for ogbgmolbace and CIFAR10) and 0.01, respectively. We set to 1000 and to 0.1. Additionally, we use mean pooling to obtain graph representation for all datasets except ogbgmolhiv. We use sum pooling for ogbgmolhiv as it achieves better classification performance on the real dataset. During the test stage, we use GCN with the same architecture and we train the model for 500 epochs (100 epochs for ogbgmolhiv) with an initial learning rate of 0.001.
3.2. Performance with Condensed Graphs
3.2.1. Classification Performance Comparison.
To validate the effectiveness of the proposed framework, we measure the classification performance of GCN trained on condensed graphs. Specifically, we vary the number of learned synthetic graphs per class in the range of ( for MUTAG and Ecommerce) and train a GCN on these graphs. Then we evaluate the classification performance of the trained GCN on the original test graphs. Following the convention in OGB (Hu et al., 2020), we report the ROCAUC metric for ogbgmolbace, ogbgmolbbbp and ogbgmolhiv; for other datasets we report the classification accuracy (%). The results are summarized in Table 1. Note that the Ratio column presents the ratio of synthetic graphs to original graphs and we name it as condensation ratio; the Whole Dataset column shows the GCN performance achieved by training on the original dataset. From the table, we make the following observations:

The proposed DosCond consistently achieves better performance than the baseline methods under different condensation ratios and different datasets. Notably, when generating only 2 graphs on ogbgmolbace dataset (0.2%), we achieve an ROCAUC of 0.657 while the performance on full training set is 0.714, which means we approximate 92% of the original performance with only 0.2% data. Likewise, we are able to approximate 96.5% of the original performance on ogbgmolhiv with 0.3% data. By contrast, baselines underperform our method by a large margin. Similar observations can be made on other datasets, which demonstrates the effectiveness of learned synthetic graphs in preserving the information of the original dataset.

Increasing the number of synthetic graphs can improve the classification performance. For example, we can approximate the original performance by 89%/93%/98% with 0.2%/2.1%/10.6% data. More synthetic samples indicate more learnable parameters that can preserve the information residing in the original dataset and present more diverse patterns that can help train GNNs better. This observation is in line with our experimental results in Section 3.3.1.

The performance on CIFAR10 is less promising due to the limit number of synthetic graphs. We posit that the dataset has more complex topology and feature information and thus requires more parameters to preserve sufficient information. However, we note that our method still outperforms the baseline methods especially when producing only 1 sample per class, which suggests that our method is much more dataefficient. Moreoever, we are able to promote the performance on CIFAR10 by learning a larger synthetic set as shown in Section 3.3.1.

Learning both synthetic graph structure and node features is necessary for preserving the information in original graph datasets. By checking the performance DCG, which only learns node features based on randomly selected graph structure, we see that DCG underperforms DosCond by a large margin in most cases. This indicates that learning node features solely is suboptimal for condensing graphs.
3.2.2. Efficiency Comparison
Since one of our goals is to enable scalable dataset condensation, we now evaluate the efficiency of DosCond. We compare DosCond with the coreset method Herding, as it is less timeconsuming than DCG and generally achieves better performance than other baselines. We adopt the same setting as in Table 1: 1000 iterations for DosCond, i.e., , and 500 epochs (100 epochs for ogbgmolhiv) for pretraining the graph convolutional network as required by Herding. We also note that pretraining the neural network need to go over the whole dataset at every epoch while DosCond only processes a batch of graphs. In Table 2, we report the running time on an NVIDIA V100 GPU for CIFAR10, ogbgmolhiv and DD. From the table, we make two observations:

DosCond can be faster than Herding. In fact, DosCond requires less training time in all the cases except in DD with 50 graphs per class. Herding needs to fully training the model on the whole dataset to obtain goodquality embedding, which can be quite timeconsuming. On the contrary, DosCond only requires matching gradients for initializations and does not need to fully train the model on the large real dataset.

The running time of DosCond increases with the increase of the number of synthetic graphs . It is because DosCond processes the condensed graphs at each iteration, of which the time complexity is for an layer GCN. Thus, the additional complexity depends on . By contrast, the increase of has little impact on Herding since the process of selecting samples based on predefined heuristic is very fast.

The average nodes in synthetic graph also impacts the training cost of DosCond. For instance, the training cost on ogbgmolhiv (=26) is much lower than that on DD (=285), and the gap of cost between the two methods on ogbgmolhiv and DD is very different. As mentioned earlier, it is because the complexity of the forward process in GCN is for condensed graphs with node size of .
To summarize, the efficiency difference of Herding and DosCond depends on the number of condensed/selected samples and the training iterations adopted in practice and we empirically found that DosCond consumes less training cost.
CIFAR10  ogbgmolhiv  DD  

G./Cls.  Herding  DosCond  Herding  DosCond  Herding  DosCond 
1  44.5m  4.7m  4.3m  0.66m  1.6m  1.5m 
10  44.5m  4.9m  4.3m  0.67m  1.6m  1.5m 
50  44.5m  5.7m  4.3m  0.68m  1.6m  2.0m 
3.3. Further Investigation
In this subsection, we perform further investigations to provide a better understanding of our proposed method.
3.3.1. Increasing the Number of Synthetic Graphs.
We study whether the classification performance can be further boosted when using larger synthetic size. Concretely, we vary the size of the learned graphs from 1 to 300 and report the results of absolute and relative accuracy w.r.t. whole dataset training accuracy for CIFAR10 in Figure (a)a. It is clear to see that both Random and DosCond achieve better performance when we increase the number of samples used for training. Moreover, our method outperforms the random baseline under different condensed dataset sizes. It is worth noting that the performance gap between the two methods diminishes with the increase of the number of samples. This is because the random baseline will finally approach the whole dataset training if we continue to enlarge the size of the condensed set, in which the performance can be considered as the upper bound of DosCond.
3.3.2. Ablation Study.
To examine how different model components affect the model performance, we perform ablation study on the proposed onestep gradient matching and regularization terms. We create an ablation of our method, namely DosCondBi, which adopts the vanilla gradient matching scheme that involves a bilevel optimization. Without loss of generality, we compare the training time and classification accuracy of DosCond and DosCondBi in the setting of learning 50 graphs/class synthetic graphs on CIFAR10 dataset. The results are summarized in Figure (c)c and we can see that DosCond needs approximately 5 minutes to reach the performance of DosCondBi trained for 75 minutes, which indicates that DosCond only requires 6.7% training cost. It further demonstrates the efficiency of the proposed onestep gradient matching strategy.
Next we study the effect of sparsity regularization on DosCond. Specifically, we vary the sparsity coefficient in the range of and report the classification accuracy and graph sparsity on DD and NCI datasets in Figure (d)d. Note that the graph sparsity is defined as the ratio of the number of edges to the square of the number of nodes. As shown in the figure, when gets larger, we exert a stronger regularization on the learned graphs and the graphs become more sparse. Furthermore, the increased sparsity does not affect the classification performance. This is a desired property since sparse graphs can save much space for storage and reduce training cost for GNNs. We also remove the regularization of Eq. (14) for ogbgmolhiv, we obtain the performance of 0.724/0.727/0.731 for 1/10/50 graphs per class, which is slightly worse than the one with this regularization.







GCond  80.1 (75.9s)  70.6 (71.8s)  77.9 (51.7s)  59.2 (494.3s)  46.5 (51.9s)  
DosCond  80.0 (3.5s)  71.0 (2.8s)  76.0 (1.3s)  59.0 (32.9s)  46.1 (14.3s)  
Whole Dataset  81.5  71.7  79.3  71.4  47.2 
3.3.3. Visualization.
We further investigate whether GCN can learn discriminative representations from the synthetic graphs learned by DosCond. Specifically, we use tSNE (Van der Maaten and Hinton, 2008) to visualize the learned graph representation from GCN trained on different condensed graphs. We train a GCN on graphs produced by different methods and use it to extract the latent representation for real graphs from test set. Without loss of generality, we provide the tSNE plots on DD dataset with 50 graphs per class in Figure 2. It is observed that the graph representations learned with randomly selected graphs are mixed for different classes. This suggests that using randomly selected graphs cannot help GCN learn discriminative features. Similarly, DCG graphs also resulted in poorly trained GCN that outputs indistinguishable graph representations. By contrast, the representations are well separated for different classes when learned with DosCond graphs (Figure 2c) and they are as discriminative as those learned on the whole training dataset (Figure 2d). This demonstrates that the graphs learned by DosCond preserve sufficient information of the original dataset so as to recover the original performance.
3.3.4. Scale of the two terms in Eq. (11).
As mentioned earlier in Section 2.3, the scale of the first term is essentially larger than the second term in Eq. (11). We now perform empirical study to verify this statement. Since both terms contain the factor , we simply drop it and focus on studying and . Specifically, we set to 500 and to 50, and plot the changes of these two terms during the training process of DosCond. The results on DD (with mean pooling) and ogbgmolhiv (with sum pooling) are shown in Figure 3. We can observe that the scale of is much larger than at the first few epochs when using mean pooling as shown in Figure (a)a. By contrast, is not negligible when using sum pooling as shown in Figure (b)b and it is desired to include it as a regularization term in this case. These observations provide support for ours discussion of theoretical analysis in Section 2.3.
3.4. Node Classification
Next, we investigate whether the proposed method works well in node classification so as to support our analysis in Theorem 2 in Appendix B.2. Specifically, following GCond (Jin et al., 2022b), a condensation method for node classification, we use 5 node classification datasets: Cora, Citeseer, Pubmed (Kipf and Welling, 2017), ogbnarxiv (Hu et al., 2020) and Flickr (Zeng et al., 2020). The dataset statistics are shown in 5. We follow the settings in GCond to generate one condensed graph for each dataset, train a GCN on the condensed graph, and evaluate its classification performance on the original test nodes. To adopt DosCond into node classification, we replace the bilevel gradient matching scheme in GCond with our proposed onestep gradient matching. The results of classification accuracy and running time per epoch are summarized in Table 3. From the table, we make the following observations:

The proposed DosCond achieves similar performance as GCond and the performance is also comparable to the original dataset. For example, we are able to approximate the original training performance by 99% with only 2.6% data on Cora. It demonstrates the effectiveness of DosCond in the node classification case and justifies Theorem 2 from an empirical perspective.

The training cost of DosCond is essentially lower than GCond as DosCond avoids the expensive bilevel optimization. By examining their running time, we can see that DosCond is up to 40 times faster than GCond.
We further note that GCond produces weighted graphs which require storing the edge weights in float formats, while DosCond outputs discrete graph structure which can be stored as binary values. Hence, the graphs learned by DosCond are more memoryefficient.
4. Related Work
Graph Neural Networks. As the generalization of deep neural network to graph data, graph neural networks (GNNs) (Kipf and Welling, 2017; Klicpera et al., 2019; Velickovic et al., 2018; Wu et al., 2019b, a; Tang et al., 2020; Jin et al., 2020; Liu et al., 2022; Wang et al., 2022b) have revolutionized the field of graph representation learning through effectively exploiting graph structural information. GNNs have achieved remarkable performances in basic graphrelated tasks such as graph classification (Xu et al., 2019; Guo et al., 2021), link prediction (Fan et al., 2019) and node classification (Kipf and Welling, 2017). Recent years have also witnessed their great success achieved in many realworld applications such as recommender systems (Fan et al., 2019)
(Li et al., 2019), drug discovery (Duvenaud et al., 2015) and etc. GNNs take both adjacency matrix and node feature matrix as input and output nodelevel representations or graphlevel representations. Essentially, they follow a messagepassing scheme (Gilmer et al., 2017) where each node first aggregates the information from its neighborhood and then transforms the aggregated information to update its representation. Furthermore, there is significant progress in developing deeper GNNs (Liu et al., 2020; Jin et al., 2022a), selfsupervised GNNs (You et al., 2021, 2020; Wang et al., 2022a) and graph data augmentation (Zhao et al., 2021b; Ding et al., 2022; Zhao et al., 2022).Dataset Distillation & Dataset Condensation. It is widely received that training neural networks on large datasets can be prohibitively costly. To alleviate this issue, dataset distillation (DD) (Wang et al., 2018) aims to distill knowledge of a large training dataset into a small number of synthetic samples. DD formulates the distillation process as a learningtolearning problem and solves it through bilevel optimization. To improve the efficiency of DD, dataset condensation (DC) (Zhao et al., 2021a; Zhao and Bilen, 2021a) is proposed to learn the small synthetic dataset by matching the gradients of the network parameters w.r.t. largereal and smallsynthetic training data. It has been demonstrated that these condensed samples can facilitate critical applications such as continual learning (Zhao et al., 2021a; Zhao and Bilen, 2021a; Kim et al., 2022; Lee et al., 2022; Zhao and Bilen, 2021b), neural architecture search (Nguyen et al., 2021a, b; Yang et al., 2022) and privacypreserving scenarios (Dong et al., 2022) Recently, following the gradient matching scheme in DC, Jin et al. (2022b) propose a condensation method to condense a largescale graph to a small graph for node classification. Different from (Jin et al., 2022b) which learns weighted graph structure, we aim to solve the challenge of learning discrete structure and we majorly target at graph classification. Moreover, our method avoids the costly bilevel optimization and is much more efficient than the previous work. A detailed comparison is included in Section 3.4.
5. Conclusion
Training graph neural networks on a largescale graph dataset consumes high computational cost. One solution to alleviate this issue is to condense the large graph dataset into a small synthetic dataset. In this work, we propose a novel framework DosCond that adopts a onestep gradient matching strategy to efficiently condenses real graphs into a small number of informative graphs with discrete structures. We further justify the proposed method from both theoretical and empirical perspectives. Notably, our experiments show that we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance. In the future, we plan to investigate interpretable condensation methods and diverse applications of the condensed graphs.
Acknowledgement
Wei Jin and Jiliang Tang are supported by the National Science Foundation (NSF) under grant numbers IIS1714741, CNS1815636, IIS1845081, IIS1907704, IIS1928278, IIS1955285, IOS2107215, and IOS2035472, the Army Research Office (ARO) under grant number W911NF2110198, and Amazon.com, Inc.
References

Concrete autoencoders for differentiable feature selection and reconstruction
. arXiv preprint arXiv:1901.09346. Cited by: §2.2.  Relational inductive biases, deep learning, and graph networks. ArXiv preprint. Cited by: §1.
 Endtoend incremental learning. In ECCV, Cited by: §3.1.
 Data augmentation for deep graph learning: a survey. arXiv preprint arXiv:2202.08235. Cited by: §4.
 Privacy for free: how does dataset condensation help privacy?. In ICML, Cited by: §4.
 Convolutional networks on graphs for learning molecular fingerprints. In NeurIPS, Cited by: §4.
 Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982. Cited by: §3.1.
 Graph neural networks for social recommendation. In WWW, Cited by: §1, §4.
 Facility location: concepts, models, algorithms and case studies. Cited by: §3.1.
 Neural message passing for quantum chemistry. In ICML, Cited by: §4.
 Fewshot graph learning for molecular property prediction. In Proceedings of the Web Conference 2021, pp. 2559–2567. Cited by: §4.

Open graph benchmark: datasets for machine learning on graphs
. In NeurIPS, Cited by: §3.1, §3.2.1, §3.4.  Feature overcorrelation in deep graph neural networks: a new perspective. In KDD, Cited by: §4.
 Graph structure learning for robust graph neural networks. In KDD, Cited by: §4.
 Graph condensation for graph neural networks. In ICLR 2022, Cited by: §3.4, §4.
 GRADmatch: gradient matching based data subset selection for efficient deep model training. In ICML, Cited by: §B.1.
 Dataset condensation via efficient syntheticdata parameterization. arXiv:2205.14959. Cited by: §4.
 Semisupervised classification with graph convolutional networks. In ICLR, Cited by: §1, §3.4, §4.
 Predict then propagate: graph neural networks meet personalized pagerank. In ICLR 2019, Cited by: §4.
 Dataset condensation with contrastive signals. In ICML, Cited by: §4.
 DeepGCNs: can gcns go as deep as cnns?. In ICCV, Cited by: §4.
 Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §1.
 DARTS: differentiable architecture search. In ICLR, Cited by: §1.
 Towards deeper graph neural networks. In KDD, Cited by: §4.
 Generating 3d molecules for target protein binding. In ICML, Cited by: §4.
 The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712. Cited by: §2.2.
 Tudataset: a collection of benchmark datasets for learning with graphs. arXiv preprint arXiv:2007.08663. Cited by: §3.1.

Dataset metalearning from kernel ridgeregression
. In ICLR, Cited by: §1, §1, §4.  Dataset distillation with infinitely wide convolutional networks. NeurIPS 34. Cited by: §1, §4.
 ICaRL: incremental classifier and representation learning. In CVPR, Cited by: §3.1.
 Active learning for convolutional neural networks: A coreset approach. In ICLR, Cited by: §2.4, §3.1.
 Transferring robustness for graph neural network against poisoning attacks. In WSDM, Cited by: §4.
 Visualizing data using tsne.. Journal of machine learning research (11). Cited by: §3.3.3.
 Graph attention networks. In ICLR, Cited by: §1, §4.
 A semisupervised graph attentive network for financial fraud detection. In ICDM, Cited by: §1.
 Dataset distillation. ArXiv preprint. Cited by: §1, §1, §4.

Graph neural networks: selfsupervised learning
. In Graph Neural Networks: Foundations, Frontiers, and Applications, pp. 391–420. Cited by: §4.  Improving fairness in graph neural networks via mitigating sensitive attribute leakage. In KDD, Cited by: §4.
 Machine learning refined: foundations, algorithms, and applications. Cambridge University Press. Cited by: §B.1.
 Herding dynamical weights to learn. In ICML, Cited by: §2.4, §3.1.
 Simplifying graph convolutional networks. In ICML, Cited by: §2.3, Remark 1, §4.
 A comprehensive survey on graph neural networks. ArXiv preprint. Cited by: §1, §4.
 How powerful are graph neural networks?. In ICLR, Cited by: §4.
 Dataset pruning: reducing training data by examining generalization influence. arXiv preprint arXiv:2205.09329. Cited by: §4.
 LipschitzLR: using theoretically computed adaptive learning rates for fast convergence. Applied Intelligence 51 (3), pp. 1460–1478. Cited by: §B.1, §B.2.
 Hierarchical graph representation learning with differentiable pooling. In NeurIPS, Cited by: §1.
 Graph contrastive learning automated. In ICML, Cited by: §4.
 When does selfsupervision help graph convolutional networks?. In ICML, Cited by: §4.
 GraphSAINT: graph sampling based inductive learning method. In ICLR, Cited by: §3.4.
 Dataset condensation with differentiable siamese augmentation. In ICML, Proceedings of Machine Learning Research. Cited by: §1, §1, §2.1, §4.
 Dataset condensation with distribution matching. arXiv preprint arXiv:2110.04181. Cited by: §4.
 Dataset condensation with gradient matching. In ICLR, Cited by: §1, §1, §1, §2.1, §2.4, §3.1, §4.
 Graph data augmentation for graph machine learning: a survey. arXiv:2202.08871. Cited by: §4.
 Data augmentation for graph neural networks. In AAAI, Cited by: §4.
Appendix A Experimental Setup
a.1. Dataset Statistics and Code
Dataset statistics are shown in Table 4 and 5. We provide our code in the supplementary file for the purpose of reproducibility.
Dataset  Type  #Clases  #Graphs  Avg. Nodes  Avg. Edges 

CIFAR10  Superpixel  10  60,000  117.6  941.07 
ogbgmolhiv  Molecule  2  41,127  25.5  54.9 
ogbgmolbace  Molecule  2  1,513  34.1  36.9 
ogbgmolbbbp  Molecule  2  2,039  24.1  26.0 
MUTAG  Molecule  2  188  17.93  19.79 
NCI1  Molecule  2  4,110  29.87  32.30 
DD  Molecule  2  1,178  284.32  715.66 
Ecommerce  Transaction  2  1,109  33.7  56.3 
Dataset  #Nodes  #Edges  #Classes  #Features 

Cora  2,708  5,429  7  1,433 
Citeseer  3,327  4,732  6  3,703 
Pubmed  19,717  44,338  3  500 
Arxiv  169,343  1,166,243  40  128 
Flickr  89,250  899,756  7  500 
Appendix B Proofs
b.1. Proof of Theorem 1
Let , denote the adjacency matrix and the feature matrix of th real graph, respectively. We denote the cross entropy loss on the real samples as , and denote that on synthetic samples as . Let denote the optimal parameter and let be the parameter trained on condensed data at th epoch by optimizing . For simplicity of notations, we assume and are already normalized. Part of the proof is inspired from the work (Killamsetty et al., 2021).
Theorem 1 ().
When we use a linearized layer SGC as the GNN used in condensation, i.e., with and assume that all network parameters satisfy , we have
(15) 
where if we use sum pooling in ; if we use mean pooling, with being the number of nodes in th synthetic graph.
Proof.
We start by proving that is convex and is lipschitz continuous when we use as the mapping function. Before proving these two properties, we first rewrite as:
(16) 
where is the number of nodes in and is an matrix filled with constant one. From the above equation we can see that with different pooling methods only differ in a multiplication factor . Thus, in the following we focus on with sum pooling to derive the major proof.
I. For with sum pooling:
Substitute for and we have for the case with sum pooling. Next we show that is convex and is lipschitz continuous when we use with .
(a) Convexity of . From chapter 4 of the book (Watt et al., 2020), we know that softmax classification with cross entropy loss is convex w.r.t. the parameters . In our case, the mapping function applies an affine function on . Given that applying affine function does not change the convexity, we know that is convex.
(b) Lipschitz continuity of . In (Yedida et al., 2021), it shows that the lipschitz constant of softmax regression with cross entropy loss is , where is the input feature matrix, is the number of classes and is the number of samples. Since is cross entropy loss and is linear, we know that the is lipschitz continuous and it satisfies:
(17) 
With (a) and (b), we are able to proceed our proof. First, from the convexity of we have
(18) 
We can rewrite as follows:
(19) 
Given that we use gradient descent to update network parameters, we have where is the learning rate. Then we have,
(20) 
Combining Eq. (18) and Eq. (20) we have,
(21) 
We sum up the two sides of the above inequality for different values of :
(22) 
Since , we have
(23) 
As we assume that , we have . Then Eq. (23) can be rewritten as,
(24) 
Recall that is lipschitz continuous as shown in Eq. (17), and combine :
(25) 
Then we choose and we can get:
(26) 
II. For with mean pooling:
Following similar derivation as in the case of sum pooling, we have