I Introduction
Radio resource management, e.g., power control [5] and beamforming [3], plays a crucial role in wireless networks. Unfortunately, many of these problems are nonconvex and computationally challenging. Moreover, they need to be solved in a realtime manner given the timevarying wireless channels and the latency requirement of many mobile applications. Great efforts have been put forward to develop effective algorithms for these challenging problems. Existing algorithms are mainly based on convex optimization approaches [31, 29], which have a limited capability in dealing with nonconvex problems and scale poorly with the problem size. Problem specific algorithms can be developed, which, however, is a laborious process and requires much problem specific knowledge.
Inspired by the recent successes of deep learning in many application domains, e.g., computer vision and natural language processing
[11], researchers have attempted to apply deep learning based methods, particularly, “learning to optimize” approaches, to solve difficult optimization problems in wireless networks [33, 17, 19, 6, 26, 36, 16, 9, 8]. The goal of such methods is to achieve nearoptimal performance in a realtime manner without domain knowledge, i.e., to automate the algorithm design process. There are two common paradigms on this topic [2, 20]. The first one is “endtoend learning”, which directly employs a neural network to approximate the optimal solution of an optimization problem. For example, in [33], to solve the power control problem, a multilayer perceptron (MLP) was used to approximate the inputoutput mapping of the classic weighted minimum mean square error (WMMSE) algorithm
[27] to speed up the computation. The second paradigm is “learning alongside optimization”, which replaces some ineffective policy in a traditional algorithm with a neural network. For example, an MLP was utilized in [26] to replace the pruning policy in the branchandbound algorithm. Accordingly, significant speedup and performance gain in the access point selection problem was achieved compared with the optimizationbased methods in [30, 28].A key design ingredient underlying both paradigms of “learning to optimize” is the neural network architecture. Most of the existing works adopt MLPs [33, 19, 26, 32]
or convolutional neural networks (CNNs)
[17, 36]. These architectures are inherited from the ones developed for image processing tasks and thus are not tailored to problems in wireless networks. Although nearoptimal performance is achieved for smallscale wireless networks, they fail to exploit the wireless network structure and thus suffer from poor scalability and generalization in largescale radio resource management problems. Specifically, the performance of these methods degrades dramatically when the wireless network size becomes large. For example, it was shown in [33] that the performance gap to the WMMSE algorithm is when and becomes when . Moreover, these methods generalize poorly when the number of agents in the test dataset is larger than that in the training dataset. In dense wireless networks, resource management may involve thousands of users simultaneously and the number of users changes dynamically, thus, making the wide application of these learningbased methods very difficult.A longstanding idea to improve scalability and generalization is to incorporate the structures of the target task into the neural network architecture [2, 38, 32]. A prominent example is the development of CNNs for computer vision, which is inspired by the fact that the neighbor pixels of an image are useful when they are considered together [4]. To achieve better scalability, structures in a singleantenna system with homogeneous agents have recently been exploited for effective neural network architecture design [6, 9]. In static channels, observing that channel states are deterministic functions of users’ geolocations in a 2D Euclidean space, spatial convolution was developed in [6], which is applicable in wireless networks with thousands of users but cannot handle fading channels. With fading channels, it was observed that the channel matrix can be viewed as the adjacency matrix of a graph [9]. From this perspective, a random edge graph neural network (REGNN) operating on such a graph was developed, and it was demonstrated that it inhibits a good generalization property when the number of users in the wireless networks changes. However, in a multiantenna system or a singleantenna system with heterogeneous agents, the channel matrix no longer fits the form of an adjacency matrix and the REGNN cannot be applied.
In this paper, we address the limitations of existing works by modeling wireless networks as wireless channel graphs and develop neural networks to exploit the graph topology. Specifically, we treat the agents as nodes in a graph, communication channels as directed edges, agent specific parameters as node features, and channel related parameters as edge features. Subsequently, lowcomplexity neural network architectures operating on wireless channel graphs will be proposed.
Existing works (e.g., [33, 26, 16]) also have another major limitation, namely, they treat the adopted neural network as a black box. Despite the superior performance in specific applications, it is hard to interpret what is learned by the neural networks. To ensure reliability, it is crucial to understand when the algorithm works and when it fails. Thus, a good theoretical understanding is demanded for the learningbased radio resource management methods. Compared with learningbased methods, conventional optimizationbased methods are wellstudied. This inspires us to build a relationship between these two types of methods. In particular, we shall prove the equivalence between the proposed neural networks and a favorable class of optimizationbased methods. This equivalence will allow the development of tractable analysis for the performance and generalization of the learningbased methods through the study of their equivalent optimizationbased methods.
Ia Contributions
In this paper, we develop scalable learningbased methods to solve radio resource management problems in dense wireless networks. The major contributions are summarized as follows:

We model wireless networks as wireless channel graphs and formulate radio resource management problems as graph optimization problems. We then show that a permutation equivariance property holds in general radio resource management problems, which can be exploited for effective neural network architecture design.

We identify a favorable class of neural networks operating on wireless channel graphs, namely MPGNNs. In such neural networks, the feature of each node is updated by aggregating information from local nodes and edges with a lowcomplexity permutation invariant function. Thus, MPGNNs satisfy the permutation equivariance property, and have the ability to generalize to largescale problems while enjoying a high computational efficiency.

For an effective implementation, we propose a wireless channel graph convolution network (WCGCN) within the MPGNN class. Besides inheriting the advantages of MPGNNs, the WCGCN enjoys several unique advantages for solving radio resource management problems. First, it can effectively exploit both agentrelated features and channelrelated features effectively. Second, it is insensitive to the corruptions of features, e.g., channel state information (CSI), implying that they can be applied with partial and imperfect CSI.

To provide interpretability and theoretical guarantees, we prove the equivalence between MPGNNs and a class of distributed optimization algorithms, which include many classic algorithms for radio resource management, e.g., WMMSE [27]. Based on this equivalence, we analyze the performance and generalization of MPGNNbased methods in the weighted sum rate maximization problem.

We test the effectiveness of WCGCN for power control and beamforming problems, training with unlabeled data. Extensive simulations will demonstrate that the proposed WCGCN matches or outperforms classic optimizationbased algorithms without domain knowledge, and with significant speedups. Remarkably, WCGCN can solve the beamforming problem with users within milliseconds on a single GPU.^{1}^{1}1The codes to reproduce the simulation results will be made available soon.
IB Notations
Throughout this paper, superscripts , , denote conjugate transpose, transpose, inverse, respectively. The set symbol in this paper denotes a multiset. A multiset is a tuple where is the underlying set of that is formed from its distinct elements, and gives the multiplicity of elements. For example, is a multiset where element has multiplicity and element has multiplicity .
Ii Graph Modeling of Wireless Networks
In this section, we model wireless networks as graphs, and formulate radio resource management problems as graph optimization problems. Key properties of radio resource management problems will be identified, which will then be exploited to design effective neural network architectures.
Iia Directed Graphs and Permutation Equivariance Property
A directed graph can be represented as an ordered pair
, where is the set of nodes and is the set of edges. Node is adjacent to node if , denoted as . Two graphs and are isomorphic if there is a bijection such that , denoted by . The adjacency matrix of is an matrix , where if and only if for all . A directed graph can be represented as an adjacency matrix. The permutation corresponds to a permutation matrix . The rows (or columns) of are rearranged if is left (or right) multiplied to . The matrix is also an adjacency matrix. Graphs corresponding to adjacency matrices and are isomorphic since applying the permutation is a reordering of nodes, denoted by .We now introduce optimization problems defined on directed graphs, and identify their permutation invariance and equivariance properties. We assign each node an optimization variable . Denote , then an optimization problem defined on graph can be written as
subject to  (1) 
where represents the objective function and represents the constraint.
As , optimization problems defined on graphs have the permutation invariance property as stated below.
Proposition II.1.
(Permutation invariance) For any permutation matrix , the optimization problem defined in (1) has the following property
Proof.
Since adjacency matrices and represent the same graph, permuting and simultaneously is simply a reordering of the variables. As a result, we have and . ∎
The permutation invariance property of the objective value and constraint leads to the corresponding property of sublevel sets. We first define the sublevel sets.
Definition II.1.
(Sublevel sets) The sublevel set of a function is defined as
where is the feasible domain.
Denote the optimal objective value of (1) as , and the set of accurate solutions as . Thus, the properties of sublevel sets imply the properties of nearoptimal solutions. Specifically, the permutation invariance property of the objective function implies the permutation equivariance property of the sublevel sets, which is stated in the next proposition.
Proposition II.2.
(Permutation equivariance) Denote as the sublevel set of in (1), and define . Then,
where is any permutation matrix.
Remark.
The permutation equivariance property of sublevel sets is a direct result of the permutation invariance in the objective function. Please refer to Appendix A for a detailed proof.
In the next subsection, by modeling wireless networks as graphs, we show that the permutation equivariance property is universal in radio resource management problems.
IiB Wireless Network as a Graph
A wireless network can be modeled as a directed graph with node and edge features. Naturally, we treat each agent, e.g., a mobile user or a base station, in wireless networks as a node in the graph. An edge is drawn from node to node if there is a direct communication or interference link with node as the transmitter and node as the receiver. The node feature incorporates the properties of the agent, e.g., users’ weights in the weighted sum rate maximization problem [27]. The edge feature includes the properties of the corresponding channel, e.g., a scalar (or matrix) to denote the channel state of a singleantenna (or multiantenna) system. We call these graphs generated by the wireless network topology as wireless channel graphs. Formally, a wireless channel graph is an ordered tuple , where is the set of nodes, is the set of edges, maps a node to its feature, and maps an edge to its feature. Denote . Also define the node feature array as with , and the adjacency feature array as
(2) 
where
is a zero vector in
.We assign each node an optimization variable . Let , then an optimization problem defined on a wireless channel graph can be written as
(3) 
where denotes the objective function and denotes the constraint.
Next we elaborate the properties of the radio resource management problems on the wireless channel graphs. Without node features or edge features, a wireless channel graph is a directed graph. As a result, the properties of wireless channel graphs follow the properties of directed graphs. We elaborate the permutation equivariance property of problems on wireless channel graphs next. We call the three dimensions of as row, column, and depth. The permutation operator for and is defined as follows. The left permutation operator rearranges the rows and the right permutation operator rearranges columns according to a permutation . Similar to optimization problems on directed graphs, the ones defined on wireless channel graphs have the permutation invariance property. As a result, the sublevel sets of in (3) also have the permutation equivariance property, which is stated below.
Proposition II.3.
(Permutation equivariance) Let denote the sublevel set of in (3), and define . Then,
where the permutation matrix , left permutation operator , and right permutation operator are associated with the same permutation .
Remark.
This result establishes a general permutation equivariance property for radio resource management problems. Proposition II.3 is reduced to the results in [9] if is a constant array and . Compared with [9], Proposition II.3 is able to handle heterogeneous agents (e.g., users with different resource constraints) and more general channels (e.g., multiantenna channels) as the heterogeneity can be modeled as node features and multiantenna channel states can be modeled as edge features. The proof is the same as Proposition II.2 by simply changing notations.
IiC Graph Modeling of user Interference Channels
In this subsection, as a specific example, we present graph modeling of a classic radio resource management problem, i.e., beamforming for weighted sum rate maximization in a user interference channel. It will be used as the main test setting for the theoretical study in Section IVC and simulations in Section V. There are in total transceiver pairs where each transmitter is equipped with antennas and each receiver is equipped with a single antenna. Let denote the beamformer of the th transmitter. The received signal at receiver is , where denotes the channel state from transmitter to receiver and
denotes the additive noise following the complex Gaussian distribution
.The signaltointerferenceplusnoise ratio (SINR) for receiver is given by
Denote as the beamforming matrix. The objective is to find the optimal beamformer to maximize the weighted sum rate, and the problem is formulated as
(4)  
subject to 
where is the weight for the th pair.
Graph Modeling
We view the th transceiver pair as the th node in the graph. As distant agents cause little interference, we draw a directed edge from node to node only if the distance between transmitter and receiver is below a certain threshold . An illustration of such a graph modeling is shown in Fig. 1. The node feature array is given by
and the adjacency feature array is given by
where is a zero vector. Problem (4) has the permutation equivariance property with respect to , , and . To solve this problem efficiently and effectively, the adopted neural network should exploit the permutation equivariance property, and incorporate both node features and edge features. We shall develop an effective neural network architecture to achieve this goal in the next section.
Iii Neural Network Architecture Design for Radio Resource Management
In this section, we endeavor to develop a scalable neural network architecture for radio resource management problems. A favorable class of GNNs, named, message passing graph neural networks, will be identified. The key properties and effective implementation will also be discussed.
Iiia Optimizing Wireless Networks via Graph Neural Networks
Most of existing works on “learning to optimize” approaches to solve problems in wireless networks adopted MLPs as the neural network architecture [33, 19, 26]. Although MLPs can approximate wellbehaved functions [13], they suffer from poor performance in data efficiency, robustness, and generalization. A longstanding idea for improving the performance and generalization is to incorporate the structures of the target task into the neural network architecture. In this way, there is no need for the neural network to learn such structures from data, which leads to a more efficient training, and better generalization empirically [22, 32] and provably [38].
As discussed above, the structures of radio resource management problems can be formulated as optimization problems on wireless channel graphs, which enjoy the permutation equivariance property. In machine learning, there are two classes of neural networks that are able to exploit the permutation equivariance property, i.e., graph neural networks (GNNs)
[35] and Deep Sets [39]. Compared with Deep Sets, GNNs not only respect the permutation equivariance property but can also model the interactions among the agents. In wireless networks, the agents interact with each other through channels. Thus, GNNs are more favorable than Deep Sets in wireless networks. This motivates us to adopt GNNs to solve radio resource management problems.IiiB Message Passing Graph Neural Networks
In this subsection, we shall identify a favorable class of GNNs for radio resource management problems, which extend CNNs to wireless channel graphs. In traditional machine learning tasks, the data can typically be embedded in a Euclidean space, e.g., images. Recently, there is an increasing number of applications generated from the nonEuclidean spaces that can be naturally modeled as graphs, e.g., point cloud [34] and combinatorial problems [18]. This motivates researchers to develop GNNs [35]
, which effectively exploit the graph structure. GNNs generalize traditional CNNs, recurrent neural networks, and autoencoders to the graph tasks. In wireless networks, while the agents are located in the Euclidean space, channel states cannot be embedded in a Euclidean space. Thus, the data in radio resource management problems is also nonEuclidean and neural networks operating on nonEuclidean space are necessary when adopting “learning to optimize” approaches in wireless networks.
As a background, we first introduce CNNs, which operate on Euclidean data. Compared with MLPs, CNNs have shown superior performance in image processing tasks. The motivation for CNNs is that adjacent pixels are meaningful to be considered together in images [4]
. Like MLPs, CNNs have a layerwise structure. In each layer, a 2D convolution is applied to the input. Here we consider a simple CNN with a rectified linear unit and without pooling. In the
th layer, for a pixel located at , the update is(5) 
where denotes pixel of the input image, denotes the hidden state of pixel at the th layer, and denotes the weight matrix in the th layer, and denotes the neighbor pixels of pixel . Specifically, for a convolution kernel of size , we have
and a common choice of is .
Despite the great success of CNNs in computer vision, they cannot be applied to nonEuclidean data. In [37], CNNs are extended to graphs from a spatial perspective, which is as efficient as CNNs, while enjoying performance guarantees on graph isomorphism test. We refer to this architecture as the spatial graph convolutional networks (SGNNs). In each layer of a CNN (5), each pixel aggregates information from neighbor pixels and then updates its state. As an analogy, in each layer of a SGNN, each node updates its representation by aggregating features from its neighbor nodes. Specifically, the update rule of the th layer at vertex in a SGNN is
(6) 
where denotes the input feature of node , denotes the hidden state of node at the th layer, denotes the set of the neighbors of , is a set function that aggregates information from the node’s neighbors, and is a function that combines aggregated information with its own information. An illustration of the extension from CNNs to SGNNs is shown in Fig. 2. Particularly, SGNNs include spatial deep learning for wireless scheduling [6] as a special case.
Despite the success of SGNNs in graph problems, it is difficult to directly apply SGNNs on radio resource allocation problems as they cannot exploit the edge features. This means that they cannot incorporate channel states in wireless networks. We modify the definition in (6) to exploit edge features and will refer to it as message passing graph neural networks (MPGNNs). The update rule for the th layer at vertex in an MPGNN is
(7) 
where denotes the edge feature of the edge (i.e., in (2)). The output of a layer MPGNN is .
The extension from SGNNs to MPGNNs is simple but crucial, due to the following two reasons. First, MPGNNs respect the permutation equivariance property in Proposition II.3. Second, MPGNNs enjoy theoretical guarantees in radio resource management problems (as discussed in Section IV). These two properties are unique for MPGNNs and are not enjoyed by SGNNs.
IiiC Key Properties of MPGNNs
MPGNNs enjoy properties that are favorable to solving largescale radio resource management problems, as discussed in the sequel.
Permutation equivariance
We first show that MPGNNs satisfy the permutation equivariance property, which leads to easier training and better generalization.
Proposition III.1.
(Permutation equivariance in MPGNNs) Viewing the input output mapping of MPGNNs defined in (7) as , we have
for any permutation matrix .
Remark.
Please refer to Appendix B for a detailed proof.
Ability to generalize to different problem scales
In MLPs, the input or output size must be the same during training and testing. Hence, the problem size in the test dataset must be equal or less than the problem scale in the training dataset [33]. This means that MLP based methods cannot be directly applied to a different problem size. In MPGNNs, each node has a copy of two sub neural networks, i.e., and , whose inputoutput dimensions are invariant with the problem scale. Thus, we can train MPGNNs on smallscale problems and apply them to largescale problems.
Fewer training samples
The required number of training samples for MPGNNs is much smaller than that for MLPs. The first reason is training sample reusing. For each training sample, each node receives a permuted version of it and processes it with and . Thus, each training sample is reused times for training and , where is the problem scale. Second, input and output dimensions of the aggregation and combination functions in MPGNNs are much smaller than the original problem, which allows the use of much fewer parameters in neural networks.
High computational efficiency
In each layer, an aggregation function is applied to all the edges and a combination function is applied to all the nodes. Thus, the time complexity for each layer is and the overall time complexity for an layer MPGNN is . The time complexity grows linearly with the number of agents when the maximal degree of the graph is bounded. Note that in MPGNNs, the aggregation function and combination function on each node can be executed in parallel. When the MPGNNs are fully parallelized, e.g., on powerful GPUs, the time complexity is , where is the maximal degree of the graph. This is a constant time complexity when the maximal degree of the graph is bounded. We will verify this observation via simulations in Fig. 4.
IiiD An Effective Implementation of MPGNNs
In this subsection, we propose an effective implementation of MPGNNs for radio resource management problems, named, the wireless channel graph convolution network (WCGCN), which is able to incorporate both agentrelated features and channelrelated features. The design space for MPGNNs (7) is to choose the set aggregation function and the combination function .
As general set functions are difficult to implement, an efficient implementation of was proposed in [10], which has the following form
where are the elements in the set, is a simple function, e.g., max or sum, and is some existing neural network architecture, e.g., linear mappings or MLPs. For and , linear mapping is adopted in popular GNN architectures (e.g., GCN [14] and S2V [7]). Nevertheless, as discussed in Section IV in [16], linear mappings have difficulty handling continuous features, which is ubiquitous in wireless networks (e.g., CSI). We adopt MLPs as and for their approximation ability [13]. MLP processing unit enables WCGCN to exploit complicated agentrelated features and channelrelated features in wireless networks.
For the aggregation function , we notice that the following property holds if we use .
Theorem III.1 states that remains the same up to corruptions of the input if all the features in are preserved and only contains a limited number of features, which is smaller than . By specifying it to problems in wireless networks, the output of a layer remains unchanged even when the CSI is heavily corrupted on some links. In other words, it is robust to missing CSI.
We next specify the architecture for the WCGCN, which aligns with traditional optimization algorithms. First, in traditional optimization algorithms, each iteration outputs an updated version of the optimization variables. In the WCGCN, each layer outputs an updated version of the optimization variables. Second, these algorithms are often timeinvariant systems, e.g., gradient descent, WMMSE [27], and FPlinQ [24]. Thus, we share weights among different layers of the WCGCN, and the updates are
(8)  
where MLP1 and MLP2 are two different MLPs, and is a normalization function that depends on applications. For example, for the power control problem, we constrain the power between and , and
can be a sigmoid function, i.e.,
.Besides the benign properties of MPGNNs, WCGCN enjoys several desirable properties for solving largescale radio resource management problems. First, the WCGCN can effectively exploit features in multiantenna systems with heterogeneous agents (e.g., channel states in multiantenna systems and users’ weights in weighted sum rate maximization). This is because WCGCN adopts MLP as processing units instead of linear mappings. This enables it to solve a wider class of radio resource management tasks than existing works [6, 16, 9] (e.g., beamforming problems and weighted sum rate maximization). Second, it is robust to partial and imperfect CSI as suggested in Theorem III.1.
Iv Theoretical Analysis of MPGNNbased Radio Resource Management
In this section, we investigate performance and generalization of MPGNNs. We first prove the equivalence between MPGNNs and a class of distributed algorithms, which include many classic algorithms for radio resource management as special examples, e.g., WMMSE [27] and FPlinQ [24]. Based on this observation, we analyze the performance of MPGNNbased methods for weighted sum rate maximization problem.
Iva Simplifications
To provide theoretical guarantees for “learning to optimize” approaches for solving radio resource management problems, it is critical to understand the performance and generalization of neural networkbased methods. Unfortunately, the training and generalization of neural networks are sill open problems. We make several commonly adopted simplifications to make the performance analysis tractable. First, we focus on the MPGNN class instead of any specific neural network architecture such as GCNs. Following Lemma 5 and Corollary 6 in [37]
, we can design an MPGNN with MLP processing units as powerful as the MPGNN class, and thus this simplification well serves our purpose. Second, we target at proving the existence of an MPGNN with performance guarantee. Because we train the neural network with a stochastic gradient descent with limited training samples during the simulations, we may not find the corresponding neural network parameters. While this may leave some gap between the theory and practice, our result is an important first step. These two simplifications have been commonly adopted in the performance analysis of GNNs
[37, 23, 1].IvB Equivalence of MPGNNs and Distributed Optimization
Compared with the neural networkbased radio resource management, optimizationbased radio resource management has been well studied. Thus, it is desirable to make connections between these two types of methods. In [23], the equivalence between some special types of GNNs and graph optimization algorithms was proved. Inspired by this result, we shall establish the equivalence between MPGNNs and a class of distributed radio resource management algorithms.
We first give a brief introduction to distributed local algorithms, following [12]. The maximal degree of the nodes in the graph is assumed to be bounded. Distributed local algorithms are a class of iterative algorithms in a multiagent system. In each iteration, each agent sends messages to its neighbors, receives messages from its neighbors, and updates its state based on the received messages. The algorithm terminates after a constant number of iterations.
We focus on a subclass of distributed local algorithms, titled, multiset broadcasting distributed local algorithms (MBDLA) [12], which include a wide range of radio resource management algorithms in wireless networks, e.g., DTP [15], WMMSE [27], and FPlinQ [24]. Multiset and broadcasting refer to the way for receiving and sending messages, respectively. Denote as the state of node at the th iteration, and the MBDLA is shown in Algorithm 1.
The equivalence between MPGNNs and MBDLAs roots in the similarity in their definitions. In each iteration of an MBDLA, each agent aggregates messages from neighbor agents and updates its local state. In each layer of an MPGNN, each node aggregates features from neighbor nodes. The equivalence can be drawn if we view the agents as nodes in a graph and messages as the features. The following proposition states the equivalence of MPGNNs and MBDLAs formally.
Theorem IV.1.
Let MBDLA() denote the class of MBDLA with iterations and MPGNN() as the class of MPGNNs with layers, then the following two conclusions hold.

For any MPGNN(), there exists a distributed local algorithm in MBDLA() that solves the same set of problems as MPGNN().

For any algorithm in MBDLA(), there exists an MPGNN() that solves the same set of problems as this algorithm.
Remark.
Please refer to Appendix C for a detailed proof.
The equivalence allows us to analyze the performance of MPGNNs by studying the performance of MBDLAs. The first result shows that MPGNNs are at most as powerful as MBDLAs. The implication is that if we can prove that there is no MBDLA capable of solving a specific radio resource management problem, then MPGNNs cannot solve it. This can be used to prove a performance upper bound of MPGNNs. The second result shows that MPGNNs are as powerful as MBDLAs in radio resource management problems. This implies that if we are able to identify an MBDLA that solves a radio resource management problem well, then there exists an MPGNN performs better or at least competitive. The generalization is also as good as the corresponding MBDLA. We shall give a specific example on sum rate maximization in the next subsection.
IvC Performance and Generalization of MPGNNs
In this subsection, we use the tools developed in the last subsection to analyze the performance and generalization of MPGNNs in the sum rate maximization problem. The analysis is built on the observation that a classic algorithm for the sum rate maximization problem, i.e., WMMSE, is an MBDLA under some conditions, which is formally stated below. We shall refer to the MBDLA corresponding to WMMSE as WMMSEDLA.
Proposition IV.1.
When the maximal number of interference neighbors is bounded by some constant, then WMMSE with a constant number of iterations is an MBDLA.
Remark.
When the problem sizes in the training dataset and test dataset are the same, we can always assume that the number of interference neighbors is a common constant. The restriction of a constant number of interference neighbors only influences the generalization. Please refer to Appendix D for a detailed proof.
Performance
Proposition IV.1
shows that WMMSE is an MBDLA. Thus, when the problem sizes in the training dataset and test dataset are the same, there exists an MPGNN whose performance is as good as WMMSE. As the WMMSE is handcrafted, it is not optimal in terms of the number of iterations. By employing a unsupervised loss function, we expect that MPGNNs can learn an algorithm which has fewer iterations and may possibly enjoy better performance. In Fig.
3, we observe that a layer MPGNN outperforms WMMSE with iterations and a layer MPGNN outperforms WMMSE with iterations.Generalization
To avoid the excessive training cost, it is desirable to first train a neural network on smallscale problems and then generalize it to largescale ones. An intriguing question is when such generalization is reliable. Compared with WMMSE, WMMSEDLA has two constraints: Both the number of iterations and the maximal number of interference neighbors should be bounded by some constants. As agents that are far away cause little interference, the number of interference neighbors can be assumed to be fixed when the user density is kept the same. As a result, compared with WMMSE with a fixed number of iterations, the performance of MPGNNs is stable when the user density in the test dataset is the user density in the training dataset multiplied by a constant. We will verify this by simulations in Table IV and Table VII.
V Simulation Results
In this section, we provide simulation results to verify the effectiveness of the proposed neural network architecture for three applications. The first application is sum rate maximization in a Gaussian interference channel, which is a classic application for deep learningbased methods. We use this application to compare the proposed method with MLPbased methods [19] and optimizationbased methods [27]. The second application is weighted sum rate maximization, and the third application is beamformer design. The last two problems cannot be solved by existing methods in [6, 16, 9].
For the neural network setting, we adopt a
layer WCGCN, implemented by Pytorch Geometric
[10]. We apply unsupervised training without labeled samples, and the loss function is defined aswhere the expectation is taken over all the channel realizations. To optimize the neural network, we adopt the adam optimizer with a learning rate .
Va Sum Rate Maximization
We first consider the sum rate maximization problem in a singleantenna Gaussian interference channel. This problem is a special case of (4) with , , and .
We consider the following benchmarks for comparison.

WMMSE [27]: This is a classic optimizationbased algorithm for sum utility maximization in MIMO interfering broadcast channels. We run WMMSE for iterations.

Strongest: We find a fixed proportion of pairs with the largest channel gain , and set the power of these pairs as while the power levels for remaining pairs are set to . This is a simple baseline algorithm without any knowledge of interference links.

PCNet [19]: PCNet is an MLP based method particularly designed for the sum rate maximization problem with singleantenna channels.
We use training samples for WCGCN and training samples for PCNet. For a specific parameter setting of WCGCN (8), we set the hidden units of MLP1 in (8) as , MLP2 as , and as sigmoid function.^{2}^{2}2The performance of WCGCN is not sensitive to the number of hidden units. The performance of different methods is shown in Table I. The SNR and number of users are kept the same in the training and test dataset. For all the tables shown in this section, the entries are (weighted) the sum rates achieved by different methods normalized by the sum rate of WMMSE. We see that both PCNet and WCGCN achieve nearoptimal performance when the problem scale is small. As the problem scale becomes large, the performance of PCNet approaches Strongest. This shows that it can hardly learn any valuable information about interference links. Nevertheless, the performance of WCGCN is stable as the problem size increases. Thus, GNNs are more favorable than MLPs for mediumscale or largescale problems.
dB  dB  dB  dB  dB  dB  
WCGCN  
PCNet  
Strongest 
We further compare the performance of WCGCN and WMMSE with different numbers of iterations. We use the system setting , SNRdB and the results are shown in Fig. 3. From the figure, we see that a layer WCGCN outperforms WMMSE with iterations and a layer WCGCN outperforms WMMSE with iterations. This indicates that by adopting the unsupervised loss function, WCGCN can learn a much better messagepassing algorithm than the handcrafted WMMSE.
VB Weighted Sum Rate Maximization
In this application, we consider singleantenna transceiver pairs within a area. The transmitters are randomly located in the
area while each receiver is uniformly distributed within
from the corresponding transmitter. We adopt the channel model from [30] and use training samples for each setting. To reduce the CSI training overhead, we assume is available to WCGCN only if the distance between transmitter and receiver is within meters. To provide a performance upper bound, global CSI is assumed to be available to WMMSE. The weights for weighted sum rate maximization, i.e., in (4), are generated from a uniform distribution in in both training and test dataset. For a specific parameter setting of WCGCN (8), we set the hidden units of MLP1 as , MLP2 as , and as sigmoid function.Performance comparison
We first test the performance of WCGCN when the number of pairs is the same in the training and test dataset. Specifically, we consider pairs in a region. We test the performance of WCGCN with different values of and , as shown in Table II. The entries in the table are the sum rates achieved by different methods. We observe that WCGCN with local CSI achieves competitive performance to WMMSE with global CSI.
(2m,65m)  (10m,50m)  (30m,70m)  (30m,30m)  

WCGCN 
Next, to test the generalization capability of the proposed method, we train WCGCN on a wireless network with tens of users and test it on wireless networks with hundreds or thousands of users, as shown in the following two simulations.
Generalization to larger scales
We first train the WCGCN with pairs in a region. We then change the number of pairs in the test set while the density of users (i.e., ) is fixed. The results are shown in Table III. It can be observed that the performance is stable as the number of users increases. It also shows that WCGCN can well generalize to larger problem scales, which is consistent with our analysis.
Number of Links  

Field length (m)  
(10m,50m)  
(30m,30m) 
Number of Links  

Field length  
(10m,50m)  ()  ()  ()  ()  ()  
(30m,30m)  ()  ()  ()  ()  () 
Generalization to higher densities
In this test, we first train the WCGCN with pairs in a region. We then change the number of pairs in the test set while fixing the area size. The results are shown in Table IV and the performance loss compared with is shown in the bracket. The performance is stable up to a fold increase in the density, and good performance is achieved even when there is a fold increase in the density.
VC Beamformer Design
In this subsection, we consider the beamforming for sum rate maximization in (4). Specifically, we consider transceiver pairs within a area, where the transmitters are equipped with multiple antennas and each receiver is equipped with a single antenna. The transmitters are generated uniformly in the area and the receivers are generated uniformly within from the corresponding transmitters. We adopt the channel model in [30] and use training samples for each setting. The assumption of the available CSI for WCGCN and WMMSE is the same as the previous subsection. In WCGCN, a complex number is treated as two real numbers. For a specific parameter setting of WCGCN (8), we set the hidden units of MLP1 as , MLP2 as , and .
Performance comparison
We first test the performance of WCGCN when the number of pairs in the training dataset and the number of pairs in the test dataset are the same. Specifically, we consider pairs in a meters by meters region and each transmitter is equipped with antennas. We test the performance of WCGCN with different and . The results are shown in Table V. We observe that WCGCN achieves comparable performance to WMMSE with local CSI, demonstrating the applicability of the proposed method to multiantenna systems.
(2m,65m)  (10m,50m)  (30m,70m)  (30m,30m)  

WCGCN 
Generalization to larger scales
We first train the WCGCN with pairs in a meters by meters region with . We then change the number of pairs while the density of users (i.e., ) is fixed. The results are shown in Table VI. The performance is stable as the number of users increases, which is consistent with our theoretical analysis.
Number of Links  

Field length  
(2m,65m)  
(10m,50m) 
Generalization to larger densities
We first train the WCGCN with pairs on a meters by meters region with . We then change the number of pairs while fix the area size. The results are shown in Table VII and the performance loss is shown in the bracket. The performance is stable up to a fold increase in the density and satisfactory performance is achieved up to a fold increase in the density. The performance deteriorates when the density grows, which indicates that extra training is needed when the density in the test dataset is much larger than that of the training dataset.
Number of Links  

Field length  
(2m,65m)  ()  ()  ()  ()  ()  
(10m,50m)  ()  ()  ()  ()  () 
Computation time comparison
This test compares the running time of different methods for different problem scales. We run “WCGCN GPU” on GeForce GTX 1080Ti while the other methods on Intel(R) Xeon(R) CPU E52643 v4 @ 3.40GHz. The implementation of neural networks exploits the parallel computation of GPU while WMMSE is not able to do so due to its sequential computation flows. The running time is averaged over problem instances and shown in Fig. 4. The speedup compared with WMMSE becomes large as the problem scale increases. This benefits from the low computational complexity of WCGCN. As shown in the figure, the computational complexity of WCGCN CPU is linear and WCGCN GPU is nearly a constant, which is consistent with our analysis in Section IIIC. Remarkably, WCGCN is able to solve the problem with users within milliseconds.
Vi Conclusions
In this paper, we developed a scalable neural network architecture based on GNNs to solve radio resource management problems. In contrast to existing learning based methods, we focused on the neural architecture design to meet the key performance requirements, including low training cost, high computational efficiency, and good generalization. Moreover, we theoretically connected learning based methods and optimization based methods, which casts light on the performance guarantee of learning to optimize approaches. We believe that this investigation will lead to profound implications in both theoretical and practical aspects. As for future directions, it will be interesting to investigate the distributed deployment of MPGNNs for radio resource management in wireless networks, and extend our theoretical results to more general application scenarios.
Appendix A Proof of Proposition ii.2
Following Proposition II.1, we have
(9) 
for any variable , adjacency matrix , and permutation matrix .
Appendix B Proof of Proposition iii.1
In the original graph, denote the input feature of node as , the edge feature of edge as , and the output of the th layer of node as . In the permuted graph, denote the input feature of node as , the edge feature of edge as , and the output of the th layer for node as . Due to the permutation relationship, we have
(11) 
We prove the result by induction. First, we have as in (11). Assume . In the th layer, the following update rule is applied
(12)  
Appendix C Proof of Theorem iv.1
In MBDLAs, the maximal degree of nodes should be bounded by some constant, denoted by . The update of MBDLA at the th iteration can be written as
(13) 
The update of an MPGNN at the layer can be written as
(14) 
1) We first show that the inference stage of an MPGNN can be viewed as an MBDLA, which is proved by induction. Before the algorithm and neural network start, both and are node features and thus . We assume . At the th iteration, the message is passed from agent to agent . Then agent updates its local state as and . By doing so, we have .
2) We show that (13) can be written in the form of (14). Before the algorithm and neural network start, both and are node features and thus . We assume . Let . At the th iteration, node aggregates features from neighbor nodes that form a multiset . We order the elements in according to their first coordinates. Let denote the function that selects the th element in a multiset , , . Taking
and , we then obtain . This completes the proof.
Appendix D Proof of Proposition iv.1
WMMSE [27] is a classic algorithm for weighted sum rate maximization in MIMO interfering broadcast channels. The WMMSE algorithm considers a cell interfering broadcast channel where base station (BS) serves users. Denote as the channel from base station to user , as the beamformer that BS uses to transmit symbols to user , as the weight of user , and
as the variance of noise for user
. The problem formulation issubject to 
The WMMSE algorithm is shown in Algorithm 2.
We first model this system as a graph. We treat the th user as the th node in the graph. The node features are . The internal state of node at the th iteration is . An edge is drawn from the th node to the th node if there is an interference link between the th BS and the th user. The edge feature of the edge is .
We show that a WMMSE algorithm with iterations is an MBDLA with at most iterations. We update the variables and
at the odd iterations while updating the variable
at the even iterations. Specifically, at the th iteration with being an odd number, the th node broadcasts its state along its edges. The edge processes the message by forming and the node receives the message set . The agent first sums over the messages . Then the th node updates its internal state as
Comments
There are no comments yet.