I Introduction
The defining feature of wireless communication is fading and the role of optimal wireless system design is to allocate resources across fading states to optimize long term system properties. Mathematically, we have a random variable
that represents the instantaneous fading environment, a corresponding instantaneous allocation of resources , and an instantaneous performance outcome resulting from the allocation of resources when the channel realization is . The instantaneous system performance tends to vary too rapidly from the perspective of end users for whom the long term average is a more meaningful metric. This interplay between instantaneous allocation of resources and long term performance results in distinctive formulations where we seek to maximize a utility of the long term average subject to the constraint . Problems of this form range from the simple power allocation in wireless fading channels – the solution of which is given by water filling – to the optimization of frequency division multiplexing [1], beamforming [2, 3], and random access [4, 5].Optimal resource allocation problems are as widespread as they are challenging. This is because of the high dimensionality that stems from the variable being a function over a dense set of fading channel realizations and the lack of convexity of the constraint
. For resource allocation problems, such as interference management, heuristic methods have been developed
[6, 7, 8]. Generic solution methods are often undertaken in the Lagrangian dual domain. This is motivated by the fact that the dual problem is not functional, as it has as many variables as constraints, and is always convex whether the original problem is convex or not. A key property that enables this solution is the lack of duality gap, which allows dual operation without loss of optimality. The duality gap has long being known to be null for convex problems – e.g., the water level in water filling solutions is a dual variable – and has more recently being shown to be null under mild technical conditions despite the presence of the nonconvex constraint [9, 10]. This permits dual domain operation in a wide class of problems and has lead to formulations that yield problems that are more tractable, although not necessarily tractable without resorting to heuristics [11, 12, 13, 14, 15, 16, 17].The inherent difficulty of resource allocation problems makes the use of machine learning tools appealing. One may collect a training set composed of optimal resource allocations
for some particular instancesand utilize the learning parametrization to interpolate solutions for generic instances
. The bottleneck step in this learning approach is the acquisition of the training set. In some cases this set is available by reverse engineering as it is possible to construct a problem having a given solution [18, 19]. In some other cases heuristics can be used to find approximate solutions to construct a training set [20, 21, 22]. This limits the performance of the learning solution to the performance of the heuristic, though the methodology has proven to work well at least in some particular problems.Instead of acquiring a training set, one could exploit the fact that the expectation has a form that is typical of learning problems. Indeed, in the context of learning,
represents a feature vector,
the regression function to be learned,a loss function to be minimized, and the expectation
the statistical loss over the distribution of the dataset. We may then learn without labeled training data by directly minimizing the statistical loss with stochastic optimization methods which merely observe the loss at sampled pairs. This setting is typical of, e.g., reinforcement learning problems
[23], and is a learning approach that has been taken in several unconstrained problems in wireless optimization [24, 25, 26, 27]. In general, wireless optimization problems do have constraints as we are invariably trying to balance capacity, power consumption, channel access, and interference. Still, the fact remains that wireless optimization problems have a structure that is inherently similar to learning problems. This realization is the first contribution of this paper:
[leftmargin=24pt]
 (C1)

Parametrizing the resource allocation function yields an optimization problem with the structure of a learning problem in which the statistical loss appears as a constraint (Section II).
This observation is distinct from existing work in learning for wireless resource allocation. Whereby existing works apply machine learning methods to wireless resource allocation, such as via supervised training, here we identify that the wireless resource allocation is itself a statistical learning problem. This motivates the use of learning methods to directly solve the resulting optimization problems bypassing the acquisition of a training set. To do so, it is natural to operate in the dual domain where constraints are linearly combined to create a weighted objective (Section III). The first important question that arises in this context is the price we pay for learning in the dual domain. Our second contribution is to show that this question depends on the quality of the learning parametrization. In particular, if we use learning representations that are near universal—meaning that they can approximate any function up to a specified accuracy (Definition 1)—we can show that dual training is close to optimal:

[leftmargin=24pt]
 (C2)
A second question that we address is the design of training algorithms for optimal resource allocation in wireless systems. The reformulation in the dual domain gives natural rise to a gradientbased, primaldual learning method (Section IIIB). The primaldual method cannot be implemented directly, however, because computing gradients requires unavailable model knowledge. This motivates a third contribution:

[leftmargin=24pt]
 (C3)
This modelfree approach additionally includes the policy gradient method for efficiently estimating the gradients of a function of a policy (Section IVA). We remark that since the optimization problem is not convex, the primaldual method does not converge to the optimal solution of the learning problem but to a stationary point of the KKT conditions [28]
. This is analogous to unconstrained learning where stochastic gradient descent is known to converge only to a local minima.
The quality of the learned solution inherently depends on the ability of the learning parametrization to approximate the optimal resource allocation function. In this paper we advocate for the use of neural networks:

[leftmargin=24pt]
 (C4)

We consider the use of deep neural networks (DNN) and conclude that since they are universal parameterizations, they can be trained in the dual domain without loss of optimality (Section V).
Together, the Lagrangian dual formulation, modelfree algorithm, and DNN parameterization provide a practical means of learning in resource allocation problems with nearoptimality. We conclude with a series of simulation experiments on a set of common wireless resource allocation problems, in which we demonstrate the nearoptimal performance of the proposed DNN learning approach (Section VI).
Ii Optimal Resource Allocation in Wireless Communication Systems
Let be a random vector representing a collection of
stationary wireless fading channels drawn according to the probability distribution
. Associated with each fading channel realization, we have a resource allocation vector and a function . The components of the vector valued function represent performance metrics that are associated with the allocation of resources when the channel realization is . In fast time varying fading channels, the system allocates resources instantaneously but users get to experience the average performance across fading channel realizations. This motivates considering the vector ergodic average , which, for formulating optimal wireless design problems, is relaxed to the inequality(1) 
In (1), we interpret as the level of service that is available to users and as the level of service utilized by users. In general we will have at optimal operating points, but this is not required a priori.
The goal in optimally designed wireless communication systems is to find the instantaneous resource allocation that optimizes the performance metric in some sense. To formulate this problem mathematically we introduce a vector utility function and a scalar utility function , taking values and , that measure the value of the ergodic average . We further introduce the set and , where is the set of functions integrable with respect to , to constrain the values that can be taken by the ergodic average and the instantaneous resource allocation, respectively. We assume contains bounded functions, i.e., that the resources being allocated are finite. With these definitions, we let the optimal resource allocation problem in wireless communication systems be a program of the form
(2) 
In (II) the utility is the one we seek to maximize while the utilities are required to be nonnegative. The constraint relates the instantaneous resource allocations with the long term average performances as per (1). The constraints and are set restrictions on and . The utilities and are assumed to be concave and the set is assumed to be convex. However, the function is not assumed convex or concave and the set is not assumed to be convex either. In fact, the realities of wireless systems make it so that they are typically nonconvex [10]. We present three examples below to clarify ideas and proceed to motivate and formulate learning approaches for solving (II).
Example 1 (Pointtopoint wireless channel).
In a pointtopoint channel we measure the channel state and allocate power to realize a rate assuming the use of capacity achieving codes. The metrics of interest are the average rate and the average power consumption . These two constraints are of the ergodic form in (1). We can formulate a rate maximization problem subject to power constraints with the utility and the set . Observe that the utility is concave (linear) and the set is convex (a segment). In this particular case the instantaneous performance functions and are concave. A similar example in which the instantaneous performance functions are not concave is when we use a set of adaptive modulation and coding modes. In this case the rate function is a step function [10].
Example 2 (Multiple access interference channel).
A set of terminals communicates with associated receivers. The channel linking terminal to the its receiver is and the interference channel to receiver is given by . The power allocated in this channel is where . The instantaneous rate achievable by terminal depends on the signal to interference plus noise ratio (SINR) . Again, the quantity of interest for each terminal is the long term rate which, assuming use of capacity achieving codes, is
(3) 
The constraint in (3) has the form of (1) as it relates instantaneous rates with long term rates. The problem formulation is completed with a set of average power constraints . Power constraints can be enforced via the set and the utility can be chosen to be the weighted sum rate or a proportional fair utility . Observe that the utility is concave but the instantaneous rate function is not convex. A twist on this problem formulation is to make in which case individual terminals are either active or not for a given channel realization. Although this set is not convex, it is allowed in (II).
Example 3 (Time division multiple access).
In Example 2 terminals are allowed to transmit simultaneously. Alternatively, we can request that only one terminal be active at any point in time. This can be modeled by introducing the scheduling variable and rewriting the rate expression in (3) as
(4) 
where the interference term does not appear because we restrict channel occupancy to a single terminal. To enforce this constraint we define the set . This is a problem formulation in which, different from Example 2, we not only allocate power but channel access as well.
Iia Learning formulations
The problem in (II), which formally characterizes the optimal resource allocation policies for a diverse set of wireless problems, is generally a very difficult optimization problem to solve. In particular, two well known challenges in solving (II) directly are:

[leftmargin=24pt]
 (i)

The optimization variable is a function.
 (ii)

The channel distribution is unknown.
Challenge (ii) is of little concern as it can be addressed with stochastic optimization algorithms. Challenge (i) makes (II) a functional optimization problem, which, compounded with the fact that (1) defines a nonconvex constraint, entails large computational complexity. This is true even if we settle for a local minimum because we need to sample the dimensional space of fading realizations . If each channel is discretized to values the number of resource allocation variables to be determined is . As it is germane to the ideas presented in this paper, we point that (II) is known to have null duality gap [10]. This, however, does not generally make the problem easy to solve and moreover requires having model information.
This brings a third challenge in solving (II), namely the availability of the wireless system functions:

[leftmargin=24pt]
 (iii)

The form of the instantaneous performance function , utility , and constraint may not be known.
As we have seen in Examples 13, the function models instantaneous achievable rates. Although these functions may be available in ideal settings, there are difficulties in measuring the radio environment that make them uncertain. This issue is often neglected but it can cause significant discrepancies between predicted and realized performances. Moreover, with less idealized channel models or performance rate functions—such as bit error rate—reliable models may even not be available to begin with. While the functions and are sometimes known or designed by the user, we assume they are not here for complete generality.
Challenges (i)(iii) can all be overcome with the use of a learning formulation. This is accomplished by introducing a parametrization of the resource allocation function so that for some we make
(5) 
With this parametrization the ergodic constraint in (1) becomes
(6) 
If we now define the set , the optimization problem in (II) becomes one in which the optimization is over and
(7) 
Since the optimization is now carried over the parameter and the ergodic variable , the number of variables in (IIA) is . This comes at a loss of of optimality because (5) restricts resource allocation functions to adhere to the parametrization . E.g., if we use a linear parametrization it is unlikely that the solutions of (II) and (IIA) are close. In this work, we focus our attention on a widelyused class of parameterizations we define as nearuniversal, which are able to model any function in to within a stated accuracy. We present this formally in the following definition.
Definition 1.
A parameterization is an universal parameterization of functions in if, for some , there exists for any a parameter such that
(8) 
A number of popular machine learning models are known to exhibit the universality property in Definition 1
, such as radial basis function networks (RBFNs)
[29] and reproducing kernel Hilbert spaces (RKHS) [30]. This work focuses in particular on deep neural networks (DNNs), which can be shown to exhibit a universal function approximation property [31] and are observed to work remarkably well in practical problems—see, e.g, [32, 33]. The specific details regarding the use of DNNs in the proposed learning framework of this paper are discussed in Section V.While the reduction of the dimensionality of the optimization space is valuable, the most important advantage of (IIA) is that we can use training to bypass the need to estimate the distribution and the functions . The idea is to learn over a time index across observed channel realizations and probe the channel with tentative resource allocations . The resulting performance is then observed and utilized to learn the optimal parametrized resource allocation as defined by (IIA). The major challenge to realize this idea is that existing learning methods operate in unconstrained optimization problems. We will overcome this limitation by operating in the dual domain where the problem is unconstrained (Section III). Our main result on learning for constrained optimization is to show that, its lack of convexity notwithstanding, the duality gap of (IIA) is small for nearuniversal parameterizations (Theorem 1). This result justifies operating in the dual domain as it does not entail a significant loss of optimality. A modelfree primaldual method to train (IIA) is then introduced in Section IV and neural network parameterizations are described in Section V.
Iii Lagrangian Dual Problem
Solving the optimization problem in (IIA) requires learning both the parameter and the ergodic average variables over a set of both convex and nonconvex constraints. This can be done by formulating and solving the Lagrangian dual problem. To do so, introduce the nonnegative multiplier dual variables and , respectively associated with the constraints and . The Lagrangian of (IIA) is an average of objective and constraint values weighted by their respective multipliers:
(9)  
With the Lagrangian so defined, we introduce the dual function as the maximum Lagrangian value attained over all and
(10) 
We think of (10) as a penalized version of (IIA) in which the constraints are not enforced but their violation is penalized by the Lagrangian terms and . This interpretation is important here because the problem in (10) is unconstrained except for the set restrictions and . This renders (10) analogous to conventional learning objectives and, as such, a problem that we can solve with conventional learning algorithms.
It is easy to verify and wellknown that for any choice of and we have . This motivates definition of the dual problem in which we search for the multipliers that make as small as possible
(11) 
The dual optimum is the best approximation we can have of when using (10) as a proxy for (IIA). It follows that the two concerns that are relevant in utilizing (10) as a proxy for (IIA) are: (i) evaluating the difference between and and (ii) designing a method for finding the optimal multipliers that attains the minimum in (11). We address (i) in Section IIIA and (ii) in Section IIIB.
Iiia Suboptimality of the dual problem
The duality gap is the difference between the dual and primal optima. For convex optimization problems this gap is null, which implies that one can work with the Lagrangian as in (10) without loss of optimality. The optimization problem in (IIA), however, is not convex as it incorporates the nonconvex constraint in (6). We will show here that despite the presence of this nonconvex constraint the duality gap is small when using parametrizations that are near universal in the sense of Definition 1. In proving this result we need to introduce some restrictions to the problem formulation that we state as assumptions next.
Assumption 1.
The probability distribution is nonatomic in . I.e., for any set of nonzero probability there exists a nonzero probability strict subset of lower probability, .
Assumption 2.
Assumption 3.
The objective utility function is monotonically nondecreasing in each component. I.e., for any it holds .
Assumption 4.
The expected performance function is expectationwise Lipschitz on for all fading realizations . Specifically, for any pair of resource allocations and there is a constant such that
(13) 
Although Assumptions 14 restrict the scope of problems (II) and (IIA), they still allow consideration of most problems of practical importance. Assumption 2 simply states that service demands can be provisioned with some slack. We point that an inequality analogous to (12) holds for the other constraints in (II) and (IIA). However, it is only the slack that appears in the bounds we will derive. Assumption 3 is a restriction on the utilities , namely that increasing performance values result in increasing utility. Assumption 4 is a continuity statement on each of the dimensions of the expectation of the constraint function —we point out this is weaker than general Lipschitz continuity. Referring back to the problems discussed in Examples 13, it is evident that they satisfy the monotonicity assumption in Assumption 3. Furthermore, the continuity assumption in Assumption 4 is immediatley satisfied by the continuous capacity function in Examples 1 and 2, and is also satisfied by the binary problem in Example 3 due to the bounded expectation of the capacity function.
Assumption 1 states that there are no points of strictly positive probability in the distributions . This requires that the fading state take values in a dense set with a proper probability density – no distributions with delta functions are allowed. This is the most restrictive assumption in principle if we consider systems with a finite number of fading states. We observe that in reality fading does take on a continuum of values, though the channel estimation algorithms may quantize estimates to a finite number of fading states. We stress, however, that the learning algorithm we develop in the proceeding sections does not depend upon this property, and may be directly applied to channels with discrete states.
The duality gap of the original (unparameterized) problem in (II) is known to be null – see Appendix A and [10]. Given the validity of Assumptions 1  4 and using a parametrization that is nearly universal in the sense of Definition 1, we show that the duality/parametrization gap between problems (II) and (11) is small as we formally state next.
Theorem 1.
Consider the parameterized resource allocation problem in (IIA) and its Lagrangian dual in (11) in which the parametrization is universal in the sense of Definition 1. If Assumptions 1–4 hold, then the dual value is bounded by
(14) 
where the multiplier norm can be bounded as
(15) 
in which is the strictly feasible point of Assumption 2.
Proof : See Appendix A.
Given any nearuniversal parameterization that achieves accuracy with respect to all resource allocation policies in , Theorem 1 establishes an upper and lower bound on the dual value in (11) relative to the optimal primal of the original problem in (II). The dual value is not greater than and, more importantly, not worse than a bias on the order of . These bounds justify the use of the parametrized dual function in (10) as a means of solving the (unparameterized) wireless resource allocation problem in (II). Theorem 1 shows that there exist a set of multipliers – those that attain the optimal dual value – that yield a problem that is within of optimal.
It is interesting to observe that the duality gap has a very simple dependance on problem constants. The factor comes from the error of approximating arbitrary resource allocations with parametrized resource allocations . The Lipschitz constant translates this difference into a corresponding difference between the functions and . The norm of the Lagrange multiplier captures the sensibility of the optimization problem with respect to perturbations, which in this case comes from the difference between and . This latter statement is clear from the bound in (15). For problems in which the constraints are easy to satisfy, we can find feasible points close the optimum so that and is not too small. For problems where constraints are difficult to satisfy, a small slack results in a meaningful variation in and a large value for the ratio . We point out that (15) is a classical bound in optimization theory that we include here for completeness.
IiiB PrimalDual learning
In order to train the parametrization on the problem (IIA) we propose a primaldual optimization method. A primaldual method performs gradient updates directly on both the primal and dual variables of the Lagrangian function in (9) to find a local stationary point of the KKT conditions of (IIA). In particular, consider that we successively update both the primal variables and dual variables over an iteration index . At each index of the primaldual method, we update the current primal iterates by adding the corresponding partial gradients of the Lagrangian in (9), i.e. , and projecting to the corresponding feasible set, i.e.,
(16)  
(17) 
where we introduce as scalar step sizes. Likewise, we perform a gradient update on current dual iterates in a similar manner—by subtracting the partial stochastic gradients and projecting onto the positive orthant to obtain
(18)  
(19) 
with associated step sizes . The gradient primaldual updates in (16)(19) successively move the primal and dual variables towards maximum and minimum points of the Lagrangian function, respectively.
The above gradientbased updates provide a natural manner by which to search for the optimal point of the dual function . However, direct evaluation of these updates requires both the knowledge of the functions , as well as the wireless channel distribution . We cannot always assume this knowledge is available in practice. Indeed, existing models for, e.g., capacity functions, do not always capture the true physical performance in practice. The primaldual learning method presented is thus considered here only as a baseline method upon which we can develop a completely modelfree algorithm. The details of modelfree learning are discussed further in the following section.
Iv ModelFree Learning
In this section, we consider that often in practice, we do not have access to explicit knowledge of the functions , , and , along with the distribution , but rather observe noisy estimates of their values at given operating points. While this renders the direct implementation of the standard primaldual updates in (16)(19) impossible, given their reliance on gradients that cannot be evaluated, we can use these updates to develop a modelfree approximation. Consider that given any set of iterates and channel realization , we can observe stochastic function values , , and
. For example, we may pass test signals through the channel at a given power or bandwidth to measure its capacity or packet error rate. These observations are, generally, unbiased estimates of the true function values.
We can then replace the updates in (16)(19) with socalled zerothordered updates, in which we construct estimates of the function gradients using observed function values. Zerothordered gradient estimation can be done naturally with the method of finite differences, in which unbiased gradient estimators at a given point are constructed through random perturbations. Consider that we draw random perturbations and
from a standard Gaussian distribution and a random channel state
from . Finitedifference gradients estimates , , and can be constructed using function observations at given points and the sampled perturbations as(20)  
(21)  
(22)  
where we define scalar step sizes . The expressions in (20)(22) provide estimates of the gradients that can be computed using only two function evaluations. Indeed, the finite difference estimators can be shown to be unbiased, meaning that that they coincide with the true gradients in expectation—see, e.g., [34]. Note also in (22) that, by sampling both the function and a channel state , we directly estimate the expectation . We point out that these estimates can be further improved by using batches of samples, , and averaging over the batch. We focus on the simple stochastic estimates in (20)(22), however, for clarity of presentation.
Note that, while using the finite difference method to estimate the gradients of the deterministic function and is relatively simple, estimating the stochastic policy function is often a computational burden in practice when the parameter dimension is very large—indeed, this is often the case in, e.g., deep neural network models. An additional complication arises in that the function must be observed multiple times for the same sample channel state to obtain the perturbed value. This might be impossible to do in practice if the channel state changes rapidly. There indeed exists, however, an alternative model free approach for estimating the gradient of a policy function, which we discuss in the next subsection.
Iva Policy gradient estimation
The ubiquity of computing the gradients of policy functions such as in machine learning problems has motivated the development of a more practical estimation method. The socalled policy gradient method exploits a likelihood ratio property found in such functions to allow for an alternative zeroth ordered gradient estimate. To derive the details of the policy gradient method, consider that a deterministic policy can be reinterpreted as a stochastic policy drawn from a distribution with density function defined with a delta function, i.e., . It can be shown that the Jacobian of the policy constraint function with respect to can be rewritten using this density function as
(23) 
where is a random variable drawn from distribution —see, e.g., [35]. Observe in (23) that the computation of the Jacobian reduces to a function evaluation multiplied by the gradient of the policy distribution . Indeed, in the deterministic case where the distribution is a delta function, the gradient cannot be evaluated without knowledge of and . However, we may approximate the delta function with a known density function centered around , e.g., Gaussian distribution. If an analytic form for is known, we can estimate by instead directly estimating the lefthand side of (23). In the context of reinforcement learning, this is called the REINFORCE method [35]. By using the previous function observations, we can obtain the following policy gradient estimate,
(24) 
where is a sample drawn from the distribution .
The policy gradient estimator in (24) can be taken as an alternative to the finite difference approach in (22) for estimating the gradient of the policy constraint function, provided the gradient of the density function can itself be evaluated. Observe in the above expression that the policy gradient approach replaces a sampling of the parameter with a sampling of a resource allocation . This is indeed preferable for many sophisticated learning models in which . We stress that while policy gradient methods are preferable in terms of sampling complexity, they come at the cost of placing an additional approximation through the use of a stochastic policy analytical density functions .
IvB Modelfree primaldual method
Using the gradient estimates in (20)(22)—or (24)—we can derive a modelfree, or zerothordered, stochastic updates to replace those in (16)(19). By replacing all function evaluations with the function observations and all gradient evaluations with the finite difference estimates, we can perform the following stochastic updates
(25)  
(26)  
(27)  
(28) 
The expressions in (25)(28) provides means of updating both the primal and dual variables in a primaldual manner without requiring any explicit knowledge of the functions or channel distribution through observing function realizations at the current iterates. We may say this method is modelfree because all gradients used in the updates are constructed entirely from measurements, rather than analytic computation done via model knowledge. The complete modelfree primaldual learning method can be summarized in Algorithm 1. The method is initialized in Step 1 through the selection of parameterization model and form of the stochastic policy distribution and in Step 2 through the initialization of the primal and dual variables. For every step , the algorithm begins in Step 4 by drawing random samples (or batches) of the primal and dual variables. In Step 5, the model functions are sampled at both the current primal and dual iterates and at the sampled points. These function observations are then used in Step 6 to form gradient estimates via finite difference (or policy gradient). Finally, in Step 7 the modelfree gradient estimates are used to update both the primal and dual iterates.
We briefly comment on the known convergence properties of the modelfree learning method in (25)(28). Due to the nonconvexity of the Lagrangian defined in (9), the stochastic primaldual descent method will converge only to a local optima and is not guaranteed to converge to a point that achieves . These are indeed the same convergence properties of general unconstrained nonconvex learning problems as well. We instead demonstrate through numerical simulations the performance of the proposed learning method in practical wireless resource allocation problems in the proceeding section.
Remark 1.
The algorithm presented in Algorithm 1 is generic in nature and can be supplemented with more sophisticated learning techniques that can improve the learning process. Some examples include the use of entropy regularization to improve policy optimization in nonconvex problems [36]. Policy optimization can also be improved using actorcritic methods [23], while the use of a model function estimate to obtain “supervised” training signals can be used to initialize the parameterization vector . The use of such techniques in optimal wireless design are not explored in detail here and left as the study of future work.
V Deep Neural Networks
We have so far discussed a theoretical and algorithm means of learning in wireless systems by employing any near universal parametrization as defined in Definition 1. In this section, we restrict our attention to the increasingly popular set of parameterizations known as deep neural networks
(DNNs), which are often observed in practice to exhibit strong performance in function approximation. In particular, we discuss the details of the DNN parametrization model and both the theoretical and practical implications within our constrained learning framework.
The exact form of a particular DNN is described by what is commonly referred to as its architecture
. The architecture consists of a prescribed number of layers, each of which consisting of a linear operation followed by a pointwise nonlinearity—also known as an activation function. In particular, consider a DNN with
layers, labelled and each with a corresponding dimension . The layer is defined by the linear operation followed by a nonlinear activation function . If layer receives as an input from the layer , the resulting output is then computed as . The final output of the DNN, , is then related to the input by propagating through each later of the DNN as .An illustration of a fullyconnected example DNN architecture is given in Figure 1. In this example, the inputs
are passed through a single hidden layer, following which is an output layer. The grey lines between layers reflect the linear transformation
, while each node contains an additional elementwise activation function . This general DNN structure has been observed to have remarkable generalization and approximation properties in a variety of functional parameterization problems.The goal in learning DNNs in general then reduces to learning the linear weight functions . Common choices of activation functions
include a sigmoid function, a rectifier function (commonly referred to as ReLu), as well as a smooth approximation to the rectifier known as softplus. For the parameterized resource allocation problem in (
IIA), the policy can be defined by an layer DNN as(29) 
where contains the entries of with . Note that by construction.
To contextualize the primaldual algorithm in (16)(19) with respect to traditional neural network training, observe that the update in (16) requires computation of the gradient
. Using the chain rule, this can be expanded as
(30)  
Thus, the computation of the full gradient requires evaluating the gradient of the policy function as well as the gradient of the DNN model . For the DNN structure in (29), the evaluation of may itself also require a chain rule expansion to compute partial derivatives at each layer of the network. This process of performing gradient descent to find the optimal weights in the DNN is commonly referred to as backpropogation.
We further take note how our learning approach differs from a more traditional, supervised training of DNNs. As in (30), the backpropogation is performed with respect to the given policy constraint function , rather than with respect to a Euclidean loss function over a set of given training data. Furthermore, due to the constraints, the backpropogation step in (16) is performed in sequence with the more standard primal and dual variable updates in (17)(19). In this way, the DNN is trained indirectly within the broader optimization algorithm used to solve (IIA). This is in contrast with other approaches of training DNNs in constrained wireless resource allocation problems—see, e.g. [20, 21, 22]—which train a DNN to approximate the complete constrained maximization function in (II) directly. Doing so requires the ability to solve (II) either exactly or approximately enough times to acquire a labeled training set. The primaldual learning approach taken here is preferable in that it does not require the use of training data. The dual problem can be seen as a simplified reinforcement learning problem—one in which the actions do not affect the next state.
For DNNs to be valid parametrization with respect to the result in Theorem 1, we must first verify that they satisfy the nearuniversality property in Definition 1. Indeed, deep neural networks are popular parameterizations for arbitrary functions precisely due to the richness inherent in (29), which in general grows richer with number of layers and associated layer sizes . This richness property of DNNs has been the subject of mathematical study and formally referred to as a complete universal function approximation [37, 31]. In words, this property implies that a large class of functions can be approximated with arbitrarily small accuracy using a DNN parameterization of the form in (29) with only a single layer of arbitrarily large size. With this property in mind, we can present the following theorem that extends the result in Theorem 1 in the special case of DNNs.
Theorem 2.
Consider the DNN parametrization in (29) with nonconstant, continuous activation functions for . Define the vector of layer lengths and a DNN defined in (29) with lengths as . Now consider the set of possible layer DNN parameterization functions . If Assumptions 1–4 hold, then the optimal dual value of the parameterized problem satisfies
(31) 
Proof : See Appendix B.
With Theorem 2 we establish the null duality gap property of a resource allocation problem of the form in (IIA) given a DNN parameterization that achieves arbitrarily small function approximation accuracy as the dimension of the DNN parameter—i.e. the number of hidden nodes—grows to infinity. While such a parametrization is indeed guaranteed to exist through the universal function approximation theorem, one would require a DNN with arbitrarily large size to obtain such a network in practice. As such, the suboptimality bounds presented in Theorem 1, which require only an DNNapproximation of given accuracy provide the more practical characterization of (IIA), while the result in Theorem 2 suggests DNNs can be used find parameterizations of arbitrarily strong accuracy.
Vi Simulation Results
In this section, we provide simulation results on using the proposed primaldual learning method to solve for DNNparameterizations of resource allocation in a number of common problems in wireless communications that take the form in (II). For the simulations performed, we employ a stochastic policy and implement the REINFORCEstyle policy gradient described in Section IVA. In particular, we select the policy distribution as a truncated Gaussian distribution. The truncated Gaussian distribution has fixed support on the domain . The output layer of the DNN is the set of
means and standard deviations to specify the respective truncated Gaussian distributions, i.e.
. Furthermore, to represent policies that are bounded on the support interval, the output of the last layer is fed into a scaled sigmoid function such that the mean lies in the area of support and the variance is no more than the square root of the support region. In the following experiments, this interval is [0, 10].
For updating the primal and dual variables, we use a batch size of 32. The primal dual method is performed with an exponentially decaying step size for dual updates and the ADAM optimizer [38]
for the DNN parameter update. Both updates start with a learning rate of 0.0005, while random channel conditions are generated with an exponential distribution with parameter
(to represent the square of a unit variance Rayleigh fading channel state).Via Simple AWGN channel
To begin, we simulate the learning of a DNN to solve the problem of maximizing total capacity over a set of simple AWGN wireless fading channel. In this case, each user is given a dedicated channel to communicate, and we wish to allocate resources between users within a total expected power budget . In this case, the capacity over the channel can be modeled as , where is the signaltonoise ratio experienced by user and is the noise variance. The capacity function for the th user is thus given by . We are interested in maximizing the weighted aggregate throughput across all users, with user weighted by . The total capacity problem can be written as
(32)  
Comments
There are no comments yet.