1 Introduction
In order to address the upcoming complex challenges in aerospace and communications research, recent studies are looking into Machine Learning (ML) methods as potential problemsolvers
[Abbas2015]. One such method is Deep Reinforcement Learning (DRL), which has had a wide adoption in the community [Luong2019, Ferreira2019]. In the specific case of satellite communications, DRL has already shown its usefulness in problems like channel allocation [Hu2018], beamhopping [Hu2020], and dynamic power allocation [luis2019deep, Zhang2020].The motivation behind the use of DRLbased decisionmaking agents for satellite communications originates mainly from the forthcoming need to automate the control of a large number of satellite variables and beams in realtime. Large flexible constellations, with thousands of beams and a broad range of different users, will need to autonomously reallocate resources such as bandwidth or power in realtime in order to address a highlyfluctuating demand [NorthernSkyResearch2019]. However, the new time and dimensionality requirements pose a considerable challenge to previouslyadopted optimization approaches such as mathematical programming [HengWang2013] or metaheuristics [Aravanis2015, Cocco2018, Durand2017]. In contrast, due to the training frameworks common in ML algorithms, DRL has the potential to meet these operational constraints [Luis2020].
Despite the positive results of DRL for satellite communications and many other fields, the majority of applied research studies mostly focus on bestcase scenarios, provide little insight on modelling decisions, and fail to address the operational considerations of deploying DRLbased systems in realworld environments. In addition to the inherent reproducibility problems of DRL algorithms [Henderson2018], recent studies show the drastic consequences of not properly addressing phenomena present in realworld environments [DulacArnold2020] such as nonstationarity, highdimensionality, partialobservability, or safety constraints.
As mentioned, highdimensionality and nonstationarity are especially important in satellite communications. Mega constellations are already a reality, and they require models that adapt to orders of magnitude of hundreds or thousands of beams. Studies like [Hu2018, Hu2020, Zhang2020] validate the proposed DRL models for cases with less than 50 beams, but obviate the performance in highdimensional scenarios. In addition, user bases are becoming volatile and their demand highlyfluctuant. Relying on models that assume static user distributions could be detrimental during realtime operations. Therefore, it is crucial that applied research studies also discuss how their models behave against user nonstationarity and how they propose to address any negative influences.
One of the challenges of upcoming large constellations is how to efficiently assign a part of the frequency spectrum to each beam while respecting the interference constraints present in the system. This is known as the Frequency Plan Design (FPD) problem. Since users’ behavior is highlydynamic, new frequency plans must be computed in real time to satisfy the demand. Given these runtime and adaptability requirements, in this paper we propose looking into a DRLbased solution able to make the frequency allocation decisions required at every moment.
In our study we shift the target from the problem to the implementation methodology. Our aim is to provide a holistic view of how the different elements in a DRL model affect the performance and how their relevance changes depending on the scenario considered. In addition to nominal conditions, we also study the effect of dimensionality and nonstationarity on our models. Our goal is to contribute to the interpretability of DRL models in order to make progress towards realworld deployments that guarantee a robust performance.
The remainder of this paper is structured as follows: Section 2 presents a short overview on DRL and its main performance drivers; Section 3 introduces the FPD problem, focusing on its decisions and constraints; Section 4 outlines our approach to solve the problem with a DRL method; Section 5 discusses the results of the model in multiple scenarios; and finally Section 6 outlines the conclusions of the paper.
2 Deep Reinforcement Learning
Deep Reinforcement Learning is a Machine Learning subfield that combines the Reinforcement Learning (RL) paradigm [sutton2018reinforcement]
and Deep Learning frameworks
[goodfellow2016deep] to address complex sequential decisionmaking problems. The RL paradigm involves the interaction between an agent and an environmentthat follows a Markov Decision Process, and is external to the agent. This interaction is sequential and divided into a series of
timesteps. At each of these timesteps, the agent observes the state of the environment and takes an action based on it. As a consequence of that action, the agent receives a reward that quantifies the value of the action according to the highlevel goal of the interaction (e.g., flying a drone or winning a game). Then, the environment updates its state taking into account the action from the agent.The interaction goes on until a terminal state is reached (e.g., the drone lands). In this state, the agent can not take any further action. The sequence of timesteps that lead to a terminal state is known as an episode. Based on the experience from multiple episodes, the objective of the agent is to optimize its policy, which defines the mapping from states to actions. The agent’s optimal policy is the policy that maximizes the expected average or cumulative reward over an episode. This policy might be stochastic depending on the nature of the environment. Table 1 summarizes the parameters defined so far and includes the symbols generally used in literature.
Symbol  Parameter 

State of the environment at timestep  
Action taken by the agent at timestep  
Reward received at timestep  
Set of possible states  
Set of possible actions  
Set of available actions at state  
Terminal state  
Policy  
Optimal policy  
Probability of action given state 
In simple environments, the most effective policy is usually given by a table that maps each unique state to an action or probability distribution over actions. However, when the state and/or action spaces are large or continuous, using a tabular policy becomes impractical. Deep Reinforcement Learning (DRL) addresses such cases by substituting tabular policies for neural networkbased policies that approximate the mapping from states to actions. Those policies are then updated following Deep Learning algorithms, such as Stochastic Gradient Descent and backpropagation. DRL has shown significant results in a wide variety of areas like games
[mnih2015human], molecular design [Popova2018], or internet recommendation systems [Theocharous2015].There are six elements that jointly drive the performance of a DRL model:

State representation. We want to look for state representations that capture as much information of the environment as possible and can be easily fed into a neural network. In the case of visionbased robots, that might be simply RGB images from cameras. In other contexts, the process sometimes involves more complex state design strategies. It can be also referred as state space.

Action space. Although flexibility matters, reducing the action space is generally beneficial to the algorithm. While, for instance, videogames’ actions are straightforward, other environments might require changes to the action space, such as discretization or pruning. It can be also referred as action representation.

Reward function. The reward function is relevant during the training stage of the algorithm, and should also be informative with respect to the goal of the agent. There might be more than one reward function that correctly guides the learning, although there is usually a performance gap between classes of strategies: constant reward, sparse reward, rollout policybased reward, etc.

Policy. In DRL the policy is mainly represented by the neural network architecture chosen. DRL policies commonly include fullyconnected layers in addition to convolutional and/or recurrent layers. Each of these layer classes includes multiple subclassifications to consider.

Policy optimization algorithm. There are many algorithms that rely on completely different design choices. In modelfree DRL, alternatives include policy gradient methods, Qlearning, or hybrid approaches.

Training procedure. The training dataset and training environment constitute an important part of the performance. We want to reflect all phenomena that might take place during testing and make sure that the training procedure guarantees a robust performance. Sometimes, however, the possibilities of interaction with real environments are limited and models must resort to dataefficient strategies.
3 Frequency Plan Design Problem
The Frequency Plan Design problem consists in the assignment of a portion of the available spectrum to every beam in a multibeam satellite system, with the goal of satisfactorily serving all the system’s users. Although it is a wellstudied problem, the highdimensionality and flexible nature of new satellites add an additional layer of complexity that motivates the exploration of new algorithmic solutions such as DRL to address it.
3.1 Decisions
In this paper we consider a multibeam satellite constellation with satellites. We assume a beam placement is provided; we define the total number of beams as . Each of these beams has an individual data rate requirement. To satisfy this demand, spectrum resources must be allocated to each beam. For the purpose of this paper, we consider the amount of needed bandwidth for every beam is given, although our framework could be adapted to satisfy dynamic bandwidth requirements. While bandwidth is a continuous resource, most satellites divide up their frequency spectrum into frequency chunks or slots of equal bandwidth, and therefore we consider that each beam is assigned a certain discrete number of frequency slots. We denote the number of slots that beam needs as . The only remaining decisions are therefore to assign which specific frequency slots and the appropriate frequency reuse mechanisms.
In this work we assume satellites can reutilize spectrum. Each satellite of the constellation has an equal available spectrum consisting of consecutive frequency slots. In addition, there is access to frequency reuse mechanisms in the form of reuse groups and two polarizations. A combination of a specific reuse group and polarization is coined as frequency group. The constellation has a total of frequency groups, which is twice the amount of reuse groups (we consider lefthanded and righthanded circular polarization available for each beam). For each frequency group, there is an equal number of frequency slots available.
To better represent this problem, we use a grid in which the columns and the rows represent the frequency slots and the frequency groups, respectively. This is pictured in Figure 1. In the case of the frequency groups, we assume these are sorted first, by reuse group, and then, by polarization (e.g., row 1 corresponds to reuse group 1 and lefthanded polarization, and row 2 corresponds to reuse group 1 and righthanded polarization).
The frequency assignment operation for beam consists of first, selecting one of the frequency groups, and secondly, picking consecutive frequency slots from the total in that group. In other words, two decisions need to be made per beam: 1) the frequency group and 2) which is the first frequency slot this beam will occupy in the group. This is represented in Figure 1 by the black squares that designate the assignment decision for each beam.
3.2 Constraints
We identify two types of constraints. On one hand, we do not assume a specific orbit and therefore take into account that handover operations might occur. The frequency plan must account for handover constraints, which entail that some beams assigned to the same frequency group cannot overlap in the frequency domain because they are powered from the same satellite at some point in time. We define this type of constraints as
intragroup constraints.To better understand intragroup constraints and following the example in Figure 2, let’s assume beam (green beam in the figure) is powered from satellite 1 at time instant . This beam is assigned to frequency group 1 and slot 6, and needs a bandwidth of 3 slots – it therefore occupies slots 6, 7, and 8 in group 1. At time instant , this beam undergoes a handover operation from satellite 1 to satellite 2. When this occurs, the beam is assigned the same group 1 and slot 6 in satellite 2 and frees them in satellite 1. While we could change the frequency assignment during the handover, this is not always possible due to various factors. As a consequence, a safe strategy is to make sure that, at the moment beam switches from satellite 1 to satellite 2, slots 6, 7, and 8 in group 1 are available in satellite 2 as well.
On the other hand, we define the other type of restrictions as intergroup constraints. These relate to the cases in which beams that point to close locations might negatively interfere with each other. This situation is more restrictive, as those beams not only must not overlap if they share the same frequency group but also if they share the same polarization. For example, if beams and had an intergroup constraint and beam occupied slots 6, 7, and 8 in group 5, then, following our previous numbering convention, could occupy any slot but slots 6, 7, and 8 in groups 1, 3, 5, 7, etc. This interaction is represented in Figure 3.
4 Methods
The objective of this paper is to design and implement a DRLbased agent capable of making frequency assignment decisions for all beams of a constellation, to evaluate its performance, and to understand its modelling tradeoffs. While there exist algorithms that could address all of the decisions simultaneously, this approach makes the problem challenging for dynamic cases with large values for , , and . The total number of possible different plans is , which poses a combinatorial problem that does not scale well.
As an alternative, we take a sequential decisionmaking approach in which, following the DRL terminology previously introduced, a timestep is defined as the frequency assignment for just one beam. We then define an episode as the complete frequency assignment of all beams. By doing this, the dimensionality of the problem reduces to for each timestep.
The next step in our formulation involves the definition of the remaining problemspecific elements present in a RL setup, i.e., state, action, and reward. Generally, the application of DRL to domainspecific problems involves the empirical study of which representation for each element better suits the problem. However, it might be the case that no single representation outperforms the rest in all scenarios, and therefore the representation can be regarded as an additional hyperparameter that depends on the environment conditions. Most domainspecific DRL studies leave these considerations out of their analyses. To highlight the importance of the representation selection, in this section we propose alternative representations for each element and study how they affect the overall performance of the model.
Regarding the action, two different action spaces are studied. These are pictured in Figure 4, which shows a scenario with and . First, we consider directly choosing a cell in the grid as the action. This action space, which we define as Grid, consists of different actions. The second action space is defined as Tetrislike and only contemplates five possible actions. In this space, a random frequency assignment is first made for a beam, i.e., we randonmly choose a cell in the grid. Then, the agent is able to move it up, down, left, and right across the grid until the action new is chosen and a new beam undergoes the same procedure. Note that with this latter approach, episodes take a longer and random number of timesteps, since one beam can take more than one timestep to be assigned and there is no restriction on how many intermediate actions can be taken before taking action new. The advantage of this representation is that the action space is substantially reduced, which benefits the learning algorithm.
Next, we also consider two possible representations for the state
space. In both, the state is defined as a 3dimensional tensor in which the first dimension size is 1, 2, or 3, and
and are the sizes of the second and third dimensions, respectively. We refer to a slice of that tensor along the first dimension as a layer. To better understand both representations, we consider the context of a certain timestep, in which we are making the assignment for one specific beam , we have already assigned beams, and beams remain to be assigned. In both representations the first layer stores which cells conflict with beam , since they are “occupied” by at least one of the beams, among the already assigned, that have some kind of constraint with . In the case the action space is Tetrislike, the second layer stores the current assignment of beam – this is done regardless of the state representation chosen. Finally, the last layer serves as a lookahead layer and is optional. Whether it is included in the tensor or not defines the two state representations considered. This layer contains information regarding the remaining beams, such as the number , or the amount of bandwidth that will be compromised in the future for beam due to the beams remaining to be assigned which have a intragroup or intergroup constraint with . Figure 5 shows an example of what these three layers can look like. We therefore run simulations with and without the lookahead layer.Note that our selection of state and action representations entails an additional benefit: we could start from incomplete frequency plans. There might be situations in which we do not want to entirely reconfigure a plan but just reallocate a couple of beams – due to moving users, handover reconfiguration, etc. Furthermore, adding or removing beams to the constellation would not be a problem either. In that sense, the model would be robust against changes in the beam placement.
Lastly, three alternative functions are considered to define the reward. For the three of them, if the action space is Tetrislike, the reward function is only applied whenever the action new is chosen, otherwise the value is given as a reward. This is done to avoid the agent finding local optima that consist in just moving a specific beam around without calling the action new.
Once a beam is assigned (i.e., any action is taken in the Grid action space or the action new is taken in the Tetrislike action space), the first reward function consists of computing the difference in the number of successfully alreadyassigned beams between the states at the current timestep and previous timestep. We consider a beam to be successfully assigned if it does not violate any constraint. The second alternative is to only compute the final quantity of beams that are successfully assigned once the final state is reached and give a reward of zero at all timesteps before that. Finally, inspired by its use in other RL papers [Silver2017], the third reward function is Monte Carlobased and uses a rollout policy that randomly assigns the remaining beams at each timestep. This is only done to compute the reward, the outcome of the rollout policy is not taken as agent’s actions. We define each of these strategies as Each, Final, and Monte Carlo (MC), respectively. Figure 6 shows a visual comparison between the three options considered.
Before discussing the results, in Table 2 we summarize the variations that we consider for each of the six elements introduced in Section 2: state, action, reward, policy, policy optimization algorithm, and training procedure. In this section we have described the variations for the first three, in the next section we address the remaining elements, since they specifically relate to the learning framework.
DRL Element  Variations considered 

State repr.  With and Without (the lookahead layer) 
Action repr.  Grid and Tetrislike 
Reward  Each, Final, and Monte Carlo 
Policy  CNN+MLP and CNN+LSTM 
Policy Opt.  DQN and PPO 
Training proc.  Same as test and harder than test 
5 Results
To test the presented framework, we simulate a scenario with , , , and . From a set of 5,000 beams based on real data, we randomly create a train and a test dataset, disjoint with respect to each other. For each episode of the training stage, we randomly select 100 beams from the train dataset. Then, during the test stage we randomly select 100 beams from the test dataset and evaluate the policy on those beams. Since the nature of the problem is discrete, we initially select Deep QNetwork [mnih2015human] as the policy optimization algorithm and later in this section we compare it against a policy gradient algorithm. We use 8 different environments in parallel, this way we can share the experience throughout the training stage and have statistics over the results during the test stage. We start using a policy
consisting of a Convolutional Neural Network (CNN) with 2 layers (first layer with 64 5x5 filters, second layer with 128 3x3 filters), then 2 fullyconnected layers (first layer with 512 units, second layer with 256 units), and an output layer. The fullyconnected layers can be commonly referred as Multi Layer Perceptron (MLP). ReLU activation units and normalization layers are used in all cases.
5.1 Full enumeration analysis
We do a full enumeration analysis and test each possible combination of actionstatereward representation for a total of 50k timesteps. At this point, we also focus on the training procedure and test which improvements there might be if we train the agent to do the assignment with 200 beams instead of 100, to make the task harder and to potentially obtain a more robust policy. In this latter case, the test evaluation is still carried out on 100 beams. In addition, we extend the grid search to include the discount factor [sutton2018reinforcement], a DRL hyperparameter, and evaluate three possible values: 0.1, 0.5, and 0.9. A larger discount factor implies that the agent takes into account longer term effects of its actions, whereas lower discount factors are related to greedier policies.
100beam training  200beam training  

Grid  Tetrislike  Grid  Tetrislike  
Reward  With  Without  With  Without  With  Without  With  Without  
0.1  99.8  95.6  92.0  95.8  98.9  97.9  87.5  98.2  
Each  0.5  98.9  96.2  90.5  96.6  97.6  94.4  80.8  90.9 
0.9  96.6  93.1  81.5  66.9  95.2  93.4  70.2  66.6  
0.1  47.5  53.9  68.8  67.4  23.8  20.9  52.5  45.5  
Final  0.5  46.6  56.0  39.9  70.1  21.1  21.2  67.0  69.9 
0.9  54.9  67.2  71.9  63.6  25.8  22.6  60.0  67.1  
0.1  19.5  17.6  45.2  45.2  2.5  2.9  50.9  34  
MC  0.5  9.1  3.0  67.9  36.1  2.6  1.9  41.1  20.2 
0.9  2.2  2.6  37.8  47.5  2.2  3.2  34.6  24.1 
The results of the enumeration analysis for the 72 different simulations are shown in Table 3, which contains the average number of successfullyassigned beams during test time (out of 100) for each combination of action, state, reward, discount factor , and number of training beams. The average is computed across the 8 parallel environments. For reference, we compare these results against a totally random policy, which achieves an average of 83.5 successfullyassigned beams.
It is observed that using the reward strategy Each leads to better outcomes, and using the Grid action space with the With state space does better on average. For our case, there is no apparent advantage in using reward strategies that delay the reward until the end of the episode or rely on rollout policies; individual timesteps provide enough information to guide the policy optimization. To better understand the impact of the modelling decisions, we also do several significance tests and further analyze the tradeoffs:

There is no advantage in training with 200 beams instead of 100 (pvalue = 0.08). Therefore, training with more beams does not make a more robust policy for our case. This is even less impactful when (p = 0.90), since the policy behaves greedily regardless of how many beams remain to be assigned. The strategy for placing the first 100 beams is similar both cases.

The discount factor affects the performance of the policy (p = 10), but there is no significant performance difference when using or (p = 0.83). The performance worsens when . The learned policy relies more on a greedy behavior to be successful, which is consistent with other methods that have addressed the problem [PachlerdelaOsa2020].

With , the Each reward strategy, and Grid action space, the state space affects the performance (p = 10). It is better to include the lookahead layer.

With , the Each reward strategy, and Tetrislike action space, the state space affects the performance (p = 10). Not including the lookahead layer in the state achieves better results in this case.

With and the Each reward strategy, using the Grid state space is a better alternative than using Tetrislike (p = 10), since it supports more flexibility to make an assignment.

Overall, the Tetrislike action space appears to be less sensitive to the reward strategy chosen, especially when comparing across the simulations using the MonteCarlo reward strategy.
The lookahead layer plays an important role together with the Grid action space, since the agent has total freedom to place a beam, and therefore being informed of the remaining assignments is beneficial. In contrast, in the case of the Tetrislike action space, the agent is limited by the initial random placement of the beam, and therefore the information of what is to come is not as useful.
5.2 Scalability analysis
Given the highdimensionality of future satellite systems, we also analyze the impact scalability has on our framework and run new simulations, this time for 500 beams (), with , , , , and the Each reward strategy, and compare all 4 combinations for the state and action representations. In this case we train each model for a total of 200k timesteps. These results are shown in Table 4 and can be compared against a totally random policy, which achieves 429.9 successfullyassigned beams.
The main conclusion of this analysis is that when the dimensionality of the problem increases, we can actually achieve a better outcome using the Tetrislike action space, as opposed to the 100beam case. The Grid action space does even worse than random, which is due to the large amount of actions (1,280 for this case) and an inapprpriate explorationexploitation balance. Overall, there is no impact in terms of the state representation in this example (p = 0.64), but if we focus on the Tetrislike action space, there is (p = 0.001); we still do better without the lookahead layer. These results prove that relying on a single representation for a specific problem might not be a robust strategy during operation. Understanding the limitations of specific representations is essential in order to make these systems deployable and, more importantly, reliable.
Action and state representation 



Grid and With  320.1  
Grid and Without  328.1  
Tetrislike and With  461.6  
Tetrislike and Without  478.9 
5.3 Policy and Policy Optimization Algorithm
Two of the key elements of a DRL model listed in Section 2 and Table 2, the policy and the policy optimization algorithm, remain to be studied for our problem. These heavily rely on the progress made by the Deep Learning and Reinforcement Learning research communities. To address them, we now extend the analyses to compare the results of the Deep QNetwork (DQN) algorithm, and the CNN plus the MLP policy, both used in all simulations so far, against other alternatives.
In the case of the policy optimization algorithm, we choose the Proximal Policy Optimization (PPO) [schulman2017proximal] method, a Policy Gradient and onpolicy algorithm, to be compared against DQN [mnih2015human], a Qlearning and offpolicy algorithm. PPO optimizes the policy on an endtoend fashion, by doing gradient ascent on its parameters according to a function of the cumulative reward over an episode. Additionally, it clips the gradients to avoid drastic changes to the policy. DQN focuses on optimizing the prediction of the value of taking a certain action in a given state; it then uses these predictions to choose an action. Also, it stores all the agent’s experience and makes use of it over time by “replaying” past episodes and training on them. We refer to the original papers for a full description of each method. We use OpenAI’s baselines [baselines] to implement each method.
Regarding the policy network, we substitute the MLP for a 256unit Long ShortTerm Memory (LSTM)
[Hochreiter1997]network, which belongs to the recurrent networks class. We compare a policy constituted by CNN+MLP against a CNN+LSTM one to evaluate the impact of using a hidden state in our formulation. Recurrent neural networks take advantage of temporal dependencies in the data and therefore can potentially perform better in sequential problems.
Table 5 shows the results of using the different policies and the different policy optimization algorithms for 4 different cases: the 100beam scenario using the Grid action space with both state representations, the 500beam scenario using the Tetrislike action space with both state representations, and two additional 1,000beam and 2,000beam scenarios for the Tetrislike action space and the Without state representation. For each case the random policy successfullyassigns 83.5, 429.9, 799.7, and 1,137.7 beams on average, respectively. Since we have concluded the Tetrislike action space better suits highdimensionality scenarios, we want to explore its performance in the thousandbeam range as well. In all cases the Each reward strategy and is used. All simulations belonging to the same scenario are trained for an equal number of timesteps.
Action and state representation 




100 beams, Random: 83.5  
,  
Grid, With  99.8  94.9  94.9  
Grid, Without  95.6  97.5  91.5  
500 beams, Random: 429.9  
,  
Tetrislike, With  461.6  460.5  470.0  
Tetrislike, Without  478.9  478.0  465.1  
1,000 beams, Random: 799.7  
,  
Tetrislike, Without  962.3  936.0  957.0  
2,000 beams, Random: 1,137.7  
,  
Tetrislike, Without  1,746.0  1,139.0  967.0 
In the scenarios with 1,000 beams or less, we can observe that there is no significant advantage in using PPO over DQN, although DQN consistently outperforms PPO in each scenario. Looking at the 2,000beam scenario, however, the performance difference emerges. Note that the 1,000beam and 2,000beam simulations use the same values for and . Therefore, in the 2,000beam case, the performance worsening occurs during the assignment of the last 1,000 beams. During the last timesteps of such a highdimensional scenario, having a prediction over multiple actions, as DQN does, proves to be a better approach, as opposed to PPO’s method, which relies on the single action that the policy outputs.
The same occurs if we compare the CNN+MLP policy (MLP in the table) against the CNN+LSTM policy (LSTM in the table). Given the greedy behaviour of the policy, having the LSTM’s hidden state does not offer any advantage when placing the last 1,000 beams in the 2,000beam scenario. Although allowing more training iterations could help reducing the performance gap, these results prove that the choices of the policy and the policy optimization algorithm are especially important under certain circumstances. For the FPD problem in the context of megaconstellations, we need to care about the thousandbeam range. To help visualizing the performance of these models, Figure 7 shows the assignment results for one of the 8 environments in the 2,000beam, DQN, and CNN+MLP scenario. A total of 1,766 beams are successfully assigned.
5.4 Nonstationarity
We now address the issue of nonstationarity and its impact on trained DRL models. In the context of the FPD problem, we are interested in the phenomena that might change from training to operation, such as the bandwidth distribution, the number of total beams, or the constraint distribution. We choose the bandwidth distribution for our analyses. Specifically, we study how the performance changes when the test environment includes beams with more average bandwidth demand than those in the training set. We carry out simulations in which the average bandwidth demand per beam in the test set is 2times and 4times larger. We do this for the 100beam scenario and the Grid action space, as well as for the 500beam scenario and the Tetrislike action space. For both cases, we compare both state representations and also run the random policy. The outcome of this analysis can be found in Table 6.
Action, State representation 




100 beams, ,  
Grid, With  99.8  84.9  52.6  
Grid, Without  95.6  84.4  65.3  
Random  83.5  81.3  71.9  
500 beams, ,  
Tetrislike, With  461.6  453.0  249.0  
Tetrislike, Without  478.9  434.0  337.0  
Random  429.9  400.0  227.0 
The results show that, as expected based on other DRL studies [DulacArnold2020], nonstationarity does negatively affect the performance of trained DRL models for the FPD problem. In our case, it correlates with the average bandwidth demand per beam difference between the train and test data. In the 100beam scenario, the performance gap between the random policy and the agent is reduced for the 2times case, and then random performs better in the 4times case.
Something similar occurs in the 500beam scenario, although the random policy does not beat the agent in any case. This is a consequence having a bigger search space and the more frequent use of the actions up and down in the Tetrislike action space to make changes to the resource group assignment rather than changing the assignment within the same resource group. If we analyze the actions taken during test time for this latter scenario, the up and down actions were taken, on average, 2.5 times more than the left and right actions. Furthermore, nonstationarity affects the usefulness of the lookahead layer for the Tetrislike action space. In the analysis from Table 4 we have seen that not using it was a better alternative, whereas in the 2times demand case the agent does better with it. There is a certain advantage in knowing what is to come in nonstationary scenarios, although there is a limit to how beneficial the lookahead layer is, as observed in the 4times demand case.
There are different alternatives to be robust in nonstationary environments. The first involves, after identifying the sources of nonstationarity, devising adequate training datasets that include episodes capturing those sources. In this case, it is essential to use a state representation that successfully incorporates information on the condition of each source. Secondly, we can make use of algorithms that refine the agent’s actions under contingency scenarios. These algorithms could guarantee a most robust behavior at the expense of computing time, such as in [garau20a]
, where a DRL model is combined with a subsequent Genetic Algorithm to maximize robustness. Finally, we can rely not only on a single agent but on an ensemble of agents that specialize on different scenarios and are trained in precisely different ways. In all cases, retraining the models over time from collected experience during operation is a good practice to capture any possible change in the environment.
5.5 Impact on realworld scenarios
The proposed models have shown a quantitatively good performance in the majority of the scenarios considered. On average, the agent successfullyassigns 99.8% of beams in the 100beam case, 95.8% in the 500beam case, 96.2% in the 1,000beam case, and 87.3% in the 2,000beam case. The main advantage of this approach is that evaluating a neural network is substantially faster than relying on other algorithms such as metaheuristics, while still showing a good performance. This aligns with what most applied DRL studies conclude. To correct not fulfilling the entirety of the constraints, we could have a subsequent “repairing” algorithm to address the conflicting beams, which could make up 1 to 15% of the total based on our numbers.
However, the main consideration behind these numbers is that they correspond to different models that use different state and action representations. Unlike many applied DRL studies, we have regarded representations as additional hyperparameters at every stage of our analyses. Given the importance of highdimensionality in realworld operations, we have also focused on cases with hundreds to thousands of beams, as opposed to works that exclusively consider cases with less than 100 beams. We could not have extracted our conclusions without making these decisions for our analyses. When studying a specific DRL model with defined hyperparameters, achieving or beating the stateoftheart performance is as important as understanding which are the limitations of those hyperparameters and testing the models under realworld conditions.
We have also studied the limitations resulting from environment nonstationarity. We have validated the Machine Learning community’s findings on the negative effects of nonstationary sources in the environment. It is not common to find this kind of analysis in domainspecific DRL papers. If the goal is make DRL deployable, it is essential that we address contingency cases in our problems, either by capturing the sources of nonstationarity in the training datasets, or by devising strategies to mitigate the impact during realtime operations. Otherwise we fail to meet the requirements of operating in realworld environments and, more importantly, reduce the reliability of our systems.
6 Conclusions
In this paper we have addressed the design and implementation tradeoffs of a Deep Reinforcement Learning (DRL) agent to carry out frequency assignment tasks in a multibeam satellite constellation. DRL models are getting attention in the aerospace community mostly due to its fast decisionmaking and its adaptability to complex nonlinear optimization problems. In our work we have chosen the Frequency Plan Design (FPD) problem as a use case and identified six elements that drive the performance of DRL models: the state representation, the action representation, the reward function, the policy, the policy optimization algorithm, and the training strategy. We have defined multiple variations for each of these elements and compared the performance differences in separate scenarios. We have put a special focus on highdimensionality and nonstationarity, being two of the main phenomena present in the upcoming satellite communications landscape.
The results show that DRL is an adequate method to address the FPD problem, since it successfully assigns 85% to 99% of the beams for cases with 100 to 2,000 beams. However, no single stateaction combination outperforms the rest for all cases. When the dimensionality of the problem is low, the Grid action space and the With state representation perform better. In contrast, the Tetrislike action space and the Without state representation are a better option for highdimensional scenarios. These findings validate our hypothesis that representation should be strongly considered as an additional hyperparameter in applied DRL studies. We have also seen that using the Deep QNetwork algorithm in combination with a convolutional neural network and a fullyconnected neural network as the policy works best for all scenarios, being especially advantageous for the 2,000beam case. Regardless of the scenario, the obtained policy has shown a greedy behavior that benefits from informative rewards at each timestep.
At the end of the paper, we have reflected on different considerations that are usually left out of applied DRL studies. Our analyses on the effect of nonstationarity have helped motivating the discussion. When the average bandwidth demand per beam distribution substantially differs between the train and test data the number of successfullyassigned beams by agent decreases, performing worse than random for some cases. We emphasize the need to identify the potential sources of nonstationarity, understand its potential effects for the DRL model during real operation cycles, and propose solutions to mitigate any negative influences. A compromise between topping performance metrics and characterizing the limitation of the models and its hyperparameters is the only way to advance in the successful deployment of DRL and other Machine Learningbased technologies.
Comments
There are no comments yet.