Applicability and Challenges of Deep Reinforcement Learning for Satellite Frequency Plan Design

10/15/2020 ∙ by Juan Jose Garau Luis, et al. ∙ 0

The study and benchmarking of Deep Reinforcement Learning (DRL) models has become a trend in many industries, including aerospace engineering and communications. Recent studies in these fields propose these kinds of models to address certain complex real-time decision-making problems in which classic approaches do not meet time requirements or fail to obtain optimal solutions. While the good performance of DRL models has been proved for specific use cases or scenarios, most studies do not discuss the compromises of such models. In this paper we explore the tradeoffs of different elements of DRL models and how they might impact the final performance. To that end, we choose the Frequency Plan Design (FPD) problem in the context of multibeam satellite constellations as our use case and propose a DRL model to address it. We identify six different core elements that have a major effect in its performance: the policy, the policy optimizer, the state, action, and reward representations, and the training environment. We analyze different alternatives for each of these elements and characterize their effect. We also use multiple environments to account for different scenarios in which we vary the dimensionality or make the environment non-stationary. Our findings show that DRL is a potential method to address the FPD problem in real operations, especially because of its speed in decision-making. However, no single DRL model is able to outperform the rest in all scenarios, and the best approach for each of the six core elements depends on the features of the operation environment. While we agree on the potential of DRL to solve future complex problems in the aerospace industry, we also reflect on the importance of designing appropriate models and training procedures, understanding the applicability of such models, and reporting the main performance tradeoffs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In order to address the upcoming complex challenges in aerospace and communications research, recent studies are looking into Machine Learning (ML) methods as potential problem-solvers

[Abbas2015]. One such method is Deep Reinforcement Learning (DRL), which has had a wide adoption in the community [Luong2019, Ferreira2019]. In the specific case of satellite communications, DRL has already shown its usefulness in problems like channel allocation [Hu2018], beam-hopping [Hu2020], and dynamic power allocation [luis2019deep, Zhang2020].

The motivation behind the use of DRL-based decision-making agents for satellite communications originates mainly from the forthcoming need to automate the control of a large number of satellite variables and beams in real-time. Large flexible constellations, with thousands of beams and a broad range of different users, will need to autonomously reallocate resources such as bandwidth or power in real-time in order to address a highly-fluctuating demand [NorthernSkyResearch2019]. However, the new time and dimensionality requirements pose a considerable challenge to previously-adopted optimization approaches such as mathematical programming [HengWang2013] or metaheuristics [Aravanis2015, Cocco2018, Durand2017]. In contrast, due to the training frameworks common in ML algorithms, DRL has the potential to meet these operational constraints [Luis2020].

Despite the positive results of DRL for satellite communications and many other fields, the majority of applied research studies mostly focus on best-case scenarios, provide little insight on modelling decisions, and fail to address the operational considerations of deploying DRL-based systems in real-world environments. In addition to the inherent reproducibility problems of DRL algorithms [Henderson2018], recent studies show the drastic consequences of not properly addressing phenomena present in real-world environments [Dulac-Arnold2020] such as non-stationarity, high-dimensionality, partial-observability, or safety constraints.

As mentioned, high-dimensionality and non-stationarity are especially important in satellite communications. Mega constellations are already a reality, and they require models that adapt to orders of magnitude of hundreds or thousands of beams. Studies like [Hu2018, Hu2020, Zhang2020] validate the proposed DRL models for cases with less than 50 beams, but obviate the performance in high-dimensional scenarios. In addition, user bases are becoming volatile and their demand highly-fluctuant. Relying on models that assume static user distributions could be detrimental during real-time operations. Therefore, it is crucial that applied research studies also discuss how their models behave against user non-stationarity and how they propose to address any negative influences.

One of the challenges of upcoming large constellations is how to efficiently assign a part of the frequency spectrum to each beam while respecting the interference constraints present in the system. This is known as the Frequency Plan Design (FPD) problem. Since users’ behavior is highly-dynamic, new frequency plans must be computed in real time to satisfy the demand. Given these runtime and adaptability requirements, in this paper we propose looking into a DRL-based solution able to make the frequency allocation decisions required at every moment.

In our study we shift the target from the problem to the implementation methodology. Our aim is to provide a holistic view of how the different elements in a DRL model affect the performance and how their relevance changes depending on the scenario considered. In addition to nominal conditions, we also study the effect of dimensionality and non-stationarity on our models. Our goal is to contribute to the interpretability of DRL models in order to make progress towards real-world deployments that guarantee a robust performance.

The remainder of this paper is structured as follows: Section 2 presents a short overview on DRL and its main performance drivers; Section 3 introduces the FPD problem, focusing on its decisions and constraints; Section 4 outlines our approach to solve the problem with a DRL method; Section 5 discusses the results of the model in multiple scenarios; and finally Section 6 outlines the conclusions of the paper.

2 Deep Reinforcement Learning

Deep Reinforcement Learning is a Machine Learning subfield that combines the Reinforcement Learning (RL) paradigm [sutton2018reinforcement]

and Deep Learning frameworks

[goodfellow2016deep] to address complex sequential decision-making problems. The RL paradigm involves the interaction between an agent and an environment

that follows a Markov Decision Process, and is external to the agent. This interaction is sequential and divided into a series of

timesteps. At each of these timesteps, the agent observes the state of the environment and takes an action based on it. As a consequence of that action, the agent receives a reward that quantifies the value of the action according to the high-level goal of the interaction (e.g., flying a drone or winning a game). Then, the environment updates its state taking into account the action from the agent.

The interaction goes on until a terminal state is reached (e.g., the drone lands). In this state, the agent can not take any further action. The sequence of timesteps that lead to a terminal state is known as an episode. Based on the experience from multiple episodes, the objective of the agent is to optimize its policy, which defines the mapping from states to actions. The agent’s optimal policy is the policy that maximizes the expected average or cumulative reward over an episode. This policy might be stochastic depending on the nature of the environment. Table 1 summarizes the parameters defined so far and includes the symbols generally used in literature.

Symbol Parameter
State of the environment at timestep
Action taken by the agent at timestep
Reward received at timestep
Set of possible states
Set of possible actions
Set of available actions at state
Terminal state
Policy
Optimal policy
Probability of action given state
Table 1: Principal Reinforcement Learning parameters and their symbols.

In simple environments, the most effective policy is usually given by a table that maps each unique state to an action or probability distribution over actions. However, when the state and/or action spaces are large or continuous, using a tabular policy becomes impractical. Deep Reinforcement Learning (DRL) addresses such cases by substituting tabular policies for neural network-based policies that approximate the mapping from states to actions. Those policies are then updated following Deep Learning algorithms, such as Stochastic Gradient Descent and backpropagation. DRL has shown significant results in a wide variety of areas like games

[mnih2015human], molecular design [Popova2018], or internet recommendation systems [Theocharous2015].

There are six elements that jointly drive the performance of a DRL model:

  1. State representation. We want to look for state representations that capture as much information of the environment as possible and can be easily fed into a neural network. In the case of vision-based robots, that might be simply RGB images from cameras. In other contexts, the process sometimes involves more complex state design strategies. It can be also referred as state space.

  2. Action space. Although flexibility matters, reducing the action space is generally beneficial to the algorithm. While, for instance, videogames’ actions are straightforward, other environments might require changes to the action space, such as discretization or pruning. It can be also referred as action representation.

  3. Reward function. The reward function is relevant during the training stage of the algorithm, and should also be informative with respect to the goal of the agent. There might be more than one reward function that correctly guides the learning, although there is usually a performance gap between classes of strategies: constant reward, sparse reward, rollout policy-based reward, etc.

  4. Policy. In DRL the policy is mainly represented by the neural network architecture chosen. DRL policies commonly include fully-connected layers in addition to convolutional and/or recurrent layers. Each of these layer classes includes multiple subclassifications to consider.

  5. Policy optimization algorithm. There are many algorithms that rely on completely different design choices. In model-free DRL, alternatives include policy gradient methods, Q-learning, or hybrid approaches.

  6. Training procedure. The training dataset and training environment constitute an important part of the performance. We want to reflect all phenomena that might take place during testing and make sure that the training procedure guarantees a robust performance. Sometimes, however, the possibilities of interaction with real environments are limited and models must resort to data-efficient strategies.

3 Frequency Plan Design Problem

The Frequency Plan Design problem consists in the assignment of a portion of the available spectrum to every beam in a multibeam satellite system, with the goal of satisfactorily serving all the system’s users. Although it is a well-studied problem, the high-dimensionality and flexible nature of new satellites add an additional layer of complexity that motivates the exploration of new algorithmic solutions such as DRL to address it.

3.1 Decisions

In this paper we consider a multibeam satellite constellation with satellites. We assume a beam placement is provided; we define the total number of beams as . Each of these beams has an individual data rate requirement. To satisfy this demand, spectrum resources must be allocated to each beam. For the purpose of this paper, we consider the amount of needed bandwidth for every beam is given, although our framework could be adapted to satisfy dynamic bandwidth requirements. While bandwidth is a continuous resource, most satellites divide up their frequency spectrum into frequency chunks or slots of equal bandwidth, and therefore we consider that each beam is assigned a certain discrete number of frequency slots. We denote the number of slots that beam needs as . The only remaining decisions are therefore to assign which specific frequency slots and the appropriate frequency reuse mechanisms.

In this work we assume satellites can reutilize spectrum. Each satellite of the constellation has an equal available spectrum consisting of consecutive frequency slots. In addition, there is access to frequency reuse mechanisms in the form of reuse groups and two polarizations. A combination of a specific reuse group and polarization is coined as frequency group. The constellation has a total of frequency groups, which is twice the amount of reuse groups (we consider left-handed and right-handed circular polarization available for each beam). For each frequency group, there is an equal number of frequency slots available.

To better represent this problem, we use a grid in which the columns and the rows represent the frequency slots and the frequency groups, respectively. This is pictured in Figure 1. In the case of the frequency groups, we assume these are sorted first, by reuse group, and then, by polarization (e.g., row 1 corresponds to reuse group 1 and left-handed polarization, and row 2 corresponds to reuse group 1 and right-handed polarization).

Figure 1: Example of the decision-making space for the case (frequency groups) and (frequency slots). Black squares represent the assignment decision for each beam; colored cells indicate the complete frequency assignment for three different beams.

The frequency assignment operation for beam consists of first, selecting one of the frequency groups, and secondly, picking consecutive frequency slots from the total in that group. In other words, two decisions need to be made per beam: 1) the frequency group and 2) which is the first frequency slot this beam will occupy in the group. This is represented in Figure 1 by the black squares that designate the assignment decision for each beam.

3.2 Constraints

We identify two types of constraints. On one hand, we do not assume a specific orbit and therefore take into account that handover operations might occur. The frequency plan must account for handover constraints, which entail that some beams assigned to the same frequency group cannot overlap in the frequency domain because they are powered from the same satellite at some point in time. We define this type of constraints as

intra-group constraints.

To better understand intra-group constraints and following the example in Figure 2, let’s assume beam (green beam in the figure) is powered from satellite 1 at time instant . This beam is assigned to frequency group 1 and slot 6, and needs a bandwidth of 3 slots – it therefore occupies slots 6, 7, and 8 in group 1. At time instant , this beam undergoes a handover operation from satellite 1 to satellite 2. When this occurs, the beam is assigned the same group 1 and slot 6 in satellite 2 and frees them in satellite 1. While we could change the frequency assignment during the handover, this is not always possible due to various factors. As a consequence, a safe strategy is to make sure that, at the moment beam switches from satellite 1 to satellite 2, slots 6, 7, and 8 in group 1 are available in satellite 2 as well.

Figure 2: Handover operation example from time instant to time instant .

On the other hand, we define the other type of restrictions as inter-group constraints. These relate to the cases in which beams that point to close locations might negatively interfere with each other. This situation is more restrictive, as those beams not only must not overlap if they share the same frequency group but also if they share the same polarization. For example, if beams and had an inter-group constraint and beam occupied slots 6, 7, and 8 in group 5, then, following our previous numbering convention, could occupy any slot but slots 6, 7, and 8 in groups 1, 3, 5, 7, etc. This interaction is represented in Figure 3.

Figure 3: Inter-group constraint example between beam and . Diagram shows the moment is to be assigned and has already been assigned.

4 Methods

The objective of this paper is to design and implement a DRL-based agent capable of making frequency assignment decisions for all beams of a constellation, to evaluate its performance, and to understand its modelling tradeoffs. While there exist algorithms that could address all of the decisions simultaneously, this approach makes the problem challenging for dynamic cases with large values for , , and . The total number of possible different plans is , which poses a combinatorial problem that does not scale well.

As an alternative, we take a sequential decision-making approach in which, following the DRL terminology previously introduced, a timestep is defined as the frequency assignment for just one beam. We then define an episode as the complete frequency assignment of all beams. By doing this, the dimensionality of the problem reduces to for each timestep.

The next step in our formulation involves the definition of the remaining problem-specific elements present in a RL setup, i.e., state, action, and reward. Generally, the application of DRL to domain-specific problems involves the empirical study of which representation for each element better suits the problem. However, it might be the case that no single representation outperforms the rest in all scenarios, and therefore the representation can be regarded as an additional hyperparameter that depends on the environment conditions. Most domain-specific DRL studies leave these considerations out of their analyses. To highlight the importance of the representation selection, in this section we propose alternative representations for each element and study how they affect the overall performance of the model.

Regarding the action, two different action spaces are studied. These are pictured in Figure 4, which shows a scenario with and . First, we consider directly choosing a cell in the grid as the action. This action space, which we define as Grid, consists of different actions. The second action space is defined as Tetris-like and only contemplates five possible actions. In this space, a random frequency assignment is first made for a beam, i.e., we randonmly choose a cell in the grid. Then, the agent is able to move it up, down, left, and right across the grid until the action new is chosen and a new beam undergoes the same procedure. Note that with this latter approach, episodes take a longer and random number of timesteps, since one beam can take more than one timestep to be assigned and there is no restriction on how many intermediate actions can be taken before taking action new. The advantage of this representation is that the action space is substantially reduced, which benefits the learning algorithm.

Figure 4: Two action spaces considered: Grid and Tetris-like.

Next, we also consider two possible representations for the state

space. In both, the state is defined as a 3-dimensional tensor in which the first dimension size is 1, 2, or 3, and

and are the sizes of the second and third dimensions, respectively. We refer to a slice of that tensor along the first dimension as a layer. To better understand both representations, we consider the context of a certain timestep, in which we are making the assignment for one specific beam , we have already assigned beams, and beams remain to be assigned. In both representations the first layer stores which cells conflict with beam , since they are “occupied” by at least one of the beams, among the already assigned, that have some kind of constraint with . In the case the action space is Tetris-like, the second layer stores the current assignment of beam – this is done regardless of the state representation chosen. Finally, the last layer serves as a lookahead layer and is optional. Whether it is included in the tensor or not defines the two state representations considered. This layer contains information regarding the remaining beams, such as the number , or the amount of bandwidth that will be compromised in the future for beam due to the beams remaining to be assigned which have a intra-group or inter-group constraint with . Figure 5 shows an example of what these three layers can look like. We therefore run simulations with and without the lookahead layer.

Figure 5: Possible layers of the state space.

Note that our selection of state and action representations entails an additional benefit: we could start from incomplete frequency plans. There might be situations in which we do not want to entirely reconfigure a plan but just reallocate a couple of beams – due to moving users, handover reconfiguration, etc. Furthermore, adding or removing beams to the constellation would not be a problem either. In that sense, the model would be robust against changes in the beam placement.

Lastly, three alternative functions are considered to define the reward. For the three of them, if the action space is Tetris-like, the reward function is only applied whenever the action new is chosen, otherwise the value is given as a reward. This is done to avoid the agent finding local optima that consist in just moving a specific beam around without calling the action new.

Once a beam is assigned (i.e., any action is taken in the Grid action space or the action new is taken in the Tetris-like action space), the first reward function consists of computing the difference in the number of successfully already-assigned beams between the states at the current timestep and previous timestep. We consider a beam to be successfully assigned if it does not violate any constraint. The second alternative is to only compute the final quantity of beams that are successfully assigned once the final state is reached and give a reward of zero at all timesteps before that. Finally, inspired by its use in other RL papers [Silver2017], the third reward function is Monte Carlo-based and uses a rollout policy that randomly assigns the remaining beams at each timestep. This is only done to compute the reward, the outcome of the rollout policy is not taken as agent’s actions. We define each of these strategies as Each, Final, and Monte Carlo (MC), respectively. Figure 6 shows a visual comparison between the three options considered.

Figure 6: Three potential reward function definitions.

Before discussing the results, in Table 2 we summarize the variations that we consider for each of the six elements introduced in Section 2: state, action, reward, policy, policy optimization algorithm, and training procedure. In this section we have described the variations for the first three, in the next section we address the remaining elements, since they specifically relate to the learning framework.

DRL Element Variations considered
State repr. With and Without (the lookahead layer)
Action repr. Grid and Tetris-like
Reward Each, Final, and Monte Carlo
Policy CNN+MLP and CNN+LSTM
Policy Opt. DQN and PPO
Training proc. Same as test and harder than test
Table 2: Principal DRL parameters and the variations we test in this paper.

5 Results

To test the presented framework, we simulate a scenario with , , , and . From a set of 5,000 beams based on real data, we randomly create a train and a test dataset, disjoint with respect to each other. For each episode of the training stage, we randomly select 100 beams from the train dataset. Then, during the test stage we randomly select 100 beams from the test dataset and evaluate the policy on those beams. Since the nature of the problem is discrete, we initially select Deep Q-Network [mnih2015human] as the policy optimization algorithm and later in this section we compare it against a policy gradient algorithm. We use 8 different environments in parallel, this way we can share the experience throughout the training stage and have statistics over the results during the test stage. We start using a policy

consisting of a Convolutional Neural Network (CNN) with 2 layers (first layer with 64 5x5 filters, second layer with 128 3x3 filters), then 2 fully-connected layers (first layer with 512 units, second layer with 256 units), and an output layer. The fully-connected layers can be commonly referred as Multi Layer Perceptron (MLP). ReLU activation units and normalization layers are used in all cases.

5.1 Full enumeration analysis

We do a full enumeration analysis and test each possible combination of action-state-reward representation for a total of 50k timesteps. At this point, we also focus on the training procedure and test which improvements there might be if we train the agent to do the assignment with 200 beams instead of 100, to make the task harder and to potentially obtain a more robust policy. In this latter case, the test evaluation is still carried out on 100 beams. In addition, we extend the grid search to include the discount factor [sutton2018reinforcement], a DRL hyperparameter, and evaluate three possible values: 0.1, 0.5, and 0.9. A larger discount factor implies that the agent takes into account longer term effects of its actions, whereas lower discount factors are related to greedier policies.

100-beam training 200-beam training
Grid Tetris-like Grid Tetris-like
Reward With Without With Without With Without With Without
0.1 99.8 95.6 92.0 95.8 98.9 97.9 87.5 98.2
Each 0.5 98.9 96.2 90.5 96.6 97.6 94.4 80.8 90.9
0.9 96.6 93.1 81.5 66.9 95.2 93.4 70.2 66.6
0.1 47.5 53.9 68.8 67.4 23.8 20.9 52.5 45.5
Final 0.5 46.6 56.0 39.9 70.1 21.1 21.2 67.0 69.9
0.9 54.9 67.2 71.9 63.6 25.8 22.6 60.0 67.1
0.1 19.5 17.6 45.2 45.2 2.5 2.9 50.9 34
MC 0.5 9.1 3.0 67.9 36.1 2.6 1.9 41.1 20.2
0.9 2.2 2.6 37.8 47.5 2.2 3.2 34.6 24.1
Table 3: Average number of successfully-assigned beams out of 100 in the test data. A random policy achieves 83.5.

The results of the enumeration analysis for the 72 different simulations are shown in Table 3, which contains the average number of successfully-assigned beams during test time (out of 100) for each combination of action, state, reward, discount factor , and number of training beams. The average is computed across the 8 parallel environments. For reference, we compare these results against a totally random policy, which achieves an average of 83.5 successfully-assigned beams.

It is observed that using the reward strategy Each leads to better outcomes, and using the Grid action space with the With state space does better on average. For our case, there is no apparent advantage in using reward strategies that delay the reward until the end of the episode or rely on rollout policies; individual timesteps provide enough information to guide the policy optimization. To better understand the impact of the modelling decisions, we also do several significance tests and further analyze the tradeoffs:

  • There is no advantage in training with 200 beams instead of 100 (p-value = 0.08). Therefore, training with more beams does not make a more robust policy for our case. This is even less impactful when (p = 0.90), since the policy behaves greedily regardless of how many beams remain to be assigned. The strategy for placing the first 100 beams is similar both cases.

  • The discount factor affects the performance of the policy (p = 10), but there is no significant performance difference when using or (p = 0.83). The performance worsens when . The learned policy relies more on a greedy behavior to be successful, which is consistent with other methods that have addressed the problem [PachlerdelaOsa2020].

  • With , the Each reward strategy, and Grid action space, the state space affects the performance (p = 10). It is better to include the lookahead layer.

  • With , the Each reward strategy, and Tetris-like action space, the state space affects the performance (p = 10). Not including the lookahead layer in the state achieves better results in this case.

  • With and the Each reward strategy, using the Grid state space is a better alternative than using Tetris-like (p = 10), since it supports more flexibility to make an assignment.

  • Overall, the Tetris-like action space appears to be less sensitive to the reward strategy chosen, especially when comparing across the simulations using the Monte-Carlo reward strategy.

The lookahead layer plays an important role together with the Grid action space, since the agent has total freedom to place a beam, and therefore being informed of the remaining assignments is beneficial. In contrast, in the case of the Tetris-like action space, the agent is limited by the initial random placement of the beam, and therefore the information of what is to come is not as useful.

5.2 Scalability analysis

Given the high-dimensionality of future satellite systems, we also analyze the impact scalability has on our framework and run new simulations, this time for 500 beams (), with , , , , and the Each reward strategy, and compare all 4 combinations for the state and action representations. In this case we train each model for a total of 200k timesteps. These results are shown in Table 4 and can be compared against a totally random policy, which achieves 429.9 successfully-assigned beams.

The main conclusion of this analysis is that when the dimensionality of the problem increases, we can actually achieve a better outcome using the Tetris-like action space, as opposed to the 100-beam case. The Grid action space does even worse than random, which is due to the large amount of actions (1,280 for this case) and an inapprpriate exploration-exploitation balance. Overall, there is no impact in terms of the state representation in this example (p = 0.64), but if we focus on the Tetris-like action space, there is (p = 0.001); we still do better without the lookahead layer. These results prove that relying on a single representation for a specific problem might not be a robust strategy during operation. Understanding the limitations of specific representations is essential in order to make these systems deployable and, more importantly, reliable.

Action and state representation
Number of successfully-
assigned beams
Grid and With 320.1
Grid and Without 328.1
Tetris-like and With 461.6
Tetris-like and Without 478.9
Table 4: Average number of successfully-assigned beams out of 500 in the test data using the Each reward function and . A random policy achieves 429.9.

5.3 Policy and Policy Optimization Algorithm

Two of the key elements of a DRL model listed in Section 2 and Table 2, the policy and the policy optimization algorithm, remain to be studied for our problem. These heavily rely on the progress made by the Deep Learning and Reinforcement Learning research communities. To address them, we now extend the analyses to compare the results of the Deep Q-Network (DQN) algorithm, and the CNN plus the MLP policy, both used in all simulations so far, against other alternatives.

In the case of the policy optimization algorithm, we choose the Proximal Policy Optimization (PPO) [schulman2017proximal] method, a Policy Gradient and on-policy algorithm, to be compared against DQN [mnih2015human], a Q-learning and off-policy algorithm. PPO optimizes the policy on an end-to-end fashion, by doing gradient ascent on its parameters according to a function of the cumulative reward over an episode. Additionally, it clips the gradients to avoid drastic changes to the policy. DQN focuses on optimizing the prediction of the value of taking a certain action in a given state; it then uses these predictions to choose an action. Also, it stores all the agent’s experience and makes use of it over time by “replaying” past episodes and training on them. We refer to the original papers for a full description of each method. We use OpenAI’s baselines [baselines] to implement each method.

Regarding the policy network, we substitute the MLP for a 256-unit Long Short-Term Memory (LSTM)

[Hochreiter1997]

network, which belongs to the recurrent networks class. We compare a policy constituted by CNN+MLP against a CNN+LSTM one to evaluate the impact of using a hidden state in our formulation. Recurrent neural networks take advantage of temporal dependencies in the data and therefore can potentially perform better in sequential problems.

Table 5 shows the results of using the different policies and the different policy optimization algorithms for 4 different cases: the 100-beam scenario using the Grid action space with both state representations, the 500-beam scenario using the Tetris-like action space with both state representations, and two additional 1,000-beam and 2,000-beam scenarios for the Tetris-like action space and the Without state representation. For each case the random policy successfully-assigns 83.5, 429.9, 799.7, and 1,137.7 beams on average, respectively. Since we have concluded the Tetris-like action space better suits high-dimensionality scenarios, we want to explore its performance in the thousand-beam range as well. In all cases the Each reward strategy and is used. All simulations belonging to the same scenario are trained for an equal number of timesteps.

Action and state representation
DQN
MLP
PPO
MLP
PPO
LSTM
100 beams, Random: 83.5
,
Grid, With 99.8 94.9 94.9
Grid, Without 95.6 97.5 91.5
500 beams, Random: 429.9
,
Tetris-like, With 461.6 460.5 470.0
Tetris-like, Without 478.9 478.0 465.1
1,000 beams, Random: 799.7
,
Tetris-like, Without 962.3 936.0 957.0
2,000 beams, Random: 1,137.7
,
Tetris-like, Without 1,746.0 1,139.0 967.0
Table 5: Average number of successfully-assigned beams in the test data when comparing different combinations of policy and policy optimization algorithm. The Each reward function, , and are used in all cases.

In the scenarios with 1,000 beams or less, we can observe that there is no significant advantage in using PPO over DQN, although DQN consistently outperforms PPO in each scenario. Looking at the 2,000-beam scenario, however, the performance difference emerges. Note that the 1,000-beam and 2,000-beam simulations use the same values for and . Therefore, in the 2,000-beam case, the performance worsening occurs during the assignment of the last 1,000 beams. During the last timesteps of such a high-dimensional scenario, having a prediction over multiple actions, as DQN does, proves to be a better approach, as opposed to PPO’s method, which relies on the single action that the policy outputs.

The same occurs if we compare the CNN+MLP policy (MLP in the table) against the CNN+LSTM policy (LSTM in the table). Given the greedy behaviour of the policy, having the LSTM’s hidden state does not offer any advantage when placing the last 1,000 beams in the 2,000-beam scenario. Although allowing more training iterations could help reducing the performance gap, these results prove that the choices of the policy and the policy optimization algorithm are especially important under certain circumstances. For the FPD problem in the context of megaconstellations, we need to care about the thousand-beam range. To help visualizing the performance of these models, Figure 7 shows the assignment results for one of the 8 environments in the 2,000-beam, DQN, and CNN+MLP scenario. A total of 1,766 beams are successfully assigned.

Figure 7: 2,000-beam Frequency Plan example obtained by the DQN-based agent using the CNN+MLP policy, the Tetris-like action space, the Without state representation, the Each reward function, and . Gold cells indicate non-successful assignments (i.e., a constraint is violated), any other color indicates a successful assignment. 1,766 beams are successfully-assigned in this case.

5.4 Non-stationarity

We now address the issue of non-stationarity and its impact on trained DRL models. In the context of the FPD problem, we are interested in the phenomena that might change from training to operation, such as the bandwidth distribution, the number of total beams, or the constraint distribution. We choose the bandwidth distribution for our analyses. Specifically, we study how the performance changes when the test environment includes beams with more average bandwidth demand than those in the training set. We carry out simulations in which the average bandwidth demand per beam in the test set is 2-times and 4-times larger. We do this for the 100-beam scenario and the Grid action space, as well as for the 500-beam scenario and the Tetris-like action space. For both cases, we compare both state representations and also run the random policy. The outcome of this analysis can be found in Table 6.

Action, State representation
Same
demand
2-times
demand
4-times
demand
100 beams, ,
Grid, With 99.8 84.9 52.6
Grid, Without 95.6 84.4 65.3
Random 83.5 81.3 71.9
500 beams, ,
Tetris-like, With 461.6 453.0 249.0
Tetris-like, Without 478.9 434.0 337.0
Random 429.9 400.0 227.0
Table 6: Average number of successfully-assigned beams in the test data when comparing different imbalances between train and test data. The Each reward function, the DQN algorithm, the CNN+MLP policy, and are used in all cases.

The results show that, as expected based on other DRL studies [Dulac-Arnold2020], non-stationarity does negatively affect the performance of trained DRL models for the FPD problem. In our case, it correlates with the average bandwidth demand per beam difference between the train and test data. In the 100-beam scenario, the performance gap between the random policy and the agent is reduced for the 2-times case, and then random performs better in the 4-times case.

Something similar occurs in the 500-beam scenario, although the random policy does not beat the agent in any case. This is a consequence having a bigger search space and the more frequent use of the actions up and down in the Tetris-like action space to make changes to the resource group assignment rather than changing the assignment within the same resource group. If we analyze the actions taken during test time for this latter scenario, the up and down actions were taken, on average, 2.5 times more than the left and right actions. Furthermore, non-stationarity affects the usefulness of the lookahead layer for the Tetris-like action space. In the analysis from Table 4 we have seen that not using it was a better alternative, whereas in the 2-times demand case the agent does better with it. There is a certain advantage in knowing what is to come in non-stationary scenarios, although there is a limit to how beneficial the lookahead layer is, as observed in the 4-times demand case.

There are different alternatives to be robust in non-stationary environments. The first involves, after identifying the sources of non-stationarity, devising adequate training datasets that include episodes capturing those sources. In this case, it is essential to use a state representation that successfully incorporates information on the condition of each source. Secondly, we can make use of algorithms that refine the agent’s actions under contingency scenarios. These algorithms could guarantee a most robust behavior at the expense of computing time, such as in [garau20a]

, where a DRL model is combined with a subsequent Genetic Algorithm to maximize robustness. Finally, we can rely not only on a single agent but on an ensemble of agents that specialize on different scenarios and are trained in precisely different ways. In all cases, retraining the models over time from collected experience during operation is a good practice to capture any possible change in the environment.

5.5 Impact on real-world scenarios

The proposed models have shown a quantitatively good performance in the majority of the scenarios considered. On average, the agent successfully-assigns 99.8% of beams in the 100-beam case, 95.8% in the 500-beam case, 96.2% in the 1,000-beam case, and 87.3% in the 2,000-beam case. The main advantage of this approach is that evaluating a neural network is substantially faster than relying on other algorithms such as metaheuristics, while still showing a good performance. This aligns with what most applied DRL studies conclude. To correct not fulfilling the entirety of the constraints, we could have a subsequent “repairing” algorithm to address the conflicting beams, which could make up 1 to 15% of the total based on our numbers.

However, the main consideration behind these numbers is that they correspond to different models that use different state and action representations. Unlike many applied DRL studies, we have regarded representations as additional hyperparameters at every stage of our analyses. Given the importance of high-dimensionality in real-world operations, we have also focused on cases with hundreds to thousands of beams, as opposed to works that exclusively consider cases with less than 100 beams. We could not have extracted our conclusions without making these decisions for our analyses. When studying a specific DRL model with defined hyperparameters, achieving or beating the state-of-the-art performance is as important as understanding which are the limitations of those hyperparameters and testing the models under real-world conditions.

We have also studied the limitations resulting from environment non-stationarity. We have validated the Machine Learning community’s findings on the negative effects of non-stationary sources in the environment. It is not common to find this kind of analysis in domain-specific DRL papers. If the goal is make DRL deployable, it is essential that we address contingency cases in our problems, either by capturing the sources of non-stationarity in the training datasets, or by devising strategies to mitigate the impact during real-time operations. Otherwise we fail to meet the requirements of operating in real-world environments and, more importantly, reduce the reliability of our systems.

6 Conclusions

In this paper we have addressed the design and implementation tradeoffs of a Deep Reinforcement Learning (DRL) agent to carry out frequency assignment tasks in a multibeam satellite constellation. DRL models are getting attention in the aerospace community mostly due to its fast decision-making and its adaptability to complex non-linear optimization problems. In our work we have chosen the Frequency Plan Design (FPD) problem as a use case and identified six elements that drive the performance of DRL models: the state representation, the action representation, the reward function, the policy, the policy optimization algorithm, and the training strategy. We have defined multiple variations for each of these elements and compared the performance differences in separate scenarios. We have put a special focus on high-dimensionality and non-stationarity, being two of the main phenomena present in the upcoming satellite communications landscape.

The results show that DRL is an adequate method to address the FPD problem, since it successfully assigns 85% to 99% of the beams for cases with 100 to 2,000 beams. However, no single state-action combination outperforms the rest for all cases. When the dimensionality of the problem is low, the Grid action space and the With state representation perform better. In contrast, the Tetris-like action space and the Without state representation are a better option for high-dimensional scenarios. These findings validate our hypothesis that representation should be strongly considered as an additional hyperparameter in applied DRL studies. We have also seen that using the Deep Q-Network algorithm in combination with a convolutional neural network and a fully-connected neural network as the policy works best for all scenarios, being especially advantageous for the 2,000-beam case. Regardless of the scenario, the obtained policy has shown a greedy behavior that benefits from informative rewards at each timestep.

At the end of the paper, we have reflected on different considerations that are usually left out of applied DRL studies. Our analyses on the effect of non-stationarity have helped motivating the discussion. When the average bandwidth demand per beam distribution substantially differs between the train and test data the number of successfully-assigned beams by agent decreases, performing worse than random for some cases. We emphasize the need to identify the potential sources of non-stationarity, understand its potential effects for the DRL model during real operation cycles, and propose solutions to mitigate any negative influences. A compromise between topping performance metrics and characterizing the limitation of the models and its hyperparameters is the only way to advance in the successful deployment of DRL and other Machine Learning-based technologies.

Acknowledgements.
This work was supported by SES. The authors would like to thank SES for their input to this paper and their financial support. The simulations are based on a representative example of a MEO constellation system, which was provided by SES [SES]. The authors would also like to thank Iñigo del Portillo and Markus Guerster, for contributing to the initial discussions that motivated this work; and Skylar Eiskowitz and Nils Pachler, for reviewing the manuscript.

References