    # Solving Continual Combinatorial Selection via Deep Reinforcement Learning

We consider the Markov Decision Process (MDP) of selecting a subset of items at each step, termed the Select-MDP (S-MDP). The large state and action spaces of S-MDPs make them intractable to solve with typical reinforcement learning (RL) algorithms especially when the number of items is huge. In this paper, we present a deep RL algorithm to solve this issue by adopting the following key ideas. First, we convert the original S-MDP into an Iterative Select-MDP (IS-MDP), which is equivalent to the S-MDP in terms of optimal actions. IS-MDP decomposes a joint action of selecting K items simultaneously into K iterative selections resulting in the decrease of actions at the expense of an exponential increase of states. Second, we overcome this state space explo-sion by exploiting a special symmetry in IS-MDPs with novel weight shared Q-networks, which prov-ably maintain sufficient expressive power. Various experiments demonstrate that our approach works well even when the item space is large and that it scales to environments with item spaces different from those used in training.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Imagine yourself managing a football team in a league of many matches. Your goal is to maximize the total number of winning matches during the league. For each match, you decide a lineup (action) by selecting players among candidates to participate in it and allocating one of positions (command) to each of them, with possible duplication. You can observe a collection (state) of the current status (information) of each candidate player (item: ). During the match, you cannot supervise anymore until you receive the result (reward), as well as the changed collection of the status (next state) of

players which are stochastically determined by a transition probability function

. In order to win the long league, you should pick a proper combination of the selected players and their positions to achieve not only a myopic result of the following match but also to consider a long-term plan such as the rotation of the members. We model an MDP for these kinds of problems, termed Select-MDP (S-MDP), where an agent needs to make combinatorial selections sequentially.

There are many applications that can be formulated as an S-MDP including recommendation systems [Ricci et al.2015, Zheng et al.2018], contextual combinatorial semi-bandits [Qin et al.2014, Li et al.2016], mobile network scheduling [Kushner and Whiting2004], and fully-cooperative multi-agent systems controlled by a centralized agent [Usunier et al.2017] (when ). However, learning a good policy is challenging because the state and action spaces increase exponentially in and . For example, our experiment shows that the vanilla DQN [Mnih et al.2015] proposed to tackle the large state space issue fails to learn the Q-function in our test environment of , even for the simplest case of . This motivates the research on a scalable RL algorithm for tasks modeled by an S-MDP.

In this paper, we present a novel DQN-based RL algorithm for S-MDPs by adopting a synergic combination of the following two design ideas:

1. For a given S-MDP, we convert it into a divided but equivalent one, called Iterative Select-MDP (IS-MDP), where the agent iteratively selects an (item, command) pair one by one during steps rather than selecting all at once. IS-MDP significantly relieves the complexity of the joint action space per state in S-MDP; the agent only needs to evaluate actions during consecutive steps in IS-MDP, while it considers actions for each step in S-MDP. We design -cascaded deep Q-networks for IS-MDP, where each Q-network selects an item with an assigned command respectively while considering the selections by previous cascaded networks.

2. Although we significantly reduce per-state action space in IS-MDP, the state space is still large as or grows. To have scalable and fast training, we consider two levels of weight parameter sharing for Q-networks: intra-sharing (I-Sharing) and unified-sharing (U-Sharing). In pactice, we propose to use a mixture of I- and U-sharing, which we call progressive sharing (P-sharing), by starting from a single parameter set as in U-sharing and then progressively increasing the number of parameter sets, approaching to that of I-sharing.

The superiority of our ideas is discussed and evaluated in two ways. First, despite the drastic parameter reduction, we theorectically claim that I-sharing does not hurt the expressive power for IS-MDP by proving (i) relative local optimality and (ii) universality

of I-sharing. Note that this analytical result is not limited to a Q-function approximator in RL, but is also applied to any neural network with parameter sharing in other contexts such as supervised learning. Second, we evaluate our approach on two self-designed S-MDP environments (circle selection and selective predator-prey) and observe a significantly high performance improvement, especially with large

(e.g., ), over other baselines. Moreover, the trained parameters can generalize to other environments of much larger item sizes without additional training, where we use the trained parameters in for those in

### 1.1 Related Work

#### Combinatorial Optimization via RL

Recent works on deep RL have been solving NP-hard combinatorial optimization problems on graphs

[Dai et al.2017], Traveling Salesman problems [Kool et al.2019], and recommendation systems [Chen et al.2018, Deudon et al.2018]. In many works for combinatorial optimization problems, they do not consider the future state after selecting a combination of items and some other commands. [Chen et al.2018] suggests similar cascaded Q-networks without efficient weight sharing which is crucial in handling large dimensional items. [Usunier et al.2017] suggests a centralized MARL algorithm where the agent randomly selects an item first and then considers the command. Independent Deep Q-network (IDQN) [Tampuu et al.2017] is an MARL algorithm where each item independently chooses its command using its Q-network. To summarize, our contribution is to extend and integrate those combinatorial optimization problems successfully and to provide a scalable RL algorithm using weight shared Q-networks.

#### Parameter Sharing on Neural Networks and Analysis

Parameter shared neural networks have been studied on various structured data domains such as graphs [Kipf and Welling2017] and sets [Qi et al.2017]. These networks do not only save significant memory and computational cost but also perform usually better than non-parameter shared networks. For the case of set-structured data, there are two major categories: equivariant [Ravanbakhsh et al.2017a, Jason and Devon R Graham2018] and invariant networks [Qi et al.2017, Zaheer et al.2017, Maron et al.2019]. In this paper, we develop a parameter shared network (I-sharing) which contains both permutation equivariant and invariant properties. Empirical successes of parameter sharing have led many works to delve into its mathematical properties. [Qi et al.2017, Zaheer et al.2017, Maron et al.2019] show the universality of invariant networks for various symmetries. As for equivariant networks, a relatively small number of works analyze their performance. [Ravanbakhsh et al.2017b, Zaheer et al.2017, Jason and Devon R Graham2018] find necessary and sufficient conditions of equivariant linear layers. [Yarotsky2018] designs a universal equivariant network based on polynomial layers. However, their polynomial layers are different from widely used linear layers. In our paper, we prove two theorems which mathematically guarantee the performance of our permutation equi-invariant networks in different ways. Both theorems can be applied to other similar related works.

## 2 Preliminary

### 2.1 Iterative Select-MDP (IS-MDP)

Given an S-MDP, we formally describe an IS-MDP as a tuple that makes a selection of items and corresponding commands in an S-MDP through consecutive selections. Fig. 3 shows an example of the conversion from an S-MDP to its equivalent IS-MDP. In IS-MDP, given a tuple of the -item information , with being the information of the item , the agent selects one item and assigns a command at every ‘phase’ for . After phases, it forms a joint selection of items and commands, and a probabilistic transition of the -item information and the associated reward are given.

To elaborate, at each phase , the agent observes a state which consists of a set of pairs of information and command which are selected in prior phases, denoted as , with being a pair selected in the th phase, and a tuple of information of the unselected items up to phase , denoted as . From the observation at phase , the agent selects an item among the unselected items and assigns a command , i.e., a feasible action space for state is given by , where represents a selection . As a result, the state and action spaces of an IS-MDP are given by and , respectively. We note that any state in an S-MDP belongs to , i.e., the th phase. In an IS-MDP, action for state results in the next state 222We use as and . and a reward ,

 k

Recall is the transition probability of S-MDP. The decomposition of joint action in S-MDPs (i.e., selecting items at once) into consecutive selections in IS-MDPs has equivalence in terms of the optimal policy [Maes et al.2009]. Important advantage from the decomposition is that IS-MDPs have action space of size while the action space of S-MDPs is .

### 2.2 Deep Q-network (DQN)

We provide a background of the DQN [Mnih et al.2015], one of the standard deep RL algorithms, whose key ideas such as the target network and replay buffer will also be used in our proposed method. The goal of RL is to learn an optimal policy that maximizes the expected discounted return. We denote the optimal action-value functions (Q-function) under the optimal policy by . The deep Q-network (DQN) parameterizes and approximates the optimal Q-function using the so-called Q-network

, i.e., a deep neural network with a weight parameter vector

. In DQN, the parameter is learned by sampling minibatches of experience

from the replay buffer and using the following loss function:

 (2)

where is the target parameter which follows the main parameter slowly. It is common to approximate rather than using a neural network so that all action values can be easily computed at once.

## 3 Methods

In this section, we present a symmetric property of IS-MDP, which is referred to as Equi-Invariance (EI), and propose an efficient RL algorithm to solve IS-MDP by constructing cascaded Q-networks with two-levels of parameter sharing.

### 3.1 IS-MDP: Equi-Invariance

As mentioned in Sec. 2.1, a state at phase includes two sets and of observations, so that we have some permutation properties related to the ordering of elements in each set, i.e., for all , , and ,

 P(s′,r∣s,a)≡P(s′,r∣σs(s),σi(a)). (3)

We denote as a permutation of a state at phase , which is defined as

 σs(s):=(σx(x),σi(i)), (4)

where is a group of permutations of a set with elements. From (3), we can easily induce that if the action is the optimal action for , then for state , an optimal policy should know that an permuted action is also optimal. As a result, we have ,

 Q⋆(s,a)=Q⋆(σs(s),σi(a)). (5)

Focusing on Q-value function , as discussed in Sec. 2.2, a permutation of a state permutes the output of the function according to the permutation . In other words, a state and the permutation thereof, , have equi-invariant optimal Q-value function . This is stated in the following proposition which is a rewritten form of (5).

###### Proposition 1 (Equi-Invariance of IS-MDP).

In IS-MDP, the optimal Q-function of any state is invariant to the permutation of a set and equivariant to the permutation of a set , i.e. for any permutation ,

 Q⋆(σs(s))=σi(Q⋆(s)). (6)

As we will discuss later, this EI property in (6) plays a critical role in reducing state and action spaces by considering pairs and permutations thereof to be the same. We follow the idea in [Zinkevich and Balch2001] to prove Proposition 1.

### 3.2 Iterative Select Q-learning (ISQ)

As mentioned in Sec. 2.1, the dimensions of state and action spaces differ over phases. In particular, as the phase progresses, the set of the state increases while the set and the action space decrease. Recall that the action space of state is . Then, Q-value function at each phase , denoted as for , is characterized by a mapping from a state space to , where the -th output element corresponds to the value of .

To solve IS-MDP using a DQN-based scheme, we construct deep Q-networks that are cascaded, where the th Q-network, denoted as , approximates the Q-value function with a learnable parameter vector . We denote by and the collections of the main and target weight vectors for all -cascaded Q-networks, respectively. With these -cascaded Q-networks, DQN-based scheme can be applied to each Q-network for using the associated loss function as in (2) with and (since ), which we name Iterative Select Q-learning (ISQ).

Clearly, a naive ISQ algorithm would have training challenges due to the large-scale of and since (i) number of parameters in each network increases as increases and (ii) size of the parameter set also increases as increases. To overcome these, we propose parameter sharing ideas which are described next. Figure 4: A simple example of the parameter-shared Q-networks Qk(⋅;θk) when K=3,N=4,|C|=1. Red and blue colored nodes represent the nodes equivariant to the selected items x and the unselected items i respectively. Each black node represents the Q value for selecting the corresponding (item, command) pair.

#### Intra Parameter Sharing (I-sharing)

To overcome the parameter explosion for large in each Q-network, we propose a parameter sharing scheme, called intra parameter sharing (I-sharing). Focusing on the th Q-network without loss of generality, the Q-network with I-sharing has a reduced parameter vector 333To distinguish the parameters of Q-networks with and without I-sharing, we use notations and for each case, respectively., yet it satisfies the EI property in (6), as discussed shortly.

The Q-network with I-sharing is a multi-layered neural network constructed by stacking two types of parameter-shared layers: and . As illustrated in Fig. 4, where the same colored and dashed weights are tied together, the layer is designed to preserve an equivariance of the permutation , while the layer is designed to satisfy invariance of as well as equivariance of , i.e.,

 ϕk(σs(x,i)) =σs(ϕk(x,i)), ψk(σs(x,i)) =σi(ψk(x,i)).

Then, we construct the Q-network with I-sharing by first stacking multiple layers of followed by a single layer of as

 Qk(s;θk):=ψk∘ϕk∘⋯∘ϕk(s),

where is properly set to have tied values. Since composition of the permutation equivariant/invariant layers preserves the permutation properties, we obtain the following EI property

 Qk(σs(x,i);θk)=σi(Qk(x,i;θk)).

ISQ algorithm with I-sharing, termed ISQ-I, achieves a significant reduction of the number of parameters from to , where is the collection of the parameters. We refer the readers to our technical report for a more mathematical description.

#### Unified Parameter Sharing (U-sharing)

We propose an another-level of weight sharing method for ISQ, called unified parameter sharing (U-sharing). We observe that each I-shared Q-network has a fixed number of parameters regardless of phase . This is well described in Fig. 4, where the number of different edges are the same in and . From this observation, we additionally share among the different Q-networks , i.e. . U-sharing enables the reduction of the number of weights from for to for . Our intuition for U-sharing is that since the order of the selected items does not affect the transition of S-MDP, an item which must be selected during phases has the same Q-values in every phase.555Note that, we set the discount factor during except the final phase . This implies that the weight vectors may also have similar values. However, too aggressive sharing such as sharing all the weights may experience significantly reduced expressive power.

#### Progressive Parameter Sharing (P-sharing)

To take the advantages of both I- and U-sharing, we propose a combined method called progressive parameter sharing (P-sharing). In P-sharing, we start with a single parameter set (as in U-sharing) and then progressively double the number of sets until it reaches (the same as I-sharing). The Q-networks with nearby phases ( and ) tend to share a parameter set longer as visualized in Fig. 5, which we believe is because they have a similar criterion. In the early unstable stage of the learning, the Q-networks are trained sample-efficiently as they exploit the advantages of U-sharing. As the training continues, the Q-networks are able to be trained more elaborately, with more accurate expressive power, by increasing the number of parameter sets. In P-sharing, the number of the total weight parameters ranges from to during training. Figure 5: Illustration of P-sharing for K=4. In the beginning, all Q-networks share the same weights. As the training progresses, we double the number of parameter sets until each Q-network Qk is trained with its own parameter vectors θk.

## 4 Intra-Sharing: Relativ Local Optimality and Universal Approximation

One may naturally raise the question of whether the I-shared Q-network has enough expressive power to represent the optimal Q-function of the IS-MDP despite the large reduction in the number of the parameters from to . In this section, we present two theorems that show has enough expressive power to approximate with the EI property in (6). Theorem 1 states how I-sharing affects local optimality and Theorem 2 states whether the network still satisfies the universal approximation even with the equi-invariance property. Due to space constraint, we present the proof of the theorems in the technical report. We comment that both theorems can be directly applied to other similar weight shared neural networks, e.g., [Qi et al.2017, Zaheer et al.2017, Ravanbakhsh et al.2017b]. For presentational convenience, we denote as , as , and as

#### Relative Local Optimality

We compare the expressive power of I-shared Q-network and vanilla Q-network of the same structure when approximating a function satisfies the EI property. Let and denote weight vector spaces for and , respectively. Since both and have the same network sructure, we can define a projection mapping such that for any . Now, we introduce a loss surface function of the weight parameter vector :

 lΩ(ω):=∑s∈B|Qω(s)−Q⋆(s)|2,

where is a batch of state samples at phase and implies the true Q-values to be approximated. Note that this loss surface is different from the loss function of DQN in (2). However, from the EI property in , we can augment additional true state samples and the true Q-values by using equivalent states for all ,

 LΩ(ω):=∑σs∈Sk×SN−k(∑s∈B|Qω(σs(s))−Q⋆(σs(s))|2).

We denote the loss surface in the weight shared parameter space .

###### Theorem 1 (Relative Local Optimality).

If is a local optimal parameter vector of the loss surface , then the projected parameter is also the local optimal point of .

It is notoriously hard to find a local optimal point by using gradient descent methods because of many saddle points in high dimensional deep neural networks [Dauphin et al.2014]. However, we are able to efficiently seek for a local optimal parameter on the smaller dimensional space , rather than exploring . The quality of the searched local optimal parameters is reported to be reasonable that most of the local optimal parameters give nearly optimal performance in high dimensional neural networks [Dauphin et al.2014, Kawaguchi2016, Laurent and Brecht2018] To summarize, Theorem 1 implies that has similar expressive power to if both have the same architecture.

#### Universal Approximation

We now present a result related to the universality of when it approximates .

###### Theorem 2 (Universal Approximation).

Let satisfies EI property. If the domain spaces and are compact, for any , there exists a -layered I-shared neural network

with a finite number of neurons, which satisfies

 ∀s∈Sk,|Q⋆(s)−Qθ(s)|<ϵ.

Both Theorems 1 and 2 represent the expressive power of the I-shared neural network for approximating an equi-invariant function. However, they differ in the sense that Theorem 1 directly compares the expressive power of the I-shared network to the network without parameter sharing, whereas Theorem 2 states the potential power of the I-shared network that any function with EI property allows good approximation as the number of nodes in the hidden layers sufficiently increase.

## 5 Simulations

### 5.1 Environments and Tested Algorithms

#### Circle Selection (CS)

In Circle Selection (CS) task, there are selectable and unselectable circles, where each circle is randomly moving and its radius increases with random noise. The agent observes positions and radius values of all the circles as a state, selects circles among selectable ones, and chooses out of the commands: moves up, down, left, right, or stay. Then, the agent receives a negative or zero reward if the selected circles overlap with unselectable or other selected circles, respectively; otherwise, it can receive a positive reward. The amount of reward is related to a summation of the selected circles’ area. All selected circles and any overlapping unselectable circle are replaced by new circles, which are initialized at random locations with small initial radius. Therefore, the agent needs to avoid the overlaps by carefully choosing circles and their commands to move.

#### Selective Predator-Prey (PP)

In this task, multiple predators capture randomly moving preys. The agent observes the positions of all the predators and preys, selects predators, and assigns the commands as in the CS task. Only selected predators can move according to the assigned command and capture the preys. The number of preys caught by the predators is given as a reward, where a prey is caught if and only if more than two predators catch the prey simultaneously.

#### Tested Algorithms and Setup

We compare the three variants of ISQ: ISQ-I, ISQ-U, ISQ-P with three DQN-based schemes: (i) a vanilla DQN [Mnih et al.2015], (ii) a sorting DQN that reduces the state space by sorting the order of items based on a pre-defined rule, and (iii) a myopic DQN which learns to maximize the instantaneous reward for the current step, but follows all other ideas of ISQ. We also consider three other baselines motivated by value-based MARL algorithms in [Tampuu et al.2017, Usunier et al.2017, Chen et al.2018]: Independent DQN (IDQN), Random-Select DQN (RSQ), and Element-wise DQN (EQ). In IDQN, each item observes the whole state and has its own Q-function with action space equals to . In RSQ, the agent randomly selects items first and chooses commands from their Q-functions. EQ uses only local information to calculate each Q-value. We evaluate the models by averaging rewards with independent episodes. The shaded area in each plot indicates confidence intervals in

different trials, where all the details of the hyperparameters are provided in ourthe technical report

.

### 5.2 Single Item Selection

To see the impact of I-sharing, we consider the CS task with , , and , and compare ISQ-I with a vanilla DQN and a sorting DQN. Fig. (a)a illustrates the learning performance of the algorithms for , and .

#### Impact of I-sharing

The vanilla DQN performs well when , but it fails to learn when and due to the lack of considering equi-invariance in IS-MDP. Compared to the vanilla DQN, the sorting DQN learns better policies under large by reducing the state space through sorting. However, ISQ-I still outperforms the sorting DQN when is large. This result originated from the fact that sorting DQN is affected a lot by the choice of the sorting rule. In contrast, ISQ-I exploits equi-invariance with I-shared Q-network so it can outperform the other baselines for all s especially when is large. The result coincides to our mathematical analysis in Theorem 1 and Theorem 2 which guarantee the expressive power of I-shared Q-network for IS-MDP.

### 5.3 Multiple Item Selection

To exploit the symmetry in the tasks, we apply I- sharing to all the baselines. For CS task, the environment settings are , and . For PP task, we test with predators () and preys in a grid world for . The learning curves in both CS task (Fig. 9) and PP task (Fig. 13) clearly show that ISQ-I outperforms the other baselines (except other ISQ variants) in most of the scenarios even though we modify all the baselines to apply I-sharing. This demonstrates that ISQ successfully considers the requisites for S-MDP or IS-MDP: a combination of the selected items, command assignment, and future state after the combinatorial selection.

#### Power of ISQ: Proper Selection

Though I-shared Q-networks give the ability to handle large to all the baselines, ISQs outperform all others in every task. This is because only ISQ can handle all the requisites to compute correct Q-values. IDQN and RSQ perform poorly in many tasks since they do not smartly select the items. RSQ performs much worse than ISQ when in both tasks since it only focuses on assigning proper commands but not on selecting good items. Even when (Fig. (c)c

), ISQ-I is better than RSQ since RSQ needs to explore all combinations of selection, while ISQ-I only needs to explore specific combinations. The other baselines show the importance of future prediction, action selection, and full observation. First, MQ shares the parameters like ISQ-I, but it only considers a reward for the current state. Their difference in performance shows the gap between considering and not considering future prediction in both tasks. In addition, ISQ-I (1) only needs to select items but still has lower performance compared to ISQ-I. This shows that ISQ-I is able to exploit the large action space. Finally, EQ estimates Q-functions using each item’s information. The performance gap between EQ and ISQ-I shows the effect of considering full observation in calculating Q-values.

#### Impact of P-sharing

By sharing the parameters in the beginning, ISQ-P learns significantly faster than ISQ-I in all cases as illustrated by the learning curves in Fig. 9 and 13. ISQ-P also outperforms ISQ-U in the PP task because of the increase in the number of parameters at the end of the training process. With these advantages, ISQ-P achieves two goals at once: fast training in early stage and good final performances.

#### Power of ISQ: Generalization Capability

Another advantage of ISQ is powerful generality under environments with different number of items, which is important in real situations. When the number of items changes, a typical Q-network needs to be trained again. However, ISQ has a fixed number of parameters regardless of . Therefore, we can re-use the trained for an item size to re-construct another model for a different item size . From the experiments of ISQ-P on different CS scenarios, we observe that for the case , ISQ-P shows an performance compared to . In contrast, for the case and , it shows an performance compared to and . These are remarkable results since the numbers of the items are fourfold different (). We conjecture that ISQ can learn a policy efficiently in an environment with a small number of items and transfer the knowledge to a different and more difficult environment with a large number of items.

## 6 Conclusion

In this paper, we develop a highly efficient and scalable algorithm to solve continual combinatorial selection by converting the original MDP into an equivalent MDP and leveraging two levels of weight sharing for the neural network. We provide mathematical guarantees for the expressive power of the weight shared neural network. Progressive-sharing share additional weight parameters among cascaded Q-networks. We demonstrate that our design of progressive sharing outperforms other baselines in various large-scale tasks.

## Acknowledgements

This work was supported by Institute for Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.2016-0-00160, Versatile Network System Architecture for Multi-dimensional Diversity).

This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science and ICT (No. 2016R1A2A2A05921755).

## References

• [Chen et al.2018] Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, and Le Song. Neural model-based reinforcement learning for recommendation. arXiv preprint arXiv:1812.10613, 2018.
• [Dai et al.2017] Hanjun Dai, Elias B Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In NeurIPS, 2017.
• [Dauphin et al.2014] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NeurIPS, 2014.
• [Deudon et al.2018] Michel Deudon, Pierre Cournut, Alexandre Lacoste, Yossiri Adulyasak, and Louis-Martin Rousseau.

Learning heuristics for the tsp by policy gradient.

In CPAIOR. Springer, 2018.
• [Gybenko1989] G Gybenko.

Approximation by superposition of sigmoidal functions.

In Mathematics of Control, Signals and Systems, 2(4):303–314, 1989.
• [Jason and Devon R Graham2018] Hartford Jason and Siamak Ravanbakhsh Devon R Graham, Kevin Leyton-Brown. Deep models of interactions across sets. In ICML, 2018.
• [Kawaguchi2016] Kenji Kawaguchi. Deep learning without poor local minima. In NeurIPS, 2016.
• [Kipf and Welling2017] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
• [Kool et al.2019] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems! In ICLR, 2019.
• [Kushner and Whiting2004] Harold J Kushner and Philip A Whiting. Convergence of proportional-fair sharing algorithms under general conditions. IEEE Transactions on Wireless Communications, 3(4):1250–1259, 2004.
• [Laurent and Brecht2018] Thomas Laurent and James Brecht. Deep linear networks with arbitrary loss: All local minima are global. In ICML, 2018.
• [Li et al.2016] Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. Contextual combinatorial cascading bandits. In ICML, 2016.
• [Maes et al.2009] Francis Maes, Ludovic Denoyer, and Patrick Gallinari. Structured prediction with reinforcement learning. Machine learning, 77(2-3):271, 2009.
• [Maron et al.2019] Haggai Maron, Ethan Fetaya, Nimrod Segol, and Yaron Lipman. On the universality of invariant networks. arXiv preprint arXiv:1901.09342, 2019.
• [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
• [Qi et al.2017] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
• [Qin et al.2014] Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. Contextual combinatorial bandit and its application on diversified online recommendation. In ICDM. SIAM, 2014.
• [Ravanbakhsh et al.2017a] Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. Deep learning with sets and point clouds. In ICLR, workshop track, 2017.
• [Ravanbakhsh et al.2017b] Siamak Ravanbakhsh, Jeff Schneider, and Barnabás Póczos. Equivariance through parameter-sharing. In ICML, 2017.
• [Ricci et al.2015] Francesco Ricci, Lior Rokach, and Bracha Shapira. Recommender systems: Introduction and challenges. In Recommender systems handbook. Springer, 2015.
• [Tampuu et al.2017] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12(4):e0172395, 2017.
• [Usunier et al.2017] Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, and Soumith Chintala. Episodic exploration for deep deterministic policies for starcraft micromanagement. In ICLR, 2017.
• [Yarotsky2018] Dmitry Yarotsky. Universal approximations of invariant maps by neural networks. arXiv preprint arXiv:1804.10306, 2018.
• [Zaheer et al.2017] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In NeurIPS, 2017.
• [Zheng et al.2018] Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. Drn: A deep reinforcement learning framework for news recommendation. In WWW, 2018.
• [Zinkevich and Balch2001] Martin Zinkevich and Tucker Balch. Symmetry in markov decision processes and its implications for single agent and multi agent learning. In ICML, 2001.

## Appendix A Intra-Parameter Sharing

### a.1 Single Channel

In this section, we formally redefine the two types of the previously defined weight shared layers and with the EI property, i.e., for all ,

 ϕk(σs(x,i))=σs(ϕk(x,i)),ψk(σs(x,i))=σi(ψk(x,i)).

We start with the simplest case when and . This case can be regarded as a state where and in Section 3.2. Let and are the identity matrices. We denote , , , and are the matrices of ones.

#### Layer ϕk

Let with and where the output of the layers and are defined as

 X:=ρ(Wxx+Wx,xx+Wx,ii+bx),I:=ρ(Wii+Wi,ii+Wi,xx+bi) (7)

with a non-linear activation function

. The parameter shared matrices defined as follows:

 Wx :=WxIx, Wx,x :=Wx,x|x|1x,x, Wx,i :=Wx,i|i|1x,i, bx :=bx1x,1, Wi :=WiIi, Wi,i :=Wi,i|i|1i,i, Wi,x :=Wi,x|x|1i,x, bi :=bi1i,1.

The entries in the weight matrices are tied by real-value parameters , respectively. Some weight matrices such as have normalizing term () or (). In our empirical simulation results, these normalizations help the stable training as well as increase the generalization capability of the Q-networks.

#### Layer ψk

The only difference of from is that the range of is restricted in of (7), i.e., where The weight matrices are similarly defined as in the case:

 Wi:=WiIi,Wi,i:=Wi,i|i|1i,i,Wi,x:=Wi,x|x|1i,x,bi:=bi1i,1.

#### Deep Neural Network with Stacked Layers

Recall that the I-shared network is formed as follows:

 Qθ(⋅;θk):=ψk∘ϕDk∘⋯ϕ1k(⋅)

where denotes the number of the stacked mutiple layers belonging to . Therefore, the weight parameter vector for consists of for and for . In contrast, the projected vector consists of high dimenional weight parameter vectors such as for and for .

### a.2 Multiple Channels

Multiple Channels. In the above section, we describe simplified versions of the intra-sharing layers

 ϕk:R|x|×R|i|→R|x|×R|i|,ψk:R|x|×R|i|→R|i|.

In this section, we extend this to

 ϕk:R|x|⋅Px+|i|⋅Pi→R|x|⋅Ox+|i|⋅Oi,ψk:R|x|⋅Px+|i|⋅Pi→R|i|⋅Oi (8)

where are the numbers of the features for the input and the output

of each layer, respectively. The role of the numbers is similar to that of channels in convolutional neural networks which increase the expressive power and handle the multiple feature vectors. This wideness allows more expressive power due to the increased numbers of the hidden nodes, according to the

universial approximatin theorem [Gybenko1989]. Furthermore, our Theorem 2 also holds with proper feature numbers in the hidden layers. Without loss of generality, we handle the case for and . We use superscripts and for and to denote such channels. Our architecture satisfies that cross-channel interactions are fully connected. Layer with multiple channels is as follows:

 X⟨o⟩:=ρ(P∑p=1(W⟨o,p⟩xx⟨p⟩+W⟨o,p⟩x,xx⟨p⟩+W⟨o,p⟩x,ii⟨p⟩+b⟨o⟩x)), I⟨o⟩:=ρ(P∑p=1(W⟨o,p⟩ii⟨p⟩+W⟨o,p⟩i,ii⟨p⟩+W⟨o,p⟩i,xx⟨p⟩+b⟨o⟩i))

where

 W⟨o,p⟩x :=W⟨o,p⟩xIx, W⟨o,p⟩x,x :=W⟨o,p⟩x,x|x|1x,x, W⟨o,p⟩x,i :=W⟨o,p⟩x,i|i|1x,i, b⟨o⟩x :=b⟨o⟩x1x,1, W⟨o,p⟩i :=W⟨o,p⟩iIi, W⟨o,p⟩i,i :=W⟨o,p⟩i,i|i|1i,i, W⟨o,p⟩i,x :=W⟨o,p⟩i,x|x|1i,x, b⟨o⟩i :=b⟨o⟩i1i,1.

Similar to the above cases, the entries in the weight matrices are tied together by real-value parameters respectively. The weight parameter vector for with multiple channels consists of for and for . In contrast, the projected vector consists of high dimenional weight parameter vectors such as for and for .

## Appendix B Proofs of the theorems

### b.1 Relative Local optimality: Theorem 1

To simplify the explanation, we only consider the case when phase so and is permutation equivariant to the order of . Furthermore, we consider the case of a single channel described in Section A.1. Therefore, we omit to notate in this subsection and denote rather than . However, our idea for the proof can be easily adapted to extended cases such as or multiple channels. To summarize, our goal is to show relative local optimality in Theorem 1 where the loss function is defined as

 LΩ(ω):=∑σ∈SN∑i∈B∣∣Qω(σ(i))−Q⋆(σ(i))∣∣.

#### Skectch of Proof

To use contradiction, we assume that there exists at least one local minima in the loss function for I-shared network while is not a local minima in the loss function for non-weight shared network . Therefore, there must be a vector in which makes the directional derivative . We first extend the definition of each to the corresponding mapping . We can generate more derivative vector for each such that . Therefore, the sum of the whole permuted vectors is also a negative derivative vector while belongs to since has the effect of I-sharing from the summation of the all permuted derivative vectors. This fact guarantees the existence of a derivative vector such that and and contradicts to the aformentioned assumption that is the local optimal minima of .

#### Extended Definition for σ∈SN

In this paragraph, we will extend the concept of the permutation from the original definition on the set to the permutation on the weight parameter vector in non-shared weight parameter vector space , i.e., to satisfy the below statement,

 ∀σ∈SN,∀ω∈Ω,∀i∈RN,σ(Qω(i))=Qσ(ω)(σ(i)). (9)

To define the permutation with the property in (9), we shall describe how permutes weight parameters in a layer in , which can be represented as

 ϕω(i)=Wi+b (10)

where is a weight matrix and

is a biased vector. In the permuted layer

, the weight matrix and in (10) convert to and , respectively. is a permutation matrix defined as where is a standard dimensional basis vector in . With the permuted weights, we can easily see for all and . Therefore, the network which is a composite of the layers s satisfies (9). Figure 16 describes an example of the permutation on .

Note that the projected weight parameter vector for an arbitrary is invariant to the permutation since satisfies the symmetry among the weights from I-sharing, i.e.,

 ∀θ∈Θ,∀σ∈SN,ω(θ)=σ(ω(θ)). (11)
###### Lemma 1 (Permutation Invariant Loss Function).

For any weight parameter vectors , , and , the below equation holds.

 LΩ(ω(θ)+ω)=LΩ(ω(θ)+σ(ω)). (12)
###### (Proof of Lemma 1).

We can derive the result of Lemma 1 from the below statement.

###### (Proof of Theorem 1)..

We use contradiction by assumping that there exists a local minima of while is not a local minima of . Since is not local minima of , there exists a vector such that the directional derivative of along is negative, i.e., . We can find additional vectors which have a negative derivative by permuting the and exploiting the result of Lemma 1.

 Dσ(ω0)(LΩ(ω(θ⋆)))=limh→0LΩ(ω(θ⋆))+hσ(ω0))−LΩ(ω(θ⋆))h=limh→0LΩ(ω(θ⋆)+hω0)−LΩ(ω(θ⋆))h