AlphaSeq: Sequence Discovery with Deep Reinforcement Learning

09/26/2018 ∙ by Yulin Shao, et al. ∙ The Chinese University of Hong Kong 0

Sequences play an important role in many applications and systems. Discovering sequences with desired properties has long been an interesting intellectual pursuit. This paper puts forth a new paradigm, AlphaSeq, to discover desired sequences algorithmically using deep reinforcement learning (DRL) techniques. AlphaSeq treats the sequence discovery problem as an episodic symbol-filling game, in which a player fills symbols in the vacant positions of a sequence set sequentially during an episode of the game. Each episode ends with a completely-filled sequence set, upon which a reward is given based on the desirability of the sequence set. AlphaSeq models the game as a Markov Decision Process (MDP), and adapts the DRL framework of AlphaGo to solve the MDP. Sequences discovered improve progressively as AlphaSeq, starting as a novice, learns to become an expert game player through many episodes of game playing. Compared with traditional sequence construction by mathematical tools, AlphaSeq is particularly suitable for problems with complex objectives intractable to mathematical analysis. We demonstrate the searching capabilities of AlphaSeq in two applications: 1) AlphaSeq successfully rediscovers a set of ideal complementary codes that can zero-force all potential interferences in multi-carrier CDMA systems. 2) AlphaSeq discovers new sequences that triple the signal-to-interference ratio -- benchmarked against the well-known Legendre sequence -- of a mismatched filter estimator in pulse compression radar systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 34

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A sequence is a list of elements arranged in a certain order. Prime numbers arranged in ascending order, for example, is a sequence [1]. The arrangements of nucleic acids in DNA polynucleotide chains are also sequences [2].

Discovering sequences with desired properties is an intellectual pursuit with important applications [1]. In particular, sequences are critical components in many information systems. For example, cellular code division multiple access (CDMA) systems make use of spread spectrum sequences to distinguish signals from different users [3]; pulse compression radar systems make use of probe pulses modulated by phase-coded sequences [4] to enable high-resolution detection of objects at a large distance.

Sequences in information systems are commonly designed by algebraists and information theorists using mathematical tools such as finite filed theory, algebraic number theory, and character theory. However, the design criterion for a good sequence may be complex and cannot be put into a clean mathematical expression for solution by the available mathematical tools. Faced with this problem, sequence designers may do two things: 1) Overlook the practical criterion and simplify the requirements to make the problems analytically tractable. In so doing, a disconnect between reality and theory may be created. 2) Introduce additional but artificial constraints absent in the original practical problem. In this case, the analytical solution is only valid for a subset of sequences of interest. For example, the protocol sequences in [5] are constructed by means of the Chinese Remainder Theorem (CRT) [6]; hence, the number of supported users is restricted to a prime number.

Yet a third approach is to find the desired sequences algorithmically. This approach rids us of the confines imposed by analytical mathematical tools. On the other hand, the issue becomes whether good sequences can be found within a reasonable time by algorithms. Certainly, to the extent that desired sequences can be found by a random search algorithm within a reasonable time, then the problem is solved. Most desired sequences, however, cannot be found so easily and algorithms with complexity polynomial in the length of the sequences are not available.


Fig. 1: In episodic reinforcement learning, the agent-environment interactions are broken into sessions called episodes. Each episode starts anew from an initial state . The agent takes actions in successive discrete time steps , resulting in the state of the environment traversing through states until a terminal state is reached, whereupon a reward is given. The next episode begins independently of how the previous episode ended [7].

Reinforcement Learning (RL) is an important branch of machine learning

[7] known for its ability to derive solutions for Markov Decision Processes (MDPs) [8] through a learning process. A salient feature of RL is “learning from interactions”. Fig. 1 illustrates the framework of episodic RL111Instead of episodic RL, the agent-environment interaction in RL can also be non-episodic. In this case, the RL interactions go on indefinitely without an end state. A reward is given in each time step rather than at the end of an episode. We focus on episodic RL because it fits the problem of sequence discovery better.. In the framework, an agent interacts with an environment in a sequence of discrete time steps . At time step , the agent observes that the environment is in state . Based on the observation of , the agent then takes an action , which results in the environment moving to state in time step . The environment will feedback a reward to the agent at the terminal state , i.e., the end of one episode.

The mapping from to is referred to as a policy function. The aim of the policy is generally to maximize the expected reward received at the end of the episode. This policy function could be deterministic, in which case a specific action is always taken upon a given state

. The policy could also be probabilistic, in which case the action taken upon a given state is described by a conditional probability

. The objective of the agent is to learn an expected-reward maximizing policy after going through multiple episodes. The agent may begin with bad policies early on, but as it gathers experiences from successive episodes, the policy gets better and better.

The latest trend in RL research is to integrate the recent advances of deep learning

[9] into the RL framework [10, 11, 12]

. RL that makes use of deep neural networks (DNNs) to approximate the optimal policy function – directly or indirectly – is referred to as deep reinforcement learning (DRL). DRL allows RL algorithms to be applied when the number of possible state-action pairs is enormous and that traditional function approximators cannot approximate the policy function accurately. The recent success of DRL in game playing, natural language processing, and autonomous vehicle steering (see the excellent survey in

[12]) have demonstrated its power in solving complex problems that thwart conventional approaches.

This paper puts forth a DRL-based paradigm, referred to as AlphaSeq, to discover a set of sequences with desired properties algorithmically. The essence of AlphaSeq is as follows:

  • AlphaSeq treats sequence-set discovery – a sequence set consists of one or more sequences – as an episodic symbol-filling game. In each episode of the game, AlphaSeq fills symbols into vacant sequence positions in a consecutive manner until the sequence set is completely filled, whereupon a reward with value between and is returned. The reward is a nonlinear function of a metric that quantifies the desirability of the sequence set. AlphaSeq aims to maximize the reward. It learns to do by playing many episodes of the game, improving itself along the way.

  • AlphaSeq treats each intermediate state with some sequence positions filled and others vacant as an image. Each position is a pixel of the image. Given an input image (state), AlphaSeq makes use of a DNN to approximate the optimal policy that maximizes the reward. AlphaSeq uses a DRL framework similar to that of AlphaGo [13], in which DNN-guided MCTS (Monte-Carlo Tree Search [14]) is used to select each move in the game. As in AlphaGo, there is an iterative self-learning process in AlphaSeq in that the experiences from the DNN-guided MCTS game playing are used to train the DNN; and the trained DNN in turn improves future game playing by the DNN-guided MCTS.

  • We introduce two techniques in AlphaSeq that are absent in AlphaGo. The first technique is to allow AlphaSeq to make moves at a time (i.e., filling sequence positions at a time). Obviously, this technique is not applicable to the game of Go, hence AlphaGo. The choice of is a complexity tradeoff between the MCTS and the DNN. The second technique, dubbed “segmented induction”, is to change the reward function progressively to guide AlphaSeq toward good sequences in its learning process. In essence, we set a low target for AlphaSeq initially so that many sequence sets can have rewards close to , with few having rewards close to . As AlphaSeq plays more and more episodes of the game, we progressively raise the target so that fewer and fewer sequence sets have rewards close to , with more having rewards close to . In other words, the game becomes more and more demanding as AlphaSeq, starting as a novice, learns to become an expert player.

We demonstrate the capability of AlphaSeq to discover two types of sequences:

  1. We use AlphaSeq to rediscover a set of complementary codes for multi-carrier CDMA systems. In this application, AlphaSeq aims to discover a sequence set for which potential interferences in the multi-carrier CDMA system can be cancelled by simple signal processing. This particular problem already has analytical solutions. Our goal here is to test if AlphaSeq can rediscover these analytical solutions algorithmically rather than analytically.

  2. We use AlphaSeq to discover new phase-coded sequences superior to the known sequences for pulse compression radar systems. Specifically, our goal is to find phase-coded sequences commensurate with the mismatched filter (MMF) estimator so that the estimator can yield output with high signal-to-interference ratio (SIR). The optimal sequences for MMF are not known and there is currently no known sequence that are provably optimal when the sequence is large. Benchmarked against the Legendre sequence [15], the sequence discovered by AlphaSeq triples the SIR, achieving dB mean square error (MSE) gains for the estimation of radar cross sections in pulse compression radar systems.

The remainder of this paper is organized as follows. Section II formulates the sequence discovery problem and outlines the DRL framework of AlphaSeq. Section III and IV present the applications of AlphaSeq in multi-carrier CDMA systems and pulse compression radar systems, respectively. Section V

concludes this paper. Throughout the paper, lowercase bold letters denote vectors and uppercase bold letters denote matrices.

Ii Methodology

Ii-a Problem Formulation

We consider the problem of discovering a sequence set , the desirability of which is quantified by a metric . Set consists of different sequences of the same length , i.e., , where the -th sequence is given by . Each symbol of the sequences in (i.e., ) is drawn from a discrete alphabet . Without loss of generality, this paper focuses on binary sequences. That is, alphabet is two-valued, and we can simply denote these two values by and . The metric function varies with application scenarios. It is generally a function of all sequences in . The optimal metric value (i.e., the desired metric value) is achieved when . Our objective is to find an optimal sequence set that yields . For binary sequences, the complexity of exhaustive search for is , which is prohibitive for large and .

This sequence discovery problem can be transformed into a MDP. Specifically, we treat sequence-set discovery as a symbol-filling game. One play of the game is one episode, and each episode contains a series of time steps. In each episode, the player (agent) starts from an all-zero state (i.e., all the symbols in the set are ), and takes one action per time step based on its current action policy. In each time step, symbols in the sequence set are assigned with the value of or , replacing the original value. We emphasize that the player can only determine the values of the symbols, but not their positions. The positions are predetermined: a simple rule is to place symbols sequence by sequence (specifically, we first place symbols in one sequence. When this sequence is completed-filled, we turn to fill the next sequence, and so on so forth. This rule will be used throughout the paper unless specified otherwise). An episode ends at a terminal state after time steps, whereupon a complete set is obtained. In the terminal state, we measure the goodness of by , and return a reward for this episode to the player, where is in general a nonlinear function of . It is the player’s objective to learn a policy that makes sequential decisions to maximizes the reward, as more and more games are played.

Ii-B Methodology

Given the MDP, a tree can be constructed by all possible states in the game. In particular, the root vertex is the all-zero state, and each vertex of the tree corresponds to a possible state, i.e., a partially-filled sequence-set pattern (completely-filled at a terminal state). The depth of the tree equals the number of time steps in an episode (i.e., ), and each vertex has exactly branches. In each episode, the player will start from the root vertex, and make sequential decisions along the tree based on its current policy until reaching a leaf vertex, whereupon a reward will be obtained. Given any vertex and an action, the next vertex is conditionally independent of all previous vertices and actions, i.e., the transitions on the tree satisfy the Markov property.

The objective of the player is then to reach a leaf vertex with the maximum reward. Toward this objective, the player performs the following:

  1. Distinguishing good states from bad states – A reward is given to the player only upon its reaching a terminal stage. While traversing the intermediate stage, the player must distinguish good intermediate states from bad intermediate states so that it can navigate toward a good terminal stage. In particular, the player must learn to approximate the expected end rewards of intermediate states: this is in fact a process of value function approximation (In RL, the value of a state refers to the expected reward of being in that state, and a value function is a mapping from states to values). Moreover, we can imagine each state to be an image with each symbol being a pixel, and make use of a DNN to approximate the expected rewards of the “images”.

  2. Improving action policy based on cognition of subsequent states. Starting as a tabula rasa, the player’s initial policy in earlier episodes is rather random. To gradually improve the action policy, the player can leverage the instrument of MCTS. MCTS is a simulated look-ahead tree search. At a vertex, MCTS can estimate the prospects of subsequent vertices by simulating multiple actions along the tree. The information collected during the simulations can then be used to decide the real action to be taken at this vertex.

A successful combination of DNN and MCTS has been demonstrated in AlphaGo [11, 13, 16], where the authors use DNN to assess the vertices during the MCTS simulation, as opposed to using random rollouts in standard MCTS222More details on the standard MCTS can be found in [14]. Throughout the paper, when we refer to MCTS, we mean the DNN-guided MCTS rather than the standard MCTS.. In this paper, we adapt the DRL framework in AlphaGo333AlphaGo itself is evolving, the DRL framework in this paper is based on AlphaGo Zero [13] and AlphaZero [16]. to solve the sequence set discovery problem associated with the underlying MDP. In deference to AlphaGo, we refer to this sequence discovering framework as “AlphaSeq”.


Fig. 2: The iterative algorithmic framework of AlphaGo/AlphaSeq. Improved DNN promotes the MCTS so that “game-play” generates experiences with higher quality; higher quality experiences can further enhance the DNN.

The overall algorithmic framework of AlphaGo/AlphaSeq can be outlined as an iterative “game-play with MCTS” and “DNN-update” process, as shown in Fig. 2. On the one hand, “game-play with MCTS” provides experiences to train the DNN so that the DNN can improves its assessments of the goodness of the states in the game. On the other hand, better evaluation on the states by the DNN allows the MCTS to make better decisions, which in turn provide higher quality experiences to train the DNN. Through an iterative process, the MCTS and the DNN mutually enhances each other in a progressive manner over an underlying reinforcement learning process.

In what follows, we dissect these two components and describe the relationship between them with more details. Differences between AlphaSeq and AlphaGo are presented at the end of this section. Further implementation details can be found in Appendix A.

Input and output of DNN – The DNN is designed to estimate the value function and policy function of an intermediate state444The DNN will only evaluate intermediate states, but not the terminal states. For terminal states, the value function is known to the player (i.e., the reward function), and there is no policy.. The value function is the estimated expected terminal reward given the intermediate state. Specifically, the output of DNN can be expressed as : each time we feed an intermediate state into the DNN with coefficients , it will output a reward estimation (value function estimation) and a probabilistic move-selection policy (policy function estimation, policy is a distribution over all possible next moves given the current state ).


Fig. 3: An episode of game, where , , and . The positions are represented by the coloured squares: grey means that the positions are filled while white means that the positions are vacant. At each time step, following the output of MCTS, the player fills positions with value or .

Game-play with MCTS – The first part of the algorithm iteration in Fig. 2 is game-play with MCTS. As illustrated in Fig. 3, we play the game under the guidance of MCTS. The upper half of Fig. 3 presents all the states in an episode, where squares represent the positions in the sequence set: grey squares mean that the position has already been filled (with value or ); white squares mean that the position is still vacant (with value ). The initial state of each episode is an all-zero state . In state , the player will follow a probabilistic policy (not the raw policy output of DNN) to choose symbols to fill in the next positions in the sequence set. This action yields a new state . The policy is a distribution over the possible moves, and is given by MCTS.

The bottom half of Fig. 3 shows the MCTS process at each state , where each circle (vertex) represents a possible state in the look-ahead search. In the MCTS for state , we first set the root vertex to be , and initialize a “visited tree” (this visited tree is used to record all the vertices visited in the MCTS. It is initialized to have only one root vertex). Look-ahead simulations are then performed along the visited tree starting at the root vertex. Each simulation traces out a path of the visited tree, and terminates when an unseen vertex is encountered. This unseen vertex will then be evaluated by DNN and added to the visited tree (i.e., a newly added vertex will be given the metric as to aid future simulations in evaluating which next move to select if the same vertex is visited again). As more and more simulations are performed, the tree grows in size. The metric used in selecting next move for the vertices will also change (i.e., equations (20) and (21) in Appendix A) as the vertices are visited more and more in successive simulations. In a nutshell, estimated good vertices are visited frequently, while estimated bad vertices are visited rarely. The resulting move-selection distribution at state , i.e., , is generated from the visiting counts of the root vertex’s children in MCTS at states (i.e., equation (22)).

Back to the upper part of Fig. 3, after time steps, the player obtains a complete sequence set with metric value that gives a reward . Then, we feed the to each state in this episode and store as an experience. One episode of game-play gives us experiences.

DNN update – The second part of the algorithm iteration in Fig. 2 is the training of the DNN based on the accumulated experiences over successive episodes. First, from the description above, we know that MCTS is guided by DNN. The capability of DNN determines the performance of MCTS since a better DNN yields more accurate evaluation of the vertices in MCTS. In the extreme, if the DNN perfectly knows which sequence-set patterns are good and which are bad, then the MCTS will always head toward an optimal direction, hence the chosen moves are also optimal. However, the fact is, DNN is randomly initialized, and its evaluation on vertices are quiet random and inaccurate initially. Thus, our goal is to improve this DNN using the experiences generated from game-play with MCTS.

In the process of DNN update, the DNN is updated by learning the latest experiences accumulated in the game-play. Given experience and , 1) the real reward can be used to improve the value-function approximation of DNN; 2) the policy given by MCTS at state can be used to improve the policy estimation of DNN555The policy generated by MCTS is more powerful than the raw output of DNN [13]. Thus, can be used to improve .. Thus, the training process is to make and more closely match and .

Remark:

When we play games with MCTS to generate experiences, Dirichlet noise is added to the prior probability of root node

to induce exploration, as that in AlphaGo [13]. These games are also called noisy games. Instead of noisy games, we can also play noiseless games in which Dirichlet noise is removed. Following the practice of AlphaGo, we play noisy games to generate the training experiences, but play noiseless games to evaluate the performance of AlphaSeq whose MCTS is guided by a particular trained DNN.

Overall, in one iteration, we (i) play episodes of noisy games with -guided MCTS to generate experiences, where is the current DNN; (ii) use experiences gathered in the latest episodes of games to train for a new DNN ; (iii) assess the new DNN by running noiseless games with -guided MCTS. In the next iteration, we generate further experiences by playing episodes of noisy games with -guided MCTS. Then these experiences are further used to train for yet another new DNN and so on and so forth. The pseudocode for AlphaSeq is given in Table I.

TABLE I:

In the following, we highlight some differences between AlphaSeq and AlphaGo.

  • In AlphaGo, the player can choose any of the legal positions (the Go board is ) to place its black or white piece, owing to the rule of game Go [11]. As a result, the overall legal state number is (in Go, and ; each position can have three possible states: occupied by no stones, a white stone, or a black stone). On the other hand, in AlphaSeq, the positions to place symbols are predetermined in each time step, and the player only needs to determine the values of the symbols in the predetermined positions. This restriction brings about two benefits: a) for MCTS, the overall legal state number is reduced to

    (1)

    that is, the state at the beginning of time step has possible values; b) the simpler rule reduces the amount of knowledge666The knowledge that the DNNs in AlphaGo and AlphaSeq are supposed to learn is also different. In AlphaGo, the DNN needs to decide which position is more promising to place its stone; while in AlphaSeq, the DNN is supposed to decide which symbols to place in the next predetermined positions. the DNN needs to acquire, hence the DNN is easier to train compared with that in AlphaGo.

  • In AlphaSeq, the choice of is a complexity tradeoff between MCTS and DNN; in AlphaGo, is always . As mentioned above, the universe of all states in the game forms a tree. The depth of the tree is , which is the number of steps in Fig. 3 from left to right. This is exactly the number of MCTS we need to run in an episode. Thus, the larger the , the fewer the MCTS we need to run. On the other hand, large yields more legal moves (i.e., ) in each state, hence burdening the DNN with a larger action space. Overall, given and , for small , for example , the mission of DNN is light since it only needs to determine to place or in the next position. However, the number of MCTS we need to run in an episode is up to . In contrast, for large , for example , the number of MCTS we need to run in an episode is reduced to , but the DNN is burdened with a heavier task because it needs to evaluate possible moves for each state.

  • In the game of Go, the board is invariant to rotation and reflection. Thus, we should augment the training data to let DNN learn these features. Specifically, in AlphaGo Zero, each experience (board state and move distribution) can be transformed by rotation and reflection to obtain extra training data, and the state in an experience is randomly transformed before the experience is fed to the DNN [13]. On the other hand, in our game, no rotation or reflection is required because all positions are predetermined. Any rotated or reflected state is an illegal state.

  • Compared with AlphaGo, our computational power is rather limited. Thus, for large sequence set beyond our computational power, a new technique, dubbed segmented induction , is devised to progressively discover better sequence set. We exhibit in Section IV that segmented induction performs well when applied to AlphaSeq.

In the following sections, we will demonstrate the searching capabilities of AlphaSeq in two applications: in Section III, we use AlphaSeq to rediscover an ideal complementary code set for multi-carrier CDMA systems; in Section IV, we use AlphaSeq to discover a new phase-coded sequence for pulse compression radar systems.

Iii Rediscover Ideal Complementary Code for Multi-Carrier CDMA

Code division multiple access (CDMA) is a multiple-access technique that enables numerous users to communicate in the same frequency band simultaneously [3]. The fundamental principle of CDMA communications is to distinguish different users (or channels) by unique codes pre-assigned to them [17]. Thus, CDMA code design lies at the heart of CDMA technology.

Iii-a Codes in Legacy CDMA Systems

Existing cellular CDMA systems work on a one-code-per-user basis [3, 18]. That is, the code set is designed such that exactly one code is assigned to each user, e.g., the orthogonal variable spreading factor (OVSF) code set used in W-CDMA downlink, the m-sequence set used in CDMA2000 uplink, and the Gold sequence set used in W-CDMA uplink [19, 20]. However, legacy CDMA systems are self-jamming systems since their code sets cannot guarantee user orthogonality under practical constraints and considerations, such as user asynchronies, multipath effects, and random signs of consecutive bits of user data streams [21]777In CDMA, “bit” refers to the baseband modulated information symbols (only BPSK/QPSK modulated symbols are considered in this paper, in general it can be shown that the codes discussed in this section are applicable for higher-order modulations), while “chip” refers to the entries in the spread spectrum code. Thus, with respect to the nomenclature in Section II, “chips” in CDMA corresponds to “symbol” of a code sequence in Section II..


Fig. 4: The interferences caused by user asynchronies (misalignments of bit boundaries), multi-paths, and random signs of consecutive bits, in CDMA uplink. To decode user A’s data, the receiver correlates the received signal with code . Interferences are induced by (a) cyclic auto-correlation of ; (b) flipped auto-correlation of ; (c) cyclic cross-correlation between and ; (d) flipped cross-correlation between and .

In CDMA uplink, each user spreads its signal bits by modulating the assigned code, and the signals from multiple users overlap at the receiver. To decode a user A’s signal bit, as shown in Fig. 4, the receiver cross-correlates the received signal with the locally generated code of user A. However, due to user asynchronies, multi-paths, and random signs in consecutive bits, the correlation results can suffer from interferences introduced by multiple paths of user A’s signal or signal from another user B. The potential interferences can be computed by the correlations between the signal bit and two overlapping interfering bits: when the signs of the two interfering bits are the same, the interferences are cyclic correlation functions (i.e., (a) and (c) in Fig. 4); when the signs of the two interfering bits are different, the interferences are flipped correlation functions (i.e., (b) and (d) in Fig. 4). On the other hand, CDMA downlink is a synchronous CDMA system and there are no asynchronies among signals of different users. However, multi-path and random signs in consecutive bits can still cause interferences through the above correlations among codes.

Mathematically, it has been proven that the ideal one-code-per-user code set that simultaneously zero-forces the above correlation functions does not exist [22]. Code sets used in legacy CDMA systems trade-off among these correlation functions. For example, the m-sequence set has nearly ideal cyclic auto-correlation property (to be exact, the auto-correlation function of the m-sequence is for any non-zero shift, hence is “nearly” optimal), while its cyclic cross-correlation and flipped correlation functions are unbounded. The Gold sequence set and the Kasami sequence set (candidate in W-CDMA) have better cyclic cross-correlation properties and acceptable cyclic auto-correlation properties, but their flipped correlations are unbounded (see the excellent survey [20] on the correlation functions of these sequences).

Iii-B Multi-Carrier CDMA and Ideal Complementary Codes

The limitations of legacy CDMA systems motivate researchers to develop multi-carrier CDMA (MC-CDMA) systems where complementary codes can be used to simultaneously null all correlation functions among codes that may cause interferences [21, 23].

The basic idea of complementary codes is to assign a flock of element codes to each user, as opposed to just one code in legacy CDMA systems. In MC-CDMA uplink, the signal bits of a user are spread by each of its element codes and sent over different subcarriers. When passing through the channel, the subcarriers can be viewed as separate virtual channels that have the same delay. The receiver first de-spreads the received signal in each individual subcarrier (i.e., correlate the received signal in each sub-carrier with the corresponding element code), and sums up the de-spreading outcomes of all subcarriers. In other words, the operations in each individual channel are the same as legacy CDMA systems: the new step is the summing of the outputs of the virtual channels, which cancels out the interferences induced by individual correlations in the underlying subcarriers.

To be specific, let us consider a MC-CDMA system with users, where a flock of element codes of length are assigned to each user. An ideal complementary code set that can enable interference-free MC-CDMA systems is a code set that meets the following criteria simultaneously:

  1. Ideal cyclic auto-correlation function (CAF): for the element codes assigned to a user , i.e., , the sum of the cyclic auto-correlation function of each code is zero for any non-zero shift:

    (2)

    where delay (chip-level) . Hereinafter, the index additions in the square brackets refer to modulo- additions.

  2. Ideal cyclic cross-correlation function (CCF): for two flocks of codes assigned to users and , i.e., , the sum of their cyclic cross-correlation functions is always zero irrespective of the relative shift:

    (3)

    where delay and .

  3. Ideal flipped correlation function (FCF): for two flocks of codes assigned to users and , i.e., , the sum of their flipped correlation functions is always zero for any non-zero shift (flipped correlation is only defined for non-zero delay):

    (4)

    where delay ; and can be the same (flipped auto-correlation function) or different (flipped cross-correlation function).

Some known mathematical constructions of ideal complementary codes are available in [18]. In this section, we make use of AlphaSeq to rediscover a set of ideal complementary codes. Our aim is to investigate and evaluate the searching capability of AlphaSeq: i.e., whether it can rediscover an ideal complementary code set and how it goes about doing so. Further, we would like to investigate the impact of the hyper parameters used in the search algorithm on the overall performance of AlphaSeq, so as to obtain useful insights for discovering other unknown sequences (e.g., in Section IV, we will make use of AlphaSeq to discover phase-coded sequences for pulse compression radar systems)

Iii-C AlphaSeq for MC-CDMA

In this subsection, we use AlphaSeq to rediscover an ideal complementary code set for MC-CDMA systems. As stated above, the ideal complementary code set is the code set that fulfills the three criteria in (2), (3), and (4). In this context, given a sequence set , we define the following metric function to measure how good set is for MC-CDMA systems.

Metric Function: For a sequence set consisting of sequences of the same length , the metric function below reflects how good is for MC-CDMA systems:

(5)

Note that our desired metric value . For AlphaSeq, the objective is then to discover the sequence set that minimizes this metric function.

As an essential part of the training paradigm in AlphaSeq, a reward function is needed to map a found sequence set to a reward . In general, we could design this reward function to be a linear (or non-linear) mapping from the value range of the metric function to the interval . This is in fact a normalization process to fit general objectives to the architecture of AlphaSeq (specifically, normalizing the rewards of different problems allow these problems to share the same underlying hyper parameters in DNN and MCTS of the AlphaSeq architecture). To rediscover the ideal complementary code, we define the reward function as follows:

Reward Function: For any sequence set with metric , the reward for MC-CDMA systems is defined as

(6)

where is some sort of a worst-case . That is, when , then ; and when , then . We initially set 888See Appendix B for the derivation of ., and initialize the DNN to (i.e., the parameters in the DNN is randomly set to ) to play noiseless games. Then, is set as the mean metric of the sequences found by these noiseless games, i.e., . After this, will not be changed anymore in future games. We specify that the initial games do not find good sequences, but nevertheless the sequences yield an much lower than . Using as increases the slope of the first line in (6).

Based on the metric function and reward function defined above, we implemented AlphaSeq and trained DNN to rediscover an ideal complementary code for MC-CDMA. A known ideal complementary code [18] is chosen as benchmark.

Benchmark: When , , and , the ideal complementary code set exists. The mathematical constructions in [18] gives us

(7)

As can be seen, there are flocks of codes in , each flock contains codes and the length of each code is . It can be verified that .

To rediscover the code set, there are symbols to be filled in the game, and the number of all possible sequence-set patterns is . Discovering the global optimum out of possible patterns is in fact not a difficult problem based on brute-force exhaustive search (even though it takes several days on our computer). The results of exhaustive search indicate that in (7) is not the only optimal pattern (that achieves ) when , , and . There are in fact optimal patterns that can be divided into non-isomorphic types (i.e., each pattern has other isomorphic patterns, see Appendix B for the definition of isomorphic pattern).

Implementation: We implemented and ran AlphaSeq on a computer with a single CPU (Intel Core i7-) and a single GPU (NVIDIA GeForce GTX Ti). The parameter settings are listed in Table. II.

TABLE II:

For the symbol filling game, we set , , and . In other words, in each time step, symbols were placed in the sequence set, and an episode ended after time steps when we obtained a complete sequence set. The metric function and reward function were then calculated following (5) and (6). An episode gave us experiences.

For DNN-guided MCTS, at each state , we first set as the root node , and then ran look-ahead simulations starting from . For each simulation, Dirichlet noise was added to the prior probability of to introduce exploration, where the parameters for Dirichlet distribution are set as . After simulations, the probabilistic move-selection policy was then calculated by (22), where we set for the first one third time steps (the probability of choosing a move is proportional to its visiting counts), and for the rest of the time steps (deterministically choose the move with the most visiting counts).

The DNN implemented in AlphaSeq is a deep convolutional network (ConvNets). This DNN consists of six convolutional layers together with batch normalization and rectifier nonlinearities (detailed architecture of this ConvNets can be found in Appendix

A). The DNN update cycle and . That is, every episodes, we trained the ConvNets using the experiences accumulated in the latest episodes (i.e.,

experiences) by stochastic gradient descent. In particular, the mini-batch size was set to

, and we randomly sampled mini-batches without replacement from the

experiences to train the ConvNets. For each mini-batch, the loss function is defined by (

23) in Appendix A.

Remark: In Table II, the width and length of the input image fed into DNN is chosen to match with and , i.e., and . However, it should be emphasized that this is not an absolute necessity. In general, we find that setting the input of the DNN to be an image can speed up the learning process of DNN. For example, if we had set instead of in this experiment, then it would better to set and (i.e., DNN takes an image as input, and in each time step, one row of the image is filled). Accordingly, any intermediate state (i.e., a partially-filled sequence set pattern) must first be transformed to a image before it is fed into the ConvNets (the last symbols in the

set will be padded with

because the original set has fewer symbols).

Iii-D Performance Evaluation

Over the course of training, AlphaSeq ran episodes, in which experiences were generated. To monitor the evolution of AlphaSeq, every episodes when the DNN was updated, we evaluated the searching capability of AlphaSeq by using it (with the updated DNN) to play noiseless games (these noiseless games are in addition to the noisy games used to provide experiences to train the DNN). The mean metric and the minimum metric of the found sequence sets were recorded and plotted in Fig. 5.


Fig. 5: The reinforcement learning process of AlphaSeq to rediscover a set of ideal complementary code for MC-CDMA systems. Mean metric , minimum metric and the number of visited states versus episodes, where the DNN update cycle and .

As can be seen from Fig. 5, with the continuous training of DNN, AlphaSeq gradually discovered sequence sets with smaller and smaller metric values. After episodes, AlphaSeq rediscovered an ideal complementary code set given by

(8)

It is straightforward to see that is an isomorphic version to : i.e., if we denote by , then . We found that AlphaSeq could find different ideal sequence set in different runs. For example, in another run, AlphaSeq eventually discovered a non-isomorphic ideal sequence set to , giving

(9)

The complexity of AlphaSeq is measured by means of distinct states that have been visited. Specifically, we stored all the states (including intermediate states and terminal states) encountered over the course of training in a Hash table. Every episodes, we recorded the length of the Hash table (i.e., the total number of visited states by then) and plotted them in Fig. 5 as the training goes on.

An interesting observation is that, there is a turning point on the curve of the number of distinct visited states. The slope of this curve corresponds to the extent to which AlphaSeq is exploring new states in its choice of actions. Under the framework of AlphaSeq, there are two kinds of exploration: 1) Inherent exploration – This is introduced by the variance of the action-selection policy. That is, the more random the action-selection policy is, the more new states are likely to be explored by AlphaSeq. 2) Artificial exploration – We deliberately add extra artificial randomness to AlphaSeq to let it explore more states. For example, the Dirichlet noise added to the root vertex in DNN-guided MCTS, the temperature parameter

that determines how to calculate the policy all add to the randomness. At the beginning of the game (i.e., episode ), the policy of AlphaSeq is quite random inherently because the DNN is randomly initialized. Thus, both inherent exploration and artificial exploration contributes to the slope of this curve. At the end of the game (i.e., episode ), the policy converges, hence the inherent exploration drops off, and only artificial exploration remains.

This turning point was in fact observed in all simulations of AlphaSeq in various applications we tried (not just the application for rediscovering complementary code here; see Section IV on application of AlphaSeq to discover phase-coded sequences for pulse compression radar). In general, we can then divide the overall reinforcement learning process of AlphaSeq into two phases based on this turning point. Phase I is an exploration-dominant phase (before the turning point), in which the behaviors of AlphaSeq are quite random. As a result, AlphaSeq actively explores increasingly more states per episodes in the overall solution space. After gaining familiarity with the whole solution space, AlphaSeq enters an exploitation-dominant phase (after the turning point), in which instead of exploring for more states, AlphaSeq tends to focus more on exploitation.

Remark: The DNN update cycle is important to guarantee that the algorithmic iteration proceeds in a direction of performance improvement. In AlphaSeq, given a DNN , the move-selection policy given by the -guided MCTS is usually much stronger than the raw policy output of . Thus, we first run -guided MCTS to play games and generate experiences. Then, we use these experiences to train a new DNN , so that can learn the stronger move given by -guided MCTS.

In this context, the DNN update cycle must be chosen so that the experiences are sufficient to capture the fine details of given by -guided MCTS. In particular, parameter is closely related to : a larger means more elements in (i.e., must capture possible moves in each step), and hence a larger is needed to guarantee that is well represented by the experiences.


Fig. 6: The polynomial fitted convergence curve for AlphaSeq and DNN player, where the DNN update cycle . The positive direction of -axis is a direction of performance improvement for DNN, while the positive direction of -axis is a direction of performance improvement for AlphaSeq.

As stated in Section II, the essence of AlphaSeq is a process of iterative “game-play with DNN-guided MCTS” and “DNN update”: the improvement of DNN brings about improvement of the DNN-guided MCTS, and the experiences generated by the improved MCTS in turn brings about further improvement of the DNN through training. To verify this, each time when the DNN is updated, we assess the new DNN by using it (without MCTS, and no noise) to discover sequences and record their mean metric . Specifically, at each state , the player directly adopts the raw policy output of the DNN, i.e., , to sample the next move without relying on the MCTS outputs .

Fig. 6 presents all the pair in the exploitation phase, and the corresponding polynomial fitted convergence curve. In particular, the positive direction of -axis in Fig. 6 is a direction of performance improvement for DNN, and the positive direction of -axis is a direction of performance improvement for AlphaSeq. The convergence curve in Fig. 6 reflects how the two ingredients, “MCTS-guided game-play” and “DNN update”, interplay and mutually improve in the reinforcement learning process of AlphaSeq.

Iv AlphaSeq for Pulse Compression Radar

Radar radiates radio pulses for the detection and location of reflecting objects [4]. A classical dilemma in radar systems arises from the choice of pulse duration: given a constant power, longer pulses have higher energy, providing greater detection range; shorter pulses, on the other hand, have larger bandwidth, yielding higher resolution. Thus, there is a trade-off between distance and resolution. Pulse compression radar can enable high-resolution detection over a large distance [4, 24, 25]. The key is to use modulated pulses (e.g., phase-coded pulse) rather than conventional non-modulated pulses.

Iv-a Pulse Compression Radar and Phase codes

The transmitter of a binary phased-coded pulse compression radar system transmits a pulse modulated by rectangular subpulses. The subpulses are a binary phase code of length . Each entry of the code is or , corresponding to phase and . Following the definition in [25] and [26], after subpulse-matched filtering and analog-to-digital conversion, the received sequence is

(10)

where 1) are coefficients proportional to the radar cross sections of different range bins [25]. In particular, corresponds to the range bin of interest, and the radar’s objective is to estimate given the received sequence ; 2) is the white Gaussian noise; 3) Matrix , as given in (11), is a shift matrix capturing the different propagation time needed for the clutter to return from different range bins [26].

(11)

where and . That is, in matrix , all entries except for that on the -th off-diagonal are . The effect of matrix is to right-shift or left-shift the phase code with zero padding: when , is a right-shifted version of ; when , is a left-shifted version of .

To estimate the coefficient , a widely studied estimator is the matched filtering (MF) estimator:

(12)

where the AWGN noise is ignored since the received signal is interference-limited (i.e., the interference power dominates over the noise power). Given the fact that we have no information on , the problem is then to discover a phase code that can maximize the signal-to-interference ratio (SIR) (larger SIR yields better estimation performance):

(13)

In fact, this is the well-known “merit factor problem” occurring in various guises in many disciplines [27, 28, 29]. In the past few decades, a variety of phase codes have been devised to achieve large SIR (merit factor), e.g., the Rudin-Shapiro sequences (asymptotically, ), m-sequences (asymptotically, ), and Legendre sequences (asymptotically, ) (see the excellent surveys [28, 29] and the references therein). Overall, the merit factor problem remains open. Experiment results show that does not increase as the sequence length increases. So far, the best-known merit factor of is achieved by the Baker sequence of length .

The motivation of the MF estimator comes from the fact that matched filtering provides the highest signal-to-noise ratio (SNR) in the presence of white Gaussian noise [30]. However, in the case of Radar, the received signal is interference-limited, hence interference suppression is much more important. This motivates researchers to devise a mismatched filtering (MMF) estimator [25, 31, 26].

Instead of using the transmitted phase-code , the MMF estimator uses a general real-valued code to correlate the received sequence, giving

(14)

where the real-valued sequence is to be optimized at the receiver. The problem is then to find a pair of sequences so that the signal-to-interference ratio (SIR) in (15) can be maximized.

(15)

It had been shown in [26] that, given a phase code , the optimal sequence that maximizes is , where matrix is given by

(16)

Substituting in (15) gives

(17)

Notice that only depends on the phase code , hence, the objective for the design of the MMF estimator is then to discover a phase-code that can maximize in (17).

Remark: The MMF estimator is superior to the MF estimator since is no less than given the same phase code . However, the problem of discovering a phase code that maximizes (17) did not receive much attention from the research community compared with the merit factor problem (i.e., discovering a code that maximizes (15)). This is perhaps due to the more complex criterion, and the lack of suitable mathematical tools [29]. In this section, we make use of AlphaSeq to discover phase codes for pulse compression radar with MMF estimator.

Iv-B AlphaSeq for Pulse Compression Radar

We choose (17) as the metric function of AlphaSeq:

(18)

where matrix is given in (16). For AlphaSeq, the objective is to discover the sequence that can maximize this metric function. Additional analyses on the structure of this metric function can be found in Appendix C.

Given a phase code with metric , we define a linear reward function as follows:

(19)

where and are an upper bound and a lower bound of , giving, and (see the derivations and analyses in Appendix C).

Remark: Generally, if is large in a problem, that means we are asking AlphaSeq to search over a large solution space all at once. We found empirically that setting the and as above can result in AlphaSeq not being able to zoom in to a good solution within an acceptable time. We will later introduce a technique dubbed “segmented induction” to induce AlphaSeq to zoom in to a good solution. In essence, segmented induction uses a smaller range of , but progressively changes and as better is obtained.

Based on the metric function and reward function defined above, we implemented AlphaSeq and trained DNN to discover a phase code for the MMF estimator. A Legendre sequence [15] is chosen as the benchmark.

Benchmark: We choose the Legendre sequence of length as our benchmark:

For the MMF estimator, this Legendre sequence yields SIR . For reference, yields a merit factor of when the MF estimator is used.

For the corresponding AlphaSeq game, there are symbols to fill. The number of all possible sequence-set patterns is . The complexity of exhaustive search for the global optimum is , and it would take more than one million years for our computer to find the optimal solution. In other words, the optimal solution of when is unavailable. In this context, the second benchmark we choose is random search. For random search, we randomly create -symbol sequences and record the maximum SIR obtained given a fixed budget of random trials.

TABLE III:

Implementation: In the AlphaSeq implementation, the parameter settings are listed in Table III. As seen in the table, we aim to discover one sequence of length wherein and in the AlphaSeq game. The number of symbols filled in each time step is set to , and the ConvNets takes images as input. To feed an intermediate state (i.e., a partially-filled pattern) into the ConvNets, we first transform it to a image (the missing symbol will be padded with ). A complete sequence is obtained after time steps, where symbols are obtained. Then, we ignore the last symbol and calculate the metric function and reward function following (18) and (19). The DNN update cycle is set to and . That is, every episodes, DNN will be updated using the experiences accumulated in the latest episodes.

Given the huge solution space, it is challenging for our computer to train AlphaSeq to find the optimal solution. For one thing, each episode in this problem consumes much more time than the complementary code rediscovery problem in Section III, because of the larger number of MCTSs run in each episode and the larger number of simulations run in each MCTS. For another, the large solution space in this problem requires a massive number of exploration-dominant episodes so that AlphaSeq can visit enough number of states to gain familiarity with the whole solution space. As a result, the exploration phase will last a long time before AlphaSeq enters the exploitation phase. To tackle the above challenges, we use the follows two techniques to accelerate the training process:

  1. Make more efficient use of experiences. Every episodes, we trained the DNN using the experiences accumulated in the latest episodes ( experiences in total) by stochastic gradient descent. In section III, the mini-batches were randomly sampled without replacement. That gave us mini-batches ( was the mini-batch size). Here, we want to make more efficient use of experiences. To this end, every episodes, we randomly sample mini-batches with replacement from the latest experiences to train the ConvNets.

  2. Segmented induction. This technique is particularly useful when the upper and lower bounds of the metric function span a large range, or when there is no way to bound the metric function. The essence of segmented induction is to segment the large range of the metric function to several small ranges, and define the linear reward in small ranges rather than in a single large range. To be more specific, assuming a metric function with values within the range . Then, rather than initializing and in (19), we segment to three small overlapping ranges999a) Non-overlapping intervals are inadvisable. Experimental results show that AlphaSeq cannot learn well when using non-overlapping intervals. b) The small ranges segmented here are for illustration purpose only. In general, we need to design the ranges according to the specifics in different problems. , , , and define the linear reward in these small ranges: in episode , we define the reward function in the first small range and initialize and . With the training of DNN, AlphaSeq is able to discover better and better sequences in the range . When AlphaSeq discovers sequences with reward approaching (i.e., the mean metric function of the found sequences approaches ), we then redefine the reward with the second range . That is, we set , , and let AlphaSeq discovers sequences in the second small range. When AlphaSeq is able to discover sequences with reward approaching again, we redefine the reward in the third small range, and so on and so forth. Overall, with a smaller range at a given time, the slope of the reward function in (19) increases, allowing AlphaSeq to distinguish the relative quality of different sequences with higher contrast.

Iv-C Performance Evaluation

For training, we ran AlphaSeq over episodes, generating experiences in total. As in Section III, to monitor the evolution of AlphaSeq, every episodes when the DNN was updated, we evaluated the searching capability of AlphaSeq by using AlphaSeq (with the updated DNN) to play noiseless games, and recorded their mean metric and maximum metric . Fig. 7 presents the and versus episodes during the process of reinforcement learning.


Fig. 7: The reinforcement learning process of AlphaSeq to discover a phase-coded sequence for pulse compression radar. Mean metric , maximum metric and total number of visited states versus episodes, where the DNN update cycle and .

As can be seen, the first episodes are the exploration-dominant phase and the episodes after that are the exploitation-dominant phase. After episodes, AlphaSeq discovers a sequence with metric :

Compared with the Legendre sequence, triples the SIR at the output of a MMF estimator.

Remark: In this implementation, the value range of , i.e., , is segmented to three small ranges , , and . In the first episodes, the linear reward is defined in the first small range : metric corresponds to reward , and corresponds to reward ; from episode to , the linear reward is defined in the second small range ; after episode , the linear reward is defined in the last small range .


Fig. 8: The searching capability comparison of AlphaSeq and random search. The AlphaSeq curve is the maximal metric versus number of visited states.

We next compare the searching capability of AlphaSeq with random search given the same complexity budget, where complexity is measured by the number of distinct visited states101010Directly evaluating the complexity through the amounts of computation time consumed is not fair, because AlphaSeq uses both GPU and CPU in the implementation (and the CPU/GPU load varies over time) while random search uses only CPU (almost ).. For AlphaSeq, the visited states include both intermediate states and terminal states, while for random search, only terminal states (i.e., completely-filled sequences) will be searched.

In Fig. 8, the AlphaSeq curve is the maximal metric versus number of visited states. This curve is a transcription of two curves in Fig. 7: we combine the two curves, versus episodes and number of visited states versus episodes, into one curve here. Fig. 8 also shows the maximal metric versus number of visited states for random search. To get this curve, given a state-visit budget, we performed runs of the experiments. For each run , we traced the maximum metric value obtained after a given number of random trials, denoted by , where is the number of trials, which correspond to the number of visited (terminal) states. The black curve in Fig. 8 is (i.e., a mean-max curve).

As can be seen from Fig. 8, the largest metric that random search can find is on average a log-linear function of the number visited states. After randomly visiting states, the best sequence random search can find is on average with metric . On the other hand, AlphaSeq discovers sequences with after visiting only states.


Fig. 9: The MSE of and for estimation in pulse compression radar systems.

Finally, we assess the estimation performance of benchmarked against the Legendre sequence when used in a pulse compression radar system. In the simulation, we assume the radar radiates pulse internally modulated by or . The received signal is given by equation (10), where

are Gaussian random variables with zero mean and same variance

, and AWGN noise is ignored. The receiver estimates using an MMF estimator, and we measure the estimation performance by mean square error (MSE) . Fig. 9 presents MSE versus for and . As can be seen, outperforms , and the MSE gains are up to about dB.

V Conclusion

This paper has demonstrated the power of deep reinforcement learning (DRL) for sequence discovery. We believe that sequence discovery by DRL is a good supplement to sequence construction by mathematical tools, especially for problems with complex objectives intractable to mathematical analysis.

Our specific contributions and results are as follows:

  1. We proposed a new DRL-based paradigm, AlphaSeq, to algorithmically discover a set of sequences with desired property. AlphaSeq leverages the DRL framework of AlphaGo to solve a Markov Decision Process (MDP) associated with the sequence discovery problem. The MDP is a symbol-filling game, where a player follows a policy to consecutively fill symbols in the vacant positions of a sequence set. In particular, AlphaSeq treats the intermediate states in the MDP as images, and makes use of deep neural network (DNN) to recognize them.

  2. We introduced two new techniques absent in AlphaGo in AlphaSeq to accelerate the training process. The first technique is to allow AlphaSeq to make moves at a time (i.e., filling sequence positions at a time). The choice of is a complexity tradeoff between the MCTS and the DNN. The second technique, dubbed segmented induction , is to change the reward function progressively to guide AlphaSeq to good sequences in its learning process.

  3. We demonstrated the searching capabilities of AlphaSeq in two applications: a) We used AlphaSeq to redicover a set of ideal complementary codes that can zero-force all potentially interferences in multi-carrier CDMA systems. b) We used AlphaSeq to discover new sequences that triple the signal-to-interference ratio – benchmarked against the well-known Legendre sequence – of a mismatched filter estimator in pulse compression radar systems. The mean square error (MSE) gains are up to dB for the estimation of radar cross sections.

Appendix A

This appendix describes the implementation details of AlphaSeq. Other than some custom features for our purpose, the general implementation follows AlphaGo Zero [13] and AlphaZero [16]. The source code can be found at GitHub [32].

A-a Mcts

MCTS is performed at each intermediate state to determine policy , and this is achieved by multiple look-ahead simulations along the tree. In the simulations, more promising vertices are visited frequently, while less promising vertices are visited less frequently. The problem is how to determine which vertices are more promising and which are less promising in the simulations, i.e., how to evaluate a vertex in MCTS. In standard MCTS algorithms, this vertex-evaluation is achieved by means of random rollouts. That is, for a new vertex encountered in each simulation, we run random rollout from this vertex to a leaf vertex such that a reward can be obtained (see [14] for more details). The randomly sampled rewards over all simulations are then used to evaluate a vertex.

In AlphaGo/AlphaSeq, instead of random rollouts, DNN is introduced to evaluate a vertex. The only two ingredients needed for MCTS are a root vertex and a DNN . First, given the root vertex , a search tree can be constructed where each vertex contains edges (since there are possible moves for each state). Each edge, denoted by , stores three statistics: a visit count , a mean reward , and an edge-selection prior probability . Second, MCTS uses DNN to evaluate each vertex (state). The input of is and the output is . Specifically, each time we feed a vertex into the DNN, it outputs a policy estimation and a reward estimation . Each entry in distribution is exactly the prior probability for each edge of vertex , and will be used for updating the mean reward , given by (21) later.

MCTS is operated by means of look-ahead simulations. Specifically, at a root vertex , MCTS first initializes a “visited tree” (this visited tree is used to record all the vertices visited in the MCTS. It is initialized to have only one root vertex) and runs simulations on the visited tree. Each simulation proceeds as follows [13]:

  1. Select – all the simulations start from the root vertex and finish when a vertex that has not been seen is encountered for the first time. During a simulation, we always choose the edge that yields a maximum upper confidence bound. Specifically, at each vertex , the simulation selects edge to visit, and

    where is a constant controls the tradeoff between exploration and exploitation.

  2. Expand and Evaluate – when encountering a previously unseen vertex (for the first simulation, this is in fact ), the simulation evaluates it using DNN, giving, , where the policy distribution . Then, we add this new vertex to the visited tree, and the statistics of ’s edges are initialized by , , and for .

  3. Backup – After adding vertex to the visited tree, the simulation updates all the vertices along the trajectory of encountering . Specifically, for each edge on the trajectory (including ), we update

    (20)
    (21)

After simulations, MCTS then outputs a move selection probability for root vertex by

(22)

That is, the move selection probability is determined by the visiting counts of the root vertex’s edges. Parameter is a temperature parameter as in AlphaGo Zero [13]. In an episode, we set (i.e., the move-selection probability is proportional to the visiting counts of each edge, yielding more exploration) for the first one third time steps and (deterministically choose the move that has the most visiting counts) for the rest of the time steps.

In the training iteration, when we play games to provide experiences for DNN, Dirichlet noise, i.e., with positive real parameters , is added to the prior probability of root node to guarantee additional exploration. Thus, these games are called noisy games. Accordingly, there is noiseless games, in which Dirichlet noise is removed. Usually, we play noiseless games to evaluate the performance of AlphaSeq with a trained DNN.

A-B Dnn

The DNN implemented in AlphaSeq is a deep convolutional network (ConvNets). This ConvNets consists of six convolutional layers together with batch normalization and rectifier nonlinearities, the details of which are shown in Fig. 10.


Fig. 10: The deep ConvNets implemented in AlphaSeq. This ConvNets consists of six convolutional layers together with batch normalization and rectifier nonlinearities.
  • Input – The ConvNets takes image stack as input. For a state (i.e., an partially-filled sequence-set pattern), we first transform it to a image (in general we set and ; zero-padding if

    ), and then perform feature extraction to transform it to a

    image stack.

  • Feature extraction – Feature extraction is a process to transform a image to a image stack comprising binary feature planes. The three binary feature planes are constructed as follows. The first plane, , indicates the presentence of ‘1’ in the image: if the intersection has value ‘1’ in the image, and elsewhere. The second plane, , indicates the presentence of ‘-1’ in the image: if the intersection has value ‘-1’ in the image, and elsewhere. The third plane, , indicates the presentence of ‘0’ in the image: if the intersection has value ‘0’ in the image, and elsewhere.

  • Output – For each state

    , DNN will output a policy estimation (i.e., a probability distribution)

    (, , …, ) as the prior probability for the edges of , and a scalar estimation on the expected reward of .

  • Training – Every games, we use the experiences accumulated in the most recent games (i.e., experiences) to update the DNN by stochastic gradient descent. The mini-batch size is set to , and we randomly sample mini-batches without replacement from the