Introduction
Chaos can be defined as random oscillations generated by deterministic dynamics [lorenz1963deterministic]. Chaotic time series are very sensitive to initial conditions, which render longterm predictions unachievable unless infinite observation accuracy is attained in the beginning [fischer2000synchronization]. The close relationship between lasers and chaos has been known for a long time; the output of a laser generates chaotic oscillations when a timedelayed optical feedback is injected back into the laser cavity [uchida2012optical]. Laser chaos exhibits ultrafast dynamics beyond GHz regime/domain; hence, various engineering applications have been examined in the literature. Examples range from optical secure communication [fischer2000synchronization] and fast physical random bit generation [uchida2008fast] to secure key distribution using correlated randomness [yoshimura2012secure].
The present study relates to the application of laser chaos to a multiarmed bandit problem (MAB) [naruse2018scalable]
. Reinforcement learning (RL), a branch of machine learning along with supervised and unsupervised learning, studies optimal decisionmaking rules. It differs from other machine learning tasks (e.g. image recognition) as the notion of reward comes into play in RL. The goal of RL is to construct decisionmaking rules that maximize obtained rewards; hence, gaming AI is a wellknown application of RL
[littman2015reinforcement]. In 2015, AlphaGo, developed by Google DeepMind, defeated a human professional Go player for the first time [deepmind].The MAB is a sequential decision problem of maximizing total rewards where there are
arms, or selections, whose reward probability is unknown. The MAB is one of the simplest problems in RL. In an MAB, a player can receive reward information that pertains only to the selected arm at each time step, so a player cannot obtain the reward information for a nonselected arm. The MAB exhibits a tradeoff between exploration and exploitation. Sufficient exploration is necessary to estimate the best arm more accurately, but it accompanies lowreward arm selections. Hence, excessive exploration can lead to significant losses. Furthermore, to maximize rewards, one needs to choose the best arm (use the exploitation principle). However, if the search for the best arm fails, then a nonbest option may be mistakenly chosen very likely. Therefore, it is important to balance exploration and exploitation.
An algorithm for the MAB using laser chaos time series has been proposed in 2018 [naruse2018scalable]. This algorithm sets two goals: to maximize the total rewards and to identify the best arm. However, concerning realworld applications, maximizing the rewards and finding the optimal arm may not be enough to solve a problem. For example, there is a study to improve communication throughput by treating the channel selection in wireless communications as an MAB [shungo2020dynamic]. Should we have multiple channel users, not all users can use the best channel simultaneously; accordingly, there may be situations where compromises must be made, i.e., other channels will be selected. Now it is obvious that particular channel performance ranking information would be useful when considering nonbest channels.
Conversely, when there are no other users, a player (the single user) can simultaneously utilize topranking options to accelerate the communication ability, similar with the channel bonding in local area networks [bejarano2013ieee]. The purpose of this study is to accurately recognize the order of the expected rewards of different arms using a chaotic laser time series and to minimize the reduction of accumulated rewards due to too detailed exploration.
Principles
Definition and Assumption
We consider an MAB problem in which a player selects one of slot machines, where and is a natural number. The slot machines are distinguished by identities numbered from to , which are also represented in bit binary code given by with (). For example, when , the slot machines are numbered by . In this study, we assume that if , and we define the th max and th argmax operators as and . The variables used in the study are defined as described below:

: Obtained reward from arm at time step (independent at each time step. is observed value.)

. (Consistent regardless of time step)


: Number of selections of arm by the end of time step ( is observed value).

: Arm selected at time step ( is the observed value).

: th best arm.
We estimate the arm order of reward expectations by calculating the sample mean of the accumulated reward at each time step. Specifically, the sample means of rewards obtained from arm by time step is calculated as follows:
(1) 
In each time step , we estimated the arm as the th best arm.
Timedivision multiplexing of laser chaos
The proposed method is based on the MAB algorithm reported in 2018 [naruse2018scalable]. This method consists of the following steps: [STEP 1] decision making for each bit of the slot machines, [STEP 2] playing the selected slot machine, and [STEP 3] updating the threshold values.
[STEP 1] Decision for each bit of the slot machine
First, the chaotic signal measured at is compared to a threshold value denoted as . If , then bit is assigned . Otherwise, is assigned . To determine the value of , the chaotic signal measured at is compared to a threshold value denoted as . If , then bit is assigned . Otherwise, is assigned . After this process, a slot machine with the number represented in a binary code is selected.
[STEP 2] Slot machine play
Play the selected slot machine.
[STEP 3] Threshold values adjustment
If the selected slot machine yields a reward, then the threshold values are adjusted in a way that the same decision will be more likely to be selected. For example, if is assigned and the player gets a reward, then should be increased because doing so increases the likelihood of getting again. All of the other threshold values involved in determining the decision (i.e. ) are updated in the same manner.
If the selected slot machine does not yield a reward, then the threshold values are adjusted to make the same decision less likely to take place. For example, if is assigned and the player does not get a reward, then should be increased because of the decreased likelihood of getting . Again, all of the other threshold values involved in determining the decision (i.e. ) are updated in the same manner.
Arm order recognition algorithm with confidence intervals
Confidence intervals.
For each threshold value (, ) and , the following values and are calculated:
(2)  
(3) 
represents a subset of machine arms. If machine can be selected when the signal is more than , then is included in . Otherwise, is not included in . In the same way, if machine can be selected when the signal is less than or equal to , then is included in . Otherwise, is not included in . For example, in the case of an eightarmed bandit problem (Fig. 1b):
represents the sample means of rewards obtained from machines in . represents the confidence interval width of the estimated value . The lower , the higher the estimation accuracy. Parameter indicates the degree of exploration : a higher means that more exploration is needed to reach a given confidence interval width.
Coarseness/fineness of exploration adjustments by confidence intervals.
At each threshold , if the two intervals
are overlapped, we suppose there is a likelihood of a change in the order relationship between and ; that is, the order of and is not known yet. Therefore, the exploration process should be executed more carefully. Hence, the threshold value should be closer to , which is a balanced situation, or we should perform further exploration, so that the threshold adjustment becomes finer. Conversely, if the two intervals are not overlapped, then we suppose a low likelihood of a wrong estimate of the order relationship between and . Hence, we should continue exploration more coarsely so that the threshold adjustment will be accelerated. (Fig. 1c)
Results
Experimental settings.
We have evaluated the performance of the methods for two cases: a fourarmed bandit and an eightarmed bandit. First, the reward probability of each arm is assumed to follow the Bernoulli distribution:
. Each reward environment is set to satisfy the following conditions: (i) , (ii) . In this experiment, a variety of assignments of reward probabilities satisfying the above conditions were prepared, and the performance was evaluated under every reward environment . We have defined the reward, regret, and correct order rate (COR) as metrics to quantitatively evaluate the performance of the method.(4)  
(5)  
(6) 
where denotes number of time steps, is the number of selections of arm up to time step , and represents the number of measurements in one reward environment . For the accuracy of arm order recognition, we considered the estimation accuracy of the top four arms regardless of the total number of arms. We prepared all 144 reward environments (all combinations satisfying the above conditions and ) for the fourarmed bandit problems and 100 randomly selected reward environments for the eightarmed bandit problems. The performances of four methods were compared: RoundRobin (all arms are selected in order at each time step), UCB1 (method for maximizing the total rewards proposed in 2002 [auer2002finite]), Chaos (previous method using the laser chaos time series [naruse2018scalable], only finding the best arm, not recognizing the order), and ChaosCI (proposed method using laser chaos time series and with confidence intervals).
Evaluation under one reward environment .
The curves in Figs. 2a and b show the time evolutions of and , respectively, over measurements under specific reward environments . Specifically, columns (i) and (ii) pertain to the fourarmed bandit problems defined by and , whereas columns (iii) and (iv) depict the eightarmed bandit problems given by and . The curves were colour coded for an easy method comparison. In the arm order recognition, ChaosCI and RoundRobin presented high accuracy in the early time step. In terms of total reward, Chaos and UCB1 achieved the greatest rewards.
Evaluation of the whole reward environments.
Figure 3a summarizes the relationship between total rewards and order estimation accuracy: axis represents the normalized reward , whereas axis represents the COR . Here, a normalized reward is defined as follows:
Each plot in the graph indicates and at time step under one reward environment :
Figure 3b shows the time evolution of the average value of each metric over the whole ensemble of reward environments from to :
Discussion
Difficulty of maximizing rewards and arm order recognition.
The results of the numerical simulations on the fourarmed and eightarmed bandit problems show similar trends: there is a tradeoff between the maximized total rewards and arm order recognition. As RoundRobin selects all arms equally, we always achieve a perfect COR at a time step for any given reward environment. However, we cannot maximize rewards because regret linearly increases with time. On the contrary, in Chaos, we achieved normalized rewards of almost unity at the time step of with respect to many types of reward environments. However, we can observe inferior performances regarding the arm order recognition accuracy because the arm selection is greatly biased to the best arm. In terms of the COR, the COR on RoundRobin and ChaosCI (proposed method) quickly converged to unity. In terms of the total rewards, Chaos (previous method) and UCB1 are more active in using the exploitation principle to obtain greater rewards. The proposed method, ChaosCI, achieves an outstanding performance on the arm order recognition and reward.
Number of arm selections: .
Figures 4a, b, and c show the time evolutions of by UCB1, Chaos and ChaosCI, respectively (RoundRobin leading to equal number of selections for all arms at any time). Here, we examine the two types of reward environments and in an eightarmed bandit given by and corresponding to the left and right columns of Fig. 4.
This figure shows that the selection number of the best arm (i.e. ) increases by and increases almost by in UCB1. Through the evolution of , UCB1 can achieve a regret of , but the convergence of is slow. In the proposed ChaosCI, the selection number of every arm evolves in a linear order. Therefore, the arm order recognition accuracy is faster than UCB1. Although the selections of nontop arms in the linear order cause regret to increase in a linear order, the slope of the linearorder regret is significantly decreased compared with that of RoundRobin by selecting better arms more often or by prioritizing the search (i.e. ).
Environment dependency.
As shown in Figs. 3 and 4, the performances of Chaos are very different depending on reward environments and . This finding is clearly linked with the arm selection number . In reward environment , all evolve in a linear order, but in reward environment , is approximately 100 at time step . Thus, the performance of Chaos heavily depends on the given reward environment. Table 1
summarizes the sample variance of metrics over 100 reward environments in an eightarmed bandit. As shown in the table, ChaosCI is less dependent on reward environments and achieves more stable and higher accuracy than UCB1 and Chaos. In terms of obtained rewards, ChaosCI has a larger variance than UCB1 and Chaos but is more stable than RoundRobin.
COR  ()  

RoundRobin  0  0.8852 
UCB1  0.0026  0.0116 
Chaos  0.0413  0.3073 
ChaosCI  0.0004  0.6140 
Conclusions
In this study, we have examined ultrafast decision making with laser chaos time series in reinforcement learning (e.g. MAB) and set a goal to recognize the arm order of reward expectations by expanding the previous method, that is, timedivision multiplexing of laser chaos recordings. In the proposed method, we have introduced explorationdegree adjustments based on confidence intervals of estimated rewards. The results of the numerical simulations based on experimental time series show that the selection number of each arm increases linearly, leading to a high and rapid order recognition accuracy. Furthermore, arms with higher reward expectations are selected more frequently; hence, the slope of regret is reduced, although the selection number of an arm still linearly increases. Compared with UCB1 and Chaos, ChaosCI (proposed method) is less dependent on the reward environment, indicating its potential significance in terms of robustness to environmental changes. In other words, ChaosCI can make more accurate and stable estimates of arm order. Such an order recognition is useful in applications, such as channel selection and resource allocation in information and communications technology, where compromise actions or intelligent arbitrations are expected.
Methods
Optical system
The device used was a distributed feedback semiconductor laser mounted on a butterfly package with optical fibre pigtails (NTT Electronics, KELD1C5GAAA). The injection current of the semiconductor laser was set to 58.5 mA (5.37), where the lasing threshold was 10.9 mA. The relaxation oscillation frequency of the laser was 6.5 GHz, and its temperature was maintained at 294.83 K. The optical output power was 13.2 mW. The laser was connected to a variable fibred reflector through a fibre coupler, where a fraction of light was reflected back to the laser, generating highfrequency chaotic oscillations of optical intensity [soriano2013relation, ohtsubo2012semiconductor, uchida2012optical]. The length of the fibre between the laser and reflector was 4.55 m, corresponding to a feedback delay time (round trip) of 43.8 ns. Polarization maintaining fibres were used for all of the optical fibre components. The optical signal was detected by a photodetector (New Focus, 1474A, 38 GHz bandwidth) and sampled using a digital oscilloscope (Tektronics, DPO73304D, 33 GHz bandwidth, 100 GSample/s, eightbit vertical resolution). The RF spectrum of the laser was measured by an RF spectrum analyzer (Agilent, N9010A544, 44 GHz bandwidth).
Details of the timedivision multiplexing of laser chaos
Convergence of Algorithm 1.
For simplicity, we assume that
and the time series used for comparison with thresholds follows a uniform distribution of
at an arbitrary time. We define the value of threshold at the beginning of time step as . The time evolution of can be represented as(7)  
The expectation of is represented as follows.
(8) 
Because we assume that follows a uniform distribution, if ,
(9) 
Equation (9) can lead to
(10) 
where
Equation (10) indicates that and converge to a certain value in if and . In this case, the number of selections for each arm linearly increases. Furthermore, if or , then convergence or divergence occurs at , which leads to or . In this case, one of the arms will be selected intensively as time passes.
The above discussion shows that the convergence and performance of Algorithm 1 depend on learning rate , exploration degree , and reward environment .
Details of the proposed method
Convergence of the proposed method.
In the previous paragraph, we have found that the performance of the algorithm proposed is heavily dependent on parameters . Therefore, in the proposed method, explorationdegree adjustments based on confidence intervals are added to Algorithm 1: if the exploration itself is not sufficient, then thresholds are set close to 0 and values of decrease, so thresholds are less likely to diverge, which leads to improved accuracy. If exploration is applied sufficiently, then the values of increase, so the thresholds are more likely to diverge, which leads to an intensive selection of a better arm and slow increase of regret.
Data availability
The datasets generated during the current study are available from the corresponding author on reasonable request.
References
Acknowledgements
This work was supported in part by the CREST project (JPMJCR17N2) funded by the Japan Science and Technology Agency and GrantsinAid for Scientific Research (A) (JP17H01277) funded by the Japan Society for the Promotion of Science. The authors acknowledge Atsushi Uchida and Kazutaka Kanno for the measurements of laser chaos time series and Satoshi Sunada and Hirokazu Hori for their variable discussions about chaos and order recognition.
Author contributions
M.N. directed the project and N.N. designed the order recognition algorithm and conducted signal processing. N.N., N.C., M.H., and M.N. analyzed the data and N.N. and M.N. wrote the paper.
Competing interests
The authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to N.N. and M.N.