1. Introduction
Researchers have started applying machine learning (ML) algorithms for optimizing the runtime performance of computer systems (rl_google, ). Networksonchip (NoCs) form the communication backbone of manycore systems; learning traffic behavior and optimizing the latency and bandwidth characteristics of the NoC in response to runtime changes is a promising candidate for applying ML. This work explores opportunities that Reinforcement learning (RL) techniques (sutton1998reinforcement, ) provide for learning optimal routing algorithms for varying traffic within a NoC.
RL techniques work via continuous interactions with an environment to learn the optimal policy. They have demonstrated promising results in robotics (brockman2016openai, ), playing Atari games, and computer network traffic control (kong2018improving, ). In this work, we study how classical RL algorithms work for NoC routing and develop a framework for applying these RL algorithms to NoCs. We further present an extended OpenAI Gym package for studying RLbased routing control in NoC simulations based on gem5 (binkert2011gem5, ). Our results show the RL agents were able to learn and pick the optimal routing algorithm for a traffic pattern to maximize a customized network objective such as the routing throughput.
2. RLbased Routing Optimization
Overview. We develop a framework to use RL to optimize NoC routing decision. As shown in Fig. 1, in NoC environment, our RL agent keeps records of the current network state and its corresponding reward (throughput), and then suggests an action (a choice of routing algorithms) with the highest expected reward, based on the learned information.
Target Task. The goal of our RL agent is to learn an optimal routing algorithm that maximizes throughput for the current application.
Defining Utility Function for NoCs. RL works by optimizing actions for a utility or reward function. It treats the problem as a Markov process, which means if we have current state with the learned information, we can decide future action to optimize reward. In our use case for NoCs, we define the utility objective () by calculating the throughput:
(1) 
Proposed RL Framework.
As a central motivation in RL, value function approaches attempt to find a policy that maximizes the return by maintaining a set of estimates of expected reward. We use the designed Utility functions in Equation
1. The RL agent’s (Fig 1) action selection is modeled as a policy ():(2) 
where the return could be calculated:
(3) 
is the temporal utility measured at the time , and is the discount factor in the Markov process. The actionvalue function of such an optimal policy is called the optimal actionvalue function to attain maximum expectation of as:
(4) 
RL Algorithms. We consider three temporal differential approaches: Qlearning, SARSA, and ExpectedSARSA. We do not use deep reinforcement learning (DRL) methods owing to a high realtime memory consumption of DRL from previous studies (sutton1998reinforcement, ) which make them prohibitive.
3. Experimental Methodology
Extending OpenAI Gym for Interconnection Routing.
OpenAI (brockman2016openai, ) is a benchmark suite for developing RL algorithms. Consequently, we provide a first scalable environment for fast prototyping new RLintegrated NoCs, called  (). Our proposed  environment includes:

– gem5 statistics with the injected flits, received flits, and average latency

– a set of standard routing algorithms (e.g., xy, oblivious northlast, adaptivenorthlast, randomadaptive) for RL agents to choose from

– a customized network objective(s) (e.g., latency and throughput) of a selected NoC topology

– Boolean format for thresholding at desired reward
Case Study 1: NoCs with Incremental Injection Rate.
In this scenario, we use Garnet2.0 (agarwal2009garnet, ), an NoC simulator network model inside gem5 (binkert2011gem5, ). We provide a target topology as an 8by8 mesh. We start packet injection at a low rate and then increase the rate as time goes on. Our goal is to optimize the performance by choosing optimal routing algorithms from the action space at each transition of environment state. We set the action space of  in both two case studies as four choices: random routing, xy routing, oblivious Northlast, and adaptive Northlast (which uses the number of free virtual channels at the next router as a proxy for choosing the output port).
Case Study 2: NoCs with Dynamic Traffic Patterns.
In this scenario, we simulate the workload of a data center network.
For example, in a Google data center, the primary application could change from mail service to video traffic in the different time frame of the day. Therefore, we simulate this scenario by switching from one network traffic pattern to another. We use seven different synthetic traffic patterns provided by Garnet2.0 in the experiments, e.g., random, transpose, and bit reversed traffic, as shown in Fig. 2 (d). Then our environment is defined as the continuously changing network traffic NoC.
We apply RL to optimize the routing algorithm decision at each state transition.
4. Evaluation
In , the reward is defined as throughput. The reward feedback in each episode is shown in Fig. 2 (a). We can observe that the rewards of the three RL algorithms converge and are stable. We examine our learned models by testing their throughput through one episode of an entire state transition which is to , as described in . We compare the throughput of NoCs guided by our RL agents in different injection rates with the throughput of fixed baseline routing algorithms (e.g., random routing, xy routing), as shown in Fig. 2 (c). For example, the throughput of random routing saturates to near after rate because of deadlock. Oblivious Northlast avoids deadlock and saturates at higher throughput. As for our RL method, the Sarsa chooses Adaptive Northlast from rate to and oblivious Northlast at rate . However, the QL always makes the optimal choice out of four routing algorithms.
In , we have different traffics patterns (e.g., random, tornado traffics) as our states. we show the result under the injection rate of in Fig. 2 (b), and the results under different injection rates can all converge and follow the same trend. The throughput of an entire state transition is in Fig. 2 (d). We could observe all three RL methods deliver near optimal choices across all states. Through theses experiments, we show that our method could serve as a decision agent for the data center facing various workloads.
5. Conclusion
We develop and demonstrate a framework to apply RL to act as a continually learning agent that configures the routing algorithms decision in a NoC. We concretely show the effectiveness of policybased RLs on NoC problems. We hope this work will inspire future extensions to bring more RL algorithms to a wide range of NoC problems for the computer networks community.
References
 [1] Azalia Mirhoseini et al. Device placement optimization with reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2430–2439, 2017.
 [2] Richard S Sutton et al. Reinforcement learning: An introduction. MIT press, 1998.
 [3] Greg Brockman et al. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 [4] Yiming Kong et al. Improving tcp congestion control with machine intelligence. In ACM NetAI, pages 60–66, 2018.
 [5] Nathan Binkert et al. The gem5 simulator. ACM SIGARCH Computer Architecture News, 39(2):1–7, 2011.
 [6] Niket Agarwal et al. Garnet: A detailed onchip network model inside a fullsystem simulator. In 2009 IEEE international symposium on performance analysis of systems and software, pages 33–42. IEEE, 2009.
Comments
There are no comments yet.