I Introduction
Active information acquisition is a challenging problem with diverse applications including robotics tasks [14, 15, 16, 17]
. It is a sequential decision making problem where agents are tasked with acquiring information about a certain process of interest (target). The objective function for such problems typically takes on the informationtheoretic form such as mutual information and Shannon entropy. The theory of optimal experiment design also studies cost functions based on the trace, determinant, or eigenvalues of the information matrix that describes the current information state. A major challenge in active information acquisition problem is the computation of cost functions, such as mutual information, that are difficult to compute for arbitrary probability distributions. Therefore, many approaches can evaluate only short planning horizons, or take greedy actions that are susceptible to local minima
[1].In this paper, we seek to solve the active information acquisition problem using Reinforcement Learning (RL) methods. In RL, a learning agent seeks an optimal or nearoptimal behavior through trialanderror interaction with a dynamic environment [10]. As it does not require a knowledgeable external supervisor, RL has been applied to many interesting sequential decision making problems including numerous applications in robotics [11, 12, 13]
. The recent successes of RL with deep neural networks have enabled a number of existing RL algorithms to be applied in complex environments, such as Atari games and robotic manipulation
[3, 4, 29].There are several advantages of applying RL to active information acquisition problems. One benefit is that we can avoid overdependence on system models and the dynamics of the target process can be estimated by various state estimation methods such as particle filters
[2] and learningbased approaches for highly nonlinear systems [25, 18, 19]. This dramatically extends the available problem domains. Another benefit is that an RL policy maximizes a discounted sum of future rewards, and thus, it is able to handle infinite planning horizons. In contrast, existing planning algorithms require prior knowledge on target models and often need to approximate a cost function online, which requires additional assumptions to make the computation tractable [1]. RLbased approaches demand an extended training stage, but produce policies that are efficient to execute online, especially as compared to longhorizon planning methods. This is necessary particularly for running a robotics system in realtime.Related Work. A number of methods for solving information acquisition problems for dynamic targets have been studied under various constraints. The solutions vary in the number of robots, the length of the planning horizon, and the probability distribution being tracked. Efficient searchbased planning methods have been applied for models assumed to be linear and Gaussian [1, 24], and samplingbased algorithms have been used for more complex problems and longer horizons [28]
. Prior work in datadriven information acquisition uses imitation learning via a clairvoyant oracle
[9, 26]. These are supervised learning methods which train a policy to mimic the provided expert trajectories, and thus, require a large and labeled dataset. This method could be more sample efficient than RL, but it is limited to problems having an access to labeled datasets.
Active object tracking is one of the common tasks in active information acquisition. With the substantial achievement of deep learning in the field of computer vision, many deep learning methods for computer vision have been studied in this area
[21, 23]. Endtoend solutions for active object tracking are introduced using ConvNetLSTM and RL algorithms [20, 22]. Their reward function is designed specifically for the object tracking problem, for example, in order to reduce a distance between an object and a learning agent. However, maximizing information obtained for a target does not always require the agent to closely follow the target, especially in the case of multiple targets or if a target is highly dynamic. Moreover, the endtoend approach requires a large number of training samples.Contributions. We highlight the following contributions of this paper:

We propose a general RL framework for the active information acquisition problem formulating it as a Markov Decision Process.

We apply the framework to an active target tracking application and use existing Qnetworkbased deep RL algorithms to learn an optimal policy.

We compare simulation results in various environments with a searchbased information acquisition approach in a target tracking scenario. The results demonstrate that the Qnetworkbased deep RL algorithms are able to outperform the existing method, while making far less assumptions on the underlying problem.
Ii Background
Iia Active Information Acquisition
Suppose that a robot carrying a sensor follows a discrete time dynamic model:
(1) 
where is a robot state and is a control input at time . The goal of the robot is to actively track targets of interest using noisy measurements from its sensor governed by some dynamic models:
(2) 
where is a state of the th target and we composes each target state as . Each target can have its own control policy . We denote the sensor measurement signal about the targets as and its observation model as :
(3) 
The available information to the robot at time is and for where the subscript represents the set of the corresponding variable from time to time for .
Problem. (Active Information Acquisition) Given an initial robot pose , a prior distribution of the target state , and a planning horizon , the task of the robot is to choose a sequence of functions, , which maximize the mutual information between the target state and the measurement set :
(4) 
IiB Reinforcement Learning
RL problems can be formulated in terms of an Markov Decision Processes (MDP) described by the tuple, where and are state and action spaces, respectively, is the state transition probability kernel, is a reward function, and is a discount factor. A policy, , determines the behavior of the learning agent at each state, and it can be stochastic or deterministic Given , the value function is defined as for all , which is the expected value of cumulative future rewards starting at a state and following the policy thereafter. The stateaction value, , function is similarly defined as the value for a stateaction pair, for all . The objective of a learning agent in RL is to find an optimal policy . Finding the optimal values, and , requires solving the Bellman optimality equation:
(5)  
(6) 
where is the subsequent state after executing the action at the state .
When an MDP is unknown or too complicated, RL is used to find an optimal policy. One of the most popular RL algorithms is Qlearning, which updates Q values from a temporal difference error using stochastic ascent. When either or both of the state and action spaces are large or continuous, it is infeasible to represent for all states and actions in a tabular format. Instead, we can use a function approximator to approximately estimate the function, . When a neural network is used for the function approximator,
corresponds to the neural network weights and biases. Deep Qnetwork (DQN) is a neural network extension to Qlearning which network outputs a vector of action values
for a given state . DQN solves the difficulty of applying neural network to Qlearning by mainly introducing the use of an additional target network and experience replay [3]. Double DQN is the extension of Double Qlearning with a neural network which reduces the overestimation of DQN by using two sets of neural network weights [5]. Assumed Density Filtering Qlearning (ADFQ) is a Bayesian counterpart of Qlearning which updates belief distribution over Q values through online Bayesian update algorithm [6]. One of the major advantages of ADFQ is that its update rule for Q values takes a nongreedy update with its uncertainty measures and reduces the instability of Qlearning. It has shown that ADFQ with a neural network outperforms DQN and Double DQN when the number of actions of a problem is large. This may be more appropriate for the active information acquisition problem as it can be highly stochastic and potentially has a large number of actions.Iii RL Framework for Active Information Acquisition
In order to solve the active information acquisition problem using RL, we first formulate the problem as an MDP. Since the robot does not have access to the ground truth of target states, we could formulate the problem as a Partially Observable Markov Decision Processes (POMDP) that maintains beliefs over states [7]. However, it is known that solving a generic discrete POMDP exactly is wildly intractable [8]. Instead, we define a target belief distribution for where is a tractable parametric distribution with parameters , and is a control input of the target if exists. The belief distribution, or , can be updated by a Bayes filter using incoming observations. We explicitly include the belief state as part of the MDP state, and thus, the problem state is expressed as a function of the robot state and the target belief states, . may vary depending on the application. The action in the MDP is defined as the control input to the robot, .
The goal of the RL agent in this problem is to find an optimal policy that maximizes mutual information (4). Assuming that is independent of the robot path, , the optimization problem now seeks to minimize the differential entropy, [1]. In order to evaluate the entropy resulted after taking an action at the current state , a reward is defined by the belief posterior at :
(7) 
Then, the optimal policy minimizes the discounted cumulative total entropy :
(8) 
The RL framework for active information acquisition is summarized in Fig.1.
Iv Learning Qnetwork for Active Target Tracking
In this section, we present a specific RL approach to the active information acquisition problem by focusing on the target tracking application in twodimensional environment with Gaussian belief distributions. Let the mean and the covariance of the th target belief be and , respectively. The RL state, , is defined by the target belief states and the information of surroundings. More formally,
where is a radial coordinate of the th target belief mean in the robot frame, and is a polar coordinate of the th target belief mean in the robot frame at time . is an observable space from the robot state and is a boolean function which returns 1 if its given statement is true and 0 otherwise. and are a radial and polar coordinate of the closest obstacle point to the robot in the robot frame, respectively. If there is no obstacle detected, and are set to its maximum sensor range and , respectively In a environment, we can use in the spherical coordinate system instead. We define the action space with a finite number of motion primitives.
Since is a belief posterior, , the differential entropy in (7) is:
(9) 
where is a constant. Assuming that all target beliefs are independent to each other, is a blockdiagonal matrix of individual covariances, ), and . Therefore, we define the reward function in this target tracking problem as:
(10) 
The first two terms penalizes the overall uncertainty of the target beliefs and their dispersion (as standard deviation). The dispersion term prevents the robot from tracking only a few targets when not all the targets are within its sensing range at time. The second term discourages the robot to approach toward obstacles or a wall.
, , and are constant factors, and is set to 0 if no obstacle is detected.We suggest offpolicy temporal difference methods such as DQN, Double DQN, and ADFQ in order to learn an optimal policy for the problem. Although any RL algorithms can be used in this framework, such offpolicy temporal difference algorithms are known to be more sample efficient than policybased RL methods [27]. Moreover, an action policy can be different from the update policy in offpolicy methods which allow a safe exploration during learning. The algorithm is summarized in Table.1. Note that the RL agent does not require any knowledge on the system models (1), (2), (3) as long as it can observe its state and a reward. Additionally, the RL update is independent from the Bayes filter and it can leverage various state estimation methods.
V Experiments
To demonstrate the proposed framework, we evaluate it with ADFQ, DQN, and Double DQN in target tracking problems with different numbers of targets (). greedy action policy is used with
annealed from 1.0 to 0.01 for all algorithms. For ADFQ, we additionally used its Thompson sampling (TS) action policy using its uncertainty estimate for Qvalues.
Furthermore, we compare with the Anytime Reduced Value Iteration (ARVI), an opensource target tracking algorithm, which we use as a baseline. The ARVI uses a linear system model and the Kalman Filter to predict a target trajectory, and then evaluates the mutual information over a search tree with some pruning to ensure finite execution time. The performance of ARVI has been verified in target tracking simulations and real robot experiments in
[24]. The aim is to show the reinforcement learning outperforms this approach, but rather that it achieves a comparable performance while featuring a much more general problem formulation.The differential drive dynamics of a robot is:
(11) 
where is a sampling period, and correspond to the elements of in axis, axis and polar coordinate at time , respectively. We discretized the action space with predefined motion primitives, m/s, rad/s. The objective of the robot is to track the positions and velocities of targets which follows double integrator dynamics with Gaussian noise:
(12) 
is a noise constant factor. When the target is close to a wall or an obstacle, it reflects its direction with a small Gaussian noise. We assumed that the target model is known to the robot and updated the target belief distributions using the Kalman Filter. Note that the Kalman Filter can be simply replaced by other Bayes filters or learningbased state estimation methods within the proposed RL framework.
The observation model of the sensor for each target is:
(13) 
To be used in the Kalman Filter, this model is linearized by computing the Jacobian matrix of with respect to :
In the experiments, the sensor has a maximum range of 10 meters and its field of view is 120 degree. We assume that the sensor is able to distinguish targets or obstacles. is randomly initialized within the given map and the position components of is also randomly initialized within the maximum offset of 8 meter from the initial robot state. The initial velocity is 0.0. The belief target state follows Gaussian. In order to design the experiment more realistic, the mean position is randomly initialized to have the maximum offset of 5 meter from the target and the covariance, , is initialized to . We use and a constant observation noise, with .
For the network, we used 3 layers with 128 units and a learning rate for a single target, and 3 layers with 256 units and a learning rate for multiple targets. The target network is updated every 50 training steps. The batch size and the replay buffer are 64 and 1000, respectively.
All experiments are obtained with 5 different random seeds for the learning algorithms and 10 random seeds for ARVI. The results are plotted in Fig.2. The darker lines show the mean over seeds and the shaded areas represent standard deviation. The current learned policies from were semigreedily evaluated with for 5 times after trained with a single trajectory (every two trajectories for multitarget experiments). The curves are smoothed by a moving average with window 4.
Va Single Target
We tested the single target problem in an empty domain where there is no obstacle, and therefore, the behavior of the target is more predictable (as there is far less reflection behavior of the target with noise). We also tested a domain with four obstacles as in the first row of Fig.3. The noise parameter for the target model is set to for both cases and the length of a trajectory is steps.
The first plot in Fig.2 shows that both ADFQ with TS and ADFQ with greedy achieved the baseline performance after learning with 13 trajectories. ADFQTS showed a more stable performance outperforming the baseline toward the end. Since the belief state mean can quickly diverge from the true state while its covariance is quite small, exploration methods based on stateaction uncertainty such as Thompson sampling leads a better performance than greedy. ADFQ outperformed the baseline in the obstacle environment as well. An example case of a learned policy by ADFQTS is presented in the first row of Fig.3. As shown, even though it missed the target at and the belief became inaccurate, it quickly adjusted its direction and followed the target keeping it in its range.
DQN and Double DQN failed to reach the baseline performance in both environments. Although their performances increased with the number of learning trajectories in the empty environment, their performances dramatically dropped in the obstacle environments. This is due to the high stochasticity of the environment as the target changes its path abruptly with noise when it faces an obstacle.
VB MultiTarget
We tested the cases of two and three targets in an empty domain with . A longer trajectory, , is used in order to evaluate cases where targets diverge and a robot has to keep traveling to minimize the covariances. In both and , ADFQ algorithms outperformed or achieved the baseline performance as shown in Fig.2
. Additionally, the baseline showed large variances in its performance in both cases while ADFQ algorithms showed fairly lower variances across the trials.
The most challenging part of these experiments is when not all targets are observable at time. The results indicate that the RL methods can learn a policy which makes a nearoptimal decision on when to keep traveling to track all the targets or when to exploit to close targets. The learned policy of ADFQTS is demonstrated in Fig.3. When the targets are not simultaneously observable but not too far from each other, the robot must choose to visit each target sequentially to maintain its belief distribution for every target.
Vi Conclusions
In this paper, we introduced a novel RL framework for the active information acquisition problem and developed a detailed approach for solving the active target tracking problem with a networkbased RL algorithm. The experimental results demonstrated that the RLbased methods can achieve or sometimes outperform the searchbased planning algorithm. As an initial approach, we used the Kalman filter with a known linear target model in our experiment. Future work will leverage various existing techniques in Bayesian filtering and state estimation within the framework in order to use nonlinear or unknown target models. Additionally, since ADFQ maintains belief distributions over Qvalues, we further intend to extend our approach by propagating target state uncertainty to Qbelief distributions.
References
 [1] N. Atanasov, J. Le Ny, and G. Pappas, Distributed Algorithms for Stochastic Source Seeking with Mobile Robot Networks, ASME Journal.
 [2] G. M. Hoffmann, S. L. Waslander, and C. J. Tomlin, Mutual information methods with particle filters for mobile sensor network control, in Proc. the 45th IEEE Conf. on Decision and Control, 2006
 [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), Deep Learning Workshop, 2013.
 [4] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, Humanlevel control through deep reinforcement learning, in Science, 518:529533, 2015.

[5]
H. V. Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double qlearning, the 30th AAAI Conf. on Artificial Intelligence, 2016.
 [6] H. Jeong, C. Zhang, G. J. Pappas, and D. D. Lee, Assumed Density Filtering Qlearning, the 28th Int. Joint Conf. on Artificial Intelligence (IJCAI), 2019.
 [7] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, Planning and acting in partially observable stochastic domains, Artificial intelligence, 1998, 101(12), pp. 99134.
 [8] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processes. Mathematics of Operations Research, 12(3):441450, 1987.
 [9] S. Choudhury, A. Kapoor, G. Ranade, and D. Dey, Learning to gather information via imitation, In 2017 IEEE Int. Conf. on Robotics and Automation (ICRA), 2017, pp. 908915, IEEE.
 [10] R. Sutton and A. Barto, Reinforcement Learning: An Introduction. MIT Press, 1988.
 [11] J. Kober, J. A. Bagnell, J. Peters, Reinforcement learning in robotics: A survey, Int. J. of Robotics Research, July 2013
 [12] J. Peters, S. Vijayakumar, and S. Schaal, Reinforcement learning for humanoid robotics, in Proc. of the 3rd IEEERAS Int. Conf. on Humanoid Robots (HUMANOIDS), Karlsruhe, Germany, 2003, pp. 120
 [13] H. Jeong and D. D. Lee, Efficient learning of standup motion for humanoid robots with bilateral symmetry. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2016, pp. 15441549, IEEE.
 [14] V. Kumar, D. Rus, and S. Singh, “Robot and Sensor Networks for First Responders,” IEEE Pervasive Computing, vol. 3, no. 4, 2004.
 [15] R. Sim and N. Roy, Global AOptimal Robot Exploration in SLAM, in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), 2005, IEEE.
 [16] V. Karasev, A. Chiuso, and S. Soatto, Controlled Recognition Bounds for Visual Learning and Exploration, in Advances in Neural Information Processing Systems (NIPS), 2012.
 [17] N. Atanasov, B. Sankaran, J. Le Ny, T. Koletschka, G. Pappas, and K. Daniilidis, Hypothesis Testing Framework for Active Object Detection, in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), 2013, IEEE.

[18]
P. Ondruska and I. Posner, Deep tracking: Seeing beyond seeing using recurrent neural networks. In Thirtieth AAAI Conf. on Artificial Intelligence, 2016.
 [19] A. Milan, S. H. Rezatofighi, A. Dick, I. Reid, and K. Schindler, Online multitarget tracking using recurrent neural networks, In 31st AAAI Conf. on Artificial Intelligence. 2017.
 [20] W. Luo, P. Sun, F. Zhong, W. Liu, and Y. Wang, Endtoend active object tracking via reinforcement learning, arXiv preprint arXiv:1705.10561, 2017.

[21]
J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P.H. Torr, Endtoend representation learning for correlation filter based tracking, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 28052813, 2017.
 [22] D. Zhang, H. Maei, X. Wang, and Y.F. Wang, Deep reinforcement learning for visual object tracking in videos, arXiv preprint arXiv:1701.08936, 2017.
 [23] J. Choi, J. Kwon, and K.M. Lee, Realtime visual tracking by deep reinforced decision making, Computer Vision and Image Understanding, 171, pp. 1019, 2018.
 [24] B. Schlotfeldt, B., D. Thakur, N. Atanasov, V. Kumar, G. J. Pappas, Anytime planning for decentralized multirobot active information gathering, IEEE Robotics and Automation Letters, 3(2), pp.10251032, 2018.
 [25] T. Haarnoja, A. Ajay, S. Levine, and P. Abbeel, Backprop kf: Learning discriminative deterministic state estimators, In Advances in Neural Information Processing Systems, pp. 43764384. 2016.
 [26] H. He, P. Mineiro, and N. Karampatziakis, Active information acquisition, arXiv preprint arXiv:1602.0218, 2016.
 [27] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, Bridging the gap between value and policy based reinforcement learning, In Advances in Neural Information Processing Systems, pp. 27752785. 2017.
 [28] G. Hollinger and G. Sukhatme, Samplingbased Motion Planning for Robotic Information Gathering, in Proc. Robotics: Science and Systems (RSS), 2013
 [29] S. Gu, E. Holly, T. Lillicrap, and S. Levine, Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates, IEEE Int. Conf. on robotics and automation (ICRA), pp. 33893396, 2017
 [30] A. Singh, A. Krause, C. Guestrin, and W. Kaiser. Efficient informative sensing using multiple robots, J. Artificial Intelligence Research, 34:707–755, Apr. 2009.
 [31] G. Hollinger, S. Singh, J. Djugash, and A. Kehagias. Efficient multirobot search for a moving target, Int. J. Robotics Research, 28(2):201–219, Feb. 2009.
Comments
There are no comments yet.