I Introduction
Active Object Recognition (AOR) refers to the problem of predicting object label from the images while being able to change the pose of the object relative to the camera for increasing prediction certainty. A robot rotating an inhand object to refine its label prediction accuracy is an example of an AOR system. Ambiguity in object recognition exists because of similar views of different objects. AOR aims at finding the optimal sequence of actions which decreases the label ambiguity and improves object recognition performance in smaller number of steps. Despite its wide application and performance improvement capacity, AOR has not been applied widely and has remained secluded from mainstream computer vision progress in recent years.
Existing approaches to AOR change the sensor position to reduce the ambiguity of label prediction [1, 4, 9]. Most of these methods rely on uncertainty about object label, and use greedy best next action selection [7]
at the test time to decrease label probability entropy. A few method aim for optimal action selection at the test time using dynamic programming
[2] or Monte Carlo planning [25]. However these methods require a model of the object, and are computationally heavy at the test time. We propose a method that that reduces optimal action selection to a simple classification of beliefs at the test time and does not require planning. We show that the proposed method generalizes well to novel views of familiar objects, and also to novel objects. Moreover, we show that active perception paradigm can be used to improve the accuracy of object recognition system, by selecting images that are more likely to result in higher rewards for training.In the first contribution of this paper, we formulate AOR as a POMDP problem and adapt a Belief Tree Search algorithm [16] to discover nearoptimal values for objects poses on the training set. We infer a policy from these values, and use that to train an LSTM network to predict the best action given the current objects belief. At the test time, we use the actions suggested by this LSTM to explore the objects. We show that this supervised approach generalizes well to explore novel objects and novel views of familiar objects, and results in higher AOR accuracy compared to reinforcement learning and guided policy search methods.
In our second contribution, we derive an update rule to learn the parameters of the POMDP likelihood function with the goal of maximizing the total reward. This update rule emphasizes views of objects that will produce higher rewards in the future. We show that by retraining the likelihood function using the proposed method, the performance of the AOR system significantly improves.
In the next section, we review some of the previous approaches to AOR. Then we present the BTS algorithm and the observation function update rule. In the results section, we report the details of the implementation of these methods and their performance on GERMS [18] dataset. GERMS has proved to be a challenging AOR dataset, and we improve stateoftheart performance on this dataset. The final section is the concluding remarks.
Ii Literature Review
A large category of AOR models try to minimize the predicted label uncertainty through best nextaction planning [29, 5, 8, 9, 6, 7]. These models predict the object label probabilities using the current view, and search for the best next action that minimizes the expected entropy of object labels. In these methods, learning object appearance is performed by fitting a generative model offline, while best action selection is carried out online at the test time. Uncertainty measures such as conditional entropy and mutual information are computationally expensive to evaluate for all possible observations. Therefore these methods usually resort to approximations of these measures, which might result in poor AOR performance.
A second category of models use techniques such as REINFORCE [32] or Neurally Fitted QIteration (NFQ) [28] to find a good policy or actionvalue function for object exploration [24, 20, 18]
. A parametric function that encodes object exploration policy or actionvalues is learned offline by using an exploration policy and collecting rewards that depend on the label prediction accuracy. The model then updates the parameters of policy or actionvalue function to maximize its total expected reward. These method suffer from high variance of prediction due to sampling of actions, only guarantee convergence to a local optimum and require a lot of training to explore and discover optimal sequences of actions.
More recently, deep convolutional neural networks (CNN) have been applied in AOR as a tool for modeling object appearance along with actionvalue prediction
[14, 17, 10, 12]. Malmir et. al trained a deep CNN using NFQ update rule [17]. In this work, a layer of Dirichlet distribution is embedded into the network for modeling the distribution of beliefs for different objectaction pairs. Johns et. al used deep CNNs for entropy regression and action prediction for the set of next view points [14]. Finding the optimal trajectory for object inspection is then approximated by maximizing the sum of cross entropy over adjacent views pairs. Haque et al. trained LSTM networks with REINFORCE algorithm to recognize subjects from 3D pointclouds [10]. Jayaraman and Grauman modeled object exploration policy as a neural network and trained it using classification accuracy as reward [12]. They found that predicting the next state of the environment based on current state and action improves the overall AOR accuracy. All these methods show improved performance over random exploration strategy and nonactive methods. However they suffer from the same problem as previous methods, which is lack of guarantee of performance even on the training set.There are very few approaches to AOR that aim to find optimal exploration policies. Atanasov et al. [2] adapted an active hypothesis testing approach [23] for camera viewpoint selection for object segmentation. This approach learns a model of object appearance, and uses that for planning a sequence of actions that minimizes the cost of motor movements, object classification and viewpoint prediction. A dynamic programming approach is used to discover the best sequence of actions that minimizes this cost. This method depends on the representation of object appearance for efficient planning, while our method acts in the belief space and is completely independent of object representation and classification. Patten et al. [25] use Monte Carlo planning for active exploration and perception of outdoor objects. This method uses rollouts that depends on predicting the pointcloud of objects for different actions. Compared to this method, our method is more intuitive, and doesn’t require a model of objects.
Of special interest to this paper are works on belief tree search and Monte Carlo POMDP planning. Lee et. al [16] proposed clustering of beliefs in a belief tree search algorithm to reduce the width of the tree. DESPOT [31] uses sampling of observations to reduce the width of the belief tree for optimal action selection. POMCP [30] adapts Monte Carlo sampling and the Upper Confidence Trees algorithm [15] for efficient POMDP planning. We based our method to [16] because of desirable properties such as acting on the belief space and performance guarantees on the reachable space of beliefs.
An important features of the proposed approach is its freedom of maintaining an object model. Our method acts in the belief space of objects, and only requires a black box simulator for training, that returns the belief resulting from performing different actions on objects. Another important property of this method is that at the test time, the next action is predicted by a simple classification of the current belief. We show that this approach is more effective in learning object exploration policies, compared to reinforcement learning and actorcritic methods. Next section describes the proposed approach in more details.
Iii Monte Carlo Belief Trees
Active object recognition can be formulated as a POMDP problem denoted with tuple , where is the set of states (object labels), is the set of actions for object examination, and is the set of observations (captured images of objects). We don’t perceive the identity of object directly but rather collect information about it through observations. The transition function marks the transition between different states and the observation function relates the observations to different object identities through which is the probability of observing after taking action when the object labels is . The reward function determines the reward for taking action when the object label is . Finally, is the reward discount factor.
Note that in this POMDP, the transition function reduces to the identity function. The observation space on the other hand includes all images of objects, and is prohibitively large to apply value iteration techniques [27]. Another approach is to use Monte Carlo planning , which builds a search tree for onestep action selection [30]. However this method suffers from the curse of history, which is exacerbated by high dimension observations. We seek to find a solution to the AOR POMDP that compactly represents history, and allows tractable search for finding the optimal actions.
It has been shown that solution to POMDP can be found by solving the equivalent belief MDP, where the belief
denotes the posterior probability of states given the observation history,
(1)  
In belief MDP, states represent the posterior probabilities of POMDP states given the actionobservation history. For this MDP, the transition between beliefs given action is defined as,
(2) 
where denotes the set of observations that can result in changing beliefs from to ,
(3) 
The belief MDP reward is defined by calculating the expected reward over states,
(4) 
And finally, given an initial belief , action and observation the updated belief is calculated by,
(5) 
Now that we defined the equivalent belief MDP, we aim at solving the planning problem using Belief Tree Search algorithm. This algorithm constructs a search tree for a given belief , where different branches represent actions and the resulting observations. Each node in the tree represents a belief about the underlying POMDP states and edge captures an action and the resulting observation. The algorithms starts with
as the root and exhaustively performs all action to collect new observations and form new beliefs. These beliefs are then added as children of the root and the process iterates with new beliefs until stop states are reached in the leaves. The values are then backtracked from the leaves to the root to estimate the value of
. The belief tree search is used in online planning for POMDPs where the dynamics of the environment are known. However we adapt it to calculate the optimal values of images for the training set.The plain belief tree search algorithm is computationally intractable to use for AOR belief MDP, since it examines all observations after each action. Instead we adapt the algorithm in theorem 1 of [16], which sacrifices optimality of the predicted values in exchange of computational tractability. This algorithm utilizes the smoothness of the optimal value function to cluster the belief space. More specifically, for a given , each node in the tree represents the approximate value of beliefs that are in neighborhood of , which we represent as . By clustering the belief space, the width of the belief tree decreases to a manageable size. Another approximation in this algorithm is that this algorithm calculates the approximate optimal value of a given belief only in , which is the reachable space of . The reachable space is defined as the set of beliefs that are reachable by arbitrary sequences of actions from . Finally, the algorithm utilizes the discount factor to limit the height of the belief tree.
We adapt this algorithm to find an approximately optimal value for each image in the training set. The algorithm is depicted in pseudo code style in algorithm 1
. Using these values, training an active object recognition system reduces to a supervised learning and knowledge transfer problem. In each node of the tree, the algorithm expands all actions and receives the new observations. New beliefs are then calculated using (
III). For each new belief , if there is an already expanded belief in that height of the tree for which , the algorithm sets the value of equal to and backtracks. Otherwise, the search continues in the children of .Our algorithm builds a belief tree over by sampling from images in the training set and maintaining a packing of at each level of the tree. A belief tree with root denotes all the possible actions and observations that are encountered while inspecting an object with the initial belief . A belief tree captures all the possible actions and observations, construction of full belief tree is prohibitive in case of active object recognition because the size of the observation space is extremely large. One modification that we made to algorithm 1 compared to theorem 1 of [16] is that at the root node, we calculate the value of . This is to reduce the overfitting of values to specific beliefs. In our algorithm, if two beliefs are very similar but result in vastly different rewards, that should be considered in calculating the value of
. In AOR this happens when the classifier is uncertain about some examples then their beliefs are close to each other and this should be reflected in their calculated values. Figure
1 depicts the proposed algorithm in graphical form.Theorem 1. provides some guarantee on the optimality of the values found for images in algorithm (1).
Theorem1. For a given maximum error , and the optimal value function , algorithm (1) finds the value of a given belief such that by setting the parameter values as,
(6)  
(7) 
Proof. See supplementary materials.
Iiia Optimizing Observation Function
It is very important to understand that the observation function in AOR POMDP is an approximation to the actual likelihood values of object views under different classes, which depends on the model that one fits to the data to model the likelihood. For example Borotsching et. al use Gaussian mixture on the eigenspace to model the likelihood of images from the view sphere of object under different classes
[5], while Malmir et. al approximate the observation function using a deep convolutional networks [17]. Calculating the observation function value for an image usually requires feature extraction from the image and density estimation for different classes. We assume a parametric observation function, which is a function of
,(8) 
Different values of parameters changes the approximation to the observation function. One may improve the observation function by using a different estimation of , which changes the feature extraction or density estimation. The improved observation function then results in a POMDP with different environment dynamics. In the ideal case the observation function value given an image is 1 for the correct object label and 0 for other labels, in which case the POMDP reduces to a trivial MDP.
For a given observation function, theorem 1 finds the image values for policies arbitrarily close to the optimal policy. However these values depend on the POMDP dynamics, e.g. the observation function. We propose to improve the observation function by increasing the total reward collected by the nearoptimal policy. For any policy and observation function , the total reward is defined as,
(9) 
Where is the dimensional belief simplex. Changing the observation function parameters may result in increased likelihood of images for the corresponding object label. Theorem 2 presents a gradient ascent update rule to the parameters of observation function, with the goal of increasing the total reward.
Theorem 2. Given a policy and the corresponding value function the gradient of the total reward with respect to the parameters of the observation function is given by,
(10) 
Proof. See the supplementary notes.
Intuitively speaking, the update rule in (IIIA) weights each parameter by the value of the belief that is reached by observing the corresponding . Changing the observation function parameters changes the belief MDP dynamics as the transition probabilities in (2) depend on . After updating the observation function, a new belief MDP is reached which can be solved approximately using theorem 1. In practice evaluation of (IIIA) is computationally intractable. In the results section, we describe a simple procedure for updating the observation function based value weighted updates.
Iv Results
Iva BTS for Training Set
In this section, we adapt the proposed method in algorithm (1) for active recognition of GERMS [18]. This is a medium size dataset with images of 136 different object collected by a robot. The robot grabs each object with 10 different orientations and examines the object by rotating them in front of the camera. The goal is to recognize object for 4 test orientations, given the other 6 inhand orientations. GERMS is proved to be a challenging dataset since separation of objects in this dataset requires extraction of small visual cues and fine categorization.
We extract visual features from GERMS images using ResNet deep CNN model [11]
. A softmax layer is trained on top of these features to predict the object label. Then we convert train and test images into belief vectors, and train our AOR method in the belief space. We normalize the output of the softmax layer for each class to sum to 1 over all GERMS images and use that as the observation probability. This is to ensure that the deep CNN outputs the likelihood and to maintain the integrity of (
2) and (III).We calculate the likelihood of each image and use algorithm 1 to calculate the value of the nearoptimal policy for each image in the train set. In order to use these values in planning, [16] proposes a sampling approach that repeatedly executes algorithm 1 for different simulations and augments the tree with newly discovered beliefs and finally uses the actionvalues of the root of the resulting tree for planning. The proposed algorithm in 1 is similar to the proposed approach in [16] however we make use of the similar belief vectors in the dataset to run the simulations. We found the actionvalues of the root of the tree to be very effective for AOR.
After we extract the actionvalues for each training image, we transfer the knowledge of these actionvalues to the test set. We compare three approaches for learning policy for these actionvalues. In the first and second approaches, we use Neurally Fittred Qlearning (NFQ) [28] and Actor Critic (AC) [26] approaches, guided by a probabilistic policy that uses the actionvalues from BTS. We show that guiding these two approaches results in slight improvement of average performance on the test set, compared to the plain version. In the third approach, we use an LSTM network to learn to predict the best action from the BST actionvalues. We show that the LSTM network is superior in performance to NFG and AC approaches.
IvB Guided Neurally Fitted Qlearning
Neurally Fitted Qlearning (NFQ) trains a neural network to predict the action values using the reward signal from the environment [28]. This algorithm has been successfully applied reinforcement learning benchmarks [28], playing Atari games [22] and active object recognition [18]. At the heart of this approach is an iterative update rule for the network parameters ,
(11) 
where the network outputs actionvalues for action in state . In the above, the gradient operator on the right side applies only to . We previously observed that the plain NFQ algorithm may fail to discover optimal policies for active object recognition [17]. Instead, we employ the step extension of this algorithm, proposed in [19], in which the update rule in (11) is applied to action sequences of length . The step NFQ speeds up learning by updating actionvalues in each iteration, compared to a single actionvalue update in the original NFQ. All experiments reported here are obtained using step sequences of actions.
We further improve the performance of NFQ by employing the importance sampling framework for policy improvement [13]. The idea behind this approach is to use an auxiliary policy to acquire sequences of actions and states , and update the parameters of target policy
using these sequences. In order to obtain an unbiased estimator, the gradients of the policy are weighted by their importance defined as,
(12) 
Where is the probability of sequence under policy , and is the probability of action in state under policy . We implement the guided NFQ (GNFQ) by drawing sequences of actions from an stochastic policy acquired by performing softmax on BTS actionvalues. The gradients in (11) are then multiplied by their importance in (12) and applied to the network. Figure 2 shows the comparison of mean AOR performance of NFQ and GNFQ, on the GERMS test data. For both approaches, we show interval of performance in shaded area. For comparison, we report the performance of random policy (RND), where at each time step, a new action is taken to explore the object. We see that the plain NFQ algorithm fails to perform any better than RND, while GNFQ performs better than random. The advantage of GNFQ over NFQ and RND is most significant for the first action, and it gradually decreases over the next four actions. After the last action where the majority of evidence has been accumulated, all three methods perform similar.
IvC Guided Actorcritic
Actorcritic is a policy learning method which updates the policy parameters using gradients of the expected reward [26]. To reduce the variance of gradient estimation, predicted value is decreased from the reward,
(13) 
(14) 
where is the collected reward at time , is the predicted value for parameterized by , and is the probability that policy , parameterized by , assigns to action in state . In figure 3, we compare the AOR performance of step actorcritic method with (ACG) and without (AC) guiding. We used the same guiding scheme as described above, by multiplying the gradient terms of (13) and (14) by their importance (12). We see that plain AC fails to perform better than random, while ACG shows higher performance than RND after the second and third actions.
IvD Supervised Learning of Action Values
To transfer the knowledge of nearoptimal values from training to test set, we train a neural network to directly predict the actionvalues given a belief. We use a stack of 3 LSTM layers with 128 units in each layer followed by a softmax layer that predicts action values (see Figure 4). The training sequences are produced by following a probabilistic policy derived from BTS actionvalues. For each belief vector, the target action is the one with the highest actionvalue. We found that it is crucial to the performance of AOR to use sequences of data for supervised action prediction. Figure 5 compares the performance LSTM and random policies. LSTM has a clear advantage in performance to the random policies. Moreover, the variance of the learned policies is significantly smaller compared to actorcritic and NFQ methods. Figure 6 compares the average performance of LSTM with previous methods. We see that supervised learning for actionvalue prediction is clearly superior to the policy learning methods. Table I shows the comparison of all methods in more details.
method  0  1  2  3  4  5 

RND  0.590  0.677  0.714  0.736  0.749  0.758 
AC  0.590  0.677  0.713  0.735  0.748  0.757 
ACG  0.590  0.678  0.717  0.741  0.754  0.760 
NFQ  0.590  0.677  0.713  0.736  0.750  0.758 
NFQG  0.590  0.688  0.717  0.738  0.748  0.758 
LSTM  0.590  0.694  0.732  0.746  0.754  0.757 
LSTMi2  0.614  0.715  0.751  0.769  0.778  0.785 
LSTMi3  0.617  0.718  0.758  0.776  0.790  0.793 
IvE Generalization to Novel Objects
In this section, we test the generalization of the proposed AOR method to novel object. The goal of this experiment is to understand how much object inspection knowledge is transferrable, amongst different objects of GERMS. For this purpose, we use 60% of objects in GERMS for training the LSTM method, and used the rest of objects for testing. The results averaged over 20 different experiments are shown in figure 7. Overall, the variance of results is high because of the large variations in the accuracy of GERMS objects. We see that the proposed method achieves better performance compared to random. The random strategy has difficulty in finding the informative moves, compared to the novel views experiment.
IvF Improving Observation Function
In this section, we retrain the observation function using the proposed gradient update rule in (IIIA). In order to implement this update rule, we adapt a sampling strategy by generating a set of rollouts using policy derived from BTS actionvalues. Let denote the belief vector corresponding to image . Let denote the set of beliefs and actions that resulted in in the rollouts. The retraining weight of is then calculated using a sample average of (IIIA) on these rollouts,
(15) 
We retrain the softmax using by weighting the cross entropy cost of each image using (15). After retraining, we run the BTS algorithm on the resulting beliefs and train an LSTM on the resulting actionvalues. We show the performance of the retrained LSTM as LSTMi2 in figure 8. The retrained LSTM achieves a significant improvement over LSTM. We can repeat this procedure and see how much improvement can be achieved. In practice, we observed that the performance of LSTM starts to decline after the second iteration. The performance of retrained LSTM are shown in table I.
V Conclusions
Active object recognition has received little attention from mainstream computer vision and machine learning communities despite its potential for improving recognition performance. Progress on AOR has been slow due to reliance of the AOR models on semi supervised or heuristic methods for object inspection policy discovery. In this work, we proposed a method that learns the object exploration policy in a supervised manner using training data. The proposed method has desirable properties, for example it is very fast at the test time and can generalize to novel objects since it does not require a generative model of objects.
Recently there has been a large interest in attention models mostly for reasons besides object recognition. These models rely on reinforcement learning
[21] or variation inference [3] to guide the system to discover suitable policies. In contrast, our proposed benefits from reduced training time and is free of the local optima that is a large problem in these approaches. By reducing the optimal policy inference to a supervised learning problem, one can use recent advances in supervised visual recognition to for learning policies for test data.We developed a weighting scheme for training of single images, that emphasizes images that result in higher value during exploration. The weight of each image denotes how useful is this image in achieving correct classification in the future. Such weighting scheme potentially reduces overfitting of the single image classification model to images that have no discriminative information. Because of the complex background in GERMS images, there is high chance of overfitting to background cues during training. By performing BTS, the AOR has the opportunity to calculate the value of each image, and direct the singleimage classification model to invest in more informative views of objects.
Acknowledgment
The research presented here was funded by NSF IIS 0968573 SoCS, IIS INT2Large 0808767, and NSF SBE0542013 and in part by US NSF ACI1541349 and OCI1246396, the University of California Office of the President, and the California Institute for Telecommunications and Information Technology (Calit2).
References
 [1] J. Aloimonos, I. Weiss, and A. Bandyopadhyay. Active vision. International journal of computer vision, 1(4):333–356, 1988.
 [2] N. Atanasov, B. Sankaran, J. Le Ny, G. J. Pappas, and K. Daniilidis. Nonmyopic view planning for active object classification and pose estimation. IEEE Transactions on Robotics, 30(5):1078–1090, 2014.
 [3] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.
 [4] R. Bajcsy. Active perception. Proceedings of the IEEE, 76(8):966–1005, 1988.
 [5] H. Borotschnig, L. Paletta, M. Prantl, and A. Pinz. Appearancebased active object recognition. Image and Vision Computing, 18(9):715–727, 2000.
 [6] B. Browatzki, V. Tikhanoff, G. Metta, H. H. Bülthoff, and C. Wallraven. Active object recognition on a humanoid robot. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pages 2021–2028. IEEE, 2012.
 [7] B. Browatzki, V. Tikhanoff, G. Metta, H. H. Bulthoff, and C. Wallraven. Active inhand object recognition on a humanoid robot. Robotics, IEEE Transactions on, 30(5):1260–1269, 2014.
 [8] F. G. Callari and F. P. Ferrie. Active object recognition: Looking for differences. International Journal of Computer Vision, 43(3):189–204, 2001.
 [9] J. Denzler, C. M. Brown, and H. Niemann. Optimal camera parameter selection for state estimation with applications in object recognition. In Pattern Recognition, pages 305–312. Springer, 2001.
 [10] A. Haque, A. Alahi, and L. FeiFei. Recurrent attention models for depthbased person identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [12] D. Jayaraman and K. Grauman. Lookahead before you leap: endtoend active recognition by forecasting the effect of motion. In European Conference on Computer Vision, pages 489–505. Springer, 2016.
 [13] T. Jie and P. Abbeel. On a connection between importance sampling and the likelihood ratio policy gradient. In Advances in Neural Information Processing Systems, pages 1000–1008, 2010.
 [14] E. Johns, S. Leutenegger, and A. J. Davison. Pairwise decomposition of image sequences for active multiview recognition. arXiv preprint arXiv:1605.08359, 2016.
 [15] L. Kocsis and C. Szepesvári. Bandit based montecarlo planning. In European conference on machine learning, pages 282–293. Springer, 2006.
 [16] W. S. Lee, N. Rong, and D. J. Hsu. What makes some pomdp problems easy to approximate? In Advances in neural information processing systems, pages 689–696, 2007.
 [17] M. Malmir, K. Sikka, D. Forster, I. Fasel, J. R. Movellan, and G. W. Cottrell. Deep active object recognition by joint label and action prediction. Computer Vision and Image Understanding, pages –, 2016.

[18]
M. Malmir, K. Sikka, D. Forster, J. Movellan, and G. W. Cottrell.
Deep qlearning for active recognition of germs: Baseline performance on a standardized dataset for active learning.
In Proceedings of the British Machine Vision Conference (BMVC), pages, pages 161–1. BMVA, 2015.  [19] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
 [20] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In Advances in Neural Information Processing Systems, pages 2204–2212, 2014.
 [21] V. Mnih, N. Heess, A. Graves, and k. kavukcuoglu. Recurrent models of visual attention. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2204–2212. Curran Associates, Inc., 2014.
 [22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [23] M. Naghshvar, T. Javidi, et al. Active sequential hypothesis testing. The Annals of Statistics, 41(6):2703–2738, 2013.
 [24] L. Paletta and A. Pinz. Active object recognition by view integration and reinforcement learning. Robotics and Autonomous Systems, 31(1):71–86, 2000.
 [25] T. Patten, W. Martens, and R. Fitch. Monte carlo planning for active object classification. Autonomous Robots, pages 1–31, 2017.
 [26] J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008.
 [27] J. Pineau, G. Gordon, S. Thrun, et al. Pointbased value iteration: An anytime algorithm for pomdps. In IJCAI, volume 3, pages 1025–1032, 2003.
 [28] M. Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pages 317–328. Springer, 2005.
 [29] B. Schiele and J. L. Crowley. Transinformation for active object recognition. In Computer Vision, 1998. Sixth International Conference on, pages 249–254. IEEE, 1998.
 [30] D. Silver and J. Veness. Montecarlo planning in large pomdps. In Advances in neural information processing systems, pages 2164–2172, 2010.
 [31] A. Somani, N. Ye, D. Hsu, and W. S. Lee. Despot: Online pomdp planning with regularization. In Advances in neural information processing systems, pages 1772–1780, 2013.
 [32] R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.