Log In Sign Up

Deep Active Object Recognition by Joint Label and Action Prediction

An active object recognition system has the advantage of being able to act in the environment to capture images that are more suited for training and that lead to better performance at test time. In this paper, we propose a deep convolutional neural network for active object recognition that simultaneously predicts the object label, and selects the next action to perform on the object with the aim of improving recognition performance. We treat active object recognition as a reinforcement learning problem and derive the cost function to train the network for joint prediction of the object label and the action. A generative model of object similarities based on the Dirichlet distribution is proposed and embedded in the network for encoding the state of the system. The training is carried out by simultaneously minimizing the label and action prediction errors using gradient descent. We empirically show that the proposed network is able to predict both the object label and the actions on GERMS, a dataset for active object recognition. We compare the test label prediction accuracy of the proposed model with Dirichlet and Naive Bayes state encoding. The results of experiments suggest that the proposed model equipped with Dirichlet state encoding is superior in performance, and selects images that lead to better training and higher accuracy of label prediction at test time.


page 6

page 19


Belief Tree Search for Active Object Recognition

Active Object Recognition (AOR) has been approached as an unsupervised l...

Adaptive Object Detection with Dual Multi-Label Prediction

In this paper, we propose a novel end-to-end unsupervised deep domain ad...

Hyper-parameter optimization of Deep Convolutional Networks for object recognition

Recently sequential model based optimization (SMBO) has emerged as a pro...

Deep Reinforcement Learning Models Predict Visual Responses in the Brain: A Preliminary Result

Supervised deep convolutional neural networks (DCNNs) are currently one ...

Generative Model with Coordinate Metric Learning for Object Recognition Based on 3D Models

Given large amount of real photos for training, Convolutional neural net...

Optimistic and Pessimistic Neural Networks for Scene and Object Recognition

In this paper the application of uncertainty modeling to convolutional n...

Seeing by haptic glance: reinforcement learning-based 3D object Recognition

Human is able to conduct 3D recognition by a limited number of haptic co...

1 Introduction

A robot interacting with its environment can collect large volumes of dynamic sensory input to overcome many challenges presented by static data. A robot manipulating an object with the capability to control its camera orientation, for example, is an example of an active object recognition system. In such dynamic interactions, the robot can select the training data for its models of the environment, with the goal of maximizing the accuracy with which it perceives its surroundings. In this paper, we focus on active object recognition (AOR) with the goal of developing a model that can be used by a robot to recognize an object held in its hand.

There are a variety of approaches to active object recognition, the goal of which is to re-position sensors or change the environment so that the new inputs to the system become less ambiguous for label prediction Aloimonos1988 ; Bajcsy1988 ; Denzler2001 . An issue with previous approaches to active object recognition is that they mostly used small simplistic datasets, which were not reflective of challenges in real-world applications mmalmir2015 . To avoid this problem, we have collected a large dataset for active object recognition, called GERMS111Available at, which contains more than 120K high resolution (1920x1080) RGB images of 136 different plush toys. This paper extends our previous work, Deep Q-learning mmalmir2015 , where an action selection network was trained on top of a pre-trained convolutional neural network. In this paper we extend the model to train the network end-to-end using GERMS images to jointly predict object labels and action values.

This paper makes two primary contributions: First, we develop a deep active object recognition (DAOR) model to jointly predict the label and the best next action on an input image. We propose a deep convolutional neural network that outputs the object labels and action-values in different layers of the network. We use reinforcement learning to teach the network to predict the action values, and minimize the action value prediction error along with the label prediction cross entropy. The visual features in early stages of this network are learned to minimize both errors. The second contribution of this work is to embed a generative Dirichlet model of objects similarities for encoding the state of the system. This model integrates information from different images into a vector, based on which actions are calculated to optimize performance. We embed this model as a layer in the network and derive the learning rule for updating the Dirichlet parameters using gradient descent. We conduct a series of experiments on the GERMS dataset to test (1) if the model can be trained jointly for label and action prediction, and (2) how effective is the proposed Dirichlet state encoding compared to more traditional Naive Bayes approach, and (3) discuss some of the properties of the learned policies.

In the next section, we review some of the previous approaches to active object recognition and examine the datasets they used. Next we introduce the GERMS dataset and describe the training and testing data used for the experiments in this paper. After that, we describe the details of the proposed network and Dirichlet state encoding, going into the details of cost functions and update rules for different layers of the network. In the results section, we report the properties of the proposed network and compare its performance in different scenarios. The following section is the concluding remarks.

2 Literature Review

Active object recognition systems include two modules: A recognition module and a control module. Given a sequence of images, the recognition module produces a belief state about the objects that generated those images. Given this belief state, the control module produces actions that will affect the images observed in the future Denzler2001 . The controller is typically designed to improve the speed and accuracy of the recognition module.

One of the earliest active systems for object recognition was developed by Wilkes and Tsotsos Wilkes1992

. They used a heuristic procedure to bring the object into a ‘standard’  view by a robotic-arm-mounted camera. In a series of experiments on 8 Origami objects, they qualitatively report promising results for achieving the standard view and retrieving the correct object labels. Seibert and Waxman explicitly model the views of an object by clustering the images acquired from the view-sphere of the object into aspects

Seibert1992 . The correlation matrices between these aspects are then used in an aspect network to predict the correct object label. Using three model aircraft objects, they show that the belief over the correct object improves with the number of observed transitions compared to randomly generated paths on the view sphere of these objects.

Schiele and Crowley developed a framework for active object recognition by making an analogy between object recognition and information transmission Schiele1998 . They try to minimize the conditional entropy between the original object and the observed signal . They used the COIL-100 dataset for their experiments, which consists of 7200 images of 100 toy objects rotated in depth Nene1996 . This dataset has been appealing for active object recognition because it provides systematically defined views of objects. At test time, by sequentially moving to the most and second most discriminative views of each object, Schiele and Crowley achieved almost perfect recognition accuracy on this dataset.

Borotschnig et al. formulate the observation planning in terms of maximization of the expected entropy loss over actions Borotschnig2000

. Larger entropy loss is equivalent to less ambiguity in interpreting the image. With an active vision system consisting of a turntable and a moving camera, they report improvements in object recognition over random selection of the next viewing pose on a small set of objects. Callari and Ferrie take into account the object modeling error and search for actions that simultaneously minimize both modeling variance and uncertainty of belief over objects


. Using a set of 10 custom clay objects, they report decrease in the entropy of the classifier output and Kullback-Leibler divergence between the posterior distribution of each object and the corresponding true distribution.

Browatzki et al. use a particle filter approach to determine the viewing pose of an object held in-hand by an iCub humanoid robot Browatzki2012 ; Browatzki2014 . For selecting the next best action, instead of maximizing the expected information gain, which is computationally expensive, they maximize a measure of variance of observations across different objects. They show that their method is superior to random action selection on small sets of custom objects. Atanasov et al. focus on the comparison of myopic greedy action selection that looks ahead only one step and non-myopic action selection which considers several time steps into the future Atanasov2014

. They formulate the problem as a Partially Observable Markov Decision Process, showing their method is superior to random and greedy selection of actions on a small set of household objects.

Rebguns et al. used acoustic properties of objects to learn an infomax controller to recognize a set of 10 objects Rebguns2011 . In this work, they proposed a Dirichlet based model to fuse information from different observations into a single belief vector. Using this latent variable mixture model for acoustic similarities, the robot learned to rapidly reduce uncertainty about the categories of the objects in a room. The state encoding of our system is similar to the mixture model of this work, however we embed this model into the network and train its parameters using gradient descent which is more suited for neural networks.

Paletta & Pinz Paletta2000 treat active object recognition as an instance of a reinforcement learning problem, using Q-learning to find the optimal policy. They used an RBF neural network with the reward function depending on the amount of entropy loss between the current and the next state.

A common trend in many of these approaches is the use of small, sometimes custom- designed sets of objects. There are medium sized datasets such as COIL-100, which consists of 7200 images of 100 toy objects rotated in depth Nene1996 . This dataset is not an adequately challenging dataset for several reasons, including the simplicity of the image background, and the high similarity of different views of the objects due to single-track recording sessions. What is missing is a challenging dataset for active object recognition with inherent similarities among different object categories. The dataset should be large enough to train models with large number of parameters, such as deep convolutional neural networks. In the next section, we describe GERMS, a large and challenging dataset for active object recognition that we used for experiments in this paper.

3 The GERMS Dataset

The GERMS dataset was collected in the context of the RUBI project, whose goal is to develop robots that interact with toddlers in early childhood education environments Malmir2012 ; Movellan2013 ; mmalmir2015 . This dataset consists of 1365 video recordings of give-and-take trials using 136 different objects. The objects are soft toys depicting various human cell types, microbes and disease-related organisms. Figure 1 shows the entire set of these toys. Each video consists of the robot (RUBI) bringing the grasped object to its center of view, rotating it by and then returning it. The dataset was recorded from RUBI’s head-mounted camera at 30 frames per second.

Figure 1: The GERMS dataset. The objects represent human cell types, microbes and disease-related organisms.

The data for GERMS were collected in two days. On the first day, each object was handed to RUBI in one of 6 pre-determined poses, 3 to each arm, after which RUBI grabbed the object and captured images while rotating it. The robot also captured the positions of its joints for every capture image. On the second day, we asked a set of human subjects to hand the GERM objects to RUBI in poses they considered natural. A total of 12 subjects participated in test data collection, each subject handing between 10 and 17 objects to RUBI. For each object, at least 4 different test poses were captured. The background of the GERMS dataset was provided by a large screen TV displaying video scenes from the classroom in which RUBI operates, including toddlers and adults moving around.

We use half of the data collected in day 1 and 2 for training and the other half of each day for testing. More specifically, three random tracks out of six tracks for each object in Day 1 and two randomly selected tracks for each object from Day 2 were used for training the network and the rest was used for testing. Table 1 shows the statistics of training and testing data for the experiments in this paper.

Number of tracks Images per Track Total Number of Images
Day 1 76,722
Day 2 51,561
Table 1: GERMS dataset statistics (meanstd)

4 Proposed Network

The traditional view of an active object recognition pipeline usually treats the visual recognition and action learning problems separately, with visual features being fixed when learning actions. In this work, we try to solve both problems simultaneously to reduce the training time of an AOR model. By incorporating the errors from action prediction into visual feature extraction, we hope to acquire features that are suited for both label and action prediction.

The proposed network is shown in figure 2. The input image is first transformed to a set of beliefs over different object labels by a classification network. The belief is then combined with the the previously belief vectors to produce an encoding of the state of the system. This is done by the Mixture belief update layer in the network. The accumulated belief is then transformed into action-values, that are used to select the next input image.

Figure 2: The proposed network for active object recognition. The red arrows representing the target values indicate the layer at which the target values are used to train the network. The numbers represent the number of units in each layer of the network. See table 2 for more details.

We next detail each part of the network, describing the challenges and their corresponding solution. We first address the transformation of images into beliefs over object classes. Then we tackle belief accumulation over observed images,followed by the action learning and,finally, present the full description of the algorithm to train this model.

4.1 Single Image Classification

The goal of this part of the network is to transform a single image into beliefs over different object labels. The feature extraction stage is comprised of 3 convolution layers followed by 3 fully connected layers. The dimensions of each layer are shown in figure 2. The convolution layers use filters of , and respectively for layers 1,2 and 3. The number of parameters in each layer of the network is shown in table 2. The operations of each layer are inspired by the model proposed in alexnet

. Each convolution layer is followed by rectification, normalization across channels and max pooling over a neighborhood of

with stride of 1. The dropout for ReLU1 and ReLU2 uses


We shall denote the GERMS dataset by , where is the image captured by the robot camera, is the object label and is an integer number denoting the pose of the robot’s gripper as positive integers mmalmir2015 . In order to learn the weights of the single image classification part, we perform gradient decent on action prediction and cross-entropy costs, denoted by and respectively. The cross-entropy classification cost is:


Here is the indicator function for the class of the object and is the predicted label belief for the image corresponding to the class. The next subsection describes the action prediction cost .

Number of Units Input to Unit Num. Parameters
Conv1 64x30x30 9K
Conv2 128x13x13 204K
Conv3 256x11x11 294K
ReLU1 256 30976 7M
ReLU2 256 256 65K
Softmax 136 256 34K
State Update. 1360 136 184K
ReLU3 256 1360+256 413K
ReLU4 256 256 65K
LU 10 256 2K

Table 2: Number of units and parameters for the proposed network.

4.2 Action Value Prediction

Active object recognition can be treated as a reinforcement learning problem, whose goal is to learn an optimal policy from states to actions . The optimal policy is expected to maximize the total reward for every interaction sequence with the environment,

where is the transition from to by performing the action . The total reward for an interaction sequence is where is a reward function and is a discount factor used to emphasize rewards closer in time. For an AOR system, an interaction sequence starts by observing image of the object with the initial orientation in the robot’s gripper. The state of the system is then updated by the observed image, and an action is selected to perform on the object to maximize the total reward. The reward in each step is determined by the accuracy of predicted label for the observed images up to that step.

In order to learn the optimal policy, we use the algorithm to train the network to predict actions for improved classification watkins1989 . This is a model-free method that learns to predict the expected reward of actions in each state. More specifically, let be the action value for state and action ,

which is the expected reward for doing action in state . Let the agent interact with the environment to produce a set of interaction sequences . Then learns a policy by applying the following update rule to every observed transition ,


where is the learning rate, and action is selected using an epsilon-greedy version of the learned policy. We interpret this iterative update in the following way to be useful for training a neural network. Let the output layer of the network predict for the learned policy for every possible action in . Then a practical approximation of the optimal policy is obtained by minimizing the reinforcement learning cost,


In the proposed network, action value prediction is done by transforming the state of the system at through layers ReLU3,ReLU4 and LU2. We train the weights of the network in these layers by minimize . In the next subsection, we go into the details of state encoding, and after that we describe the set of actions.

4.3 State Encoding

State encoding has a prominent effect on the performance of an AOR system. Based on the current state of the system, an action is selected that is expected to decrease the ambiguity about the object label. An appealing choice is to transform images into beliefs over different target classes and use them as the state of the system. Based on the target label beliefs, the system decides to perform an action to improve its target label prediction. What we expect from the AOR system is to guide the robot to pick object views that are more discriminative among target classes.

We first transform the input image into a belief vector using the the first 7 layers of the network, where

The produced label belief vector is then combined with the previously observed belief vectors from this interaction sequence to form the state of the system. The motivation for this encoding is that the combined belief encodes the ambiguity of the system about target classes and thus can be used to navigate to more discriminative views of objects. Active object recognition methods usually adapt a Naive Bayes approach to combining beliefs from different observations. Assume that in an interaction sequence, a sequence of images have been observed and their corresponding beliefs have been calculated. The state of the system at time is calculated using Naive Bayes belief combination, which is to take the product of the individual belief vectors and then normalize,


where is the target label, and is the vector of beliefs produced using single image classification. Here we assumed a uniform prior over images and target labels. The problem with Naive Bayes is that if an image is observed repeatedly in , the result will change based on the number of repetitions. This is undesirable since the state of the system changes with repeated observations of an image where no new information is added to the system. If a specific image is good for classification, the system can visit that image more often to artificially increase the performance of the system. To avoid this problem, we adapt a generative model based on Dirichlet distribution to combine different belief vectors.

We use a generative model similar to Rebguns2011 to calculate the state of the system given a set of images. The intuition behind this model is that performing an action on an object will produce a distribution of belief vectors. We model the observed belief vectors given the object and action as a Dirichlet distribution, parameters of which are learned from the data. The model is shown in figure 3. Here is a discrete variable representing the action from the repertoire of actions , represents the object label and is the vector of parameters of the Dirichlet distribution from which the belief vector over target labels is drawn,


The state of the system is calculated by computing the posterior probability of object-action beliefs using the model in figure

3. Let denote the posterior probability of an object-action pair given the performed action and the observed belief vector. Assuming uniform prior over object and and a deterministic policy for choosing actions,


The notation is to make clear that there is an for each pair of object-action. Instead of full posterior probability, we use

, maximum likelihood estimate of

, and replace the integral above by ,


For an interaction sequence and , the posterior probability of object-action pair is,


The state of the system is comprised of the vector of object posterior beliefs for every object and action, plus the features and belief extracted from the latest image ,


Note that is a vector of length .

4.4 Training Network for Joint Label and Action Prediction

Our goal is to train the network jointly for action and label prediction. We achieved this by minimizing the total cost which is sum of the costs for both label (1) and action prediction (3

). Note that the errors for action value prediction are backpropagated through the entire network, reaching visual feature extraction units. The total cost function for action value and label prediction is,


The weights of the network in the visual feature extraction layers (Conv1, Conv2, Conv3, ReLU1, ReLU2, LU1) are trained using backpropagation on (10), while the action prediction layers (ReLU3, ReLU4 and LU2) are trained by gradient descent on the action prediction error (3).

To learn the parameters of the belief update, that is , we use gradient descent on maximum likelihood of the data. The maximum likelihood of Dirichlet distribution is a convex function of its parameters and can be minimized using gradient descent. For a set of beliefs observed by performing action on the object , the gradient of the log-likelihood with respect to the parameters are,


where is the digamma function. We use one unit per Dirichlet distribution in the belief update. These units receive the current belief and their output for the previously observed belief, and produce an updated belief. An schematic of the belief update layer of the network is shown in figure 3. Learning is carried out simultaneously with the rest of the network weights in one training procedure.

Figure 3: Dirichlet belief update layer. Each unit in this layer represents a Dirichlet distribution for a pair of object-action. The parameters of this layer are the vectors of Dirichlet parameters for each unit.

4.5 Reward Function

Another component that has an important effect on the performance of our AOR system is the reward function which maps state of the system (4.3) into rewards. A simple choice for reward function is


We call this the correct-label reward function. A reward of is given to the system if at time step the action brings the object to a pose for which the predicted label is correct (wrong). The intention behind this reward function is to drive the AOR system to pick actions that lead to best next view of the object in terms of label prediction.

4.6 Action Coding

In order to be able to reach every position in the robot’s joint gripper range, we use a set of relative rotations as the actions of the system. More specifically, we use 10 actions to rotate the gripper from its current position by any of the following offset values: . The total range of rotation for each of the robot’s grippers is . The actions are selected to be fine grained enough so that the robot can reach any position with minimum number of movements possible. This encoding is simple and flexible in the range of positions that the robot can reach, however we found that the policies can become stuck with a few actions without trying the rest. Encoding the states with the Dirichlet belief update helps alleviate this issue to some degree, however, it doesn’t completely remove the problem. We deal with this problem by forcing the algorithm to pick the next best action whenever the best action leads to an image which has already been seen.

5 Experimental Results

5.1 Training Details

We trained the network by minimizing the costs of classification, action value prediction (3) and negative of log-likelihood of Dirichlet distributions (11). We used backpropagation with minibatches of size 128 to train the network. For , we used initial learning rate of which was multiplied by after iterations and then remained constant. The total number of training iterations is . For each iteration, an interaction sequence of length 5 is followed. The full training algorithm is shown in algorithm 1. For , we used -greedy policy in the training stage, with decreasing step-wise from 0.9 to 0.1. We found that using an at the test stage hurts the performance, therefore we used during testing. The number of actions is 10 as described above, and there are a total of 136 object classes, resulting in a total of 1360 Dirichlet distributions for state encoding 9.

1:procedure Train
3:     for iteration=1 To N do
4:          NextImage(iteration)
7:         for t=1 To NumMoves do
11:              if  then
13:              for   do
15:              for   do
17:              for   do
Algorithm 1 Training the network for joint label and action prediction.

5.2 Learning the Parameters of Dirichlet Distributions

Figure 4 shows the average negative log-likelihood of the data under Dirichlet distributions for training of a DN model. This figure shows that the neg-log-likelihood of data decreases after the first 1000 iterations, after which is the rate of change is decreased but not stopped.

Figure 4: Average Negative log-likelihood of data under Dirichlet distributions. The decrease in negative log-likelihood indicates learning in the belief update layer.

5.3 Label Prediction Accuracy

5.3.1 Comparing Naive Bayes and Dirichlet State Encoding

In the first experiment, we compare the effectiveness of the Dirichlet and Naive Bayes state encodings in terms of label prediction accuracy. For Naive Bayes models (NB), the state of the system is updated using (4.3), while the size and configuration of the rest of the network remains the same. Dirichlet state encoding is implemented using (9). We refer to Dirichlet models as (DR). For each encoding and for each arm, we train 10 different models and report the average test label prediction accuracy as a function of number of observed images, comparing the Deep Active Object Recognition (DAOR) and Random (Rnd) action selection policies. Figure 5 plots the performance for these models. It is obvious that the Dirichlet model is superior to Naive Bayes in label prediction accuracy.

Figure 5: Test label prediction accuracy as a function of number of observed images for left and right arms for Dirichlet state encoding with repeated visits (DR) and non-repeated visits (DN).

The first point to notice in figure 5 is the performance difference between Naive Bayes and Dirichlet belief updates on single images (action 0). NB models achieve a performance less than , while Dirichlet achieves higher than . One interpretation of this result is that the Naive Bayes models pick actions that bounce between a subset of train images, leading to underfitting of the model. In the next subsection, we provide some evidence for this justification. On the other hand, the performance of DR-DAOR model tends to saturate after 3 actions, while DR-Rnd keeps improving for subsequent actions. This might be due to the fact that DR-DAOR also bounces between subsets of images at the test time. We can avoid such behavior by forcing the policies to pick actions that lead to joint poses that haven’t already been visited in the same interaction sequence.

5.3.2 Removing Duplicate Visits

We train a set of models using Dirichlet state encoding, while forcing the policy to pick non-duplicate joint poses in every action of an interaction sequence. This approach is easy to implement by keeping a history of visited joint poses during an interaction sequence and picking actions with highest action value that don’t lead to the visited joint positions. We refer to this model as Dirichlet with non-repeated visits (DN). Comparison between DN and DR for Rnd and DAOR policies (both forced to visit novel poses) is shown in figure 6.

Figure 6: Test label prediction accuracy as a function of number of observed images for left and right arms for Naive Bayes (NB) and Dirichlet (DR) state encoding.

Comparison between the models mentioned above is shown in table 3. We see that the best performing model is DN-DAOR with the exception of action 1 for the right arm, which DR-DAOR achieves the best performance. For both arms, Dirichlet models perform significantly better than Naive Bayes, improving the model’s performance on average by for the right arm and for the left arm.

Frames 0 1 2 3 4 5
NB-Rnd 31.3 38.1 41.3 43.4 45.0 46.1 Right Arm
NB-DAOR 31.3 42.1 45.8 48.0 48.3 49.0
DR-RND 40.3 48.7 51.9 53.6 54.6 55.2
DR-DAOR 40.3 49.7 51.6 53.0 52.5 52.6
DN-RND 39.4 47.8 50.8 52.5 53.6 54.3
DN-DAOR 39.3 48.4 53.1 55.4 57.0 57.1
NB-Rnd 32.7 39.5 42.9 44.9 46.3 47.4 Left Arm
NB-DAOR 32.7 43.7 47.5 49.6 50.0 50.6
DR-RND 43.7 52.5 55.8 57.5 58.6 59.3
DR-DAOR 43.7 53.0 54.9 55.9 55.5 55.4
DN-RND 45.4 54.5 58.0 60.0 61.1 61.9
DN-DAOR 45.4 56.3 60.7 62.8 64.1 64.6
Table 3: Comparison of DQN, random and sequential.

5.3.3 Visualizing Policies

It may help us understand the weakness and strength of different models if we take a closer look into the learned policies. For this purpose, we visualize the consecutive actions in the interactions sequences of length 5, as shown for training data in figures 7 and for test data in figure 8. Each plot represents actions in different rows, with the magnitude and orientation of the action begin depicted by the length and direction of the corresponding arrow on the left side. Each time step of the interaction sequence is shown as a numbered column. The colored lines in each plot connect one action in column to another action in column only if those actions appeared consecutively in interaction sequences at these time steps. The thickness of lines depicts the relative frequency by which two actions were observed on the data.

Figure 7 visualizes the policies DN-DAOR and NB-DAOR on the training data. This figure helps clarify the lower performance of NB models as described before. For NB-DAOR shown on the left side of figure 7, we see thick lines connecting actions that rotate the object with the largest magnitude in opposite directions. The relative thickness of these lines indicates that the model tends to go to one end of joint’s rotation range, go back with one large rotation and then repeat. Despite presence of other actions, this back and forth action dominates the training process, leading to lower accuracy on test label prediction for single images. On the right side of figure 7 we see that DN-DAOR picks a wide range of actions, which leads to better examination of training images and thus higher performance on single images.

Figure 7: Visualization of (left) NB and (right) DN models for train data. Each row represents an action and each column represents a move performed by the policy in an interaction sequence. The color of lines connecting two columns are different for clarity for every consecutive time steps, while the thickness of these line indicate the frequency of that transition in interaction sequences.

Figure 8 visualizes the learned policies at test time for NB-DAOR and DN-DAOR. We see on the left side that NB-DAOR only swings between the two large rotations in the opposite direction, while DN-DAOR prefers to do a few larger actions (thick purple and blues lines connecting columns 2, 3 and 4) followed by few smaller actions in different directions. There is no back and forth for DN-DAOR between visited joint positions, which leads to better performance on the test set.

Figure 8: Visualization of (left) NB and (right) DN models for test data. NB model prefers to repeats the same two actions, swinging between two joint poses at one end of the joint range. The DN model usually performs a few larger rotations on the object, followed by a few smaller rotations in different directions.

6 Conclusions

In this paper, we proposed a model for deep active object recognition based convolutional neural networks. The model is trained by jointly minimizing the action and label prediction simultaneously. The visual features in early stages of this network were trained by minimizing action and label prediction costs. The difference between the work presented here and deeply supervised networks DeeplySupervised is that in the latter, the training is carried out by minimizing the classification error in different layers, while in our approach we minimized the action learning costs along with classification error.

We also adapted an alternative to the common Naive Bayes belief update rule for state encoding of the system. Naive Bayes has the potential to overfit to subsets of training images, which could lead to lower accuracy at the test time. We used a generative model based on Dirichlet distribution to model the belief over target classes and actions performed on them. This model was embedded into the network, which allowed training the network in one pass jointly for label and action prediction. The results of experiments confirmed that the proposed Dirichlet model is superior in test label prediction to the Naive Bayes approach for system’s state encoding.

A common trend we observed in the models trained in this paper was the strong preference for a few actions, which led to limited examination of the objects, and thus lower performance on label prediction. This preference was the strongest in the Naive Bayes state encoding models. Employing Dirichlet for state encoding helped alleviate this problem, mainly for the training data and less for test data. We observed that the strong preference for a limited set of actions weakens in the training stage for the DR-DAOR model, and as a result of this the test label prediction accuracy was improved. We hypothesize that in addition to state encoding, learning actions on the training images which have high label prediction accuracy, leads to this strong preference. In training our models, the training accuracy reaches above after 1000 iterations. This may cause the to reward every action, which finally may lead to one action taking over and always producing the highest action value.

7 Acknowledgments

The research presented here was funded by NSF IIS 0968573 SoCS, IIS INT2-Large 0808767, and NSF SBE-0542013 and in part by US NSF ACI-1541349 and OCI-1246396, the University of California Office of the President, and the California Institute for Telecommunications and Information Technology (Calit2).

8 References


  • (1) J. Aloimonos, J. I. Weiss, and A. Bandyopadhyay, Active vision, International J. Computer Vision, vol. 1, no. 4, pp. 333-356, 1988.
  • (2) R. Bajcsy, Active perception, Proceedings of the IEEE, vol. 76, no. 8, pp. 966-1005, 1988.
  • (3) D. Wilkes and J. K. Tsotsos, Active object recognition, Proceedings CVPR’92., 1992 IEEE Computer Society Conference on, pp. 136-141. IEEE, 1992.
  • (4) M. Seibert and A. M. Waxman, Adaptive 3-D object recognition from multiple views, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 2, pp. 107-124, 1992.
  • (5) B. Schiele and J. L. Crowley, Transinformation for active object recognition, In Computer Vision, Sixth International Conference on, pp. 249-254. IEEE, 1998.
  • (6) S. A. Nene, S. K. Nayar and H. Murase, Columbia object image library (COIL-100), Technical Report CUCS-006-96, Columbia University, 1996.
  • (7) H. Borotschnig, L. Paletta, M. Prantl and A. Pinz, Appearance-based active object recognition, Image and Vision Computing, vol. 18, no. 9, pp. 715-727, 2000.
  • (8) L. Paletta and A. Pinz, Active object recognition by view integration and reinforcement learning, Robotics and Autonomous Systems, vol. 31, no. 1, pp. 71-86, 2000.
  • (9) F. G. Callari and F. P. Ferrie, Active object recognition: Looking for differences, International J. Computer Vision, vol. 43, no. 3, pp. 189-204, 2001.
  • (10) B. Browatzki, V. Tikhanoff, G. Metta, H. H. Bulthoff and C. Wallraven, Active object recognition on a humanoid robot, In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 2021-2028, IEEE, 2012.
  • (11) B. Browatzki, V. Tikhanoff, G. Metta, H. H. Bulthoff, C. Wallraven, Active In-Hand Object Recognition on a Humanoid Robot, Robotics, IEEE Transactions on , vol. 30, no. 99, pp. 1-9, 2014.
  • (12) N. Atanasov, B. Sankaran, J. L. Ny, G. J. Pappas and K. Daniilidis, Nonmyopic View Planning for Active Object Classification and Pose Estimation, Robotics, IEEE Transactions on , vol. 30, no. 99, pp. 1078-1090, 2014.
  • (13) M. Malmir, D. Forster, K. Youngstrom, L. Morrison and J. R. Movellan, Home Alone: Social Robots for Digital Ethnography of Toddler Behavior, Computer Vision Workshops (ICCVW), 2013 IEEE International Conference on, pp. 762-768, 2013.
  • (14) J. R. Movellan, M. Malmir and D. Forester, HRI as a tool to monitor socio-emotional development in early childhood education, In proc. HRI 2nd Workshop on Applications for Emotional Robots, Bielefeld, Germany, 2014.
  • (15)

    M. Malmir, K. Sikka, D. Forster, J. Movellan and G. W. Cottrell. Deep Q-learning for Active Recognition of GERMS: Baseline performance on a standardized dataset for active learning. In Xianghua Xie, Mark W. Jones, and Gary K. L. Tam, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 161.1-161.11. BMVA Press, September 2015.

  • (16) Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Cambridge University.
  • (17)

    Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Imagenet classification with deep convolutional neural networks.” In Advances in neural information processing systems, pp. 1097-1105. 2012.

  • (18)

    Rebguns, Antons, Daniel Ford, and Ian R. Fasel. ”Infomax control for acoustic exploration of objects by a mobile robot.” In Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence. 2011.

  • (19)

    Denzler, Joachim, Christopher M. Brown, and Heinrich Niemann. ”Optimal camera parameter selection for state estimation with applications in object recognition.” In Pattern Recognition, pp. 305-312. Springer Berlin Heidelberg, 2001.

  • (20) Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, Zhuowen Tu, Deeply-Supervised Nets, In Proceedings of AISTATS 2015.