In order to drive the real world in terms of robot’s perspective, recognizing the -dimensional environment is one of the most important problem to address. Robot can learn the environment via various sensors such as vision, laser or pressure. We choose to work with vision sensor because it can provide abundant information compared with other sensors. Input data from vision sensor contains not only edge, corner, and line, but also RGB information that helps robot perceive object in its surroundings. Since the input given to the robot is continuous, we use sequence of images as input data.
Although input data contains the whole information about the environment including background and foreground, robot cannot extract any useful information unless the data is processed manually. Thus, the method that filters information from given input data is necessary. In this paper, we proposed a method that helps robot learn the motion of dynamic objects. In order to address this problem, in  and  introduced a method called the Object Semantic Hierarchy (OSH), which consists of multiple representations with different ontologies. The OSH by itself can provide analysis of given sequential image. Through this analysis, robot can detect dynamic objects and recognize its existence.
However, the OSH is only the framework that helps segmenting the background information. This framework requires some techniques to pull out the invariant property under the camera motion and detect objects. Previous works using this framework applied SIFT feature and HOG  or Homography transformation . Here, we would like to incorporate Deep Belief Network (DBN) model so the robot can not only detect the object, but understand it. Reconstructing the 3D model by combining structure of the OSH to differentiate the noise and DBN technique would give robot a better understanding about its environment. A robot could learn -dimensional environment and recognize the real world as human does using our proposed method.
2 Related Work
2.1 Review of Previous Work
Since the history of background subtraction or background modeling is long, there are lots of approaches to address this problem. The simplest methods for background subtraction are frame difference between two frames such as median, average, or running average of the previous frames. However, these methods are too sensitive to the threshold as well as they do not cope with multiple model background distributions. A number of probabilistic methods have been proposed for solving this problem such as mixtures of Gaussians 4], and mean shift . Mixtures of Gaussians consider multiple model background distributions by the accumulation of supporting evidence . In 
the background probability density function is given by the histogram of the most recent pixel values. Mean shift is a gradient-ascent method that can detect the multiple model distribution together with their covariance. There are also several papers that contribute to background modeling. In  a bankground is modeled using the joint spatial color model. They recently described the method that using the RANSAC algorithm and the Markov Random Field, do background subtraction under the freely moving camera . Apart from the previous methods since this vision sensor can be considered as a significant role of mobile robotics, running this background subtraction algorithm in real time is a crucial step. In  the method that adaptively deciding the background model was described and they can do real-time tracking. In addition, statistical approach to solve the background subtraction in a real time was introduced in .
Different from these approaches mentioned above, our method did not use traditional computer vision methods or assumed the model of background. Instead, our proposed method learn the pattern of the motions using the Restricted Boltzmann Machines (RBM) and obtain the global motion from the sequential images. Once this global motion is identified between frames, we can select the regions that violate the global motion. This violated regions against global motion can be considered as the foreground region as suggested in the OSH .
2.1.1 Restricted Boltzmann Machines
The Deep Belief Network is cutting edge technique originated from neural network where its motivation is to mimic human’s perception process. This technique is probabilistic generative models that are composed of multiple layers of stochastic, latent variables. To get the probabilistic model the RBM imported the concept called energy-based model from thermodynamics. The energy function of the RBM can explain the relationship between visible unit, which is from input images, and hidden unit, which is latent variables. Given visible unitx and hidden units h the energy function of the RBM can be written as follows:
where b is the visual biases and c is the hidden biases. Using this energy function in Eq (1
), the probability distribution can be derived as such:
can be a sigmoid function, hyperbolic tangent function or Gaussian function depending on the input data type.
The DBN is well known to have good performance in classification because the DBN can actually learn the meaningful bases from the given input. For example, several experiments were introduced to extract the bases that explain different objects 
. The DBN is represented as a hierarchal model where each layer contains high level information of its previous layer. In the first hidden layer, the hidden units contain the edge like bases from the objects. In the second hidden layer, there is information about the specific part of the objects. As the depth of hidden layer increases, information that hidden layer contains are the higher level features of objects. In addition, since the DBN technique is one of the machine learning techniques, it involves training and testing stages. Many experiments have been performed using the RBM and they turned out that the performance of unsupervised learning or semi-supervised learning is better than the performance of supervised learning in object detection and classification.
2.1.2 Higher-Order Restricted Boltzmann Machines
Conventional RBM only focused on the mean intensity of each pixel. However, to capture the correlation of multiple images, the visible unit input images should be multiple too. As shown in Eq (1) and (LABEL:eq:probability), the energy function and probability expressions of the RBM only have one visible unit input. We should use higher order Boltzmann Machines to get more than two inputs.
One of the drawback to use the -order Boltzmann machines is that the higher order Boltzmann machines are too complicated to compute the hidden units and cannot be applied to the real time data. More specifically, the weight matrix
that connects visible unit and hidden units is n-dimensional tensor and finding thisrequires high computational cost. To reduce the computational cost of this weight matrix, the method that reduced the complexity of weight matrix through factorization is proposed in . Moreover, new type of RBM called covariance RBM that can capture the correlation of image is introduced in . This factorization idea in  is also applied in . In particular, the -way factored RBM model is introduced to compute the covariance of input data. By applying this model to still image, their proposed method can capture the self-correlation of given image with duplicated images as two visible units .
2.1.3 Object Semantic Hierarchy
In [1, 2] the robot initially treats the every input from sensors as noise (Layer 0: Noisy world). Then, using the static background model, the layer can extract the static background (Layer 1: Static Background). During the static background stage a static model of the background is assumed to be time-invariant and dynamic change is considered as noise. Through the layer, the OSH extracts the set of constant 2D object view and the time-variant 2D object poses (Layer 2: 2D object in 2D space). In the layer, it collects each constant 2D image components to reconstruct the time-variant 3D poses (Layer 3: 2D object in 3D space). Finally, in the last layer, the same collection of 2D components but with invariant relations among 3D poses and the time-invariant 3D pose of the objects are extracted as a whole (Layer 4: 3D object in 3D space). The framework of the OSH is displayed in Figure 1. In this paper, we utilize the framework of the OSH to perform the background subtraction which is described in the stage of the OSH.
2.2 Main Contribution
Having mentioned from previous section, the hidden layer can contain the high level features such as edges, the part of objects, and the whole objects from the given image. However, this technique was mainly applied for single still image  so far. In  the -order RBM that uses two sets of visible units was introduced. Note that previous RBM models use a single set of visible unit. Our intuition in this paper is to apply such a model to extract the motion information of two sequential images, whereas  uses single images as two visible units. More specifically, if two sequential images are used as inputs, the hidden units in the RBM may contain some meaningful information such as the relationship between two images. More precisely, these factors could include each pixel’s motion information and correlation rather than edges or objects obtained from using single still image.
Once having the motion information, we can successfully extract optical flow-like field in between two images, namely max-flow field. This is a fairly unconventional but probably quite biologically plausible representation of what happened between two images. There are the number of possible ways of using this flow field including background subtraction. Since the flow field of every pixel is estimated, we could figure out the global motion of each image pairs. Based on the assumption that background regions are large enough compared to foreground regions, we can expect that the global motion regions may represent the background regions. Using the concept of stage of the OSH, background subtraction can be done by dividing the two motion subfields: the background motion regions and the motion regions that violate the background motion.
3 Technical Content
3.1 Summary of the Technical Solution
As previously discussed, the method using two visible units as inputs was introduced in . However, different from , we suggested taking the different images as inputs. More specifically, by taking two sequential images as inputs to the 3-way RBM model, we successfully obtained the motion information between the given inputs. After the motion information from inputs were obtained, we displayed max flow field using the weight matrix computed by the 3-way RBM. Then, based on the max-flow field, we extracted the global motion of two sequential images. Once estimating the flow field of image at pixel level is performed, we can apply the stage of the OSH framework to segment foreground object.
3.2 Details of the Technical Solution
3.2.1 Spatial Covariance and Temporal Covariance
The energy function of the -order RBM consists of two visible units , , one hidden unit and three dimensional tensor . Note that two visible units and are same still image input in . Using the factorization method, the weight matrix can be approximately computed by the summation of the products of the two-dimensional weight matrices as:
where represents the weight matrix between visible units and factor and is the weight matrix between another input visible units and factor. Similarly, is the weight matrix between hidden units and factor. Since the -dimensional tensor is factorized by the product of -order weight matrix, the complexity of computation reduces from O() to O() . Moreover, the weight matrix and are the same matrix since the duplicated images and are used as visible unit inputs in spatial covariance case . So given energy function in Eq (4) can be simplified as
Using this energy function, we can easily get the probability distribution based on the same inference of conventional RBM such as
Eq (7) represents the probability distribution based on the energy function shown in Eq (6). The relationship between visible units and hidden units can be observed in Figure 2. In Figure 2, factored -way RBM model is conceptually presented. Visible units and hidden units are connected with each weight matrix from visible units or hidden units to factor. The weight matrices between visible units and factor ( and ) can be computed and obtained by Eq (6) and (7). To get the concrete result,  used the Hybrid Monte Carlo (HMC) method by using leapfrog steps
-times per epoch to reject the abnormal output with given probability. Hence, the computed weights based on parameters such as visible units, hidden units converge.
We expect for these weight matrices to contain the information that can explain the correlation inputs. Since the inputs are the same images in this case, the useful information to explain image would be edge-like features and we denote this method as a spatial covariance method. As shown in Figure 2, we take the similar concept from 3-way factorized RBM model introduced from . However, our proposed method uses time sequential images as inputs. We denote our proposed method as a temporal covariance method.
Since our input images are two sequential images denoted as previous image () and current image (), respectively, the weight matrix between previous image and factor should be different from the matrix between current image and factor. Therefore, the energy functions under the different image inputs are shown as
Here, x and y are the visible unit inputs and h is hidden units. Different from Eq (4), we deal with the conditional energy function to simplify the computational complexity. From this energy function, the probability distributions are similarly computed as Eq (2) and (3) such that
can be a sigmoid function when the input images are expressed as binary data and the Gaussian distribution when the input images are continuous data.
After computing these hidden probabilities, this algorithm calculates the error between original output (y) and new output (y’
). When we applied the Hybrid Monte Carlo (HMC) technique to our two input images, the reconstruction error failed to reduce and the output results no longer had edge-like features in calculated weights. Therefore, we had to find different method to get the robust weight matrix. We choose to apply the k-steps Contrastive Divergence (CD-k) and the results seem to converge well with only. After the reconstruction error is converged, the reconstruction error is decreased. Brief implementation algorithm is provided from Algorithm 1 at Section 3.3.
3.2.2 Max Flow Field and Background Subtraction
As we expect to get the edge-like bases from spatial covariance, the hidden units computed by two sequential images have the information of translation and rotation motion behaviors. Once we get the motion bases through the temporal covariance method, max-flow fields are extracted from motion. More specifically, given the motion bases which are factored into three weights, we can estimate optical flow as a binary vector, or as a vector of Bernoulli probabilities. The hidden states inferred from positive hidden probability can give the relationship between theand . For every pixel in the the strongest outgoing connection is declared the output pixel which this connection leads to as the target of corresponding input pixel. This is denoted as a max-flow field that maps every input pixel to an output pixel.
In order to apply this max-flow field to subtract background, we need to assume two things; the background has a consistent rigid body under the camera motion and the portion of background is dominant from the dynamic objects. Under the these assumptions, finding the global motion can be done by deducing the max-flow field in pixel level. Here, since the robot moves in the real world, robot’s camera sensor should not be considered as a fixed camera. Our method should segment foreground objects from background taking consideration of randomly moved camera motion. To subtract background we need to know about the background motion information.
After global motion is estimated, since background region is bigger than foreground regions as we already assumed, the regions that are governed by global motion can be regarded as background regions. Similar to the OSH framework, the regions that violate the invariant property (global motion) will be treated as foreground regions. As such, subtracting the background regions using max-flow field is possible. Brief implementation algorithm is provided from Algorithm 2 at Section 3.3.
3.3 Main Implementation Algorithm
We described the brief algorithm of 3-way RBM for learning weights and biases. Implementation is done with Matlab by referring Memisevic’s paper . In our implementation we used factors, mapping units, for momentum, for learning rate, and for target hidden probability as detailed in Algorithm 1. These values were chosen by parameter tuning on a development set of inference tasks. After training, the model can infer optical flow from a pair of test images with factored weights by computing the hidden unit activations as detailed in Algorithm 2. The representation of optical flow in the model is as a binary vector and the binary vector over hidden units defines a matrix that maps the input image to the output image. One can find, for example, for every input pixel the strongest outgoing connection to the output pixel that leads to as the target of this input pixel. This defines a max-flow field that maps every input pixel to an output pixel.
4.1 Learning Weights
As the first stage of experiments, we trained on different types of translations and rotations on dot binary images which were generated randomly with about ten percents of the pixels on. created image of size pixels were used to do the training for translation and rotation and each of their corresponding pairs as a sequential frame was transformed at random direction, respectively. Implementation results of translation and rotation are shown in Figure 3 and Figure 4, respectively.
Figure 3 shows the results for finding the temporal covariance between sequential frames when translation motion is trained and Figure 4 shows the similar results when the rotation motion is trained. Since the input images for training translations or rotations are the binary inputs, the given probability that shows from the Eq (9) and (10) should be the sigmoid function. Note that Eq (10) should be Gaussian if the input image is the real-valued input. Empirically, the epoch was set to for optimal results in the translation training and the epoch was set to for optimal results in the rotation training. As shown in Figure 3 and 4, corresponding motion information was extracted successfully. From the learned weights we can infer optical flow as well as reconstruct new output images as shown in the next section.
4.2 Max-Flow Field and Reconstructed Output
In order to verify whether the model is able to infer transformations such as translation and rotation, we constructed max-flow fields and then displayed them. More specifically, given the learned weights shown in Figure 3 and 4, we can infer optical flow pattern from pairs of test images by computing hidden unit activations. It is possible to visualize motion flows in the hidden units by drawing arrows from the input pixel to the output pixel. To express the flow field efficiently, we drew only one arrow from the each input pixel which is selected by maximizing the corresponding output pixel location. This connection is defined as a max-flow field as we already discussed in previous section. Note that this process may loose some information since it cannot visualize the case where an input pixel is mapped to multiple output pixels and it may loose all information about uncertainty .
Figure 5 and 6 shows four image pairs of translation and rotation and each inferred max-flow fields, respectively. For translation, given a test set of binary random dot images, we constructed kinds of transformations for this experiments such as no shift, up, down, left, right, and four diagonal directions as output images. After generating learned weights, hidden units are used to map into input images and output images. Hence, we could extract max-flow fields as shown in the rightmost picture for each pair in Figure 5. For rotation, given a test set of binary MNIST images, we rotated images ranging from to as output images. With the same configuration and procedure we could extract max-flow fields as shown in the rightmost picture for each pair in Figure 6.
From the max-flow fields, we can easily observe that there is a global motion between two sequential images. As an example of translation shown in Figure 5, a group of flow fields indicates one dominant direction such as up, lower-left diagonal, right, and upper-left diagonal, respectively. Similarly, as an example of rotation in Figure 6, we can clearly observe that there is a global rotation motion. Each input and output pairs represents the global rotation motion such as counterclockwise, clockwise, no-rotation, and reverse turns, respectively.
Moreover, once we compute the correspondence between input image and output image pairs, we can apply these transformations to previously unseen images. Whatever the input image is, it is successfully reconstruct the output image by analogy since we know the max-flow fields. Figure 7 illustrates the reconstructed images.
5 Future Work and Conclusion
We have proposed a method that combines OSH framework and DBN techniques in order to learn the foreground motion information in a video sequence. Our first step was to reproduce the work of existing DBN technique,called 3-way factored RBM . There was a further modification to accept two different images as visible units. We have shown that the DBN can not only extract the useful information about single image but also find the correlation between two sequential images, for example, the max-flow fields. With these max-flow fields, we could estimate the global motions.
Some of remaining future work includes background and foreground segmentation. As defined in OSH , , we consider the regions that move along to the global motion as background region. On the contrary, foreground region is defined such that the region violates the global motion. Since we finished extracting the flow-field from given test set and estimated the direction of the global motion regions which is become possible to determine background region followed by the foreground regions segmentation. We further believe that building on top of our proposed technique could construct the 3D modeling since we do not need to consider the background information.
-  C. Xu and B. Kuipers, “Towards the object semantic hierarchy,” Proc. of the ICDL, 2010.
-  C. Xu and B. Kuipers, “Construction of the object semantic hierarchy,” Fifth International Cognitive Vision Workshop, 2009.
-  C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” Proc. of the CVPR, pp. 246-252, 1999.
A. Elgammal, D. Harwood, and L. S. Davis, “Non-parametric model for background subtraction,”Proc. of the ICCV, Frame-rate Workshop, 1999.
-  D. Comaniciu and P. Meer, “Mean Shift: A robust approach toward feature space analysis,” IEEE Transactions on PAMI, vol. 24, no. 5, pp. 603-619, 2002.
-  Y. Sheikh and M. Shah, “Bayesian object detection in dynamic scenes,” IEEE Transactions on PAMI, 2005.
-  Y. Sheikh, O. Javed, and T. Kanade, “Background subtraction for freely moving cameras,” Proc. of the ICCV, pp. 1219-1225, 2009.
-  C. Stauffer, W. Eric, and L. Grimson, “Adaptive background mixture models for real-time tracking,” Proc. of the CVPR, pp. 2246-2252, 1999.
-  T. Horprasert, D. Harwood, and L. Davis, “A statistical approach for real-time robust background subtraction and shadow detection,” Proc. of the ICCV, 1999.
-  H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” Proc. of the ICML, 2009.
-  Y. Bengio, “Learning deep architectures for AI,” Technical Report 1312, University of Montreal, 2007.
-  M. Ranzato, A. Krizhevsky, and G. E. Hinton, “Factored 3-Way restricted Boltzmann machines for modeling natural images.” Proc. of the AISTATS, 2010.
-  R. Memisevic and G. E. Hinton, “Unsupervised learning of image transformations,” Proc. of the CVPR, 2007.
-  R. Memisevic and G. E. Hinton, “Learning to represent spatial transformations with factored higher-order Boltzmann machines,” Neural Computation, June 2010.