I Introduction
Inspired by human vision, the eventbased cameras asynchronously capture an event whenever there is a brightness change in a scene[1]. An event is simply composed of a pixel coordinate, its binary polarity value, and the timestamp when the event occurs. This differs from the framebased cameras where an entire image is acquired at a fixed time interval. Based on its novel design concept, the eventbased camera can rapidly stream events (i.e., at microsecond speeds). This is superior to framebased cameras which usually sample images at millisecond rates [2]. This novel ability makes the event cameras more suitable for the highspeed robotic applications that require low latency and high dynamic range from the visual data.
Although the event camera creates a paradigm shift in solving realtime visual problems, its data come extremely quickly without the intensity information usually found in an image. Each event also carries very little information (i.e., the pixel coordinate, the polarity value and the timestamp) when it occurs. Therefore, it is not trivial to apply standard computer vision techniques to event data. Recently, the event camera is gradually becoming more popular in the computer vision and robotics community. Many problems such as camera calibration and visualization
[3], 3D reconstruction [4], simultaneous localization and mapping (SLAM) [5], and pose tracking [6] have been actively investigated.Our goal in this work is to develop a new method, which relocalizes the 6 Degrees of Freedom (6DOF) pose of the event camera using a deep learning approach. The problem of effectively and accurately interpreting the pose of the camera plays an important role in many robotic applications such as navigation and manipulation. However, in practice it is challenging to estimate the pose of the event camera since it can capture a lot of events in a short time interval, yet each event does not have enough information to perform the estimation. We propose to form a list of events into an event image and regress the camera pose from this image with a deep neural network. The proposed approach can accurately recover the camera pose directly from the input events, without the need for additional information such as the 3D map of the scene or inertial measurement data.
In computer vision, Kendall et al. [7] introduced a first deep learning framework to retrieve the 6DOF camera pose from a single image. The authors in [7] showed that compared to the traditional keypoint approaches, using CNN to learn deep features resulted in a system that is more robust in challenging scenarios such as noisy or uncleared images. Recently, the work in [8]
introduced a method that used a geometry loss function to learn the spatial dependencies. In this paper, we employ the same concept, using CNN to learn deep features, however unlike
[8] that builds a geometry loss function based on the 3D points in the scene, we use an SPLSTM network to encode the geometry information. Our approach is fairly simple but shows significant improvement over the stateoftheart methods.The rest of the paper is organized as follows. We review related work in Section II, followed by a description of the event data and event images in Section III. The SPLSTM network is introduced in Section IV. In Section V, we present the extensive experimental results. Finally, we conclude the paper and discuss the future work in Section VI.
Ii Related Work
The event camera is particularly suitable for realtime motion analysis or highspeed robotic applications since it has low latency [3]. Early work on event cameras used this property to track an object to provide fast visual feedback to control a simple robotic system [9]. The authors in [6] set up an onboard perception system with an event camera for 6DOF pose tracking of a quadrotor. Using the event camera, the quadrotor’s poses can be estimated with respect to a known pattern during highspeed maneuvers. Recently, a 3D SLAM system was introduced in [5] by fusing framebased RGBD sensor data with event data to produce a sparse stream of 3D points. This sparse stream is a compact representation of the input events, hence it uses less computational resource and enables fast tracking.
In [10], the authors presented a method to estimate the rotational motion of the event camera using two probabilistic filters. Recently, Kim et al. [4] extended this system with three filters that simultaneously estimate the 6DOF pose of the event camera, the depth, and the brightness of the scene. The work in [11] introduced a method to directly estimate the angular velocity of the event camera based on a contrast maximization design without requiring optical flow or image intensity estimation. Reinbacher et al. [12] introduced a method to track an event camera based on a panoramic setting that only relies on the geometric properties of the event stream. More recently, the authors in [13] [14] proposed to fused events with IMU data to accurately track the 6DOF camera pose.
In computer vision, 6DOF camera pose relocalization is a wellknown problem. Recent research trends investigate the capability of deep learning for this problem [7] [8] [15]. Kendall et al. [7] introduced a first deep learning framework to regress the 6DOF camera pose from a single input image. The work of [16] used Bayesian uncertainty to correct the camera pose. Recently, the authors in [8] introduced a geometry loss function based on 3D points from a scene, to let the network encode the geometry information during the training phase. Walch et al. [15] used a CNN and four parallel LSTM together to learn the spatial relationship in the image feature space. The main advantage of the deep learning approach is that the deep network can effectively encode the features from the input images, without relying on the handdesigned features.
This paper follows the recent trend in computer vision by using a deep network to estimate the pose of the event camera. We first create an event image from a list of events. A deep network composed of a CNN and an SPLSTM is then trained endtoend to regress the 6DOF camera pose. Unlike [8] that used only CNN with a geometry loss function that required the 3D points from the scene, or [15] that used four parallel LSTM to encode the geometry information, we propose to use Stacked Spatial LSTM to learn spatial dependencies from event images. To the best of our knowledge, this is the first deep learning approach that successfully relocalizes the pose of the event camera.
Iii Event Data
Iiia Event Camera
Instead of capturing an entire image at a fixed time interval as in standard framebased cameras, the event cameras only capture a single event at a timestamp based on the brightness changes at a local pixel. In particular, an event is a tuple where is the timestamp of the event, is the pixel coordinate and is the polarity that denotes the brightness change at the current pixel. The events are transmitted asynchronously with their timestamps using a sophisticated digital circuitry. Recent event cameras such as DAVIS 240C [1] also provide IMU data and globalshutter images. In this work, we only use the event stream as the input for our deep network.
IiiB From Events to Event Images
Since a single event only contains a binary polarity value of a pixel and its timestamp, it does not carry enough information to estimate the 6DOF pose of the camera. In order to make the pose relocalization problem using only the event data becomes feasible, similar to [11] we assume that events in a very short time interval will have the same camera pose. This assumption is based on the fact that the event camera can capture many events in a short period, while in that very short time interval, the poses of the camera can be considered as unchanging significantly. From a list of events, we reconstruct an event image (where and are the height and width resolution of the event camera) based on the value of the polarity as follows:
(1) 
This conversion allows us to transform a list of events to an image and apply traditional computer vision techniques to event data. Since the events mainly occur around the edge of the scene, the event images are clearer on simple scenes, while more disorder on cluttered scenes. Fig. 2 shows some examples of event images. In practice, the parameter plays an important role since it affects the quality of the event images, which are used to train and infer the camera pose. We analyze the effect of this parameter to the pose relocalization results in Section VD.
Iv Pose Relocalization for Event Camera
Iva Problem Formulation
Inspired by [7] [16], we solve the 6DOF pose relocalization task as a regression problem using a deep neural network. Our network is trained to regress a pose vector with represents the camera position and represents the orientation in 3D space. We choose quaternion to represent the orientation since we can easily normalize its four dimensional values to unit length to become a valid quaternion. In practice, the pose vector is seven dimensional and is defined relatively to an arbitrary global reference frame. The groundtruth pose labels are obtained through an external camera system [3] or structure from motion [7].
IvB Stacked Spacial LSTM
We first briefly describe the LongShort Term Memory (LSTM) network
[17], then introduce the Stacked Spatial LSTM and the architecture to estimate the 6DOF pose of event cameras. The core of the LSTM is a memory cell which has the gate mechanism to encode the knowledge of previous inputs at every time step. In particular, the LSTM takes an input at each time step , and computes the hidden state and the memory cell state as follows:(2)  
where represents elementwise multiplication; the function is the sigmoid nonlinearity, and is the hyperbolic tangent nonlinearity. The weight and bias
are trained parameters. With this gate mechanism, the LSTM network can choose to remember or forget information for long periods of time, while is still robust against vanishing or exploding gradient problems.
Although the LSTM network is widely used to model temporal sequences, in this work we use the LSTM network to learn spatial dependencies in image feature space. The spatial LSTM has the same architecture as normal LSTM, however, unlike normal LSTM where the input is from the time axis of the data (e.g., a sequence of words in a sentence or a sequence of frames in a video), the input of spatial LSTM is from feature vectors of the image. Recent work showed that the spatial LSTM can further improve the results in many tasks such as music classification [18] or image modeling [19]. Stacked Spatial LSTM is simply a stack of several LSTM layers, in which each layer aims at learning the spatial information from image features. The intuition is that higher LSTM layers can capture more abstract concepts in the image feature space, hence improving the results.
IvC Pose Relocalization with Stacked Spacial LSTM
Our pose regression network is composed of two components: a deep CNN and an SPLSTM network. The CNN network is used to learn deep features from the input event images. After the last layer of the CNN network, we add a dropout layer to avoid overfitting. The output of this CNN network is reshaped and fed to the SPLSTM module. A fully connected layer is then used to discard the relationships in the output of LSTM. Here, we note that we only want to learn the spatial dependencies in the image features through the input of LSTM, while the relationships in the output of LSTM should be discarded since the components in the pose vector are independent. Finally, a linear regression layer is appended at the end to regress the seven dimensional pose vector. Fig.
3 shows an overview of our approach.PoseNet[7]  Bayesian PoseNet[16]  SPLSTM (ours)  
Median Error  Average Error  Median Error  Average Error  Median Error  Average Error  
shapes_rotation  ,  ,  ,  ,  ,  , 
box_translation  ,  ,  ,  ,  ,  , 
shapes_translation  ,  ,  ,  ,  ,  , 
dynamic_6dof  ,  ,  ,  ,  ,  , 
hdr_poster  ,  ,  ,  ,  ,  , 
poster_translation  ,  ,  ,  ,  ,  , 
Average  ,  ,  ,  ,  ,  , 
In practice, we choose the VGG16 [20]
network as our CNN. We first discard its last softmax layer and add a dropout layer with the rate of
to avoid overfitting. The event image features are stored in the last fully connected layer in a dimensional vector. We reshape this vector to in order to feed to the LSTM module with hidden units. Here, we can consider that the inputs of LSTM are from “feature sentences”, each has “words”, and the spatial dependencies are learned from these sentences. We then add another LSTM network to create an SPLSTM with layers. The output of SPLSTM module is fed to a fully connected layer with neurons, following by another fully connected layer with neurons to regress the pose vector. We choose the SPLSTM network with layers since it is a good balance between accuracy and training time.IvD Training
To train the network endtoend, we use the following objective loss function:
(3) 
where and are the predicted position and orientation from the network. In [8], the authors proposed to use a geometry loss function to encode the spatial dependencies from the input. However, this approach required a careful initialization and needed a list of 3D points to measure the projection error of the estimated pose, which is not available in the groundtruth of the dataset we use in our experiment.
For simplicity, we choose to normalize the quaternion to unit length during testing phrase, and use Euclidean distance to measure the difference between two quaternions as in [7]. Theoretically, this distance should be measured in spherical space, however, in practice the deep network outputs the predicted quaternion close enough to the groundtruth quaternion , making the difference between the spherical and Euclidean distance insignificant. We train the network for epochs using stochastic gradient descent with momentum and weight decay. The learning rate is empirically set to and kept unchanging during the training. It takes approximately days to train the network from scratch on a Tesla P100 GPU.
V Experiments
Va Dataset
We use the event camera dataset that was recently introduced in [3] for our experiment. This dataset included a collection of scenes captured by a DAVIS camera in indoor and outdoor environments. The indoor scenes of this dataset have the groundtruth camera poses from a motioncapture system with submillimeter precision at Hz. We use the timestamp of the motioncapture system to create event images. All the events with the timestamp between and of the motioncapture system are grouped as one event image. Without the loss of generality, we consider the groundtruth pose of this event image is the camera pose that was taken by the motioncapture system at time . This assumption technically limits the speed of the event camera to the speed of the motioncapture system (i.e. Hz), however it allows us to use the groundtruth poses with submillimeter precision from the motioncapture system.
Random Split As the standard practice in the pose relocalization task [7], we randomly select of the event images for training and the remaining for testing. We use sequences (shapes_rotation, box_translation, shapes_translation, dynamic_6dof, hdr_poster, poster_translation) for this experiment. These sequences are selected to cover different camera motions and scene properties.
Novel Split To demonstrate the generalization ability of our SPLSTM network, we also conduct the experiment using the novel split. In particular, from the original event images sequence, we select the first of the event images for training, then the rest for testing. In this way, we have two independent sequences on the same scene (i.e., the training sequence is selected from timestamp to , and the testing sequence is from timestamp to ). We use three sequences from the shapes scene (shapes_rotation, shapes_translation, shapes_6dof) in this novel split experiment to compare the results when different camera motions are used.
We note that in both the random split and novel split strategies, after having the training/testing set, our SPLSTM network selects the event image randomly for training/testing, and no sequential information between event images is needed. Moreover, unlike the methods in [13] [14] that need the inertial measurement data, our SPLSTM only uses the event images as the input.
VB Baseline
We compare our experimental results with two recent stateoftheart methods in computer vision: PoseNet [7] and Bayesian PoseNet [16]. We note that both our SPLSTM, PoseNet and Bayesian PoseNet use only the event images as the input and no further information such as 3D map of the environment or inertial measurements is needed.
For each sequence, we report the median and average error of the estimated poses in position and orientation separately. The predicted position is compared with the groundtruth using the Euclidean distance, while the predicted orientation is normalized to unit length before comparing with the groundtruth. The median and average error are measured in and for the position and orientation, respectively.
VC Random Split Results
Table I summarizes the median and average error on sequences using the random split strategy. From this table, we notice that the pose relocalization results are significantly improved using our SPLSTM network in comparison with the baselines that used only CNN [7] [16]. Our SPLSTM achieves the lowest mean and average error in all sequences. In particular, SPLSTM achieves , in median error on average of all sequences, while PoseNet and Bayesian PoseNet results are , and , , respectively. Overall, this improvement is around times in position error and times in orientation error. This demonstrates that the spatial dependencies play an important role in the camera pose relocalization process and our SPLSTM successfully learns these dependencies, hence significantly improves the results. We also notice that PoseNet performs slightly better than Bayesian PoseNet, and the uncertainty estimation in Bayesian PoseNet cannot improve the pose relocalization results for event data.
From Table I, we notice that the pose relocalization results also depend on the properties of the scene in each sequence. Due to the design mechanism of the eventbased camera, the events are mainly captured around the contours of the scene. In cluttered scenes, these contours are ambiguous due to nonmeaningful texture edge information. Therefore, the event images created from events in these scenes are very noisy. As the results, we have observed that for sequences in cluttered or dense scenes (e.g. hdr_poster), the pose relocalization error is higher than sequences from the clear scenes (e.g. shapes_rotation, shapes_translation). We also notice that dynamic objects (e.g. as in dynamic_6dof scene) also affect the pose relocalization results. While PoseNet and Bayesian Posenet are unable to handle the dynamic objects and have high position and orientation errors, our SPLSTM gives reasonable results in this sequence. It demonstrates that by effectively learning the spatial dependencies with SPLSTM, the results in such difficult cases can be improved.
Error Distribution Fig. 4
shows the position and orientation error distributions of our SPLSTM network. Each box plot represents the error for one sequence. We recall that the top and bottom of a box are the first and third quartiles that indicate the interquartile range (IQR). The band inside the box is the median. We notice that the IQR of position error of all sequences (except the
hdr_poster) is around to , while the maximum error is around . The IQR of orientation error is in the range to , and the maximum orientation error is only . In all sequences in our experiment, the hdr_poster gives the worst results. This is explainable since this scene is a dense scene, hence the event images have uncleared structure and very noisy. Therefore, it is more difficult for the network to learn and predict the camera pose from these images.VD Novel Split Results
Table II summarizes the median and average error on sequences using the novel split strategy. This table clearly shows that our SPLSTM results outperform both PoseNet and Bayesian PoseNet by a substantial margin. Our SPLSTM achieves the lowest median and average error in both sequences in this experiment, while the errors of PoseNet and Bayesian PoseNet remain high. In particular, the median error of our SPLSTM is only and in average, compared to , and , from PoseNet and Bayesian PoseNet errors, respectively. These results confirm that by learning the spatial relationship in the image feature space, the pose relocalization results can be significantly improved. Table II also shows that the domination motion of the sequence also affects the results, for example, the translation error in the shapes_translation sequence is higher than shapes_rotation, and vice versa for the orientation error.
Compared to the pose relocalization errors using the random split (Table I), the relocalization errors using the novel split are generally higher. This is explainable since the testing set from the novel split is much more challenging. We recall that in the novel split, the testing set is selected from the last of the event images. This means we do not have the “neighborhood” relationship between the training and testing images. In the random split strategy, the testing images can be very close to the training images since we select the images randomly from the whole sequence for training/testing. This does not happen in the novel split strategy since the training and testing set are two separated sequences. Despite this challenging setup, our SPLSTM still is able to regress the camera pose and achieves reasonable results. This shows that the network successfully encodes the geometry of the scene during training, hence generalizes well during the testing phase.
To conclude, the extensive experimental results from both the random split and novel split setup show that our SPLSTM network successfully relocalizes the event camera pose using only the event image. The key reason that leads to the improvement is the use of stacked spatial LSTM to learn the spatial relationship in the image feature space. The experiments using the novel split setup also confirm that our SPLSTM successfully encodes the geometry of the scene during the training and generalizes well during the testing. Furthermore, our SPLSTM also has very fast inference time and requires only the event image as the input to relocalize the camera pose.
Reproducibility
We implement the proposed method using Tensorflow framework
[21]. The testing time for each new event image using our implementation is around on a Tesla P100 GPU, which is comparable to the realtime performance of PoseNet, while the Bayesian PoseNet takes longer time (approximately ) due to the uncertainty analysis process. To encourage further research, we will release our source code and trained models that allow reproducing the results in this paper.VE Robustness to Number of Events
In this work, we assume that events occurring between two timestamps of the external camera system will have the same camera pose. Although this assumption is necessary to use the groundtruth poses to train the network, it limits the speed of the eventbased camera to the sampling rate of the external camera system. To analyze the effect of number of events to the pose relocalization results, we perform the following study: During the testing phase using the random split strategy, instead of using all events from two continuous timestamps, we gradually use only , , …, of these events to create the event images (the events are chosen in order from the current timestamp to the previous timestamp). Fig. 5 shows the position and orientation errors of our SPLSTM network in this experiment. From the figure, we notice that both the position and orientation errors of all sequences become consistent when we use around number of events. When we use more events to create the event images, the errors are slightly dropped but not significantly. This suggests that the SPLSTM network still performs well when we use fewer events. We also notice that our current method to create the event image from the events is fairly simple since some of the events may be overwritten when they occur at the same coordinates but have different polarity values with the previous events. Despite this, our SPLSTM network still successfully relocalizes the camera pose from the event images.
Vi Conclusions and Future Work
In this paper, we introduce a new method to relocalize the 6DOF pose of the event camera with a deep network. We first create the event images from the event stream. A deep convolutional neuron network is then used to learn features from the event image. These features are reshaped and fed to a Stacked Spatial LSTM network. We have demonstrated that by using the Stacked Spatial LSTM network to learn spatial dependencies in the feature space, the pose relocalization results can be significantly improved. The experimental results show that our network generalizes well under challenging testing strategies and also gives reasonable results when fewer events are used to create event images. Furthermore, our method has fast inference time and needs only the event image to relocalize the camera pose.
Currently, we employ a fairly simple method to create the event image from a list of events. Our forming method does not check if the event at the local pixel has occurred or not. Since the input of the deep network is the event images, better forming method can improve the pose relocalization results, especially on the cluttered scenes since the data from event cameras are very disorder. Although our network achieves inference time, which can be considered as realtime performance as in PoseNet, it still may not fast enough for highspeed robotic applications using event cameras. Therefore, another interesting problem is to study the compact network architecture that can achieve competitive pose relocalization results while having fewer layers and parameters. This would improve the speed of the network and allow it to be used in more realistic scenarios.
Acknowledgment
Anh Nguyen, Darwin G. Caldwell and Nikos G. Tsagarakis are supported by the European Union Seventh Framework Programme (FP7ICT201310) under grant agreement no 611832 (WALKMAN). ThanhToan Do is supported by the Australian Research Council through the Australian Centre for Robotic Vision (CE140100016).
References
 [1] C. Brandli, R. Berner, M. Yang, S. C. Liu, and T. Delbruck, “A 240x180 130 db 3us latency global shutter spatiotemporal vision sensor,” IEEE Journal of SolidState Circuits, 2014.
 [2] A. Censi and D. Scaramuzza, “Lowlatency eventbased visual odometry,” in ICRA, 2014.
 [3] E. Mueggler, H. Rebecq, G. Gallego, T. Delbrück, and D. Scaramuzza, “The eventcamera dataset and simulator: Eventbased data for pose estimation, visual odometry, and SLAM,” IJRR, 2017.
 [4] H. Kim, S. Leutenegger, and A. J. Davison, “Realtime 3d reconstruction and 6dof tracking with an event camera,” in ECCV, 2016.
 [5] D. Weikersdorfer, D. B. Adrian, D. Cremers, and J. Conradt, “Eventbased 3d slam with a depthaugmented dynamic vision sensor,” in ICRA, 2014.
 [6] E. Mueggler, B. Huber, and D. Scaramuzza, “Eventbased, 6dof pose tracking for highspeed maneuvers,” in IROS, 2014.
 [7] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for realtime 6dof camera relocalization,” in ICCV, 2015.
 [8] A. Kendall and R. Cipolla, “Geometric loss functions for camera pose regression with deep learning,” in CVPR, 2017.
 [9] J. Conradt, M. Cook, R. Berner, P. Lichtsteiner, R. J. Douglas, and T. Delbruck, “A pencil balancing robot using a pair of aer dynamic vision sensors,” in International Symposium on Circuits and Systems (ISCAS), 2009.
 [10] H. Kim, A. Handa, R. Benosman, S.H. Ieng, and A. Davison, “Simultaneous mosaicing and tracking with an event camera,” in BMVC, 2014.
 [11] G. Gallego and D. Scaramuzza, “Accurate angular velocity estimation with an event camera,” RAL, 2017.
 [12] C. Reinbacher, G. Munda, and T. Pock, “RealTime Panoramic Tracking for Event Cameras,” in International Conference on Computational Photography (ICCP), 2017.
 [13] A. Zihao Zhu, N. Atanasov, and K. Daniilidis, “Eventbased visual inertial odometry,” in CVPR, 2017.
 [14] H. Rebecq, T. Horstschaefer, and D. Scaramuzza, “Realtime visualinertial odometry for event cameras using keyframebased nonlinear optimization,” in BMVC, vol. 3, 2017.
 [15] F. Walch, C. Hazirbas, L. LealTaixé, T. Sattler, S. Hilsenbeck, and D. Cremers, “Imagebased localization with spatial lstms,” CoRR, vol. abs/1611.07890, 2016.
 [16] A. Kendall and R. Cipolla, “Modelling uncertainty in deep learning for camera relocalization,” in ICRA, 2016.
 [17] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Computing, 1997.
 [18] K. Choi, G. Fazekas, M. B. Sandler, and K. Cho, “Convolutional recurrent neural networks for music classification,” CoRR, vol. abs/1609.04243, 2016.
 [19] L. Theis and M. Bethge, “Generative image modeling using spatial lstms,” in NIPS, 2015.
 [20] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for LargeScale Image Recognition,” CoRR, vol. abs/1409.1556, 2014.

[21]
M. Abadi et al.
, “TensorFlow: Largescale machine learning on heterogeneous systems,” software available from tensorflow.org. [Online]. Available:
http://tensorflow.org/