The direct interaction between a robot and its surroundings is one of the major challenges in robotics. The iCub, designed to be anthropomorphic with a three-and-a-half year old child, primarily uses vision to measure the state of the external environment and, as such, visual motion estimation is fundamental.
Unlike standard cameras that read the full sensor array to produce images at a fixed frame rate, event cameras only report change in pixel-level brightness above a threshold. The “events” are produced independently and asynchronously for each pixel sensor. They offer a high dynamic range ( as compared to of standard cameras), together with low latency and high temporal resolution (both in the order of micro-seconds). Processing of redundant information is avoided as pixels that do not experience a change simply do not activate.
As such, event cameras are a promising technology for fast, accurate and low power vision algorithms for robots in dynamic environments. Since the output of event cameras is fundamentally different from standard cameras (a continuous, asynchronous stream of events instead of a sequence of images), new algorithms are required to deal with these data. The iCub  is a humanoid robot designed with sensory and actuation capabilities to interact with a dynamic environment, in which objects and the robot itself move simultaneously. The neuromorphic iCub is equipped with a stereo pair of ATIS event cameras .
Solving a problem such as visual tracking can be performed almost trivially using a stationary event camera: only the motion of the target causes events to be produced. The problem of segmenting the background from the target is inherently solved by the sensor. However, the movement of an event camera mounted on a robot causes events to be generated due to all contrast in the field-of-view and the problem of tracking or recognising the motion of a target object becomes more difficult. Differentiating the event camera signal caused by ego-motion from the signal caused by independent motion has many potential uses in event-driven robotics.
The problem of Independent Motion Detection (IMD) was first studied in [3, 4]. However, as noted by , the apparent motion induced by the ego-motion has “some degree of statistical regularity” and “the structure of the environment [..] is far from arbitrary” for many applications (in their case, autonomous ground vehicles). Also constraints on the vehicle motion were suggest to be exploited .
Besides visual sensing, most robots are also equipped with proprioceptive sensors such as inertial measurement units and joint encoders. Thus, a typical strategy is to use the knowledge of the robot kinematics to predict the apparent motion induced by ego-motion [7, 8, 9]. However, due to imprecision in the mechanical structure and sensor acquisition errors, significant noise is present in these predictions. Therefore,  proposed to learn
the correlation joint velocities and flow statistics, without requiring a kinematic model of the robot. Independent motion is found by predicting a probability distribution of the optical flow and identifying flow vectors that do not belong to this distribution. This approach estimates ego-motion correctly, even if the majority of the image plane is independent motion (e.g., a large object in front of the camera), which is not possible by using methods relying on vision alone.
In this paper we present a method to segment events caused by ego-motion from those caused by independent object motion. Such a technique has wide applicability for event-driven algorithms in robotics. For example, identifying independent object motion makes detection and tracking of moving objects simple. In a dynamic environment moving objects are usually very relevant to behaviour (e.g. for avoidance or attentional interaction with human collaborators handling objects). Alternatively, segmentation of ego-motion events can be used to remove outliers to improve event-based visual odometry methods (e.g.[10, 11]) in dynamic scenes. The use of event cameras for this task is driven by the strong potential for low-latency, low-power robotic vision. Our algorithm takes advantage of previous work in event-based corner detection  as well as traditional approaches to ego-motion segmentation [8, 9].
Ii Related Work
Ii-a Independent Motion Detection for Robot Applications
In , the authors proposed the MotionCUT framework, based on Lucas-Kanade tracker failures. They observed that Lucas-Kanade algorithm typically fails around rotations and occlusions, which are likely caused by objects independently moving. This approach assumes visual motion to be dominated by ego-motion and only a small portion of it to be produced by independent motion. In case of large objects or objects close to the camera, the camera translation component is not negligible and it can produce a failure in the Lucas-Kanade tracking, as well as an independent moving object. The assumption of the ego-motion being the dominant motion of the scene is often violated in dynamic environments where many objects move at the same time.
In , the kinematics of the robots were assumed to be known, but with input-dependent noise due to mechanical imperfections and sensor noise. To address these issues, the correlation between predicted and tracked features was learned. For the prediction, the depth of the features was estimated using stereo. The statistics were learned in situations without independently moving objects, i.e., the apparent motion was only due to the ego-motion. In the testing phase, these statistics are used to detect anomalies that correspond to independent motion.
Ii-B Event-based Motion Estimation
Much work on motion estimation with event cameras has focused on SLAM and visual odometry algorithms. Estimating only the rotational component of a camera has been performed using various methods [15, 16, 17], however, the motion of the iCub’s eyes includes a non-negligible amount of translational motion, and these algorithms are not suitable, especially when objects are close to the camera.
Event-based 6-DOF SLAM and visual odometry methods [10, 11] assume static scenes and sufficiently large independently moving objects introduce errors. In this paper, we consider environments in which objects are also moving and are interested in their motion, rather than solely the robot’s ego-motion. It was shown that specific algorithmic adjustments were required to perform tracking in a highly-dynamic environment, which had simultaneous camera and object motion . When using an event-based camera, the problem difficulty increases dramatically. In addition, the experiments relied solely on visual data as they did not have access to the robot kinematics. Previously, combining event camera vision with external sensors has been performed using a gyroscope for event-stabilisation . Again, this method can only estimate the apparent motion due to camera rotation, but not translation.
To segment ego-motion induced events from those induced from an independently moving object, first a measure of the optical flow is required. Optical flow can be easily calculated using event cameras, by fitting planes in the three-dimensional spatio-temporal space in which events exist . However, similarly to frame-based cameras, when considering only small spatial windows of events, aperture problems cause incorrect flow estimation along long uniform edges. With event-based cameras, the locally spatial plane fitting method, in combination with the temporally asynchronous nature of events, makes it difficult to apply global correction methods.
Unique features are typically used to avoid the aperture problem, from which the true optical flow can be calculated. Event-based corner detection methods have been proposed [21, 12]. In this paper, we use , that adapted the Harris method to the event data stream. It has also been shown that tracking corners in event-space is possible, resulting in a faster than frame-rate update of corner positions.
The pipeline for detecting independent motion consists of visual corner detection, tracking and velocity estimation in parallel with the estimation of the joint velocities from motor encoders, as shown in Fig. 1. In a learning phase using a completely static scene, a model of the correlation between motor velocities and the resulting visual motion is developed (following [8, 9]
). During operation, computed visual motion is compared to the expected visual motion given the model and the movement of the robot. Large discrepancies between estimated motion and computed motion can be classified as being caused by an independently moving object.
Iii-a Corner Detection
The dynamic vision circuitry of the ATIS camera provides a stream of events, in the form of (pixel position in , polarity, and timestamp). The stream of events is first reduced to the subspace of corner events, detected using , with the following implementation changes. Previously a fixed event window was used that stored a fixed number of the most-recent events globally across the sensor space in the form of a “surface” . Good results were achieved with the size of the window tuned according to the scene complexity. However this can be problematic when motion is not uniform over the visual scene (e.g. in our case of an object moving with motion independent of the camera motion), as higher velocities produce a higher number of events per second, manifesting in the event window as “thicker edges”. Instead of a single global event window, we implemented the same fixed event surface on a local scale, one for each pixel location. The window size becomes no-longer dependent on the particular scene and object motion, but on the feature type used: in our case, corners. As the features are constant, the window size can also be constant and independent of the scene. For corner detection we set the window size to be , where is the radius of the window used.
Iii-B Corner clustering and tracking
Events that are labelled as corners are clustered into corner tracks, from which the velocity of each corner can be estimated. The first cluster is initialized with the first corner event and following corner events are added based on the spatial distance to the clusters: events are added if the distance is less than a threshold ; if no cluster is less than pixels to the current event, a new cluster is created.
Such a greedy corner allocation method can be applied as corner detection error is typically limited to 2 pixels  and we can assume to observe the full trajectory (pixel by pixel) of objects in the event space (i.e. an observed object cannot jump more than one pixel when using an event-based camera, whereas it is a common occurrence for objects that move faster than the frame-rate of a traditional camera).
Each cluster is updated according to a first-in first-out rule: when the maximum size is reached, the oldest corner event is removed from the cluster. To avoid tracking corner events that do not reflect the current motion, clusters are deleted if they are not updated for a time higher than .
Corners positions are tracked over time using regression to fit a line in the space. Given a set of corner events that belong to the current cluster , we find the function that minimizes the sum of the squared deviations:
defines the direction of the line and provides the components of the flow in both directions . To minimise the error on velocity estimation, we define a minimum number of events for the clusters to be informative.
The corner event is augmented with the additional information from the velocity calculation, generating a flow event: .
The algorithm is detailed in Algorithm 1.
Corner clusters are updated and an estimation of visual velocity is calculated asynchronously as events occur, with a higher than microsecond resolution. The flow of the entire scene can be found by querying all event clusters that are active at any point in time. The most recent flow event within the cluster holds the most up-to-date velocity calculation for each corner cluster.
Iii-C Model Learning
A model of average visual motion given motor encoder velocities is learned from data. Robot joint velocities, , can be estimated by differentiating encoder positions and applying a filter 
. Supervised learning is performed every time joint velocities are read such that the input isand the learning signal is the optical flow statistics , computed from corner tracking. and are the mean scene velocities () and the covariance matrix of all active clusters queried at time . Importantly, the model must be learned (once) in a completely static environment such that ego-motion alone contributes to the visual flow.
Joint velocities are updated every , while the estimated velocity from the corner clusters is asynchronously updated for each event, with an initial latency of events. When at least half of the clusters reach events, we read the encoder velocities and update the statistics, considering only clusters that satisfy the requirement. To avoid using just one cluster which would not be informative to learn scene statistics, we also define a minimum number of clusters that need to be active at the same time. For fast speed motion, the speed of the algorithm is limited by the encoder update frequency, but for slower motion, the update is tailored to the dynamics of the scene, and the update is done only when sufficient information is gathered, saving computation and power.
Joint velocities and associated cluster velocities are then used as training examples for -SVM  with an RBF kernel, to learn five regressors, namely .
The algorithm is detailed in 2.
Iii-D Independent Motion Classification
During robot operation we compare the computed velocity of corner events with the expected distribution , predicted using the model, given the joint velocities. The Mahalanobis distance is used as a metric of how likely the calculated velocities belong to the ego-motion distribution:
Classification is performed using a distance threshold , such that flow events that are below the threshold can be assumed to be created from the motion of the robot itself, while those that exceed the threshold can be labelled as independent motion events.
Iv Experiments and Results
We characterized and tested the algorithm on data collected from the event camera (ATIS) mounted on the iCub robot. iCub head has a total of 6 Degrees of Freedom (DoFs):for the neck (pitch, roll and yaw), and for the eyes (tilt, version and vergence). The iCub GazeController  was used to move the robot head position, controlling both neck and eyes independently. Ego-motion was generated by defining 3D target positions in the environment to gaze at, and different speeds were achieved by specifying the time for the trajectory to be completed.
During the learning phase, in order to get a representative dataset for training, we sampled a static environment selecting targets for the controller distributed in a rectangular area, while changing the times for gazing. The data collected were used to train the ego-motion prediction model.
For the testing phase, we performed two sets of experiments. In the first, we moved the head randomly in the environment at a fixed speed and simultaneously the hand around the yaw axis, while holding an object (a tea box). This method was used to control the velocity of the object and the ground truth. The hand moved at different speeds, in order to evaluate the detection response with different velocities of the independently moving object. In the second experiment, we changed the velocity of the head, while maintaining fixed the velocity of the hand, in order to evaluate the detection response with different velocities of the ego-motion. Fig. 2 shows the experimental setup. We controlled the hand and the head velocities, respectively at and and and .
For all testing datasets, the region-of-interest defining the position of the object that underwent independent motion was labelled by hand. The corner events falling within the region-of-interest formed the ground-truth true-positive detections.
We empirically selected the following parameters: .
Sparse corner flow events were compared to the learned model using the metric defined in Eq. 2. Fig. 3 shows the distance between the velocity vectors and the predicted motion computed according to Eq. 2, grouped into background and independent motion (blue and red line) according to the ground truth. The average distance over for each group is shown. In general, velocity vectors that belong to independent motion exhibit higher distances to the model than background vectors. This indicates the potential for separating ego-motion and independent motion using the proposed method. At some points in the dataset the distance for the independent motion corners drops to a similar level as the ego-motion corners (red and blue lines overlap, for example, between , , and in the last ). This happens as the object stops moving and becomes indistinguishable from the background motion. These points do not correspond to failure in the detection algorithm, but represent an intrinsic limitation that originates from the use of motion to detect the target.
The performance of the algorithm, on an event-by-event basis, as the detection threshold changes, was evaluated in terms of precision and recall, shown in Fig. 4. At low thresholds (i.e. low recall), the precision is , which indicates that a strong “independent motion” response was always present in the system. Such a response is caused by noise in the detection algorithm, however a precision of may not be required for many robotic applications. The precision is stable over a wide range of thresholds, until recall rate is achieved. We can therefore select a threshold to achieve a precision of . Performances are consistent with different speeds of the target and iCub head. The algorithm is therefore robust to changes in velocities and a valid threshold can be chosen that should be robust to speed variation.
Example snapshots of events (accumulated over and labelled according to the selected threshold ), are shown in Fig. 5 along with the corresponding motion distributions (in orientation and magnitude). In Fig. 4(a), corner events on the target fall within the ego-motion distribution, as the target is not moving. Coherently with Fig. 3, we can select a proper threshold to separate the independent motion from the background, both in conditions in which motion magnitudes and orientations are separable, as shown in Fig. 5 (fig:frame2fig:mag2fig:ori2), and when only one of the two components is separable (a different magnitude but orientation falling in the same distribution is shown in Fig. 5 (fig:frame3fig:mag3fig:ori3).
Some corner events are not labelled as independent motion even though they belong to the moving object (Fig 5 fig:frame2fig:frame3). This happens mainly when motion direction changes, in this case, the relative motion between the object and the background approaches zero and independent motion is similar to the ego-motion generated flow. Additionally, during sharp changes in motion, there is a small latency in recovering the clusters and re-evaluating the new velocity.
We finally analysed the trajectories traced by corner events labelled as independent motion, grouped according to the ground truth (Fig. 6). Only trajectories along the sensor plane are shown for clarity. Ideally, all corner events within the ground truth region should be classified as independent motion (i.e. in Fig. 5(a)), and all corner events from the background should be ego-motion (i.e. in Fig. 5(b)). Despite the recall of a consistent detection is still achieved over time, indicating a segmentation algorithm could potentially achieve a consistent result. Importantly, false positives are sparse and don’t form a coherent pattern, such that a simple filter could easily reject such detections of independent motion.
In this work, we have presented an event-based independent motion detector using the event camera, which disentangles the independent motion that occurs in the visual scene from the robot ego-motion. As background clutter can induce many additional events (but irrelevant for certain tasks), this task is crucial for event-driven scenarios where cameras are non stationary (on a robot). The use of cluster events reduces the data flow to the most informative events, enabling efficient, real-time implementation of many different event-driven vision algorithms for robotics. We detect and track corners in the space of events and learn the correlation between their motion and robot’s joint velocities, when there is no moving object in the scene. We then label as belonging to independent motion corner events whose motion does not agree with the predicted velocity.
We model ego-motion with first-order statistics, relying on the assumption of negligible motion parallax (which depends on the structure of the scene) and motion induced by rotation around the optical axis of the cameras (the head roll), as in [8, 9]. This assumption did not affect the result as the algorithm was able to detect independent motion, with a precision of , consistently with changing speed of both the target and the head. However we plan to model and learn the ego-motion using an affine motion model to be robust to head rotations. The detection can be problematic when the object changes direction, as the relative motion with the background approaches zero. However we do not need dense detection in time as sparse detections of independent motion can be used as triggering locations for visual tracking. Finally we show that sparse optical flow can be effectively used to address independent motion detection, reducing therefore the amount of data to process.
We plan to use these sparse detections to segment a moving object in a cluttered scene on the iCub, implementing an event-based attention mechanism driven by the motion of the target, which would facilitate visual tracking.
This research was supported by the Swiss National Science Foundation through the National Center of Competence in Research Robotics.
-  G. Metta, L. Natale, F. Nori, G. Sandini, D. Vernon, L. Fadiga, C. von Hofsten, K. Rosander, M. Lopes, J. Santos-Victor, A. Bernardino, and L. Montesano, “The iCub humanoid robot: An open-systems platform for research in cognitive development,” Neural Netw., vol. 23, pp. 1125–1134, 2010.
-  C. Posch, D. Matolin, and R. Wohlgenannt, “A QVGA 143 dB dynamic range frame-free PWM image sensor with lossless pixel-level video compression and time-domain CDS,” IEEE J. Solid-State Circuits, vol. 46, no. 1, pp. 259–275, Jan. 2011.
-  J. P. Costeira and T. Kanade, “A multibody factorization method for independently moving objects,” Int. J. Comput. Vis., vol. 29, pp. 159–179, 1998.
-  M. Irani and P. Anandan, “A unified approach to moving object detection in 2D and 3D scenes,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 577–589, 1998.
-  R. Roberts, C. Potthast, and F. Dellaert, “Learning general optical flow subspaces for egomotion estimation and detection of motion anomalies,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recog., 2009, pp. 57–64.
-  R. Sabzevari and D. Scaramuzza, “Multi-body motion estimation from monocular vehicle-mounted cameras,” IEEE Trans. Robot., vol. 32, pp. 638–651, 2016.
-  T. Viéville, E. Clergue, R. Enciso, and H. Mathieu, “Experimenting with 3D vision on a robotic head,” J. Robot. and Auton. Syst., vol. 14, pp. 1–27, 1995.
-  S. R. Fanello, C. Ciliberto, L. Natale, and G. Metta, “Weakly supervised strategies for natural object recognition in robotics,” in IEEE Int. Conf. Robot. Autom. (ICRA), 2013, pp. 4223–4229.
-  S. Kumar, F. Odone, N. Noceti, and L. Natale, “Object segmentation using independent motion detection,” in IEEE Int. Conf. Humanoid Robot. (Humanoids), 2015, pp. 94–100.
-  H. Kim, S. Leutenegger, and A. J. Davison, “Real-time 3D reconstruction and 6-DoF tracking with an event camera,” in Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 349–364.
-  H. Rebecq, T. Horstschäfer, G. Gallego, and D. Scaramuzza, “EVO: A geometric approach to event-based 6-DOF parallel tracking and mapping in real-time,” IEEE Robot. Autom. Lett., vol. 2, pp. 593–600, 2017.
-  V. Vasco, A. Glover, and C. Bartolozzi, “Fast event-based Harris corner detection exploiting the advantages of event-driven cameras,” in IEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS), 2016.
-  C. Ciliberto, U. Pattacini, L. Natale, F. Nori, and G. Metta, “Reexamining Lucas-Kanade method for real-time independent motion detection: Application to the iCub humanoid robot,” IEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS), pp. 4154–4160, 2011.
C. Ciliberto, S. R. Fanello, L. Natale, and G. Metta, “A heteroscedastic approach to independent motion detection for actuated visual sensors,” inIEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS), 2012, pp. 3907–3913.
-  M. Cook, L. Gugelmann, F. Jug, C. Krautz, and A. Steger, “Interacting maps for fast visual interpretation,” in Int. Joint Conf. Neural Netw. (IJCNN), 2011, pp. 770–776.
-  H. Kim, A. Handa, R. Benosman, S.-H. Ieng, and A. J. Davison, “Simultaneous mosaicing and tracking with an event camera,” in British Machine Vis. Conf. (BMVC), 2014.
-  G. Gallego and D. Scaramuzza, “Accurate angular velocity estimation with an event camera,” IEEE Robot. Autom. Lett., vol. 2, pp. 632–639, 2017.
-  A. Glover and C. Bartolozzi, “Event-driven ball detection and gaze fixation in clutter,” 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2203–2208, 2016.
-  T. Delbruck, V. Villanueva, and L. Longinotti, “Integration of dynamic vision sensor with inertial measurement unit for electronically stabilized event-based vision,” in IEEE Int. Symp. Circuits Syst. (ISCAS), Jun. 2014, pp. 2636–2639.
R. Benosman, C. Clercq, X. Lagorce, S. H. Ieng, and C. Bartolozzi,
“Event-based visual flow,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 2, pp. 407–417, feb 2014.
-  X. Clady, S.-H. Ieng, and R. Benosman, “Asynchronous event-based corner detection and matching,” Neural Networks, vol. 66, pp. 91–106, 2015.
-  F. Janabi-Sharifi, V. Hayward, and C.-S. J. Chen, “Discrete-Time AdaptiveWindowing for Velocity Estimation,” IEEE Transactions on Control Systems Technology, vol. 8, pp. 1003–1009, 2000.
C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,”ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
-  A. Roncone, U. Pattacini, G. Metta, and L. Natale, “A cartesian 6-DoF gaze controller for humanoid robots,” Robotics: Science and Systems (RSS), 2016.