Event-based cameras are asynchronous sensors that capture changes in pixel intensity as binary events, with very high frequency compared to RGB sensors. This makes them suitable for high speed applications, such as robotics (kim2016real; dimitrova2019towards) and other safety-critical scenarios. The Dynamic Vision Sensor (DVS) (dvs_tobi_2008) is an event camera that, compared to traditional sensors, has low power consumption, high dynamic range, no motion blur, and microsecond latency times.
Unfortunately, due to their asynchronous and binary format, there is no obvious choice of a model class for handling DVS data, unlike the predominant use of convolution-based models for RGB images. In this paper, we propose the use of a deep-learning and differential-equation hybrid method for such tasks, inspired by Neural Ordinary Differential Equations, (NODE)(Chen2018NeuralOD). NODE pioneered a novel machine learning approach where the data is modeled as an ODE in latent space, which can in principle be adjusted to process multiple asynchronous inputs.
Most recent works using machine-learning to model DVS data integrate individual events to convert them into formats that can be fed as input into existing models, but lose precise timing information. The work of (akolkar_2015_question) studies the benefit of using precise temporal event data over aggregated event techniques. In particular, the study states:
The use of information theory to characterize separability between classes for each temporal resolution shows that high temporal acquisition provides up to 70% more information than conventional spikes generated from frame-based acquisition as used in standard artificial vision, thus drastically increasing the separability between classes of objects.
This provides motivation to research methods that can directly handle asynchronous data.
Summary of contributions.
This work develops a novel real-time online classification model for event-based camera data streams. Moreover, it proposes INODE, an extension of the NODE architecture, which can directly take as input the stream of a possibly-high-frequency signal. This can be seen a continuous-time extension of Recurrent Neural Networks (RNNs). INODE is trained to perform continuous-time event filtering in order to infer classification labels online, based on its hidden state at a given moment. At test time, the classification prediction and the hidden state are updated as each (asynchronous) camera event is received. The event polarity and spatial coordinates are fed directly as inputs to the network without using convolutional layers or event integration. Importantly, we remark that we do not process input data in any form beyond normalization.
Summary of experiments.
We demonstrate that the proposed approach excels in sample efficiency and real-time performance, significantly outperforming several LSTM architectures using short sequencing during online inference at test time. Furthermore, our method works with raw, noisy camera readings and is also invariant to the camera resolution used to capture the data.
2 Related Works
We review previous works related to our method, first describing alternative approaches to process events and discussing their relative advantages, then briefly introducing NODE methods.
2.1 Learning from event data
Event data from DVS cameras, being asynchronously streamed per sensor array pixel, requires careful processing to be compatible with traditional machine learning models. Methods for handling event data can be, in general, divided into grouped-event-based and per-event-based. The former employ a scheme to integrate multiple events into a single data structure that can be handled by spatially-based (e.g., convolutional) models, while the latter process the data stream on an event-by-event basis. Figure 1 illustrates the main differences between the reviewed works and the proposed approach.
One of the more evident strategies in this category is to integrate time windows of data into grayscale intensity images, then apply existing computer vision techniques on these reconstructions. This is used, for example, in optical flow estimation(bardow2016simultaneous), SLAM (kim2016real)
and face recognition(barua_2016). Such a process requires various filtering, tracking, and/or inertial measurement integration to properly compute frame offsets. This integration method itself is also the subject of (rebecq2019events), that uses RNNs to obtain usable intensity video from events. The main advantage of these methods is the possibility of directly plugging-in existing algorithms on top of grayscale images. This comes at the cost of including pipeline buffering (latency) due to event collection over some time window, loosing the timestamp information, and potentially needing external IMU integration for long-term odometry.
Many techniques avoid the reconstruction of a full intensity image over a long buffer, but still rely on machine learning methods made for image data, such as Convolutional Neural Networks(fukushima1980neocognitron; lecun1998gradient), and thus require formatting events into a sparse 2D grid structure. This has been applied to optical flow estimation (zhu_ev-flownet_2018; cannici_matrix-lstm_2020), object detection (cannici_asynchronous_2018; cannici_matrix-lstm_2020), and depth estimation (tulyakov2019learning). Various aggregation schemes can be used, such as time-window binning or voxel volumes. Different grid sampling schemes are proposed in (gehrig_end--end_2019) and (cannici_matrix-lstm_2020). Advantages of these methods include compatibility with image-based learning algorithms, but disadvantages include, once again, inefficiency over sparse grids, loss of precise event timings, and a delay required to collect frames over time windows.
A distinct approach, evaluated on image classification, samples events until they form a connected graph, with a combination of spatial and temporal distances as a measure of edge length (Bi2019Graph). A neural network able to work on graph data (bronstein2017geometric) is then used to process the inputs. The use of spatial graph convolutions addresses the issue of sparsity found in grid-based approaches but still requires to collect data over a time window.
Since event-cameras are considered a neuromorphic system, researchers theorized they would go hand-in-hand with a more biologically-grounded model for processing. Spiking Neural Networks (SNNs) (maass1997networks)
are a class of neural networks based on human-vision perception principles, asynchronously activating specific neurons. This makes them a theoretical candidate for processing DVS events, one at a time(akolkar_2015; paulun_2018)
. In their original form, SNNs are non-differentiable and thus incompatible with backpropagation-based training; therefore, most SNN methods require either proxy-based procedures(stromatias2017event) or an approximation of the original SNN formulation (lee2016training). Nevertheless, these models tend to have lower performance than more modern methods.
Another clear choice for event-by-event classification are RNNs (elman1990finding), neural networks specifically designed to handle sequential data. Such models, however, usually assume evenly-spaced series inputs, therefore neglecting one of the main features of DVS sensors. To address this, an extension of the LSTM (hochreiter1997long) architecture, named PhasedLSTM (neil_phased_2016), was devised. This model added time gates to the previous and current intermediate hidden states. These gates open cyclically, modulated by the current input timestamp. PhasedLSTM was tested on event classification, using an embedding for the event coordinates, showing an improvement over LSTM for performance on the same task. Note that this is the closest existing method to our own.
2.2 Neural ODEs
NODEs are a recent methodology for modeling data as a dynamical system, governed by a neural network and solved using traditional ODE solvers (Chen2018NeuralOD). Inference is performed using gradient-based optimization through several time steps of the discretized ODE, typically using explicit time-stepping schemes (butcher_runge-kutta_1996). To reduce memory requirements, researchers have proposed using the adjoint method (Chen2018NeuralOD; Gholami2019). NODEs have been applied to the time-series domain (rubanova_latent_2019), by employing an LSTM to preprocess irregularly-spaced samples before feeding it into a NODE solver. This adds flexibility to the original formulation, at the cost of additional parameters and increased processing time. Moreover, there is high risk that the conditioning network could perform most of the inference and therefore the NODE results only in an integration task. In this work, we instead consider ODEs with an input connection, similarly to the SNODE architecture in (Quaglino2020SNODE).
3 Input-filtering Neural ODE (INODE)
The proposed approach builds upon the architecture proposed in (Quaglino2020SNODE)
, with the difference that here we do not focus on the improvement of training efficiency and use standard back-propagation through time. We implement a batch Euler ODE solver so that our network can be dealt with as an RNN. This allows for the state to be unmeasured (hidden), for instance like in LSTMs. The result is an recurrent architecture with skip connections that can handle unevenly-spaced points in time. We also add a decoder network as a classifier.
Input-filtering Neural ODE.
Consider the constrained differential optimization problem,
where is the hidden state, is the input, is the predicted output, is the desired output, the loss is given, and are neural networks with a fixed architecture defined by, respectively, , and which are parameters that have to be learned. The first two equality constraints in (1) define an ODE. Problems of this form have been used to represent several inverse problems, for instance in machine learning, estimation, filtering and optimal control (stengel_optimal_1994; law2015data; Ross2015). Since this architecture can act as a general filter for the input signal, , we refer to it as the Input-filtering Neural ODE (INODE). We consider this as a general framework for handling event data in a machine-learning scenario.
Application to DVS cameras.
We propose to use INODE to build a system that predicts (labels) online by filtering a live-stream of DVS-camera events. The aim is to learn the ODE in problem (1), given short excitation event sequences . Ideally, this model should produce the fastest trajectory from the initial state to an appropriate (unknown) state such that , where serves as a classification layer and are the labels to be predicted. Hence, we fix the target to , .
Events are high-frequency signals, and solving a high-frequency ODE is difficult. Event streams are also extremely dense: the time between events is, in general, very small (often s). We propose the use of a sample-and-hold approach, where events are held constant for up to a maximum delta-time . In the rare case that no events occur after , then we simply wait for the next event and hold the previous result without running the forward pass.
A neuromorphic dataset is a collection of events , where is the number of events considered for a given sample (typically on the order of thousands), and labels for classes. A digit is represented by a tuple and the dataset by , where is the number of samples. Thus, the integral in (1) is discretized for each sample using a subset of size evaluation points as:
where is the cross-entropy loss. For each evaluation point, a new input event is used, i.e., . Finally, the sample loss is averaged over the dataset and used for optimization.
Time step normalization.
To accurately use the time-steps , they can be normalized to values smaller than one (timestamps are recorded in microseconds and thus quickly reach very large values). At the same time, should not be very small to avoid optimization issues, such as vanishing gradients. We compute
from the raw time-steps and divide by the 98th quantilefrom the empirical distribution of for each training dataset, pre-computed and fixed, with an upper threshold at 1. The normalized step is . The complete training procedure is summarised in Algorithm 1.
We consider multiple classification tasks to validate our method, benchmarking against LSTM variants. During these, we always learn from short event subsequences (up to 100 events). Performance is evaluated with the same number of events used during training. This allows for potential real-time classification (when properly optimized), as inference time increases with number of events processed.
We use the same configurations, architectures, and hyper-parameters for all of the datasets and model variants. We train all models with different levels, where is the fraction of train dataset used for training. For each sequence, we sample a random offset and relative sub-sequence of length . In all of the experiments we set . We then use such sub-sequence as input for the model with batch size .
At test time, we consider different scenarios: a standard case, where the models are evaluated with on the test set, and more challenging ones, in which they are evaluated with short sub-sequences in the range .
We first compare INODE against LSTM and bidirectional LSTM (bi-LSTM). The LSTMs and bi-LSTMs receive the event time-step as additional input. We consider three bi-LSTM models with hidden states of dimension . The has approximately the same capacity of INODE, while is 3x larger.
We also consider a variant of LSTM, the PhasedLSTM (neil_phased_2016) without coordinate-grid embedding. This model explicitly handles asynchronous data learning an additional phase gate. Such approach is – according to the authors – fruitful for long sequences (>1000 steps), in which the phase gate can exploit periodic mechanism in the data. Given our use case, short sequences of events (<100), we do not expect improvements over a standard LSTM. To the best of our knowledge, this is the only known method which – like ours – inherently handles asynchronous timing within the model and does not need to learn an external transition model. Unfortunately, our initial results with standard PhasedLSTM were rather poor. However, combining phased and bidirectional LSTM seemed promising. We denote this as P-bi-LSTM.
The number of states, parameters, and input features for each model are summarized in Table 1.
|model||n states||n params||input|
We consider three neuromorphic datasets:
The NMNIST dataset (orchard2015converting) is a neuromorphic version of MNIST. It is an artificial dataset, generated by moving a DVS sensor in front of an LCD monitor displaying static images. It consists of 60k training samples and 10k test samples, for 10 different digits on a grid of 34 34 pixels. We consider only the first 2,000 (of potentially up to 6,000) events for each sequence. We do not stabilize the events spatially nor attempt to remove noisy events, which are options available in the dataset.
2) ASL (12-16k)
The ASL-DVS dataset, is a neuromorphic dataset, obtained for a stream of real-world events (Bi2019Graph). It consists of around 100k samples for 24 different letters from the American Sign Language, with spatial resolution 180 240. Its sequences range from 1-500k events, with length distribution peaking in the 12-16k range. To avoid inconsistencies, we consider a subset containing only samples with a number of events between 12k and 16k. The resulting dataset contains 12,275 training samples plus 1,364 test samples.
Similarly, the NCALTECH dataset (orchard2015converting) is the neuromorphic version of CALTECH101, produced in the same fashion as NMNIST. It consists of 100 heavily unbalanced classes of objects plus a background, with spatial resolution 172 232. The dataset contains 6,634 training samples and 1,608 test samples, after removing the background images. As with NMNIST, we again avoid stabilizing/denoising the images.
We train each model using ADAM for 300 epochs, withand learning rate of 1e-3. The batch size is for NMNIST, and
for the other datasets. We consider a simple multi-layer perceptron for:
where denotes the concatenation operation, FC is a fully-connected layer, and is the activation.
|output dim||128||128||128||30||n classes|
|model||dataset %||n events test|
When testing the models, we vary both the size of the training dataset and the number of test events used for the classification (). The former is used to show INODE’s learning efficiency when using a small amount of training data, while the latter demonstrates INODE’s real-time scenario usability. Tables 3, 4, and 5 report accuracies for each of our datasets.
The LSTM with 164 states outperforms the proposed architecture on NMNIST, see Table 5. On the ASL dataset (Table 4) our approach consistently outperforms all of the unidirectional baselines with a margin of 20%. We believe this is important since, among the considered datasets, ASL contains by far the most realistic data, being the only one not generated from static images. For NCALTECH, our approach is either on par or better than the LSTM when a small percentage of event is used (Table 3).
For the bidirectional baselines, with approximately the same capacity ( and ), INODE performs better then the bi-LSTMs on all of the datasets. Increasing the baseline capacity (), INODE performs better on NCALTECH and ASL, while slightly losing its edge to the on NMNIST. Decreasing the training-set size has essentially no impact on NMNIST for all models – confirmation of a relatively simple dataset.
One can also notice that, with a couple of exceptions on NMNIST, INODE outperforms the bidirectional methods regardless of number of input events. These are as low as and in principle even is possible without modifying our approach. Interestingly, with only 10 events, the model can correctly classify NMNIST digits about half of the time. As such, we demonstrate INODE’s ability to extract information in the case of exceptional sparsity and data unavailability. This could be extremely important in scenarios such as collision avoidance and human-machine interaction, where safety is a paramount requisite.
|model||dataset %||n events test|
|model||dataset %||n events test|
This paper presents a novel approach for performing machine learning from event-camera streams. The proposed INODE model is devised to handle high-frequency event data, inherently making use of the precise timing information of each individual event, and does not require processing the raw data into different formats.
We compared the approach to LSTM baselines on multiple DVS camera-based classification tasks. On the ASL task, the INODE significantly outperforms the baselines in fewer epochs. The network gains marginal predictive power as the complexity of the dataset increases or as the amount of data decreases. The baselines deliver a better performance only for simple datasets (MNIST) and if a large amount of data is availabile (NCALTECH).
INODE excels in the most realistic scenarios, when little training data and few events are available. This makes it suitable for real-time, low-computation settings where decisions must be taken with only few event such as collision avoidance and high-speed object recognition.
The authors are grateful to Christian Osendorfer for his valuable input and feedback and to everyone at NNAISENSE for contributing to an inspiring R&D environment.
Appendix A Learning Dynamics
The learning curves and online inference trajectories for the proposed method and the bidirectional LSTM baselines are depicted in Figure 5, 7 and 6 for, respectively, the NMNIST, NCALTECH and ASL dataset. On ASL, our method consistently outperforms the baselines b a large margin; and in Figure 8, 10 and 9 for the LSTM baselines.