V2E: From video frames to realistic DVS event camera streams

06/13/2020 ∙ by Tobi Delbruck, et al. ∙ Universität Zürich 0

To help meet the increasing need for dynamic vision sensor (DVS) event camera data, we developed the v2e toolbox, which generates synthetic DVS event streams from intensity frame videos. Videos can be of any type, either real or synthetic. v2e optionally uses synthetic slow motion to upsample the video frame rate and then generates DVS events from these frames using a realistic pixel model that includes event threshold mismatch, finite illumination-dependent bandwidth, and several types of noise. v2e includes an algorithm that determines the DVS thresholds and bandwidth so that the synthetic event stream statistics match a given reference DVS recording. v2e is the first toolbox that can synthesize realistic low light DVS data. This paper also clarifies misleading claims about DVS characteristics in some of the computer vision literature. The v2e website is https://sites.google.com/view/video2events/home and code is hosted at https://github.com/SensorsINI/v2e.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 6

page 7

page 8

page 9

Code Repositories

v2e

V2E: From video frames to DVS events


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Overview

Fig. 1A illustrates how a dynamic vision sensor (DVS) event camera outputs a stream of brightness change events [14]. Each pixel holds a memorized brightness value (log intensity value) and continuously monitors if the brightness changes away from this stored value by a critical event threshold (see Fig. 1A). If so, the pixel asynchronously outputs either an ON or OFF brightness change event and then the pixel memorizes the new brightness value. Fig. 1B shows that a DAVIS event camera concurrently outputs both DVS events (the spiral cloud of points) and standard global-shutter active pixel sensor (APS) intensity frames (background image).

The high dynamic range, high time resolution, and quick plus sparse output make DVS attractive sensors for machine vision under difficult lighting conditions and limited computing power. Since the first DVS cameras, several subsequent generations of DVS-type event cameras have been developed; see [4, 16, 20, 9] for surveys. However, they all share the common characteristic of the original DVS in outputting a stream of brightness change events.

Fig. 1: A: Concept of DVS event camera pixel response; B: DAVIS frame + event camera data from 100 Hz spinning dot.

With the growing commercial development of event cameras and the widespread growth of deep learning, there has come the need for datasets for developing and testing algorithms and for training DVS networks. There are some very useful DVS datasets (see 

[28] for a good list), but there are far fewer DVS datasets than frame-camera datasets.

It would be useful to be able to synthesize realistic DVS datasets from the vast number of conventional camera datasets and from virtual scenes. However, DVS pixels offer much higher time resolution than most standard video. DVS sensors are asynchronous, which means an event can be triggered at any time. Most DVS cameras quantize the time to microseconds, but real-world DVS event timing jitter is on the order of 100 us, and under low light conditions can become as large as milliseconds. Even with this jitter, in most cases of interest to the event camera community, standard video frame intervals ranging from 10 ms to 100 ms are too large to generate the timing precision of DVS events.

Debunking myths of event cameras: Computer vision papers about event cameras have made rather misleading claims such as “Event cameras [have] no motion blur” and have “latency on the order of microseconds” [22, 21, 17], which were perhaps fueled by the titles (though not the content) of papers like [14, 2, 23]. Review papers like [9] are more accurate in their descriptions of DVS limitations, but are not very explicit about the actual behavior. DVS cameras must obey the laws of physics like any other vision sensor: They must count photons. Under low illumination conditions, photons become scarce and therefore counting them becomes noisy and slow. v2e is aimed at realistic modeling of these conditions, which are crucial for deployment of event cameras in uncontrolled natural lighting. Sec. II discusses the reality of DVS pixel operation under natural lighting conditions.

I-a Prior work

The first emulation of the DVS from standard cameras used a 200 Hz frame rate Sony PlayStation Eye-2 (PS2-Eye) web camera to synthesize DVS events with the 5 ms time resolution of the PS2-Eye [13]. It reported a simple model of DVS pixel operation that generated DVS events from the camera intensity samples. The Event Camera Dataset and Simulator software toolbox [18, 29] and the newer ESIM [21, 30] are great contributions to the event camera community because they enable generating synthetic DVS events from synthetic video, e.g. from Blender scenes or image datasets.

An extension to ESIM called rpg_vid2e drives ESIM

from interpolated video frames 

[10]. rpg_vid2e uses the same idealistic model of DVS pixels as ESIM. ESIM

assumes that the DVS pixel bandwidth is at least as high as the upsampled video rate, there is no temporal noise, no leak events, and that the threshold mismatch is uniformly distributed — all of which are not true. Since synthetic slow motion requires sharp, high quality source frames,

rpg_vid2e allows the simulation of idealized DVS pixels under good lighting, but not the simulation of real DVS pixels under bad lighting, which is their most important use case area. v2e is a step towards this aim. By enabling explicit control of the noise and nonideality ‘knobs’, v2e enables the generation of synthetic datasets covering a range of illumination conditions.

I-B Contributions and outline

The main contributions of v2e are

  1. a detailed description of the operation of the DVS pixel, together with the effects of so-called biases and the behavior of DVS pixels under low illumination;

  2. a demystification of claims in the computer vision literature about DVS latency and lack of motion blur;

  3. the first DVS pixel model that includes temporal noise, leak events, finite bandwidth, and Gaussian threshold distribution;

  4. a method for automatically estimating the correct DVS temporal contrast thresholds from a recording from a DAVIS frame+event camera;

  5. a software toolbox that generates from any high quality video file source a stream of realistic DVS data, which opens the possibility developing applications that work better under the broad range of lighting conditions for which DVS are suited.

The rest of this report starts with Sec. II, which explains DVS pixel operation with a focus on the effect of biases and low light operation. Then Sec. III explains the v2e steps for generating synthetic DVS events; in particular Sec. III-A describes how v2e upsamples source video to obtain luma, and Sec. III-B explains the steps of generating events. Sec. IV explains and shows results from a simple algorithm to estimate correct DVS temporal contrast threshold values. Sec.V shows results of an experiment to use v2e to model normal and low light DVS output. The report concludes with Sec. VI, which discusses v2e and provides tips for its use.

Ii DVS pixel operation and biases

Fig. 2 shows the DVS pixel circuit. The continuous-time process of generating events is illustrated in Fig. 2F. The input intensity generates a continuous photoreceptor output voltage . The change amplifier produces an inverted and amplified output voltage . When crosses either the ON or OFF threshold voltage, the pixel emits an event (via a shared digital output that is not shown). The event reset (Fig. 2E) memorizes the new log intensity value across the capacitor C. Fig. 2 shows how the pixel bias currents affect its bandwidth, thresholds, and refractory period.

Fig. 2: Principle of operation of DVS pixel. The operating principle is illustrated in F. The photoreceptor voltage follows the input photocurrent . The bias currents and of photoreceptor A and its source-follower buffer B determine pixel analog bandwidth, i.e. how quickly follows . The pixel change event thresholds in natural log units of intensity change are set by the ratios of bias currents and in the change amplifier C and the ON and OFF comparators D. The log intensity brightness value is memorized on the capacitor . The pixels refractory period is set by the bias current in circuit E. is the refractory dead time after an event is generated that the change amplifier ignores changes. G: The junction and parasitic photodiode photocurrent in the PN junction DL causes a low rate of ”leak” events that appear as ON events. The leak event rate increases with intensity as more parasitic photocurrent is generated in DL. This figure was adapted from [14, 19]; see these papers for details.

The logarithmic response of the photoreceptor comes from the exponential IV relationship in the feedback transistor , where the conductance at the source of is proportional to the photocurrent . Since the resistance increases when the photocurrent decreases, the gain also increases as the current decreases, which provides the gain control we want. But it also means that the smaller the photocurrent, the longer the time constant , where is the photodiode capacitance. This dynamics is illustrated in Fig. 3 and discussed below.

Event generation is controlled by the DVS pixel bias configuration. ‘Bias’ refers to so-called “bias-current” in the pixel circuits. These biases control the pixel event threshold, analog bandwidth, and refractory period between events. These parameters can be estimated from the known bias currents using the formulas in [19], which are estimated from the bias current generator digital configuration. The User Friendly control panel in jAER [8, 27] shows users this computed estimate of the DVS event threshold.

DAVIS event cameras concurrently output conventional intensity frames and events from the same pixels. [2], which allows us to estimate some DVS parameters from matching data from the frames and events. The method of Sec. IV estimates the event generation thresholds of DVS pixels given a reference frame+event DAVIS recording so that the first-order event count statistics of real and synthetic streams match.

DVS under low lighting: Fig. 3 shows a realistic simulation of what happens in DVS pixels under extremely low illumination conditions when the pixel sees a grating passing by. The grating consists of alternating gray and white sections. Even in the absence of light, the photodiode has a dark current that flow all the time. During the initial moderately bright cycles, the signal photocurrent is much larger than the dark current and the bandwidth of the photoreceptor is fast enough to follow the input current fluctuations. The contrast of the signal was set to 2 so that the white part of the grating produced twice the photocurrent compared to the gray part. The pixel makes about 5 events for the rising and falling edges (the change threshold was set to 0.1 units), but these are spread over time by the rise and fall time of the photoreceptor signal. Suddenly the pixel goes into the shadowed very dark section and the overall illumination is reduced by a factor of 10. Here the contrast of the signal is unchanged (the reflectance of the scene is the same as before), but now the signal photocurrent has become comparable to the photodiode dark current, thus reducing the actual contrast of the current fluctuations. The current is also so small that the bandwidth reduces to the point that the photoreceptor can no longer follow the input current fluctuations. Both effects reduce the number of generated brightness change events (to about 2 per edge) and increases their timing jitter. v2e models these effects to produce realistic low light synthetic DVS events.

Fig. 3: Simulated DVS pixel photoreceptor and resulting ON and OFF events under moderate and extremely low illumination. Both photocurrent and dark current include shot noise which is proportional the mean current. See [5] for python code.

Motion blur: It should be obvious from Fig. 3 that a DVS pixel does not respond instantly to an edge: The finite aperture of the pixel results in a linear change of current as the edge passes over the pixel. Then the finite bandwidth of the photoreceptor can blur the edge even more. The transition from one brightness level to another is like the response of an RC lowpass filter. The bigger the step, the longer it takes for the pixel to settle to the new brightness value. The result is that a passing edge will result in an extended series of events, as the pixel settles down to the new value. This finite response time over which the pixel continues to emit events is the equivalent “motion blur” of DVS pixels. The v2e website includes video of high speed motion that make these effects clear. Playing with a 3D space-time display of the events such as the jAER SpaceTimeEventDisplayMethod shows that under good lighting, it possible to line up the events by looking at the event cloud from the right angle. This notion was used for years in algorithms computing optical flow from DVS events, and was formalized more recently in a series of papers from the computer vision community. But under lower illumination conditions, it might not be possible to disambiguate the blur caused by motion from the blur caused by finite response time, which spreads events at one pixel caused by an edge moving over it over a significant period of time. Typical values under bright indoor illumination show that the pixel motion blur is on the order of 1 ms. It is much faster than most imagers, but it is still finite. Under very low illumination conditions, the equivalent motion blur of DVS can extends for tens of milliseconds.

Latency: Quick response time is a clear advantage of DVS cameras, and they have been used to build complete complete visually servoed robots with total closed loop latencies of under 3 ms [6, 3]. With direct hardware connection to the DVS output (no host computer interface), it should be possible to be even quicker. But it is important to realize the true range of achievable response latency. High-speed USB computer interfaces impose a minimum latency of a few hundred microseconds [6] and software stacks like ROS can increase the average latency to many milliseconds.

Added to these computer and operating system latencies are the DVS sensor chip latencies, which are illustrated in Fig 4 with real DVS data. In this experiment, we recorded the response latency to turning off a blinking LED. We chose the OFF edge because the starting bandwidth is higher for this condition compared to the ON edge, where the pixel starts with lower bandwidth. The horizontal axis is in units of illumination in lux (visible photons) using two scales: The upper scale is for chip illumination, and the lower scale is for scene illumination, assuming 20% scene reflectance and a relatively fast f/2.8 lens aperture [7]. Typical scenarios are listed below the scene illuminance axis. The DVS was biased in two different ways: The “nominal biases” setup used settings that are meant for everyday use of the DVS where quickness is not critical. With these settings, the DVS pixel bandwidth is limited by the photoreceptor and source follower biases and thus the DVS latency is only a soft function of intensity. This choice limits noise at low light intensities. The “biased for speed” setup used much higher bias currents for the photoreceptor and source follower to optimize the DVS for the quickest possible latency, with the tradeoff of additional noise from shorter integration time. With this setup, we can see that the latency decreases with reciprocal of intensity, for low intensities. We can see from this data that typical users of DVS will experience real world latencies on the order of about one to a few ms, with latency jitter on the order of 100 us to 1 ms. The absolute minimum latency is reported in the paper as a figure of merit for such sensors (as is customary in the electronics community), but it clearly does not reflect real world use. The v2e lowpass filtering and noise models (Sec III-B) model these effects.

Fig. 4: Real DVS latency measurements to turning off blinking LED. A: definition of response latency. B: measured data. Adapted from [14], with scene illumination axis added based on [7]. See text.

Noise: There exists no complete theory of DVS pixel noise, but circuit operation and observations of noise break it down to two main contributions: Leak events (Secs. II and III-B), which are caused by reset switch leakage [19], and temporal noise, which is caused by the random reception of photons and random flow of electrons. Both effects are easily observed in DVS output.

To illustrate how temporal noise is affected by illumination, bandwidth, and DVS event threshold, we did the simple experiment illustrated in Fig 5. We aimed a DAVIS346 camera at a white wall, where we mounted a black rectangle. We then recorded 1 minute of data from the still scene under various conditions. An ideal DVS would produce zero output for all these conditions, but it is easy to see that there are plenty of noise events to show when they are accumulated for sufficient time. Overall, the data shows that:

  • Leak events dominate under ‘normal’ indoor lighting conditions of a few hundred lux, with nominal thresholds and pixel bandwidth.

  • Significant temporal noise is caused by some some combination of low event threshold, high bandwidth, and low light intensity.

Temporal noise can be controlled by some some combination of lower bandwidth or higher event threshold. v2e models both of these noise sources. Simple, low-computational cost correlation-based noise filters, e.g. the jAER BackgroundActivityFilter111BackgroundActivityFilter.java on jAER github reported in [8], are also very effective at removing the noise as long as the rate is not excessive.

Fig. 5: Real DVS noise observations. The accumulation time and full-scale count FS are in each DVS histogram sub-figure. A: DAVIS346 still image of scene. The contrast between white and black parts of the scene is about 10:1. DN/ms is the exposure in digital numbers (DN) per millisecond. B: Leak and noise events under indoor illumination with wide open lens. ON and OFF event thresholds were set to about 0.24. There is no temporal noise (there are only leak ON events) and the leak event rate is about 0.1 Hz. The inset shows the frequency distribution of event rates for ON and OFF events. The axes are frequency of occurrence across the pixel array versus log event rate in Hertz. C: The lens aperture was closed to reduce the intensity by a factor of 35X. Now in the dark part of the scene, temporal noise has increased the event rate to above 0.5 Hz, but the bright part is not greatly affected. D: With the same dark conditions, increasing pixel photoreceptor and source follower bias by 30X increases the temporal noise by a factor of about 20X. ON events are green and OFF events are red. E: Same setup as C, but decreasing the bandwidth by 10X reduces the noise to about the starting levels. F: Same setup D, but now increasing threshold to 0.3 decreases noise significantly, but it is still higher than A conditions.

Iii Method

Fig. 6 shows the steps of the DVS emulation starting from RGB pixel intensity samples. These steps are explained in the following sections.

It may be that v2e is used to process synthetic video; in this case the video can be generated at arbitrary frame rate. But in the case of standard camera video, v2e first optionally upsamples the base frame rate to a new target frame rate (e.g., 1000 fps) using a state-of-the-art video interpolation network (Super-SloMo [12]) that we trained to optimally estimate intensity values. Super-SloMo interpolates frames between each and so that . (Section III-A)

Secondly, after the intensity changes are estimated and represented by a high frame rate set of frames , the event stream is produced according to a model of event generation mechanism from DVS pixels (Section III-B).

Iii-a Light Intensity Estimation

v2e starts from a source video that has frames where . The video has a base frame rate such as 30 Hz. The event generation depends on the light intensity changes.

Color to luma conversion: If v2e starts from a color video, it automatically converts the input color frames into luma frames, using the ITU-R recommendation BT. 709 digital video (linear, non-gamma-corrected) color space conversion [31] in (1):

(1)

In (1), are the values of three channels of a RGB pixel, and is the output source luma value. When the video consists of the grayscale frames, the pixel value is treated as the luma value.

After conversion to luma, frames are scaled to the desired output height and width in pixels.

Synthetic slow motion: The luma frames are then interpolated by a synthetic slow motion network to go from the frames to upsampled frames. We adopted the excellent Super-SloMo video interpolation network from [12]222https://github.com/avinashpaliwal/Super-SloMo

. Super-SlowMo takes consecutive luma frames as input and produces the backward and forward optic flow. Then the bi-directional optic flow vectors are linearly interpolated to an arbitrary time between the two input frames. Finally, the interpolated frame is produced by warping both input frames with the predicted bi-directional optic flow.

To estimate grayscale flow, we forked the official implementation and retrained it on the Adobe240FPS [25] dataset after converting its RGB frames to luma frames. That way, our Super-SloMo network can interpolate frames from grayscale frame cameras like DAVIS.

The user chooses the upsampling ratio to produce DVS events with desired time resolution. For example, if the source video is at 30 FPS and the upsampling ratio is 10, then the DVS events will have time resolution of . (v2e does not adaptively gearshift the upsampling ratio like [10], because many algorithms might break from potential large quantization of the DVS event timing.)

Iii-B Generating synthetic DVS events from video

For a real DVS sensor [4, 14], an event is triggered when the magnitude of the change of log intensity from the pixel’s memorized value exceeds the threshold.

Linear to logarithmic mapping: The next step is to generate events from frames. Standard digital video usually represents intensity linearly, but DVS pixels detect changes in log intensity. We adopted the method we developed in [13], which modeled live DVS from a high frame rate Sony PlayStation web camera. To model the DVS sensor, we first convert the light intensity value from linear to logarithmic scale using a lin-log mapping. (2) converts the digital number (DN) intensity sample luma intensity value into the logarithmic value :

(2)
Fig. 6: Steps of the v2e DVS event generation.

This mapping is illustrated in Fig. 6. Standard cameras typically have a linear response to light intensity and maximum dynamic range of about 60dB (a factor of 1000). Automotive cameras extend dynamic range to above 100dB (a factor of 100,000) by using strategies like multiple sampling during the exposure period. Dark pixels output small DNs, even down to DN 0. By default, computer vision uses 8-bit values, which limits dynamic range to . To deal with this limited range and quantization, in Eq. 2 is a piece-wise linear-logarithmic function, since the function is sensitive to small values near zero. For example, the logarithmic change from DN 0 to DN 1 is infinite, and the change from DN 1 to DN 2 is a factor of 2. These huge changes of log intensity could create a huge number of unrealistic noise events. Therefore, for the range less than DN 5, we use a linear mapping from exposure value (intensity) to log intensity. The mapping is joined at  DN. The maximum output value is . The mapping is illustrated in Fig. 6.

This lin-log mapping means that the smallest detected event threshold contrast possible in the linear region is for ON events and for OFF events. The linearizing part of the conversion function means that small DNs will be converted linearly, reducing noise in the synthetic DVS output.

Finite intensity-dependent photoreceptor bandwidth: Since the real DVS pixel has finite analog bandwidth, an optional low-pass filter filters the input values. This cutoff models the DVS pixel response under low illumination as discussed in Sec. II.

DVS pixel bandwidth is proportional to intensity, at least for low photocurrents [14]. v2e models this effect for each pixel by making the filter bandwidth (BW) that increases monotonically with the intensity value.

This filter is implemented by an IIR lowpass on pixel values in the interpolated brightness values. The transfer function of the filter in continuous time form is

(3)

where , is the cutoff frequency, and . The shape of this transfer function is illustrated in Fig. 6.

This second-order RC lowpass filter has a nominal cutoff frequency of for full white pixels. This bandwidth is proportional to the luma . To avoid nearly zero bandwidth for small DN pixels, an additive constant limits the minimum bandwidth to about 1/10 of the maximum. The update is done by the steps in (4-9):

(4)
(5)
(6)
(7)
(8)
(9)

where is the lowpass-filtered brightness output, is an internal value representing the first stage of the filter, and is the input brightness value. The value 275 is chosen to result in for .

Logarithms and temporal contrast threshold: We define the pixel event thresholds for generating ON and OFF events as

(10)

The ON threshold is positive and the OFF threshold is negative. Typically the magnitudes of and are quite similar and take on values from , i.e., the typical range of adjustable DVS thresholds is approximately from 10% to 50% light intensity change (but see below to understand what is meant by percentage change).

We think of such thresholds most easily in terms of a logarithmic representation of intensity. Since , a threshold on the ratio is the same as a fixed threshold on the change of , i.e. . Taking the above example of a threshold ON ratio of , then the corresponding log intensity change threshold is . Since , the corresponding OFF threshold is exactly .

The event thresholds are dimensionless, but they represent a threshold for relative intensity change, i.e., a threshold on the change of the intensity by some factor relative to the memorized value. These relative intensity changes are produced by scene reflectance changes, which is why this representation is useful for producing events that are informative about the visual input.

If is small (much less than 1), then it can be stated as a percentage change, since , or stated another way, a factor , so is the threshold in percent change. But for large , the percentage change for ON and OFF is very different. For example, if , it means a change of light intensity by a factor of or . For ON events it is a percentage increase of , but for OFF events it means a percentage reduction to of the memorized value. By the usual measure of percent change, this reduction to 37% is a 73% reduction. So, to avoid confusion, it is easier to consider all changes in logarithmic units from the starting value.

Event generation model: Given that a pixel has a memorized brightness value , and that the new low pass filtered brightness value is , then the basic model of event generation generates a signed integer quantity of positive ON or negative OFF events by the recipe (11):

(11)

In (11), denotes the signed number of generated ON or OFF events. If is positive it means ON events, and if negative it means OFF events. The function takes the value closer to zero for both signs, e.g. and .

If the change is multiple times the threshold, then multiple DVS events are generated. The memorized brightness value is updated by an multiple of the threshold. These events are spread over the time between this input frame and the next one (see below for details).

Threshold mismatch: Typical value of DVS threshold is about

. Pixel to pixel variation in event threshold is modeled by a frozen Gaussian variation of the thresholds corresponding to a chosen standard deviation of the temporal contrast threshold. Measurements of real DVS show that the distribution is close to Gaussian 

[14]. Typical values are about 3% contrast, i.e. before starting DVS event generation, we store a 2D array of and values that a drawn from . The default value of .

Hot pixels: DVS sensors always have a few ’hot pixels’, which fire events continuously even in the absence of input. Examples can be seen in Fig. 5. Hot pixels can result from abnormally low thresholds or reset switches with very high dark current. v2e arbitrarily limits the minimum threshold to 0.01 to prevent too many hot pixel events.

Event timestamps: The timestamps of the interpolated frames are discrete, and they are also fully determined by the frame rate of input video and the slow-motion upsampling frame interval. Thus, we used the following strategy to assign the DVS event timestamps:

Given two consecutive interpolated frames and ,

  • If there is only one event triggered at a pixel, the timestamp of this event will be assigned as

  • If there are events triggered at a pixel, the timestamps will be evenly distributed between and , e.g. ,

Leak noise events: DVS pixels emit spontaneous ON events called leak events [19]. They are caused by junction leakage and parasitic photocurrent in the change detector reset switch. They occur at a rate typically about 0.1 Hz. v2e adds these leak events by decreasing the memorized brightness value as illustrated in Fig. 6, by using (12-14):

(12)
(13)
(14)

where is the nominal leak rate and is the nominal scalar ON threshold. Even if does not change, eventually drifts away from by , and the pixel emits an ON event. Fig. 6 illustrates one of these leak events being generated by the gradual change of . is the individual pixel ON-event threshold. This way, the leak rate varies according to the random variation of the event threshold and the leak events become desynchronized. To make leak events appear from start of simulation, if the leak rate is nonzero, then each pixel’s is initialized to a uniformly-distributed random fraction of below the initial value of .

Temporal noise: The quantal nature of photons results in shot noise: If on the average

photons are accumulated in each integration period, then the variance around the average will also be

. Shot noise appears in all vision sensors. In conventional imagers that accumulate for a fixed integration time, shot noise gets larger as the signal gets larger, but its effect on contrast shrinks. That is because as grows, the standard deviation only grows as , so the relative noise shrinks as .

DVS pixels are different. At low light intensities, DVS pixel integration time is approximately proportional to the intensity, which means that a DVS pixel integrates over a constant number of photons. It means that DVS pixel photoreceptors have total noise power that is constant with intensity. As the intensity increases, the total noise is spread over more bandwidth. It is often observed that DVS recordings show more noise in the dark parts of the scene. The reason for this is that more of the total noise power is concentrated in lower frequencies that lie within the passband of the subsequent change detector.

v2e models temporal noise using a Poisson process. It generates ON and OFF temporal noise events to match an observed noise event rate . Fig. 6 shows how it works: For each sample, a uniformly-distributed number in range 0-1 is compared with two thresholds to decide if an ON or OFF noise event is generated.

To model the increase of temporal noise with reduced intensity, observed noise rate for dark parts of the scene is multiplied by a linear function of luma that reduces noise in bright parts by a factor (default ). This modified rate is multiplied by the timestep

to obtain the probability

that will applied to the next sample. The complete steps are (15-19):

(15)
(16)
(17)
: Generate OFF event (18)
: Generate ON event (19)

These noise events are injected to the output, and the pixel is reset the same way that a ‘real’ input would reset it. This way, the noise events do not lose input signal.

Since the input video already has noise, adding additional noise is only needed to model low light intensities.

v2e output: v2e outputs a variety of formats. The basic output is a stream of events (either in text or jAER .aedat format), and a DVS AVI video that accumulates the signed DVS events starting from a gray image, at a specified frame rate (constant-duration), or with two variable-frame rate count-based exposure strategies, constant-count and area-event [15].

Iii-C Other DVS non-idealities

Refractory period: The real DVS pixel has an adjustable refractory period, which is used to limit the maximum pixel event rate. After each event is detected, the reset switch transistor in Fig. 2 is connected for a finite time by the Fig. 2E ’reset and refractory period’ circuit. During this time, the change amplifier ignores changes in the log intensity. To model finite refractory period, a user could write code to ignore frames subsequent to an event for a period after the event is generated.

Finite event output bandwidth: v2e does not model that DVS have a maximum output event rate, e.g. about 10 MHz for the DAVIS346 camera used here, which is determined by a combination of on-chip arbitration circuits and computer interface limitations.

Iv Threshold auto-estimation

For training networks, [24] reported that best results are obtained by using a very wide range of DVS parameters so that the network properly generalizes. However, to obtain output from that is a good approximation to reality, v2e includes a tool that adjusts DVS event thresholds so that the statistics of the v2e output match real statistics. These are obtained from DAVIS camera [2] recordings such as [18, 11] that have a concurrent grayscale APS frames and DVS events. v2e uses the APS frames to generate DVS events, and it sweeps the ON and OFF DVS thresholds so that the real and synthetic DVS event count statistics match. For example, if there are too many v2e ON events, the ON threshold is increased, and vice versa.

Fig. 7 shows the results of such a sweep. For this data, ON and OFF thresholds were both swept with the same values. The minimum difference in event counts was found at . When is small, there is a huge difference because v2e makes far too many events. When is very large, v2e makes almost no events and so the difference is just the number of real events.

Fig. 8 shows a typical set of DVS event count statistics over time after threshold calibration. The statistics match visually except that the OFF threshold is a bit too large and v2e makes too few OFF events around . Inspection of the output shows it is the result of white saturation of the frame at the trailing edge of the tree.

Fig. 7: Results of a threshold sweep, showing the difference in event counts versus .
Fig. 8: DVS event statistics from real DVS and v2e conversion. A: From driving data in [1], a region of interest indicated by the black box in the DVS stream is collected for some time. B: The histograms of ON and OFF events over time from the real DVS recording and v2e events. The generated events match quite closely. See [5] code for details.

V Results

Readers are invited to inspect videos of conversions on the v2e website. This section shows v2e conversion, with a focus on new features of advanced DVS pixel modeling. The example is from a pendulum recording. The pendulum was a white golf ball suspended on fishing line. We recorded the pendulum using a prototype iniVation DAVIS346 camera with our front-illuminated sensor chip with dual intensity frame and DVS event output [26, 2]. First we recorded the pendulum under good lighting to obtain baseline data and clear, well exposed and sharp intensity frames. We also recorded under low lighting conditions where the DVS pixel slows down substantially (although still being much faster than the intensity pixels). We used v2e to generate simulated DVS data from the frames to compare with the real DVS events. For this simulation, we set the leak event rate from (12) to match the observed leak event rate in a part of the scene without motion (leak events are all of ON type so they can be distinguished from shot noise events). Next, to simulate low lighting, we adjusted the DVS pixel bandwidth of (4) and shot noise rate in (15) of v2e to model low light output of the DVS; specifically, in dark parts of the scene that are changing, we can estimate from the OFF event rate. We can estimate from looking at what speed fine details of the pendulum disappear.

Fig. 9 shows data from this experiment:

  1. [label=]

  2. DAVIS APS frames. Exposure time was 6.7 ms and frame rate was about 37 Hz. We use these frames to generate synthetic v2e events. The maximum speed of the pendulum is about 200 pix/s.

  3. real DVS data under moderate lighting. The APS autoexposure time was 60ms, so the illumination was about 9X less than when the part A frames were captured. Full scale is 8 events and accumulation time is 10ms. Average total event rate was about 180 kHz, consisting of about 60 kHz real events with the rest mostly ON leak events.

  4. shows v2e data simulating this lighting. We upsampled by 10X to 370 Hz. There is no added noise, no threshold mismatch, and no lowpass filtering of photoreceptor output. Same integration time and full scale value as in B. Average simulated event rate was 87 kHz.

  5. shows real DVS data under lower lighting. The APS autoexposure time was 191 ms, which indicates it was a factor 28X darker than the original scene but only 3.1X darker than the part B scene. Same DVS frame accumulation time and full scale as B. The contrast is reduced indicating fewer events, and in the middle of the swing, when the pendulum is moving at about 400 pixels/s, the details are blurred out. There are also more background noise events.

  6. shows v2e data after adjusting parameters for bandwidth (), leak events noise ( ), and temporal noise () to match the Fig. 9D characteristics.

To summarize, this experiment demonstrates that starting from the same sequence of normal frames, we can realistically model the DVS output under a range of lighting conditions.

Fig. 9: Pendulum data. See text.

V-a Throughput performance

v2e processes video about 20 to 100 times slower than real time on low-end GPU hardware. For example, a laptop Nvidia MX150 GPU on 2019 Hauwei Matebook Pro X running Ubuntu 18.04 with python 3.7 processed a source video shot at 50 Hz with 6X upsampling and all DVS pixel effects activated at about 1.35 frame/s, i.e. a slowdown of 37.

Processing time is dominated by frame interpolation, so faster inference hardware would speed up the processing. Batch mode processing of frames is not currently implemented but would further increase throughput.

Vi Discussion

v2e serves a complementary purpose to the useful rpg_vid2e and ESIM toolboxes. ESIM allows generating synthetic DVS data from virtual scenes, which has been used for example to train a network that (rather expensively) reconstructs video from DVS events.

The rpg_vid2e extension of ESIM allows idealized simulation of DVS data from good source video. v2e extends this ability further, by allowing realistic simulation of extreme lighting conditions for DVS sensors. These are increasingly important as DVS are deployed in the kinds of challenging environments to which they are ideally suited.

Like rpg_vid2e, v2e can process video files generated from simulated virtual environments, including super high dynamic range scenes with extremes of low and high illumination. v2e will realistically simulate the variable pixel bandwidth and noise from such scenes.

Acknowledgment

We thank Samsung via the Neuromorphic Processor Project Global Research Program and NCCR Robotics for supporting these developments.

References

  • [1] J. Binas, D. Neil, S. Liu, and T. Delbruck (2017-08) DDD17: End-To-End DAVIS driving dataset. In

    ICML’17 Workshop on Machine Learning for Autonomous Vehicles (MLAV 2017)ICML’17 Workshop on Machine Learning for Autonomous Vehicles

    ,
    Sydney, Australia. External Links: Link Cited by: Fig. 8.
  • [2] C. Brandli, R. Berner, M. Yang, S. Liu, and T. Delbruck (2014-10) A 240x180 130 db 3 us latency global shutter spatiotemporal vision sensor. IEEE J. Solid-State Circuits 49 (10), pp. 2333–2341. External Links: Link Cited by: §I, §II, §IV, §V.
  • [3] J. Conradt, M. Cook, R. Berner, P. Lichtsteiner, R. J. Douglas, and T. Delbruck (2009-05) A pencil balancing robot using a pair of AER dynamic vision sensors. In IEEE International Symposium on Circuits and Systems (ISCAS) 2009, Taipei, pp. 781–784. External Links: Link, ISBN 9781424438273, Document Cited by: §II.
  • [4] T. Delbrück, B. Linares-Barranco, E. Culurciello, and C. Posch (2010-05) Activity-driven, event-based vision sensors. In Proceedings of 2010 IEEE International Symposium on Circuits and SystemsProceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS), Paris, pp. 2426–2429. External Links: Link Cited by: §I, §III-B.
  • [5] T. Delbruck, Y. Hu, and Z. He (2020) V2E: from video frames to DVS events. Institute of Neuroinformatics, University of Zurich and ETH Zurich. External Links: Link Cited by: Fig. 3, Fig. 8.
  • [6] T. Delbruck and M. Lang (2013-11) Robotic goalie with 3 ms reaction time at 4% CPU load using event-based dynamic vision sensor. Front. in Neuromorphic Eng. 7, pp. 223 (en). External Links: Link, ISSN 1662-4548, 1662-453X, Document Cited by: §II.
  • [7] T. Delbruck (1997–2017) Notes on practical photometry. Note: https://www.ini.uzh.ch/~tobi/wiki/doku.php?id=radiometry External Links: Link Cited by: Fig. 4, §II.
  • [8] T. Delbruck (2008) Frame-free dynamic digital vision. In Proceedings of Intl. Symp. on Secure-Life Electronics, Advanced Electronics for Quality Life and SocietyIntl. Symp. on Secure-Life Electronics, Vol. 1, Tokyo, Tokyo, Japan, pp. 21–26. External Links: Link Cited by: §II, §II.
  • [9] G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis, and D. Scaramuzza (2019) Event-based vision: A survey. CoRR abs/1904.08405. Cited by: §I, §I.
  • [10] D. Gehrig, M. Gehrig, J. Hidalgo-Carrió, and D. Scaramuzza (2020-06) Video to events: recycling video datasets for event cameras. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Cited by: §I-A, §III-A.
  • [11] Y. Hu, J. Binas, D. Nei, S. Liu, and T. Delbruck (2020) DDD20 End-to-End event camera driving dataset: fusing frames and events with deep learning for improved steering prediction. In The 23rd IEEE International Conference on Intelligent Transportation Systems (ITSC 2020)The 23rd IEEE International Conference on Intelligent Transportation Systems (ITSC 2020), pp. (accepted). Cited by: §IV.
  • [12] H. Jiang, D. Sun, V. Jampani, M. Yang, E. Learned-Miller, and J. Kautz (2018-06) Super slomo: high quality estimation of multiple intermediate frames for video interpolation. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §III-A, §III.
  • [13] M. L. Katz, K. Nikolic, and T. Delbruck (2012-05) Live demonstration: behavioural emulation of event-based vision sensors. In 2012 IEEE International Symposium on Circuits and Systems (ISCAS)2012 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 736–740. External Links: Link Cited by: §I-A, §III-B.
  • [14] P. Lichtsteiner, C. Posch, and T. Delbruck (2008) A 128128 120 db 15 s latency asynchronous temporal contrast vision sensor. IEEE J. Solid-State Circuits 43 (2), pp. 566–576. External Links: Link Cited by: §I, §I, Fig. 2, Fig. 4, §III-B, §III-B, §III-B.
  • [15] M. Liu and T. Delbruck (2018-09) Adaptive Time-Slice Block-Matching optical flow algorithm for dynamic vision sensors. In BMVC 2018BMVC 2018, Nescatle upon Tyne, pp. 12–16 (en). Cited by: §III-B.
  • [16] S. Liu, T. Delbruck, G. Indiveri, A. Whatley, and R. Douglas (2014-12) Event-Based neuromorphic systems. John Wiley & Sons (en). External Links: Link Cited by: §I.
  • [17] A. Mitrokhin, C. Ye, C. Fermuller, Y. Aloimonos, and T. Delbruck (2019) EV-IMO: motion segmentation dataset and learning pipeline for event cameras. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 6105–6112. Cited by: §I.
  • [18] E. Mueggler, H. Rebecq, G. Gallego, et al. (2017)

    The event-camera dataset and simulator: event-based data for pose estimation, visual odometry, and SLAM

    .
    Journal of Robotics. External Links: Link Cited by: §I-A, §IV.
  • [19] Y. Nozaki and T. Delbruck (2017) Temperature and parasitic photocurrent effects in dynamic vision sensors. IEEE Trans. Electron Devices. External Links: Link Cited by: Fig. 2, §II, §II, §III-B.
  • [20] C. Posch, T. Serrano-Gotarredona, et al. (2014) Retinomorphic event-based vision sensors: bioinspired cameras with spiking output. Proceedings of the. External Links: Link Cited by: §I.
  • [21] H. Rebecq, D. Gehrig, and D. Scaramuzza (2018-29–31 Oct) ESIM: an open event camera simulator. In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.), Proceedings of Machine Learning Research, Vol. 87, , pp. 969–982. Cited by: §I-A, §I.
  • [22] C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, and D. Scaramuzza (2020) Fast image reconstruction with an event camera. In The IEEE Winter Conference on Applications of Computer Vision, pp. 156–163. External Links: Link Cited by: §I.
  • [23] T. Serrano-Gotarredona and B. Linares-Barranco (2013) A 128 x 128 1.5% Contrast Sensitivity 0.9% FPN 3 us Latency 4 mW Asynchronous Frame-Free Dynamic Vision Sensor Using Transimpedance Preamplifiers. IEEE J. Solid-State Circuits 48 (3), pp. 827–838. Cited by: §I.
  • [24] T. Stoffregen, C. Scheerlinck, D. Scaramuzza, T. Drummond, N. Barnes, L. Kleeman, and R. Mahony (2020-03)

    How to train your event camera neural network

    .
    arXiv e-prints. Cited by: §IV.
  • [25] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and O. Wang (2017) Deep video deblurring for hand-held cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1279–1288. Cited by: §III-A.
  • [26] G. Taverni, D. Paul Moeys, C. Li, C. Cavaco, V. Motsnyi, D. San Segundo Bello, and T. Delbruck (2018-05) Front and back illuminated dynamic and active pixel vision sensors comparison. IEEE Trans. Circuits Syst. Express Briefs 65 (5), pp. 677–681. External Links: Link, ISSN 1558-3791, Document Cited by: §V.
  • [27] Various contributors (2007-03)

    jAER open source project

    .
    Note: http://jaerproject.orgAccessed: 2016-5-23 External Links: Link Cited by: §II.
  • [28] Various contributors (2020) Event-based_vision_resources. Github. External Links: Link Cited by: §I.
  • [29] Various contributors (2020) Rpg_davis_simulator. Github. External Links: Link Cited by: §I-A.
  • [30] Various contributors (2020) Rpg_esim. Github. External Links: Link Cited by: §I-A.
  • [31] Wikipedia contributors (2019-07) Luma (video). Note: https://en.wikipedia.org/w/index.php?title=Luma_(video)&oldid=904566688Accessed: 2020-4-7 External Links: Link Cited by: §III-A.