Autonomous mobile robots are well established in controlled environments such as factories where they rely on external infrastructure such as magnetic tape on the floor or beacons. However, in unstructured and changing environments, robots need to be able to plan their way and interact with their environment which, as a first step, requires accurate positioning [RolandSiegwart2011].
In mobile robotic applications, visual sensors can provide solutions for odometry and Simultaneous Localization and Mapping (SLAM), achieving good accuracy and robustness. Using additional sensor modalities such as inertial measurement units (IMUs) [Bloesch2015, Leutenegger2015, Qin2018VINS-Mono:Estimator] can additionally improve robustness and accuracy for a wide range of applications. For many frameworks, a time offset or delay between modalities can lead to bad results or even render the whole approach unusable. There are frameworks such as VINS-Mono [Qin2018VINS-Mono:Estimator] that can estimate a time offset during estimation, however, convergence of the estimation can be improved by accurate timestamping, i.e. assigning timestamps to the sensor data that precisely correspond to measurement time from a common clock across all sensors.
For a reliable and accurate sensor fusion, all sensors need to provide timestamps matchable between sensor modalities.
Typically, the readout of camera image and IMU measurement are not on the same device resulting in a clock offset between the two timestamps which is hard to predict due to universal serial bus (USB) buffer delay, operating system (OS) scheduling, changing exposure times and internal sensor filtering. Therefore, measurement correspondences between modalities are ambiguous and assignment on the host is not trivial.
Currently, there is no reference sensor synchronization framework for data collection, which makes it hard to compare results obtained in various publications that deal with visual-inertial (VI) SLAM. Commercially and academically established sensors such as the Skybotix VI-Sensor [Nikolic2014], Intel RealSense [IntelCorporation2019IntelT265], SkyAware sensor based on work from Honegger et. al. [Honegger2017EmbeddedStereo] or PIRVS [ZheZhang2018] are either unavailable and/or are limited in hardware configuration regarding image sensor, lens, camera baseline and IMU. Furthermore, it is often impossible to add further extensions of those frameworks to enable fusion with other modalities such as LiDAR sensors or external illumination.
In this paper, we introduce VersaVIS111VersaVIS is available at: www.github.com/ethz-asl/versavis shown in Figure 5, the first open-source framework that is able to accurately synchronize a large range of camera and IMU sensors. The sensor suite is aimed for the research community to enable rapid prototyping of affordable sensor setups in different fields in mobile robotics where visual navigation is important. Special emphasis is put on an easy integration for different applications and easy extensibility by being based on well-known open-source frameworks such as Arduino [ArduinoInc.2019ArduinoZero] and Robot Operating System (ROS)[StanfordArtificialIntelligenceLaboratoryetal.2018RoboticSystem].
The remainder of the paper is organized as follows: In Section II, the sensor suite is described in detail including all of its features. Section III provides evaluations of the synchronization accuracy of the proposed framework. Finally, Section IV showcases the use of the Open Versatile Multi-Camera Visual-Inertial Sensor Suite (VersaVIS) in multiple applications while Section V provides a conclusion with an outlook on future work.
Ii The Visual-Inertial Sensor Suite
The proposed sensor suite consists of three different parts, (i) the firmware which runs on the microcontroller unit (MCU), (ii) the host driver running on a ROS enabled machine, and (iii) the hardware trigger printed circuit board (PCB). An overview of the framework is provided in Figure 6. Here, the procedure is described for a reference setup consisting of two cameras and an Serial Peripheral Interface (SPI) enabled IMU.
The core component of VersaVIS is the MCU. First of all, it is used to periodically trigger the IMU readout together with setting the timestamps and sending the data to the host. Furthermore, the MCU sends triggering pulses to the cameras to start image exposure. This holds for both independent cameras and stereo cameras (see Section II-A). After successful exposure, the MCU reads the exposure time by listening to the cameras’ exposure signal in order to perform exposure compensation described in Section II-A1 and setting mid-exposure timestamps. The timestamps are sent to the host together with a strictly increasing sequence number. The image data is hereby sent directly from the camera to the host computer to avoid massive amounts of data through the MCU. This enables to use high-resolution cameras even with a low-performance MCU.
Finally, the host computer merges image timestamps from the MCU with the corresponding image messages based on a sequence number (see Section II-A1).
The MCU is responsible for triggering the devices at the correct time and to capture timestamps of the triggered sensor measurements. This is based on the usage of hardware timers and external signal interrupts.
Ii-A1 Standard cameras
In the scope of the MCU, standard cameras are considered sensors that are triggerable with signal pulses and with non-zero data measurement time, i.e. image exposure time. Furthermore, the sensors need to provide an exposure signal (often called strobe signal) which indicates the exposure state of the sensor. While the trigger pulse and the timestamp are both created on the MCU based on its internal clock, the data is transferred via USB or Ethernet directly to the host computer. To enable correct association of the image data and timestamp on the host computer (see Section II-B1), both the timestamp and the image data are assigned an independent sequence number and , respectively. The mapping between these sequence numbers is determined during initialization as a simultaneous start of the cameras and the trigger board cannot be guaranteed.
Initialization procedure: After startup of the camera and trigger board, corresponding sequence numbers are found by very slowly trigger the camera without exposure time compensation. Corresponding sequence numbers are then determined by closest timestamps. This holds true if the triggering period time is significantly longer than the expected delay and jitter on the host computer. As soon as the sequence number offset is determined, exposure compensation mode at full frame rate can be used.
Exposure time compensation: Performing auto-exposure (AE), the camera adapts its exposure time to the current illumination resulting in a non-constant exposure time. Furgale et al. [Furgale2013UnifiedSystems] showed, that mid-exposure timestamping is beneficial for image based state estimation, especially when using global shutter cameras. Instead of periodically triggering the camera, a scheme proposed by Nikolic et al. [Nikolic2014] is employed. The idea is to trigger the camera for a periodic mid-exposure timestamp by starting exposure half the exposure time earlier to its mid-exposure timestamp as shown in Figure 7 for cam0, cam1 and cam2. The exposure time return signal is used to time the current exposure time and calculate the offset to the mid-exposure timestamp of the next image. Using this approach, corresponding measurements can be obtained even if multiple cameras do not share the same exposure time (e.g. cam0 and cam2 in Figure 7).
Master-slave mode: Using two cameras in a stereo setup compared to a monocular camera can retrieve metric scale by stereo matching. This can enable certain applications where IMU excitation is not high enough and therefore biases are not fully observable without this scale input e.g. for rail vehicles described in Section IV-B. Furthermore, it can also provide more robustness. To perform accurate and efficient stereo matching, it is highly beneficial if keypoints from the same spot have a similar appearance. This can be achieved by using the exact same exposure time on both cameras. Thereby, one camera serves as the master performing AE while the other adapts its exposure time. This is achieved by routing the exposure signal from cam0 directly to the trigger of cam1 and also using it to determine the exposure time for compensation.
Ii-A2 Other triggerable sensors
Some sensors enable measurement triggering but do not require or offer the possibility to do exposure compensation e.g. thermal cameras or ToF cameras. These typically do not allow for adaptive exposure compensation but rather have a fixed exposure/integration time. They can be treated the same as a standard camera, but with fixed exposure time.
Sensors that provide immediate measurements (such as external IMUs) do not need an exposure time compensation and can just use a standard timer. Note that the timestamp for those sensors are still captured on the MCU and therefore correspond to the other sensor modalities.
Ii-A3 Other non-triggerable sensors
In robotics, it is often useful to perform sensor fusion with multiple available sensor modalities such as wheel odometers or LiDAR sensors [Zhang2015Visual-lidarFast]. Most of such sensor hardware do not allow triggering or low-level sensor readout but send the data continuously to a host computer. In order to enable precise and accurate sensor fusion, having corresponding timestamps of all sensor modalities is often crucial.
As most such additional sensors produce their timestamps based on the host clock or synchronized to the host clock, VersaVIS performs time translation222In this context, clock synchronization refers to modifying the slave clock speed to align to the master clock while clock translation refers to translating the timestamp of the slave clock to the time of the master clock [Sommer2017ACameras]. to the host. An on-board extended Kalman filter (EKF
) is deployed to estimate clock skewand clock offset using
where and are the timestamps on the host and slave (in this case the VersaVIS MCU) respectively, is the update step, is the time since the last EKF update and refers to the initial clock offset set at the initial connection between host and slave.
Periodically, VersaVIS performs a filter update by requesting the current time of the host
where is the time at sending the request and is the time when receiving the answer from the host assuming that communication time delay between host and slave is symmetric. The filter update is then performed using standard EKF equations:
where depicts the prediction, is the covariance matrix and are the noise parameters of the clock offset and clock skew, respectively. The measurement residual can be written as
where the Kalman gain and the measurement update can be derived using the standard EKF equations.
Ii-B Host computer driver
The host computer needs to take care of merging together the image data directly from the camera sensors and the image timestamps from the VersaVIS triggering board.
During initialization, timestamp and image data are assigned based on minimal time difference within a threshold for each connected camera separately such as
where is the sequence number offset.
As the images are triggered very slowly (e.g. ), the USB buffer and OS scheduling jitter is assumed to be negligible.
As soon as is constant and time offsets are small, the trigger board is notified about the initialization status of the camera. With all cameras initialized, normal triggering mode (e.g. high frequency) can be activated.
During normal mode, image data (directly from the camera) and image timestamps (from VersaVIS triggering board) are associated based on sequence number like
Ii-B2 Imu receiver
Ii-C VersaVIS triggering board
One main part of VersaVIS is the triggering board which is a MCU-based custom PCB shown in Figure (d)d that is used to connect all sensors and performs sensor synchronization. For easy extensibility and integration, the board is compatible with the Arduino environment [ArduinoInc.2019ArduinoZero]. In the reference design, the board supports up to three independently triggered cameras with a four pin connector. Furthermore, SPI, Inter-Integrated Circuit (IC) or Universal Asynchronous Receiver Transmitter (UART) can be used to interface with an IMU or other sensors. Table I shows the specifications of the triggering board. The board is connected to the host computer using USB and communicates with ROS using rosserial [Ferguson2019Rosserial].
In this section, several evaluations are carried out that show the synchronization accuracy of different modules of the VersaVIS framework.
The first important characteristic of a good multi-cam time synchronization is that multiple corresponding camera images capture the same information. This is especially important when multiple cameras are used for state estimation (see Section IV-B).
For the purpose of evaluating the synchronization accuracy of multi-camera triggering, we captured a stream of images of an light emitting diode (LED) timing board shown in Figure 8 with two independently triggered but synchronized and exposure-compensated cameras.
The board features eight counting LEDs (the left and right most LEDs are always on and used for position reference). The board is changing state whenever a trigger is received. The LEDs are organized in two groups. The right group of four indicate one count each, while the left group is binary encoded resulting in a counter overflow at . The board is triggered with aligned with the mid-exposure timestamps of the images. Furthermore, both cameras are operated at a rate of and with a fixed exposure time of .
For a successful synchronization of the sensors, images captured with both cameras should show the same bit count with at most two of the striding LEDs on (since the board changes state at mid-exposure) and also the correct increment between images of .
Figure 12 shows results of three consecutive image pairs. All image pairs (left and right) show the same LED count while consecutive images show the correct increment of . An image stream containing image pairs was inspected without any case of wrong increment or non-matching pairs.
We can therefore conclude that the time synchronization of two cameras has an accuracy better than showing an accurate time synchronization.
For many visual-inertial odometry (VIO) algorithms such as ROVIO [Bloesch2015] or OKVIS [Leutenegger2015], accurate time synchronization of camera image and IMU measurement is crucial for a robust and accurate operation.
Typically, time offsets between camera and IMU can be the result of data transfer delay, OS scheduling, clock offsets of measurement devices, changing exposure times or internal filtering of IMU measurements. Thereby, only offsets that are not constant are critical as constant offsets can be calibrated. Namely, offsets that are typically changing such as OS scheduling, clocks on different measurement devices or a not compensated changing exposure time should be avoided.
Using the camera-IMU calibration framework Kalibr [Furgale2013UnifiedSystems], a time offset between camera and IMU measurement can be determined by optimizing the extrinsic calibration together with a constant offset between both modalities. To test the consistency of the time offset, multiple datasets were recorded with VersaVIS using different configurations for the IMU filtering and exposure times. The window width of the deployed Barlett window finite impulse response (FIR) filter on the IMU [AnalogDevices2019ADIS16448BLMZ:Sensor] was set to while the exposure time was set to AE or fixed to .
Furthermore, also the Skybotix VI-Sensor [Nikolic2014] and Intel RealSense T265 [IntelCorporation2019IntelT265] were tested for reference.
Table II and Figure 13 show import indicators of the calibration quality for multiple datasets. The reprojection error represents how well the lens and distortion model fit the actual lens and how well the movement of the camera agrees with the movement of the IMU after calibration. The reprojection error of VersaVIS and VI-Sensor are both low and consistent meaning that the the calibration converged to a consistent extrinsic transformation between camera and IMU and both sensor measurements agree well. Furthermore, the reprojection errors are independent of the filter and exposure time configuration showing that exposure compensation is working as expected. On the other side, the RealSense shows high reprojection errors because of the sensor’s fisheye lenses which turn out to be hard to calibrate even with the available fisheye lens models [Usenko2018TheModel] in Kalibr.
This also becomes visible in the acceleration and gyroscope errors where the errors are very low as a result to the poorly fitting lens model. Kalibr estimates the body spline to be purely represented by the IMU and neglect image measurements. For VersaVIS both errors are highly dependent on the IMU filter showing a decrease of error with more aggressive filtering due to minimized noise. However, also here similar or lower errors of VersaVIS can be achieved compared to the VI-Sensor.
Finally, the time offset between camera and IMU measurements shows that VersaVIS
and VI-Sensor possess a similar accuracy in time synchronization as the standard deviations are low and the time offsets consistent indicating synchronization accuracy below. However, time offset is highly dependent on the IMU filter configuration. Therefore, the delay should be compensated either on the driver side or on the estimator side when more aggressive filtering is used (e.g. to reduce the influence of vibrations). Furthermore, the time offset is independent of exposure time and camera indicating again that exposure compensation is working as intended. RealSense shows inconsistent time offset estimations with a bi-modal distribution delimited by half the inter-frame time of indicating that for some datasets, there might be image measurements shifts by one frame.
|N||Reprojection error||Gyroscope error||Acceleration error||Time offset|
As mentioned in Section II-A3, not all sensors are directly compatible with VersaVIS. The better the clock translation of different measurement devices, the better the sensor fusion.
Thanks to the bi-directional connection between VersaVIS and host computer, clock translation requests can be sent from VersaVIS and the response from the host can be analyzed.
Such requests are sent every second. Figure 14 shows the evolution of the EKF states introduced in Section II-A3. In this experiment, there is a clock skew between the host computer and VersaVIS resulting in a constantly decreasing offset after EKF convergence. This highlights the importance of estimating the skew when using time translation.
Figure 15 shows results of the residual and the innovation terms of the clock offset and skew after startup and in convergence. After approximately , the residual drops below and keeps oscillating at due to USB jitter. However, thanks to the EKF, this error is smoothed to a zero mean innovation of the offset of resulting in a clock translation accuracy of . The influence of the skew innovation can be neglected with the short update time of .
This section validates the flexibility, robustness and accuracy of our system with several different sensor setups using VersaVIS, utilized in different applications.
Iv-a Visual-inertial SLAM
The main purpose of a VI sensor is to perform odometry estimation and mapping. For that purpose, we collected a dataset walking around in our lab with different sensor setups including VersaVIS equipped with a FLIR BFS-U3-04S2M-CS camera and the Analog Devices ADIS16448 IMU shown in Figure (a)a, a Skybotix VI-Sensor [Nikolic2014] and an Intel RealSense T265 all attached to the same rigid body. For reference, we also evaluated the use of the FLIR camera together with the IMU of the VI-Sensor as a non-synchronized333Since both sensors do some time translation to the host, software time translation is available. sensor setup.
Figure 16 shows an example of the feature tracking window of ROVIO on the dataset. Both VersaVIS and the VI-Sensor show many well tracked features even during fast motions depicted on the image. Due to the high field of view (FOV) of the RealSense cameras and not perfectly fitting lens model, some of the keypoints do not reproject correctly to the image plane resulting in falsely warped keypoints. With a non-synchronized sensor (VersaVIS non-synced), many of the keypoints cannot be correctly tracked as the IMU measurements and the image measurements do not agree well.
Figure 17 shows trajectories obtained with the procedure described above. While all sensors provide useful output depending on the application, VersaVIS shows the lowest drift. VI-Sensor and RealSense suffer from their specific camera hardware where the VI-Sensor has inferior lenses and camera chips (visible in motion blur) and RealSense has camera lenses where no well-fitting lens model is available in the used frameworks. Without synchronization, the trajectory becomes more jittery resulting in potentially unstable estimator also visible in the large scale offset and shows higher drift. Using batch optimization and loop closure [Schneider2017], the trajectory of VersaVIS can be further optimized (VersaVIS opt). However, the discrepancy between VersaVIS and VersaVIS opt is small indicating an already good odometry performance.
Iv-B Stereo Visual-Inertial Odometry on Rail vehicle
The need for public transportation is heavily increasing while current infrastructure is reaching its limits. Using on-board sensors with reliable and accurate positioning, the efficiency of the infrastructure could be highly increased [Tschopp2019ExperimentalVehiclesb]. However, this requires the fusion of multiple independent positioning modalities of which one could be visual-aided odometry.
For this purpose, VersaVIS was combined with two global-shutter cameras arranged in a fronto-parallel stereo setup using master-slave triggering (see Section II-A
) and a compact, precision six degrees of freedomIMU shown in Figure (c)c. In comparison to many commercial sensors, the camera has to provide a high frame-rate to be able to get a reasonable number of frames per displacement, even at higher speeds and feature a high dynamic range to deal with the challenging lighting conditions. Furthermore, due to the constraint motion of the vehicle and low signal to noise ratio (SNR), the IMU should be of a high quality and also should be temperature calibrated to deal with temperature changes due to direct sunlight. The sensor specifications are summarized in Table III.
The results obtained with VersaVIS (details found in our previous work [Tschopp2019ExperimentalVehiclesb]) show that by using stereo cameras and tightly synchronized IMU measurements we can improve robustness and provide accurate odometry up to of the travelled distance on railway scenarios and speeds up to .
Iv-C Multi-modal mapping and reconstruction
VI mapping as described in IV-A can provide reliable pose estimates in many different environments. However, for applications that require precise mapping of structures in GPS denied, visually degraded environments, such as mines and caves, additional sensors are required. The multi-modal sensor setup as seen in Figure (a)a was specifically developed for mapping research in these challenging conditions. The fact that VersaVIS provides time synchronization to the host computer greatly facilitates the addition of other sensors. The prerequisite is that these additional sensors are time synchronized with the host computer as well, which in our case, an Ouster OS-1 64-beam LiDAR, is done over Precision Time Protocol (PTP). The absence of light in these underground environments also required the addition of an artificial lighting source that fulfills very specific requirements to support the camera system for pose estimation and mapping. The main challenge was to achieve the maximum amount of light, equally distributed across the environment (i.e. ambient light), while at the same time being bound by power and cooling limitations. To that end a pair of high-powered, camera-shutter-synchronized LEDs, similar to to the system shown by Nikolic et. al. [Nikolic2013AFacilities], are employed. VersaVIS provides a trigger signal to the LED control board, which represents the union of all camera exposure signals, ensuring that all images are fully illuminated while at the same time minimizing the power consumption and heat generation. This allows operating the LEDs at a significantly higher brightness level than during continuous operation.
Figure 21 shows an example of the processed multi-modal sensor data. VI odometry [Bloesch2015] and mapping [Schneider2017] including local refinements based on LiDAR data and Truncated Signed Distance Function (TSDF)-based surface reconstruction [Oleynikova2017Voxblox:Planning] were used to precisely map the 3D structure of parts of an abandoned iron mine in Switzerland.
Iv-D Object Based Mapping
Robots that operate in changing environments benefit from using maps that are based on physical objects instead of abstract points.
Such object based maps are more consistent if individual objects move, as only a movement of an object must be detected instead of each point on the moved object that is part of the map.
Object based maps are also a better representation for manipulation tasks.
The elements in the map are typically directly the objects of interest in such tasks.
For the object based mapping application, a sensor setup with a depth and an RGB camera, and an IMU was assembled, see Figure (b)b. A Pico Monstar, which is a ToF camera that provides resolution depth images at up to , was combined with a Flir Blackfly color camera, and an ADIS16448 IMU. To obtain accurate pose estimates, the IMU and color camera were used for odometry and localization [Schneider2017].
With these poses, and together with the depth measurements, an approach from [Furrer2019Modelify:Completion] was used to reconstruct the scene and extract object instances. An example of such a segmentation map is shown in Figure (a)a, and objects that were extracted and inserted into a database in Figure (b)b and (c)c.
We presented a hardware synchronization suite for multi-camera VI sensors consisting of the full hardware design, firmware and host driver software. The sensor suite supports multiple beneficial features such as exposure time compensation and host time translation and can be used in both independent and master-slave multi-camera mode.
The time synchronization performance is analyzed separately for camera-camera synchronization, camera-IMU synchronization and VersaVIS-host clock translation. All modules achieve time synchronization accuracy of which is expected to be accurate enough for most mobile robotic applications.
The benefits and great versatility range of the sensor suite are demonstrated on multiple applications including hand-held VIO, multi-camera VI applications on rail vehicles as well as large scale environment reconstruction and object based mapping.
For the benefit of the community, all hardware and software components are completely open-source with a permissive license and based on easily available hardware and development software. This paper and the accompanying framework can also serve as a freely available reference design for research and industry as it summarizes solution approaches to multiple challenges of developing a synchronized multi-modal sensor setup.
The research community can easily adopt, adapt and extend this sensor setup and rapid-prototype custom sensor setups for variety of robotic applications. Many experimental features showcase the easy extensibility of the framework:
Illumination module: The VersaVIS triggering board can be paired with LEDs shown in Figure (a)a which are triggered corresponding to the longest exposure time of the connected cameras. Thanks to the synchronization, the LEDs can be operated at a higher brightness as which would be possible in continuous operation.
LiDAR synchronization: The available auxiliary interface on VersaVIS could be used to tightly integrate LiDAR measurements by getting digital pulses from the LiDAR corresponding to taken measurements. The merging procedure would then be similar to the one described in Section II-B1 for cameras with fixed exposure time.
This work was partly supported by Siemens Mobility, Germany and by the National Center of Competence in Research (NCCR) Robotics through the Swiss National Science Foundation. The authors would like to thank Gabriel Waibel for his work adapting the firmware for a newer MCU.