The evaluation dataset and image processing visualizations are made available here: https://embedded.rwth-aachen.de/doku.php?id=forschung:mobility:infralocalization:itsc2021.
High-precision localization is a key enabler for automated vehicles, as most disclosed automated vehicle (AV) prototypes rely on a high-precision map of the environment they operate in. These maps contain static information about the environment and are generated before driving autonomously in a specific area. Maps are comprised of static environment elements such as lane boundaries, traffic light locations, stop signs as well as traffic rules. These pieces of information are often required ahead of time for trajectory planning and are difficult to extract from live sensor data with sufficient foresight due to sensor range, occlusions or non-line-of-sight. Localization within these maps typically is either based on heuristic features that have been stored during map generation or based on Global Navigation Satellite System (GNSS) approaches, often in combination. Currently, relevant static environment elements are retrieved from the map and combined with dynamic elements of the environment extracted from live sensor data, e.g., other traffic participants, free space, state of traffic lights. This environment model is the basis for subsequent decision making, trajectory planning and control.
A variety of localization approaches for autonomous vehicles have been proposed and used in practice. The majority of the proposed approaches are following two fundamental ideas: GNSS-based systems, and systems based on heuristic features.
GNSS-based systems, such as GPS or Galileo, use satellite signals, to which direct line of sight is required, to triangulate their position.
The requirement for line of sight poses a problem in tunnels or urban environments, where tall buildings may reflect or block the signal, culminating in reduced accuracy or inability to localize.
These issues can be somewhat mitigated by fusing the GNSS position with measurements from other sensors, such as an inertial measurement unit (IMU) and map data.
Such approaches can help to temporarily overcome situations with degraded performance, but suffer from IMU drift and do not reliably solve the robustness problem in general.
As detailed in Section II, camera or LIDAR-based localization often relies on finding and identifying heuristic feature points in images.
Both camera and LIDAR approaches suffer from weak long-term feature stability in changing weather and lighting situations.
Contribution: As illustrated in Fig. 1, we investigate the idea of augmenting the infrastructure with active visual features that are specifically designed to be easily detectable and identifiable at long-range. As detailed in Section II, approaches proposed in prior-art are not detectable at long ranges and therefore require a high density, rendering them impracticable for large-scale outdoor environments. This work presents proof-of-concept for a camera with band-pass filter that is able to detect and identify single infrared beacons at range at both day and night. Due to the large detection distance, sparse distributions of our unobtrusive low-cost beacons is sufficient to augment wide areas for autonomous vehicle deployments or other robotic applications. For example, our detection distance is three times greater than the typical distance between light poles in urban areas in Germany. Our contributions are summarized as follows:
We propose a simple and robust approach for infrastructure-based localization using sparsely distributed infrared beacons.
We describe the hardware and protocol design and image processing pipeline in detail.
We provide a proof-of-concept evaluation for a long-range outdoor scenario for both day and night.
Structure: In Section II, we discuss related work, both for classic approaches as well as other approaches using hand-crafted features. In Section III, we describe design decisions for our camera system, the infrared beacons as well as the image processing pipeline. We provide proof-of-concept evaluation results in Section IV and conclude the paper in Section V.
Ii Related Work
We first focus on common localization approaches that are based on GNSS or heuristic features.
Prior art on infrastructure-based systems is then discussed in more detail.
GNSS approaches such as GPS or Galileo use satellite signals to determine the current position. While GPS achieves meter accuracy[tan_dgps-based_2006-1], performance can be improved through the use of differential GPS (DGPS) or real-time kinematic (RTK) solutions, which use base-stations with known location to improve positioning accuracy [kuutti_survey_2018]
. The reliability problem remains in urban environments or tunnels, where signals can be reflected or blocked entirely, culminating in reduced accuracy or inability to localize. To overcome situations with degraded GNSS availability, inertial measurement units (IMUs) have been used to estimate the vehicle position relative to an initial position. However, IMU integration accumulates errors which result in an growing drift of the estimated position from the true position. For example, Zhang et al.[zhang_sensor_2012] improve raw GPS performance through fusion with an IMU, but the accumulated root mean square positioning error after is at , which is too high for autonomous driving.
Camera-based techniques have been investigated in numerous publications [li_location_2010, sattler_hyperpoints_2015, torii_247_2015, taira_inloc_2018]
. Often, these techniques rely on heuristic feature points, which are embedded in a map and matched with features extracted from live camera images for localization. In[sattler_benchmarking_2018] Sattler et al. benchmark various approaches under changing environment conditions and conclude ’that long-term localization is far from solved’. Works proposed so far have failed to solve the issue of long-term feature stability. Problems arise from feature detection and matching with corresponding map features as the scene appearance changes over time. Environmental changes, due to dynamic objects, such as bicycles and pedestrians, whose landmark features may have been stored during mapping, are also problematic. Frequent map updates and semantic information may accommodate for long-term changes[Schoenberger2018Semantic, Xiao2018MonocularSemantic], but performance issues under different lighting and weather situations remain.
LIDAR-based localization is promising [Levinson2010RobustLIDAR, Wolcott2015FastLIDAR, Rohde2016PreciseLIDAR], but the prohibitively high cost of LIDAR systems is an economic challenge for mass deployment in consumer vehicles. Furthermore, LIDAR suffers from reduced surface reflectivity in rain or snow. Storing raw point clouds can consume up to per kilometer and even with compression, this number is still reported to range from to per kilometer [wolcott_robust_2017].
Infrastructure-based approaches try to alleviate recognition problems of heuristic features by purposely placing hand-crafted features or devices in the environment. We first present related work for indoor systems and then discuss prior work for outdoor applications.
Li et. al. propose Epsilon, an indoor positioning system that uses regular LED lamps [Li2014Epsilon] and that achieves sub-meter accuracy indoor. The authors of [Hijikata2009indoor-led] use infrared LEDs placed on walls, which are detected using a camera system, allowing for mobile robot localization. To simplify detection, an optical low-pass filter is used in front of the camera, which blocks visible light. Different to our work, the LEDs do not communicate an identifier and are not distinguishable. This can limit the applicability for outdoor scenarios due to potential ambiguities. The achieved accuracy lies in the sub-meter range.
Fiducial tags such as arUco [garrido-jurado_generation_2015] and AprilTags [olson_apriltag_2011] are purposely designed for robust recognition and have been used in numerous robotic applications. These features are printed on paper and are difficult to detect and identify at long-range, especially in low-light situations. Therefore, these passive approaches require a dense distribution to allow for large-scale outdoor deployments. In contrast, our approach aims at minimizing the number of beacons required, which potentially renders the idea economically feasible and without having to clutter the environment with noticeable artifacts.
The Vehicle Information and Communication System (VICS) in Japan has been in operation since the beginning of the 1990s and consists of over 56.000 infrared transceivers that are installed on roads and highways. The beacons are mounted above the road, have a communication range of , and are used for traffic monitoring and traffic information purposes. In [hayama_advanced_nodate], the authors were able to extend the communication range to . Our proposed system is related to the Visible Light Communication (VLC) domain [yamazato_overview_2017], for which we will now present related work. In [eso_experimental_2019] the authors modulate traffic light intensities and capture and decode the transmitted signal using a camera. The authors do not test their system in motion and traffic lights are too sparsely distributed to be used for localization in wider areas. The authors of [yamazato_image-sensor-based_2014] also investigate the use of traffic lights as communication channels and achieve transmission rates of multiple . The same authors design a system using a LED-based transmitter capable of high-speed transmissions, but rely on an expensive high-speed camera and much larger beacons [nagura_improved_2010], [nagura_tracking_2010]. In [Liu2003positioning-beacons], Liu et. al. propose an outdoor localization system incorporating LEDs in traffic lights and visible light beacons. The authors design a coding scheme that causes the human eye to perceive the beacons as being turned on permanently. In contrast to our work, the use of optical band-pass filters was not investigated and no outdoor evaluation was conducted. Kim et al. propose a VLC system for localization but the demonstration and evaluation is carried out only in simulation [kim_vehicle_2016]. Since we do not focus on high-bandwidth transmission, we do not need to use sophisticated coding schemes and high-speed cameras to achieve tolerable data rates. Instead, we use off-the-shelve cameras and simple coding schemes to increase system performance at long-range.
The authors of [rabinovich_cobe_2020] propose coded beacons (CoBe) using infrared LEDs for localization and object tracking. Although convincing results are obtained, the evaluation is carried out indoors and no evaluation of outdoor applicability has been conducted. Our approach also differs fundamentally in the encoding scheme and hardware setup.
Iii Infrared Beacons using Infrared Beacons
The underlying idea of this work is to embed active visual beacons into the infrastructure that act as long-term stable, unambiguous, and easily detectable features to facilitate long-term stable vehicle localization in urban environments under changing environment conditions. Our privacy-preserving beacons actively communicate a beacon identifier using infrared light. Thus, their visibility is not dependent on environmental conditions and external light sources, as would be the case with traffic signs or passive QR codes. Furthermore, our system is designed to increase the signal-to-noise ratio, i.e., the visibility of the infrared beacons in contrast to other elements of the environment at the camera sensor. This is achieved by restricting the wavelengths of light that reach the camera sensor to a small window around the wavelength emitted by our beacons through the use of a band-pass filter. We will now provide more detail on design considerations and the resulting hardware setup. Afterwards, we will detail the protocol design and image processing pipeline for beacon detection and identification.
Iii-a Hardware Design
Signal-to-Noise Ratio: Our hardware design aims at maximizing the signal-to-noise ratio to ease the detection of beacons by avoiding interference with other light sources. During the daytime, the most dominant cause of interference is sunlight. As Fig. 2 shows, the sun as an idealized black-body radiation source emits light across the full wavelength spectrum at varying intensity levels [iqbal_introduction_1983].
The sun radiates light with the highest intensity in the human visible spectrum and intensity drops for ultra-violet and infrared light.
Ideally, the beacon wavelength is as far as possible in the infrared spectrum to avoid the beacon signal to overlap with natural sunlight.
Due to physical limitations imposed by the photoelectric effect, quantum efficiency for common, silicon-based camera sensors drops to zero at photon energy, which corresponds to light wavelength.
Infrared cameras can overcome this limitation through the use of different sensor principles, but are very expensive and are therefore not considered here.
However, there are a few characteristic drops in sunlight intensity for specific wavelengths caused by interference with molecules in the atmosphere.
The drop at is particularly strong and is the furthest into the infrared spectrum that is still detectable for regular camera sensors.
Beacon and Camera Prototype: Fig. 3 depicts our proof-of-concept beacons that are made up of 48 LEDs at wavelength and measure .
In order to allow for investigating different patterns, the LEDs are divided into 16 groups with 3 LEDs each. Each group of LEDs can be individually turned on and off. Our prototype camera system consists of two cameras and a Raspberry Pi for recording. The monochrome camera for beacon detection is a Basler dart daA1600-60um with a resolution of . It is equipped with a band-pass filter for wavelength with a full-width-at-half-maximum value of . Sunlight intensity is low but non-zero at , therefore the environment is still visible. To counteract this, the beacon detection camera uses a low exposure time of . This results in a mostly black image with only the beacons being visible. The second camera serves for evaluation purposes and has no optical filter. The beacon detection camera captures images at and the evaluation camera at . As the beacon-specific code has strict timing requirements, a STM32 microcontroller is used for controlling the LED groups. Fig. 5 shows images captured using our prototype system described above. Most of the environment is removed by the filter and the beacons are clearly visible at nighttime and also against direct sunlight during the daytime. Most of the surrounding environment is removed from the image, thereby dramatically increasing the signal-to-noise ratio and easing beacon detection. Note that both cameras have different lenses and the images have not been registered.
Iii-B Optical Communication Channel
Preliminary considerations: We considered two approaches for designing the communication channel between beacon and camera.
In a pure spatial encoding approach, each beacon displays a unique pattern for identification (e.g., QR codes).
Global identifier uniqueness would not be required as long as it is unique for a sufficiently large local neighborhood of beacons.
As a result, beacons can potentially be identified from a single camera image instead of having to track and reconstruct the signal over multiple images.
This would result in a maximum of different IDs for the 16 LED groups of our prototype beacon.
For beacon recognition, this approach would require the camera to be able to clearly image individual LEDs or groups of LEDs, which becomes infeasible after a few meters with a beacon size, as the examples in Fig. 5 show.
This approach may be feasible with extreme camera resolutions or significantly larger beacons—options we discarded for practical considerations.
Linecode: We choose a mixture of time and spatial encoding, where the identifier is transmitted through the alternation between simple patterns over time. An intuitive approach is turning the beacon on and off in order to transmit 1 or 0. Due to the large displacements during the off phase, tracking at a vehicle speed of has proven difficult in experiments — especially with multiple beacons. Instead, we use the diagonal symbols depicted in Fig. 4 to implement a non-return-to-zero line code. Depending on diagonal orientation, 1 or 0 is represented. The figure shows the active LED groups as well as real-world images from and as observed by the band-pass filter camera. Here it becomes obvious that identifying individual LED groups is not tractable for a pure spatial domain encoding. The chosen approach has the advantage that the beacons are always active, which eases tracking and subsequently signal reconstruction.
Beacon identifier: A 12-Bit-Code is assigned to every beacon, which is transmitted in an endless loop using the two previously introduced symbols, see Fig. 4. For the sake of simplicity, we do not use a synchronization symbol. Ambiguities may arise if codes are chosen unconstrained, e.g. 00110011 and 11001100 generate the same signal if repeated indefinitely. Therefore, we use prefix-free codes that are generated through ambiguity testing of all cyclic shifts for all possible identifiers. The camera records an image every while the beacon displays each bit for , which allows for oversampling and more robust signal reconstruction. Transmission of the full 12-bit beacon identifier takes .
Iii-C Image Processing
We will now describe the image processing pipeline used for beacon detection and identification.
The processing pipeline first generates image patches that potentially contain beacons.
Candidates are then tracked using a tracking-by-detection approach and the resulting tracks are then used for signal reconstruction.
The beacon detection stage first operates on binarized images for filtering purposes, but uses grayscale images for subsequent stages. We first generate a set of proposal image patches that potentially contain beacons and filter the proposals based on reference image comparison. First, the 8-bit greyscale imageis converted to a binary image by setting pixel intensities lower than to zero, and pixel values above to .
This step removes noise from the image. The threshold value 5 was determined through histogram analysis. Next, a set of contiguous areas are extracted from the camera image using morphological operations as beacon proposals . Each area is described through its bounding box width, height and location in the camera image.
As beacons have a fixed size, the maximum and minimum area occupied in the camera image depends only on the distance to the camera. The upper and lower bounds can be determined either analytically using the pinhole camera model or experimentally. The proposals are filtered based on these limits such that both large-scale artifacts and individual white pixels are excluded. The set of proposals is refined to based on proposal area: . The choice of the lower boundary significantly affects the range of detection. According to the pinhole model, a beacon at occupies only . For distant beacons, Individual pixels of low intensity have been removed by thresholding, but the decoder is still able to determine the orientation correctly in the grayscale image . This explains the choice for the lower limit. Due to the discrete resolution of the camera sensor, in practice beacons usually irradiate more than two neighboring pixels and thus occupy a larger area.
The final detection step uses shape matching to compare the proposals with a reference image
. To this end, we compute Hu-moments for each beaconin the original grayscale image , which are translation, rotation and scale-invariant . Image moments of order are defined as follows:
with and being the centroids in each dimension:
Scale invariant moments are then obtained through normalization as:
The seven rotational invariant Hu-moments can then be defined using . For example and . A complete list of all moment definitions is provided in . The distance between and a proposal patch extracted from is then computed on the basis of respective Hu-moments:
Proposals for which holds are regarded as detections and are passed to the tracker.
The threshold value of has been determined experimentally.
Tracking: The objective of the tracking stage is to associate beacon detections across multiple images, such that the transmitted signal can be reconstructed. Due to the high frame rate of , we assume only small shifts of the beacon position between consecutive images. Input to this stage are tracked beacons from previous images denoted as tracks and detections for the current camera image . We compute a distance matrix for each detection and track combination based on the euclidean distance of bounding box centroids as follows:
We follow a greedy approach and associate tracks and detections starting with the smallest distance up to a maximum distance threshold.
A new track is created for all detections that can not be matched to any existing track, either due to exceeding the maximum distance or because all other tracks have been matched.
Tracks are discarded if they are not matched to a detection for more than 30 frames.
Signal Reconstruction: After beacons have been detected and tracked over several images, the decoder attempts to recover the identifier for each track. The patterns correspond to perpendicular diagonals, so the signal decoding stage first determines beacon orientation in each frame. The orientation is computed for each frame that belongs to a track based on the central image moments:
Orientations are mapped to if and to otherwise. The decoder is designed to make use of the oversampled symbol display, i.e., time between camera images and display time for a single bit. For each track, a sliding window sums up the last seven estimated bit values (c.f. Eq. 8).
As depicted in Figure 6, the sliding window produces a signal that is input to a Schmitt trigger, with threshold values and .
If the sum exceeds , the beacon signal is interpreted as 1, and if it falls below as 0.
Whenever a signal level change has been detected, a single bit is appended to the decoded identifier of the track corresponding to the Schmitt trigger state and the time of signal change is stored.
The decoder also handles repetitions of the same symbol, as the time to sample the Schmitt trigger is derived from the known beacon frequency.
Each track generates a bit sequence that can be matched to a known list of beacon identifiers.
Error in previous processing stages are compensated during decoding, as recognizing known identifiers for tracks that do not belong to a beacon is unlikely.
Thus, this stage adds to the overall robustness of the system.
Camera Pose Estimation:
Camera Pose Estimation:Subsequent processing steps for heuristic feature based approaches and our beacon-based system are similar [rabinovich_cobe_2020]. The essential matrix is computed for image pairs based on feature tracks, from which point clouds can be triangulated and further refined using bundle adjustment. For a given point cloud, the current pose can then be estimation using the Perspective-n-Point problem, which yields a solution for beacons. The performance is heavily influenced by feature matching and recognition. Therefore, our work is focused on the ability of the system to detect, track and identify the proposed beacons.
Iv Experiments & Results
|Correct ID Bits||263||265||263||261||167|
This section presents the experiments we conducted to determine the feasibility of our proposed approach. We investigate the ability to detect and identify the beacons at both day and night in a long-range setup on a city road. We first describe the experiment setup and then present the results obtained. All datasets and visualizations of system processing are made available.
Iv-a Experiment Setup
Iv-B Standstill Experiment
|Total Bits Decoded||673||651||653||637||653.5||657||637||591||580||616.25|
|Correct ID Bits||180||168||156||177||170.25||228||195||161||180||191.0|
|Correct ID Bits||228||206||198||201||208.25||213||210||203||192||204.5|
|Correct ID Bits||151||127||131||111||130.0||129||137||129||129||131.0|
This experiment intends to determine the system performance when the vehicle is not moving. For this purpose, only beacon is activated and we have parked our test vehicle , , , , and away from . We investigate the ability of the system to correctly decode the identifier transmitted by the beacon. This experiment evaluates the end-to-end performance of the system since a correctly decoded beacon identifier requires all stages of the processing pipeline to function correctly, i.e., detection, tracking, and signal reconstruction. All recordings for this experiment have been recorded during daytime and the beacon identifier is .
The outcome of this experiment is summarized in Table I. Although the experiment was conducted in daylight, no detections and thus tracks were generated other than the one corresponding to the beacon for all distances. The total number of bits decoded corresponds to the length of the bit-string that is generated by the signal reconstruction stage.
We count all occurrences of the known beacon identifier in the decoded bit-string. The first and last bits are also considered correct if they a post- or prefix of the identifier. All remaining bits are considered error bits. For example, let 1100100010011001010001001100100001 be the decoded string, then 11001X1X0001 is results after replacing all occurrences of with X. Since 11001 and 0001 are pre- and postfix of , we count the number of correctly decoded bits as 33 and the number of error bits as 1.
The results show that the system can correctly derive the beacon identifier up to a distance of error-free. We see that the system is also able to detect and track the beacon at a distance of . The fact that there are less detections than frames indicates that the detector fails to generate beacon candidates, with a negative impact on the decoded bit sequence. Although we were still able to find in the decoded bit string multiple times at , we see that errors start to occur. For practical applications, this might still be sufficient. While the position of in the camera image is crucial for ego-pose estimation, querying the geographical location of the beacon using the identifier only has to be done once.
Iv-C Driving Experiment
This experiment determines the ability of the system to detect and identify the beacons while driving. The vehicle accelerates from a standstill at position to and drives towards the end of the road until it passes . All three beacons are active with and . In order to investigate the performance under drastically changing environmental conditions, we have conducted eight runs of this experiment, four during daytime and four at nighttime. The light poles are turned on in the nighttime recordings.
The results for this experiment are shown in Table II. We report the total number of frames, detections, tracks as well as total bits decoded for each recording. The number of detections per run appear large, but it is less than three beacons per frame. This is expected, as and leave the field of view and is picked up only after . Furthermore, we report the number of tracks for each beacon where the generated decoder bit sequence contains the correct beacon identifier. In each run, only one track generates the correct identifier per beacon, with an exception of run . In , beacon is recognized very early. This early generated track is discarded as the detector fails to find . Later on, is picked up at the same distance as in the other experiments.
We also show the vehicle position (cf. Fig. 7) of the first and last recognition of the full beacon identifier. and are identified within the recognition range observed during the standstill experiment. The detection distance is non-zero, as the vehicle is in motion and the transmission of the 12-bit identifier takes . At , the vehicle drives during a complete identifier transmission. The observed numbers are slightly lower as .
Table II also shows the correct and incorrect bits in the decoded bit string as defined in the standstill experiment. Here, we can observe a higher number of error bits. We found that in many cases the errors occur at the end of track lifetime. As beacons leave the field of view, bit are mistakenly appended as the decoder assumes symbol repetition. We expect this problem to be eliminated with marginal tracker improvements.
The performance of the system does not differ significantly during nighttime or daytime, which becomes clear when comparing decoding errors. This might be attributed to normalization of the camera image achieved by the band-pass filter. Due to the absence of light sources at wavelength, the number of detections and thus the number of tracks is slightly lower at nighttime. It is noteworthy that is recognized on average earlier at nighttime in comparison to daytime. The greater recognition distance might be attributed to less noisy detections and therefore more robust template matching in the detector.
V Conclusion & Future Work
We have presented an approach for infrastructure-based localization using infrared beacons and a band-pass filtered camera system. The hardware design is described in detail and the underlying design considerations have been elaborated. We have described our image processing pipeline that uses traditional computer vision techniques. Our experiments demonstrate the ability of the system to detect, track and identify our beacons independent of light conditions at long ranges. As our detection range is greater than the typical distance between light poles in urban areas, the obtrusive and costly construction of new infrastructure is not required. The use of infrared light increases signal-to-noise ratio and is invisible to the human eye. Our contribution can be regarded as a proof-of-concept and many more investigations have to be carried out. We plan to investigate alternative beacon patterns, as our beacon prototype allows to display up todifferent symbols. Furthermore, experiments in wider urban areas are necessary. Due to sparse beacon distribution, the system is more vulnerable to detection and identification failures. Therefore, strategies have to be devised for handling occlusions and errors in beacon identifier decoding. Experiments will also investigate the final positioning performance achieved.