Humans have five senses and out of those, visual perception is likely to be the main information relevant for driving a vehicle Sivak1996
. This fact is one reason why visual perception is essential in robots and assisted (autonomous) driving. In the last years, visual perception by machines has made huge progress by the use of Neural Networks (NNs)fasterrcnn ; yolo3 ; imagenet and achieved super human performance for some tasks resnet . However, while progressing heavily in specific directions like object detection or action recognition, the abilities are far behind the general purpose human performance, which is, for instance, reflected in the problem of adversarial robustness Eykholt2018 .
Another reason why computer vision systems cannot compete with humans in general purpose tasks is that computer vision systems are mostly trained to solve one specific task. For this, the vision task is formulated in a mathematical framework. For example, in object detection, the most studied field in computer vision, the objects are marked by bounding boxes and the task is to predict and classify those bounding boxes. Often, this sufficiently reflects our human visual performance. But, in general, human visual perception is more complex. Humans reason about the environment based on complex learned causalities usingall
the available information. If we hear a siren, for instance, we expect the occurrence of an ambulance and try to visually estimate the point of occurrence. Such causalities to enrich our environmental model are present in everyday life: At day light, we use shadow movements and illumination changes to reason about moving objects without having direct sight and if we drive a car through a village and see a ball rolling on a street we expect the occurrence of playing children.
Another example for complex causalities to enrich the environmental model happens while driving at night. During nighttime, humans show impressive abilities to foresee oncoming cars by analyzing illumination changes in the environment like light reflections on guardrails (see left image in 1), a brightening of a turn ahead, unnatural glares in trees, and so on. Drivers use this provident information to adapt their driving style proactively, like turning off the high beam in advance to avoid glaring of oncoming drivers or adapt their driving trajectory. In the scope of safe and anticipatory driving where time matters and the earlier an information is received the better it is, this human ability is obviously handy and outperforms current computer vision systems used in vehicles.
Oldenziel et al. Oldenziel2020 analyzed this discrepancy between the human detection capabilities and an in-production computer vision system and quantified that humans are approximately s faster. One reason why state-of-the-art object detection systems are behind human capabilities is that object detection systems rely on the assumption that objects have clear, visible object boundaries. Even if this assumption comes with a lot of advantages, like a well-defined description for the enclosing bounding box of an object, it is not inherently applicable to light artifacts—since usually light artifacts (illuminated areas) have no clear object boundaries and the position of these light artifacts does not directly correspond to the location of the light source. Therefore, due to this assumption, the earliest point in time an oncoming vehicle can be detected by state-of-the-art computer vision systems is after almost full visibility (see right image in 1).
Currently, vehicles are more and more equipped with driver assistance systems and manufacturers are working on self-driving cars. Therefore, while driving, more and more tasks are controlled or supported by systems such that the algorithms have more and more responsibility to operate correctly in our complex environment. For safe and anticipatory driving, time matters and s are a non-negligible unexplored potential to, for instance, plan driving trajectories, understand the environment, or simply control the high beam to avoid glaring of oncoming vehicles.
In this paper, we study the task to detect light artifacts at night so that we can reduce the aforementioned time difference. To illustrate the usefulness (and to visualize the detection results in the real-world and in real-time), we equip a test car with such a detection algorithm and use the information to proactively control the matrix beam headlights boke2015mercedes . On the computer vision level, our contributions are:
a simple and fast computer vision algorithm that is able to detect light artifacts;
an investigation of methods to estimate the distance to light artifacts so that we can estimate the three-dimensional position.
On the system level, we investigate:
the tool chain to integrate such a computer vision system in a vehicle to control a component;
the time benefits that can be gained when using such a system.
The outline of the paper is as followed: First, in 2, the current state-of-the-art in vehicle detection at night is reviewed. This section also covers state-of-the-art methods for distance estimation for camera-based systems. Limitations regarding the provident nature of such systems are then highlighted in 3. New approaches to provident detection are reviewed as well, including the baseline systems used to develop the featured system in this work. In 4, a complete pipeline is presented, including the perception systems developed by Saralajew et al. Saralajew2021 . To show the feasibility of such a system in a production car environment, multiple experiments are performed in a real-world environment. The experiments are described and evaluated in 5, and 6 gives a conclusion and a future outlook.
2 Related work
For autonomous driving and Advanced Driver Assistance Systems (ADAS), the detection of vehicles is of high priority to perform emergency braking maneuvers, automatically control the high beams, and so on. The commonly used sensor for that is a driver assistance camera that captures images in the visual range. Thereby, the methods of how vehicles are detected at daylight and nighttime are different. For example, at daylight vehicles are detected based on feature descriptors like edge detectors, symmetry arguments, and classifiers like support vector machines such that vehicles are recognized mainly by their contoursSun2002 ; Sun2006 ; SSTTB2012 or are being detected by deep NNs Fan2016 . In contrast, at nighttime due to low contrast, vehicles are detected by their headlamp and rear light singularities in the image space caused by the luminous intensity of the light sources Lopez.2008 ; P.F.Alcantarilla.2011 ; Eum.2013 ; Juric.2014 ; Sevekar.2016 . Even if this detection principle of light sources is highly efficient, it limits the capability of the detection systems to directly visible objects since light artifacts caused by headlamps cannot be sufficiently described as a singularity, as their fuzzy nature and weak intensity does not allow for clear boundaries.
The provident detection of objects was already studied by several authors. At daylight, Naser et al. FelixMaximilianNaser.18.01.2019 providently detected objects by analyzing shadow movements and, at nighttime, Oldenziel et al. Oldenziel2020 and Saralajew et al. Saralajew2021 studied the task to providently detect oncoming vehicles by detecting light artifacts produced by the headlights. In particular, this work extends the analysis of the latter two and deploys the method in a test vehicle to providently control the car’s matrix beam headlights.
Oldenziel et al. Oldenziel2020 analyzed the discrepancy between the human abilities and an in-production computer vision system in detecting oncoming vehicles. Notably, based on the results of a test group study, the authors specified the deficit in detecting oncoming vehicles providently by 1.7 s on average in favor of humans. Since this is a significant amount of time, the authors studied whether it is possible to detect oncoming vehicles based on light artifacts by training a Faster-RCNN architecture fasterrcnn on a small dataset of approximately 700 images annotated by bounding boxes. The presented results showed that the NN learned the task to some extent. However, the analysis of the detection results raised concerns whether an annotation method with bounding boxes is a good annotation scheme for light artifacts due to a high annotation uncertainty because of unclear object boundaries.
Saralajew et al. Saralajew2021 extended the work of Oldenziel et al. Oldenziel2020 and published a dataset containing 59 746 annotated gray-scale images for the task to providently detect oncoming vehicles at nighttime. In contrast to Oldenziel et al. Oldenziel2020
, the published dataset uses keypoint annotations with a clear annotation hierarchy to allow the investigation of several use cases. Moreover, with the keypoints as initial seeds, the authors further explored methods to extend the keypoint annotations to bounding boxes with low annotation uncertainty. Finally, to show the usefulness of their dataset, they trained several machine learning algorithms for the task to detect light artifacts (light reflections, glaring of areas above a street, headlamps, etc.). The two types of architectures used for this experiment are conceptually completely different. The first type is based on YoloV5 architectures,111https://github.com/ultralytics/yolov5
which is a deep NN architecture, whereas the second one is a two-phase framework consisting of a heuristic blob detector followed by a shallow NN. Both methods show promising results and provide a strong baseline for further experiments. Due to the low computational complexity, we built upon their two-phase framework, fine tune the architecture such that we outperform their published results, and deploy the method in a test car.
In order to use the light artifact detection results exemplary for the proactive control of the matrix beam headlights of the test car, the artifacts have to be located in the three-dimensional space. For this purpose, several approaches can be used—for example, single image depth estimation eigen2014 ; Laina2016 , depth estimation from video Zhou2017 ; ranftl2016 ; gordon2019 , structure from motion furukawa2004 ; saponaro2014 ; gallardo2017 , object localization through ground plane Song2015 . As these methods require the validity of certain assumptions (which are not valid for light artifacts) to provide an accurate estimate, we study a new approach that uses predictive street data in order to locate the light artifacts.
3 Inherent limitation of current systems
Simply said, the motivation of this work is to provide the information about oncoming vehicles at night earlier than current systems do—in the best case before they are directly visible—to ensure safe and anticipatory driving. Currently, there is a technical limitation in current systems regarding how early they can perceive a vehicle (see 2), caused by the commonly used object detection paradigms and the system related latencies. Within this section, we explain why these limitations exist and are inherent. Knowing these limitations is essential to understand what we can achieve with the presented approach (how fast we can get in detecting vehicles).
3.1 Object detection paradigms
First, it must be noted that current camera-based perception models used to detect vehicles at night are object detectors. As the most reliable information source, headlamps of other vehicles are used to detect the position of vehicles. So in most vehicle detection systems, headlamps are used as “objects” from which succeeding systems can infer the location of vehicles present in the image (e. g., P.F.Alcantarilla.2011 ). While being a robust reference, the restriction on headlamps limits the performance of such systems, since the earliest point in time they can perceive a vehicle is when they have direct sight to the vehicle (see “object becomes visible” in 2 and the middle image in 1). As already mentioned in 1, this differs from how humans estimate whether and where a vehicle is oncoming because humans react to light artifacts like the light reflections on the guardrail in 1. Thus, the question is why light artifacts are not naturally being detected or tried to being detected by current vehicle detection systems considering the obvious discrepancy between humans and systems regarding this task (see the time gap between the human provident and camera-based object detection in 2). We can only speculate why this is the case but expect that one reason is the object detection paradigm: the algorithms detect objects that match an object definition. For example, if the object detector is a bounding box regressor, we have to be able to specify the object boundaries to define bounding boxes. However, due to the nature of light, light reflections caused by headlamps have no clear object boundaries, see 1. In vehicle detection at night, as mentioned in 2, the computer vision systems are often designed to detect the light blobs of the headlights (e. g., by keypoints). A headlamp can be considered as an intensity singularity in the image, meaning an extraordinary high intensity peak. Again, this assumption does not apply to light artifacts since they illuminate almost homogeneously larger areas with small gradients and their intensity varies largely depending on their strength and other global light sources.
Summarily, light artifacts cannot be treated as objects without further thoughts due to their missing object boundaries and annotation difficulties. From now on, we will denote both indirect light objects (e. g., light reflections on guardrails) and direct light objects (e. g., headlights) of oncoming vehicles as light artifacts or sometimes as light instances. In 5.1 we briefly recap the work of Saralajew et al. Saralajew2021 and the presented solution for tackling the inherent annotation uncertainty by using keypoints instead of bounding boxes.
3.2 System related latencies
As already mentioned in 2, Oldenziel et al. Oldenziel2020 showed that even if detection systems are able to detect vehicles directly after direct sight, on average they have a system related latency. The precise reasons for this latency are not clear as the in-production perception models operate as black-boxes. However, we suspect that the major cause is the plausibility check, which uses the detection results of several frames to validate the detected objects regarding temporal consistency to increase the overall precision of the system and to prevent oscillating behavior of the main ADAS functionalities—e. g., the glare-free high beam assist.
In general, we assume that the following steps cause the latency in a vehicle detection system—which is also qualitatively visualized in 2:
After the vehicle starts to become visible, there is a system-specific time until the vehicle has a visibility status that fulfills the object definition. After that the object can be potentially detected by the computer vision system (compare the middle and right image in 1).
If an image is captured with an object that fulfills the object definition, a latency in the detection is caused by the image processing time. Usually, this latency is lower than the frame rate of the camera.
Normally, the object detection system detects objects in the image coordinate system. To use the information of detected vehicles (e. g., to control the matrix beams headlights), the two-dimensional information has to be transformed into three-dimensional by making assumptions or using further sensor information. Again, this causes a processing latency, which is usually lower than the frame rate.
Finally, the plausibility check causes a latency of several frames due to semantic and temporal checks. This step often performs a kind of object tracking in order to safely predict there is an oncoming vehicle.
Not only object detection systems have an internal latency, humans have that as well: reaction time. In 2, the human reaction times during the test group study are illustrated. As Oldenziel et al. Oldenziel2020 showed, the camera-based vehicle detection is approximately 200 ms faster in a fair setting (allowed detection after direct sight) than its human counterpart. However, human detection almost reaches the minimal possible detection time, acting only approximately 800 ms after the first indication (e. g., through glares, reflections) of oncoming vehicles.
, we present a system that is able to detect light artifacts and implement all the aforementioned steps. Even if this system reacts to light artifacts, it still has the inherent system-related latencies. Therefore, depending on the scenario, it does not necessarily detect oncoming vehicles before direct sight but shifts as much as possible of the inherent latency before the moment of direct sight and, thus, detects oncoming vehicles faster than current systems do.
This section presents the methodology how light artifacts are detected. For that, in the next section, a detection algorithm is proposed that can detect light reflections. After that, the distance estimation approaches are described that are needed to locate the detected objects in the three-dimensional space. To stabilize the detections and to perform a plausibility check, a tracking algorithm is outlined in the last section.
4.1 Object detector
The first element in the system pipeline is the object detector. The task here is to detect both direct and indirect light artifacts within the camera image. The feasibility of the light artifact detection was shown by Oldenziel et al. Oldenziel2020 through multiple practical examples. The general setup for such a detector can be divided into the following sub-tasks:
Generate region proposals based on local features;
Validate the proposals to reduce the amount of false-positive detections.
This pipeline is used in many state-of-the-art systems as well as in machine learning object detectors (e. g., fasterrcnn ). The usual approach for such a system would be to use a NN-based system in an end-to-end manner. We rejected this approach because of its inevitable in-transparent nature and computational load. Instead, the approach proposed by Saralajew et al. Saralajew2021 with a tailored region proposal algorithm and an NN classifier as a validation step is used. The resulting pipeline is depicted in 3.
First, a dynamic thresholding is performed to retrieve high intensity regions from the image. Bounding box proposals are then inferred through a blob detection in the generated binary image (: above threshold, : below threshold). The validation is performed using a small Convolutional NN (CNN). The results are bounding box representations of light artifacts within the image. Operations are always performed on a half-scale image (690480 pixels) in order to reduce the computational complexity.
Image preparation—The raw image is filtered to reduce the amount of noise present. This is achieved by applying a Gaussian blur over the entire image. This smooths out edges and removes high frequency noise like salt-and-peper noise. The effects of the filtering are depicted in 3, image b).
Dynamic thresholding—Due to the low intensities noticed for light reflections and glares, a global thresholding strategy is not suitable to retrieve interesting regions from the raw image. In contrast, all considered artifacts share the common feature of a higher intensity relative to their surroundings Saralajew2021 . This can be used to perform dynamic thresholding on the image. Therefore, a pixel-wise threshold is calculated to retrieve interesting regions.
The criterion for the dynamic threshold at pixel is defined as the following:
with being the local mean intensity—calculated over a fixed-sized window around the pixel —and being the deviation of the pixel intensity from the local mean. The sensitivity of this threshold can be adjusted using the factor . LABEL:Eq:dynthresh is adapted from Singh et al. singh2011thresholding
, who originally developed this technique to binarize documents. Comparisons with other threshold techniques showed that this method yielded high-quality results. Also, the usage of the integral image allows for an efficient computationsingh2011thresholding . The threshold is calculated for every pixel in the filtered image and used to infer a binary image with
An example of this binary image is shown in 3, image c).
Blob detector—The binary image contains multiple, unconnected regions—the generated proposals of the thresholding step. For ease of use and further handling, these regions are compressed into bounding boxes. This is achieved by applying a standard blob detection routine to find connected areas and allowing gaps of the size —measured with respect to the distance. After the bounding boxes have been computed, they are filtered by removing bounding boxes where the mean absolute deviation of the included intensity values is smaller than a threshold .
Classification—The proposal bounding boxes still contain many false-positive detections, as simply put, all bright objects were retrieved so far. This allows for a high recall of interesting regions, but also yields a low precision and therefore reduces the quality of succeeding modules (e. g., the glare-free high beam functionality). Therefore, a shallow NN is added afterwards, to classify each of the proposals (this strategy is similar to a Faster-RCNN architecture). For that, an enlarged region around each proposal bounding box is passed through a CNN. The network classifies the proposal to be either true-positive or false-positive, leading to a binary classification problem, distinguishing between whether a light artifact belongs to an oncoming vehicle or not. The reason for posing this as a binary classification problem is explained in more detail in 5.1. As this network is equivalent to Saralajew et al. Saralajew2021 , we will not focus on the architecture.
The approach used to detect light artifacts was chosen to allow for a suitable implementation in a production car, where only limited computational resources are available. While many detection and recognition systems designed for the automotive context rely heavily on parallel computing (e. g., through GPUs, TPUs), such hardware is not yet implemented in most production cars, limiting the practical usage of these systems. Even if more and more computational power will be available in the upcoming years, resources will be always limited as the number of functions increases as well. Therefore, two of the major requirements for the detection system are to be computationally efficient and to not rely too heavily on additional hardware. The simple operations used to build the proposal generation are a result of these requirements. The classification is still performed on a GPU, but is still efficient enough to be implemented on a production car’s hardware with only minor adjustments. As shown in literature, learned systems out-perform conventional methods. This is also true for our case, when computational power is unlimited. However, this fact changes with the constraint of restricted computational power. Here, the inference time of more complex systems becomes the major driver of reaction time of the overall pipeline. With this constraint the more efficient pipeline described above was able to outperform learned systems in terms of reaction time.
4.2 Distance estimator
In real-world driving scenarios it is often not sufficient to just provide the spatial information of the detected object in the two-dimensional image space. Only knowing where the object of interest is located in the environment enables the vehicle to react appropriately—for example, for performing emergency brakes or adjusting the adaptive high beams. Therefore, it is necessary to compute an estimate for the three-dimensional position of detected light artifacts.
|dark images||Arbitrarily de-|
|Monocular, single-image depth estimation, e. g., Fu2018 ; li2017 ; wofk2019 ; nekrasov2019 ; patil2020 ; Godarf2017 ; Laina2016 ; saxena2008 ; eigen2014||no||yes||high|
|Depth estimation from video, e. g., Zhou2017 ; ranftl2016 ; gordon2019||no||no||high|
|Structure from Motion, e. g., furukawa2004 ; saponaro2014 ; gallardo2017||partly||partly||high|
|Intersection with ground plane, e. g., zia2015 ; Song2015||yes||yes||low|
In literature, there are several methods described to perform the distance estimation222Also referred to as object localization or depth estimation. to detect objects. However, the special use case of nighttime images captured by a monocular grayscale camera adds clear restrictions. The general problem is that the images are fairly dark and low-textured, which complicates a proper application of state-of-the-art depth estimation methods. Also, light reflections can be considered non-rigid, arbitrarily deforming objects over time. Furthermore, the overall goal to run the method in real-time applies a constraint on the computational complexity.
1 presents a summary of possible applicable distance estimation methods. Due to the aforementioned constraints, the applicability of depth estimation from video and structure from motion is not possible. Additionally, monocular, single-image depth estimation approaches have a too high computational complexity, so that they cannot be used for the studied approach as well.333Even if these methods are not a good choice for the studied use case with respect to the computational complexity, we tested some of them in a proof-of-concept investigation on grayscale daylight and nighttime images. The results were not satisfying, which underlined the exclusion from further studies. Hence, the only applicable state-of-the-art method is the object localization through ground plane intersection. Note that this approach assumes that an object is located on the ground plane, which is not necessarily true for light artifacts (e. g., a light artifact on a guardrail).
To overcome the limitations of the methods listed in 1, we also evaluated a rather unconventional method for estimating the distance of light artifacts by fusing the position of the object in the image with Predictive Street Data (PSD). The PSD protocol contains information about the road geometry ahead of the vehicle (see 4). With this, the road lying ahead can be projected into the vehicle coordinate system, giving a three-dimensional representation of the road geometry. At the same time, knowing the intrinsic and extrinsic camera calibration, a ray can be projected from the image coordinate frame into the real-world coordinate frame in which the object of interest has to lie somewhere. Assuming that a detected light artifact always lies on or at least close to the road, the projection of the ray and the road ahead into the three-dimensional vehicle coordinate system allows to search for an intersection or closest point. This point is then considered as the object position in the vehicle coordinate system.
Summarily, we analyze the following four methods in the experiments:
uses the PSD road geometry (three-dimensional) and searches for the closest point along a projected ray of the detected object;
follows the PSD-3D principle but corrects the vehicle orientation (yaw angle and lateral offset) by road markers detected with the camera;444The vehicle orientation is calculated using the onboard, in-production vehicle orientation algorithm.
follows the PSD-3D principle but simplifies the problem to a two-dimensional coordinate system by ignoring the elevation information;
4.3 Object tracking
The object detection and distance estimation are frame-based computations and, therefore, can be unstable with respect to the temporal context (e. g., if a vehicle gets occluded). To improve the detection stability, we use an object tracking algorithm
to match the objects between different frames,
to predict the position of occluded objects, and
to increase the precision of the vehicle detection.
The implemented tracker is composed of different filters:
- filter in the two-dimensional image space to predict and estimate the position of bounding boxes;
- filter to predict and estimate the distances to the objects;
moving mean filter to estimate the confidence.
Between different frames the object matching is performed by computing the intersection over union of the tracked objects to the detected objects and assigning detected objects to tracked objects with highest intersection over union. To handle noise in the detections with respect to the bounding box size, the bounding box size of the detected objects is slightly increased before the intersection over union is computed.
If an object is occluded (not detected in the last frame), the prediction of the - filter is used to forecast the position of the object for a maximal number of three frames before it is removed from the list of tracked objects. Additionally, to increase the precision of the vehicle detection system, an object is only output when it is already detected for a minimal number of five frames and if the estimated confidence is greater than a threshold 0.5—thus, the tracker also operates as a plausibility checker. Finally, to lower the number of tracked objects, the tracker only considers object with a confidence value greater than a second threshold 0.1.
The scope of the experiments described in this section is: (a) to optimize the baseline bounding box annotation quality and, therefore, the detector performance presented by Saralajew et al. Saralajew2021 , (b) to evaluate the distance estimation methods, (c) to quantify the time benefit of the proposed system in terms of an early detection of oncoming vehicles with respect to both human performance and an in-production computer vision system for vehicle detection at night, and (d) to demonstrate the utility of the provident vehicle detection information by integrating the proposed detection system into a test car and realizing a glare-free high beam functionality.
In the following section, we describe the datasets and the test car that is used across the experiments. After that, each section describes an experiment mentioned above.
5.1 Datasets and test car
For the evaluation of the object detector performance, the detection times, and run-times the PVDN dataset Saralajew2021 was used. This dataset contains 59 746 grayscale images with a resolution of 1280960 pixels where all light instances—both direct (e. g., headlamps) and indirect (e. g., light reflections on guardrails)—of oncoming vehicles are annotated via keypoints. As the authors argue, the keypoint annotations allow for an objective annotation by placing the keypoint on the intensity maximum of each light instance. Also, from this an automatic generation of bounding boxes is possible, which comes useful because currently most of the state-of-the-art object detectors rely on bounding box annotations. Since those bounding boxes are inferred automatically, it may happen that one bounding box covers both direct and indirect instances at the same time. This is why the task of detecting bounding boxes on the dataset is currently framed only as a binary classification problem, namely whether the bounding box covers a relevant light artifact caused by an oncoming vehicle (either direct or indirect) or not.
The images are taken in coherent sequences such that the temporal relations within sequences are contained. Each scene is recorded with 18 Hz either with a short exposure (day cycle, darker images) or long exposure (night cycle, brighter images). For the experiments in this work, we only work with the day cycle data as the shorter exposure results in a stronger contrast between the background and light instances. Within the PVDN dataset, each illumination cycle is split into a train, a validation, and a test dataset in order to enable the development, evaluation, and testing of algorithms. Most importantly, the sequences of the dataset contain tags that mark the timestamps where (a) the oncoming vehicle was first annotated by its light reflections, (b) the driver recognized the oncoming vehicle based on its light reflections, (c) the vehicle was first directly visible, and (d) the in-production vision system first detected the oncoming vehicle. Those tags were collected during the annotation process and the test group study which was performed at recording time of the dataset.
Distance evaluation data:
Since the PVDN dataset does not contain depth data, an additional small dataset for the evaluation of the distance estimation methods was recorded. The dataset consists of 24 scenes with in total 438 images (181 direct and 257 indirect light instances). Each scene contains five consecutive image frames in order to later allow for time series analyses. The light instances (both direct and indirect ones) were annotated manually with bounding boxes. The ground truth depth data was captured using a Hesai LiDAR sensor and the same camera system that was used to record the PVDN dataset. The single ground truth depth value for each light instance was calculated using the median of all available depth measurements within a respective bounding box.
A test car was used as the platform for deploying the pipeline in a real use case. It has to be noted that the test car was also used for recording the PVDN and distance evaluation dataset. Consequently, the same image input specification as for the PVDN dataset hold. Furthermore, the test car has matrix beam headlights and a glare-free high beam assist boke2015mercedes ; Knoechelmann2019 which is used in the experiment for visual demonstrations and for deploying a provident glare-free high beam. Each matrix beam headlight consists of 84 LEDs and each is almost illuminating another solid angle (within a headlight) and can be dimmed independently to all the other LEDs. Therefore, if an oncoming vehicle is detected and the glare-free high beam assist is activated, the individual LEDs where the vehicle is located can be turned off such that the overall headlight system stays in “high beam mode” without blinding the oncoming vehicle (this vehicle moves in a black corridor).
To perform the experiments an additional computing platform was used consisting of
two Intel Xeon CPUs with a base clock frequency of 3.2 GHz and eight cores per CPU, and
one NVIDIA Tesla V100 GPU with 16 GB RAM.
The implementations on this platform were done using Python and C++ in the Robot Operating System ros . To underline the transferability of the algorithms to a production ECU with hardware acceleration, all algorithms were executed on the CPU except for the shallow NN (which was executed on the GPU).
5.2 Object detector
|Parameter||Description||Search space||Step size||Final value|
|Scaling parameter in dynamic thresholding.||0.05||0.4|
|Window size in dynamic thresholding.||1||19|
|Threshold that the mean absolute deviation of a bounding box has to exceed in order to be proposed.||0.01||0.01|
|Maximal distance that is allowed between blobs in order to be considered in the same bounding box.||1||4|
|Generated bounding boxes annotations Saralajew2021||1.00||0.69||0.81||0.42||0.420.24||1.000.00|
|Optimized generated bounding boxes annotations||1.00||0.87||0.93||0.70||0.700.30||1.000.00|
The object detector described in 4.1 was trained on the PVDN dataset. For that, the performance presented in the original dataset paper of Saralajew et al. Saralajew2021 serves as a baseline. There, the bounding box annotations were created automatically based on the original keypoint annotations and by using the rule-based region proposal algorithm as explained in 4.1
. Originally, the parameters of the region proposal algorithm were selected by a random search. Here, we performed a hyperparameter search for the region proposal algorithm using the tree-structured Parzen estimator approachbergstra2011 . This approach belongs to the family of sequential model-based optimization approaches and is a common algorithm for hyperparameter optimization. The goal was to minimize the objective function
is the parameter vector,is the bounding box quality Saralajew2021 , and the F-score Saralajew2021 . This objective function encourages a good balance between a high detection performance while maintaining high-quality generated bounding boxes.555The quality is the product of —a -measure that represents how often a ground truth keypoint is covered by an automatically inferred true positive bounding box (best possible value is 1)—and —a -measure that represents how many ground truth keypoint are in an automatically inferred true positive bounding box (best possible value is 1). The specific search space configuration can be found in 2.
We optimized the hyperparameters on the official training set, selected the best set based on the performance on the validation set, and report the results on the test set. With the newly optimized bounding box annotations, we trained the classifier. For that, we also used the training dataset for model training, the validation dataset for model selection, and the test dataset for reporting the performance results. We trained for 300 epochs with an initial learning rate of 0.001, batch size of 64, weight decay of 0.01, and binary cross-entropy. We used the Adam optimizerkingma2014 and augmented the images with horizontal flips, rotations, crops, and gamma corrections while training. The confidence threshold for a valid classification of a light artifact was set to 0.5. The whole training pipeline is available for public use and reproduction.666https://github.com/larsOhne/pvdn The performance measures are based on the metrics proposed by Saralajew et al. Saralajew2021 .
In 2 the results of the hyperparameter optimization are shown. Using these parameters to generate the bounding box annotations, the optimized light artifact detector achieved the results reported in 3. The optimized region proposal algorithm results in an increase of the bounding box quality of 28% and, therefore, shows a clear improvement of the automatically inferred bounding box annotations compared to the baseline. Training the object detector on this optimized ground truth thus also shows an improvement of the detection performance, as simply more of the light artifacts are captured by the region proposal algorithm. Summarily, the optimized and trained light artifact detector sets a new benchmark on the PVDN dataset.
5.3 Distance estimation
We evaluated the proposed methods of 4.2 on the dataset described in 5.1. First, the general performance of each method was evaluated on the available images. For that, each light object marked by a bounding box was transformed to a single pixel by taking the center of the bounding box. Performance results are shown in 5. It becomes clear that with a median relative error of for direct and for indirect light instances approach GP which only considers the area ahead as a plane outperforms approaches PSD-3D, PSD-3D+, and PSD-2D, which try to estimate the road geometry using the PSD. A negative error means that the estimated distance is less than the ground truth distance. An in-depth analysis of the PSD shows that the positioning of the vehicle on the road described by the PSD is often too inaccurate to give a precise enough representation of the exact road geometry ahead which is needed in order for the PSD approaches to work. Especially in curves, an accurate positioning of the vehicle on the road segment is absolutely mandatory, since even a slight deviation can cause a huge discrepancy between the actual road geometry ahead and the one described by the PSD at a specific time step.
When looking at the performance of the ground plane method a performance deficit between direct and indirect light instances becomes clear. There are several possible reasons for this. First, the direct light instances are always located further away from the ego-vehicle than the indirect ones. The direct instances in the underlying data have an average distance of 83m, whereas indirect instances are on average 63m away. Therefore, inaccuracies influence the relative error more for the indirect instances. Second, indirect light instances often span over a large area (e. g., on the street), where the acquisition of a single ground truth distance value is difficult, as the beginning of the annotated area has a different distance value than the end. This can lead to partly inaccurate ground truth values. Third, all of the mentioned methods strongly depend on the quality of the intrinsic and extrinsic camera calibration. Thus, unknown inaccuracies in the calibration can also affect the result. Finally, the assumption of the environment ahead being a plane could be often inaccurate. If the assumption was true, the expected result for indirect light instances on the road would be nearly perfect compared to the ground truth, whereas all light instances located above the road (e. g., headlights or light reflections on guardrails) would give a too high distance, as the projected ray would find the intersection with the plane behind the actual light instance. However, the results show that the direct instances are nearly perfect (only a little overshooting distance estimation of approximately 12%), whereas the distance estimation for indirect instances falls too short. This indicates that the data often contains scenes where the plane assumption does not hold.
In a next step, the distance of each pixel within a bounding box was calculated with the goal of stabilizing the distance estimation using heuristics over the whole set of estimated distances within the bounding box. As the methods using the PSD already did not show satisfying results in the first experiment, the next evaluations are only done for the ground plane method. To retrieve the final distance estimation from all distance values within a bounding box, five simple approaches were compared with each other:
only considering the maximum distance value;
only considering the minimum distance value;
only considering the distance value of the lowest pixel in the bounding box (as it is closest to the estimated plane);
taking the mean over all distance values in the bounding box;
taking the median over all distance values in the bounding box.
The results are shown in 6. Interestingly, the five approaches do hardly show any improvements. Only the approach of taking the maximum distance value within a bounding box improves the distance estimation for indirect light instances, which makes sense considering that the original estimation for indirect light instances was often too short.
Since the annotation format was chosen so that correspondence of light instances across multiple frames can be determined, in a final experiment the distance estimation for the ground plane method was attempted to be stabilized by considering a series of consecutive distance estimations for the same instance. The idea was that with this, possible outliers can be filtered. For that the two approaches of taking the median or the mean of a series of distance estimations were compared. One series consists of five consecutive images. The results can be seen in7 and do not show a significant improvement or stabilization of the distance estimations. The relative errors show an offset by roughly the same positive amount, which is a reasonable behavior since the predictions from previous time steps, where the instances were still further away, increase the final estimated distance. Note, that this approach requires a tracking of detected objects across multiple images.
To summarize, the high positioning inaccuracy of the ego-vehicle on the road described by the PSD results in a highly inaccurate distance estimation of light artifacts. However, the core idea itself is promising, as it models the road environment ahead in its actual shape. Yet, the current positioning inaccuracies make the data not usable for this use case. The alternative approach of modeling the world ahead as a simple plane and finding the intersection with the projected ray however shows satisfying results. Another advantage is that this approach does not require any sensor data except for the camera input and its calibration and also comes at a very low computational cost, since calculating the intersection of a line with a plane requires only a few floating point operations. Still, when an object is not located directly on the ground (e. g., light reflections on guardrails) the method becomes inaccurate, too. For future improvements, approaches should be analyzed that try to estimate the surface of the environment ahead in order to account for curvatures of the road surface and thus return a better approximation than the simple plane. However, for the system presented in this paper the accuracy of this method is considered to be sufficient and therefore the ground plane approach is used as the distance estimation module in our system. It has to be noted that the estimated three-dimensional location of the detected light artifact is used as the expected position where the vehicle will appear, which is considered to give a sufficient estimate for the underlying use case.
5.4 Time benefit
The goal of this experiment is to evaluate the time benefit of the proposed system in terms of an early detection of oncoming vehicles with respect to human performance and an in-production computer vision system for vehicle detection at night. For this purpose, evaluations are performed on the test and validation dataset (to increase the database) of the PVDN dataset (see 5.1). The sequences from the dataset are processed with the proposed system. The first detection times based on a single image and after the tracker of the proposed system were compared with those of the human performance and the in-production computer vision system. These comparisons allow determining the time benefit for each sequence. Based on all sequences, the average time benefit can be specified.
8 shows the results in form of a box plot. The evaluation were performed on 39 sequences (test and validation dataset) of the PVDN dataset. In 18 sequences the in-production computer vision system did not detect a vehicle and thus has fewer measurement samples. Within each sequence, the time zero was determined by the first indirect sight annotation, so that the plots show how long it took until the vehicle was detected. The time when each system detected an oncoming vehicle was determined by the first bounding box that included a keypoint from the sequence.
It can be seen that the proposed system based on the tracker detects oncoming vehicles on average 1.6 s faster than the in-production computer vision system and is on average as fast as a human. The first detection based on a single frame is on average 2.1 s before the detection of the in-production computer vision system and 0.5 s before the detection of a human. The delay between the single frame detection and the tracker is caused by the plausibilization phase of the tracker: an object has to be detected for at least five frames before it is sent as output. Five frames correspond to approximately 277 ms, and this is the minimal delay inherently caused by the plausibilization (compare with 2). Therefore, it is not surprising that the time difference between the single frame detection and the tracker is on average 500 ms. However, overall, the results clearly show the time benefit that can be achieved by such an unconventional sensing system.
5.5 Provident glare-free high beam
To demonstrate the usefulness of the provident detection information for ADAS functionalities, we integrated the proposed detection system into the test car and used the provident detection information to control the matrix beam headlights. Therefore, we exemplary realized a provident glare-free high beam functionality. The results of this experiment provide useful information about the applicability in real use cases:
it shows that the entire workflow of the detection system can run in real-time;
it nicely visualizes the detection results in the real-world;
it shows that glare-free high beam functions can be implemented without blinding oncoming vehicles due to latencies in the computer vision system (see 2).
The glare-free high beam functionality is a nice functionality to visualize the detection results, as the matrix headlights can be considered as projectors that visualize the detected objects by turning off the respective pixels.777Additionally, (correctly implemented) the function is not safety-critical and can be tested on public roads. Therefore, any serious inaccuracy in the system becomes immediately visible, and thus the integration serves as a proof-of-concept whether the object localization uncertainties are in such a range that they still provide useful information for later systems.
As already said, for this experiment, the proposed detection pipeline was integrated into the test car. To have a unique detection output and to ensure that oncoming vehicles were not blinded by the matrix beam headlights, only the detected object (after tracking) with the highest intensity value was sent to the glare-free high beam module. Moreover, this concept ensured that after direct sight to the oncoming vehicle that only this object was masked out by the matrix beam headlights. During this experiment, we performed test drives on public rural roads at night.
Before the system was tested on public roads, the real-time capabilities of the pipeline by measuring the computation times on test car hardware (see 5.1) were analyzed. Since the camera was capturing images with 18 Hz, the requirement was that the entire pipeline had an execution time faster than the 18 Hz. 9 presents the run-time analysis in form of a box plot. The measurements were performed on the 7 030 images (test and validation dataset) of the PVDN dataset. The average run-time of the complete pipeline (from input blob detector to output object tracker) for one image is on average 0.045 s so that the real-time requirement is fulfilled. However, it must be noted that the run-time of the entire pipeline is not constant. For example, the run-time is strongly affected by the number of bounding boxes created by the blob detector. Moreover, with increasing number of components (after the dynamic thresholding step) during blob detection, the run-time of the bounding box creation (inside the blob detector) increases as well. Overall, the real-time requirement is fulfilled for 75% of the analyzed images (see 9) and, therefore, the system can be deployed in the test car.888If the computation was still running and the camera was already capturing an image, the camera image was dropped in the deployed algorithm.
10 shows an example scene of the test drives on rural roads at night. This scene illustrates very well the accuracy and time benefit of the proposed system in real-world in terms of a provident vehicle detection compared to the in-production computer vision system. In (a)a the first light artifact of the oncoming vehicle can be seen. After 0.5 s, the first detection is made by the proposed system based on a single frame, as shown in (b)b. Then, after 2.6 s, the tracker has validated the object and output’s it correctly to the matrix beam headlights, see (c)c. Based on the result of the tracker, the end of the road is dimmed proactively (a black gap can be seen in the white box) to avoid blinding of the oncoming vehicle. The in-production system detects the oncoming vehicle after 3.8 s when it is fully visible and after a significant latency, see (d)d. Therefore, the in-production system would have caused a short blinding of the oncoming vehicle. In this scene, there is a total time benefit of 1.2 s of the proposed system.
Besides this experiment provides a useful visualization interface of the detection results in the real-world, it also shows that despite the localization uncertainties of the proposed detector and distance estimator, the information can be used to realize a provident glare-free high beam assist. However, there are two points that need to be clarified:
Why is the dimmed gap (see white box in (c)c) larger than the detected light reflection and tends to the left?
Why does the tracker take “so long” after the first detection to detect the oncoming vehicle?
First, the reason for the size of the dimmed area might be due to the low resolution of the matrix beam headlights (see 5.1) and the left tendency might be caused by the inaccuracies of the distance estimation. Second, the reason for the late detection of the tracker (in this case 2.1 s after the first detection based on single frame) is because the vehicle disappears behind the trees several times, which makes it difficult for the tracker to continuously track the vehicle over multiple frames.
6 Conclusion and outlook
Extending the work of Oldenziel et al. Oldenziel2020 and Saralajew et al. Saralajew2021 , with this work we presented a complete pipeline designed for automotive use cases which is capable of providently detecting vehicles at night. Our system consists of a set of algorithms solving the tasks of detection, three-dimensional localization, and tracking of both direct light objects (e. g., headlights) and indirect light objects (e. g., light reflections on guardrails) caused by oncoming vehicles. With this, we showed that the proposed system is able to detect oncoming vehicles almost 1.6 s earlier than commonly used vehicle detection systems at night, which can be considered a significant amount of time for automotive use cases. Also, by deploying the pipeline in a test car for the use case of providently dimming the matrix beam headlights for oncoming vehicles, we demonstrate the applicability of our system not only under lab conditions but also in real-world scenarios and in real-time. Currently, for further use cases (e. g., trajectory planning, automatic breaking) the system might still lack the necessary precision in three-dimensional localization of the light reflections. Therefore, future work will focus on evaluating new distance estimation methods by extending the currently applied ground plane assumption to a more precise representation of the environmental geometry ahead. Also, as we proved that such a system of providently detecting oncoming vehicles at night is possible to be deployed for a real automotive use case in rural scenarios, the next natural step is to investigate its transition to urban situations (e. g., junctions in cities). With this, we believe we can further improve the performance of ADAS and increase its customer acceptance by bringing computer vision algorithms closer to human behavior.
- (1) M. Sivak, “The information that drivers use: Is it indeed 90% visual?” Perception, vol. 25, no. 9, pp. 1081–1089, 1996, pMID: 8983048. [Online]. Available: https://doi.org/10.1068/p251081
- (2) S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems 28, 2015, pp. 91–99.
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once:
Unified, real-time object detection,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” inCVPR09, 2009.
- (5) K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Son, “Robust physical-world attacks on deep learning visual classification,” inProceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition – CVPR 2018. Salt Lake City, UT, USA: IEEE, 2018, pp. 1625–1634.
- (7) S. Saralajew, L. Ohnemus, L. Ewecker, E. Asan, S. Isele, and S. Roos, “A dataset for provident vehicle detection at night,” arXiv preprint arXiv:2105.13236, 2021.
- (8) E. Oldenziel, L. Ohnemus, and S. Saralajew, “Provident detection of vehicles at night,” in 2020 IEEE Intelligent Vehicles Symposium (IV), 2020, pp. 472–479.
- (9) B. Böke, M. Maier, J. Moisel, and F. Herold, “The mercedes-benz headlamp of the future: Higher resolution with greater intelligence for enhanced safety,” in Proc. Int. Symposium on Automotive Lighting, 2015, pp. 49–58.
- (10) Z. Sun, R. Miller, G. Bebis, and D. DiMeo, “A real-time precrash vehicle detection system,” in Sixth IEEE Workshop on Applications of Computer Vision, 2002. (WACV 2002). Proceedings., 2002, pp. 171–176.
- (11) Z. Sun, G. Bebis, and R. Miller, “On-road vehicle detection: A review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 5, pp. 694–711, 2006.
- (12) S. S. Teoh and T. Bräunl, “Symmetry-based monocular vehicle detection system,” Machine Vision and Applications, vol. 23, pp. 831–842, 2012.
- (13) Q. Fan, L. Brown, and J. Smith, “A closer look at Faster R-CNN for vehicle detection,” in 2016 IEEE Intelligent Vehicles Symposium (IV), 2016, pp. 124–129.
- (14) A. López, J. Hilgenstock, A. Busse, R. Baldrich, F. Lumbreras, and J. Serrat, “Nighttime vehicle detection for intelligent headlight control,” in Advanced Concepts for Intelligent Vision Systems, ser. Lecture notes in Computer Science, J. Blanc-Talon, S. Bourennane, W. Philips, D. Popescu, and P. Scheunders, Eds. Springer, 2008, pp. 113–124.
- (15) P. Alcantarilla, L. Bergasa, P. Jiménez, I. Parra, D. Fernández, M.A. Sotelo, and S.S. Mayoral, “Automatic lightbeam controller for driver assistance,” Machine Vision and Applications, pp. 1–17, 2011.
- (16) S. Eum and H. G. Jung, “Enhancing light blob detection for intelligent headlight control using lane detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 2, pp. 1003–1011, 2013.
- (17) D. Jurić and S. Lončarić, “A method for on-road night-time vehicle headlight detection and tracking,” in 2014 International Conference on Connected Vehicles and Expo (ICCVE), 2014, pp. 655–660.
- (18) P. Sevekar and S. B. Dhonde, “Nighttime vehicle detection for intelligent headlight control: A review,” in Proceedings of the 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), 2016, pp. 188–190.
- (19) F. M. Naser, “Detection of dynamic obstacles out of the line of sight for autonomous vehicles to increase safety based on shadows,” Master’s thesis, MIT, Boston, 2019.
- (20) D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems, 2014, pp. 2366–2374.
- (21) I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth International Conference on 3D Vision (3DV), 2016, pp. 239–248.
T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” inThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- (23) R. Ranftl, V. Vineet, Q. Chen, and V. Koltun, “Dense monocular depth estimation in complex dynamic scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4058–4066.
- (24) A. Gordon, H. Li, R. Jonschkowski, and A. Angelova, “Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8977–8986.
- (25) Y. Furukawa, A. Sethi, J. Ponce, and D. Kriegman, “Structure and motion from images of smooth textureless objects,” in European Conference on Computer Vision. Springer, 2004, pp. 287–298.
P. Saponaro, S. Sorensen, S. Rhein, A. R. Mahoney, and C. Kambhamettu, “Reconstruction of textureless regions using structure from motion and image-based interpolation,” in2014 IEEE International Conference on Image Processing (ICIP). IEEE, 2014, pp. 1847–1851.
- (27) M. Gallardo, T. Collins, and A. Bartoli, “Dense non-rigid structure-from-motion and shading with unknown albedos,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3884–3892.
- (28) S. Song and M. Chandraker, “Joint SFM and detection cues for monocular 3D localization in road scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
- (29) T. R. Singh, S. Roy, O. I. Singh, T. Sinam, and K. M. Singh, “A new local adaptive thresholding technique in binarization,” IJCSI International Journal of Computer Science Issues, vol. 8, no. 6, pp. 271–277, 2011.
- (30) H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- (31) J. Li, R. Klein, and A. Yao, “A two-streamed network for estimating fine-scaled depth maps from single rgb images,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- (32) D. Wofk, F. Ma, T.-J. Yang, S. Karaman, and V. Sze, “Fastdepth: Fast monocular depth estimation on embedded systems,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6101–6108.
- (33) V. Nekrasov, T. Dharmasiri, A. Spek, T. Drummond, C. Shen, and I. Reid, “Real-time joint semantic segmentation and depth estimation using asymmetric annotations,” in 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 7101–7107.
- (34) V. Patil, W. Van Gansbeke, D. Dai, and L. Van Gool, “Don’t forget the past: Recurrent depth estimation from monocular video,” arXiv preprint arXiv:2001.02613, 2020.
- (35) C. Godard, O. M. Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- (36) A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structure from a single still image,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 5, pp. 824–840, 2008.
M. Z. Zia, M. Stark, and K. Schindler, “Towards scene understanding with detailed 3d object representations,”International Journal of Computer Vision, vol. 112, no. 2, pp. 188–203, 2015.
- (38) M. Knöchelmann, M. Held, G. Kloppenburg, and R. Lachmayer, “High-resolution headlamps – technology analysis and system design,” Advanced Optical Technologies, vol. 8, no. 1, pp. 33–46, 2019.
- (39) Stanford Artificial Intelligence Laboratory et al., “Robot operating system.” [Online]. Available: https://www.ros.org
- (40) J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyper-parameter optimization,” in 25th annual conference on neural information processing systems (NIPS 2011), vol. 24. Neural Information Processing Systems Foundation, 2011.
- (41) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.