CUDA Code for our paper "Accelerating Translational Image Registration for HDR Images on GPU", accessible at https://arxiv.org/abs/2007.06483
High Dynamic Range (HDR) images are generated using multiple exposures of a scene. When a hand-held camera is used to capture a static scene, these images need to be aligned by globally shifting each image in both dimensions. For a fast and robust alignment, the shift amount is commonly calculated using Median Threshold Bitmaps (MTB) and creating an image pyramid. In this study, we optimize these computations using a parallel processing approach utilizing GPU. Experimental evaluation shows that the proposed implementation achieves a speed-up of up to 6.24 times over the baseline multi-threaded CPU implementation on the alignment of one image pair. The source code is available at https://github.com/kadircenk/WardMTBCudaREAD FULL TEXT VIEW PDF
CUDA Code for our paper "Accelerating Translational Image Registration for HDR Images on GPU", accessible at https://arxiv.org/abs/2007.06483
In several applications, a High Dynamic Range (HDR) image is obtained by fusing multiple images of the same scene taken with different exposures by a standard camera  . By doing so, the resulting HDR image includes scene detail from well-exposed regions of each image. Using a handheld camera to capture a static scene with multiple exposure levels results in misaligned images due to shaking of the camera. Using a tripod to fix the position of the camera reduces the amount of misalignment; however, this brings a significant restriction and may not be feasible at all shooting conditions. In addition, small shifts may still occur due handling of the camera to adjust its parameters or to shoot the photo. This results in a blurry HDR image when multiple misaligned images are merged (Figure 1).
The main steps of an HDR pipeline are: (i) acquisition of multiple exposure images of a scene, (ii) aligning them to each other, (iii) deriving the camera response function, and (iv) computing the final HDR image . The alignment phase includes computationally expensive 2D image manipulation operations. This study focuses on this alignment phase to decrease the time cost of the overall HDR pipeline.
Traditional image alignment algorithms are not suitable for the alignment of images of the same static scene with varying exposures. Hence, specialized algorithms were proposed for HDR imaging pipeline. A popular algorithm amongst these is the translational image registration method that uses Median Threshold Bitmaps (MTB) . Utilization of MTBs alleviates the effect of varying exposure times and brings the images to a common data format. This way, MTBs of the images can be shifted and error-tested to automatically calculate the globally optimal shift amount.
In this paper, we focus on speeding-up an MTB based alignment algorithm , and image manipulation routines by leveraging the highly parallel nature of GPUs. For this purpose, we optimize the following operations on the GPU: converting multiple exposures into gray-scale, downsampling, shifting, histogram calculations, creation of bitmaps, pixel-wise AND and XOR operations and erroneous pixel counting.
The remainder of this paper is organized as follows: In Section 2, we provide a summary of related work and their limitations. In Section 3, we describe the reference method and proposed optimizations on the GPU. In Section 4, we provide the experimental evaluation and discussion of the results. In Section 5, we provide the conclusion and future work ideas.
The method introduced in  for registration of handheld exposures of a static scene for HDR imaging has become a landmark work due to its robustness to high exposure differences. The fundamental idea in this work is making use of MTBs to eliminate the brightness difference caused by different exposure times. This idea has been found to be highly effective and many follow-up image registration techniques have been introduced to enable registration of images captured with different exposure times.
Ward  claimed that only 10% of the test scenes requires an additional rotational alignment along with the translational alignment. While this work did not implement rotational alignment; it suggests splitting the image into quadrants and then applying the algorithm on each quadrant separately to detect if rotational alignment is needed. This study was further extended by implementation of an additional rotational alignment step, along with the handling of small movements of in-scene objects . Moreover, graphical hardware was used to calculate XOR result of two MTBs by first converting them into textures, and then running a fragment code on them. Another method, extending for rotational alignment, integrates rotation into error tests. By rotating the image with steps of 0.5 degree, pivoting at the center , this approach finds the optimal rotation angle together with the translation offset.
Akyüz  uses correlation maps to find a suitable translational alignment offset, as correlation maps are not dependent on exposure time. Khan et al.  iteratively weight the pixels to detect their contribution to the final HDR image to remove blurry artifacts. Sen et al.  apply a joint optimization equation which solves the alignment and the HDR reconstruction problems simultaneously. Zimmer et al.  use an energy‐based optic flow approach which can deal with different exposure timings to perform the alignment with in-scene object movement.
Guthier et al.  propose parallelization of the histogram-based exposure registration using CUDA where parts of the MTB algorithm are run on the GPU. They register exposure pairs based on , but different to that, they extract row and column histograms from each MTB of the image pair and apply Normalized Cross Correlation on the corresponding histograms to calculate the shift offset in each dimension, without utilizing an image pyramid. Different from , the implementation proposed in this paper adheres to the original algorithm , and also run all image-related calculations on the GPU, eliminating costly GPU-CPU memory copy operations. In addition, as the code is run iteratively for each image pair, the copy operations bring higher overhead as a some images need to be copied twice (once to compare with the previous image and once to compare with the next image) when input image count is higher than two. The implementation proposed in this paper is designed to copy all the input images to the GPU memory once, avoiding multiple copies.
As both  and  are based on MTB, the proposed implementation can also be used to speed-up their translational alignment. In the reference implementation, CPU processes the 2D input image stack in a sequential manner or in the multi-threaded case, it works by assigning parts of the image to different CPU threads. On the other hand, the GPU implementation can assign each pixel to a different GPU thread to massively parallelize the algorithm. A further higher level parallelism can also be introduced by assigning each image to a different CUDA stream and run them in parallel.
In this section, we first describe the MTB based reference algorithm and then describe the proposed optimization of this method.
MTB based algorithm 
first converts the exposures into gray-scale and calculates the median pixel value of each exposure over its computed histogram. Then, a median thresholding is applied to binarize these images into MTBs. This step replaces the pixels having a value less than or equal to the median with a value of zero and the others with a value of one in MTBs. Shift tests are done by applying pixel-wise XOR operation on the MTBs of the image pair in comparison. Alignment error is then found by counting the non-zero values in the XOR result.
A brute-force approach to align two MTBs would go through each possible shift offset to find out the optimal one, which requires alignment tests for an pixel shift in both dimensions. A computationally more efficient alternative approach is using image pyramids. In this approach, a 6-level image pyramid is created by down-sampling the gray-scale images by halving at both dimensions and creating the MTBs at each level  (Figure 2). Then starting from the deepest level, only 1 pixel shift tests are applied both horizontally and vertically. Multiplying the total shift offset by two before starting with the next level until all levels are done, each bit of the global shift offset is found in only 54 shift tests.
In our approach (Figure 4), after the input images are read into the CPU memory, they are sent to the GPU memory and separate CUDA kernels are launched using one CUDA stream for each image to calculate the gray-scale images in parallel 
. Then, these gray-scale images are bound as CUDA textures and downsampled by launching linear interpolation downsampling kernels in all streams. This is done iteratively to create 5 levels of MTB pyramid. Using CUDA texture memory grants a speed-up by decreasing the neighbouring pixel access time on the GPU memory.
|Implementation||Configuration||Input Image Count|
In the next step, pixel histogram calculation kernels are launched to calculate 256-bin histograms on all gray-scale images in the pyramid. The kernels use parallel reduction technique to speed-up the calculation . Then, another kernel is launched to calculate the median value of each histogram in the pyramid. Finally, MTBs of all images are created using the median value by another kernel.
The MTBs of two images to be compared are XORed to get the error bitmap of the current translational shift offset candidate. However, using the XOR result directly is prone to errors, as hard-cutting the pixel’s MTB value right at the median is prone to create noise at the image area with pixels that are close to the median value. For this purpose, Exclusion Bitmaps (EB)  are used. In EBs, each bit is set to one if that pixel’s value is 4 of the median and to zero otherwise. Later, the EBs of both images in comparison are ANDed onto the XOR result to remove the noisy bits that would otherwise negatively affect the robustness of the algorithm.
Each white pixel in the denoised XOR image accounts for an error and they need to be counted to find out the final error. This procedure is repeated by shifting the MTB of the second image to test the remaining shift offset candidates, at each level of the pyramid. By propagating the offset up to the top-most level of the pyramid, global offset is calculated. Lastly, for each image pair in the input stack, the second image is shifted using this offset to match the first image.
AND, XOR, error-counting and gray-scale image shifting functions have been implemented as separate CUDA kernels. These kernels are overlapped whenever possible by placing them into separate CUDA streams. Once the global offset is calculated, RGB image shifting kernel is used to apply this shift offset to each channel of the second image, aligning it with the reference.
The MTB algorithm is iterated to pairwise align the input images if there are more than two images. For instance, an image set with 5 images requires 4 iterations of the algorithm. All the images are sent to the GPU memory at the beginning of the execution and corresponding MTB pyramids are calculated once. So, no CPU-to-GPU memory copies are needed for the further iterations.
In the reference algorithm, one bit is used to hold each pixel’s MTB value and bit-wise operations are used to speed-up the CPU implementation . The OpenCV implementation also uses this idea . However, converting each pixel of the bitmap image from one byte to one bit brings an overhead during the pre-processing phase of the algorithm. On the other hand, in GPU implementations, for such low arithmetic intensity operations, memory transfers are likely to be the bottleneck. So, in our approach, we used uint8_t (unsigned char) type, as GPU memory throughput is better when appropriate variable types are used in CUDA kernels. Due to this fact, we used byte-wise rather than bit-wise operations and eliminated the overhead that would be caused by byte-to-bit conversion. So, in our implementation, instead of bitmaps, we hold byte-maps on the GPU, where we set the pixel value to 255 instead of 1, and 0 otherwise. The source code of the implementation is available at https://github.com/kadircenk/WardMTBCuda
|CPU 1||CPU 2||CPU 3|
We have evaluated the proposed implementation with different hardware configurations and two diverse image sets. OpenCV includes a multi-threaded and publicly available CPU implementation  of the original MTB algorithm  and we use it as the baseline method to compare against.
All hardware configurations are running Ubuntu 18.04 LTS, and they are detailed as follows. CPU 1: 4-core 2.4GHz Intel® i7-4700HQ. CPU 2: 6-core 3.20GHz Intel® i7-8700. CPU 3: 8-core 3.6GHz Intel® Core™ i7-9700K. GPU 1: NVIDIA GeForce GTX 850M. GPU 2: NVIDIA GeForce GTX 1080Ti. GPU 3: NVIDIA GeForce RTX 2080Ti.
The test image set consist of 3.7 megapixel (MP) exposures (Figure 3). For all the results, we executed the codes 10 times and calculated the average run-time. Note that the run-time of the algorithm depends only on the image resolution and it is independent of the image content. While measuring the run-time, we excluded the reading/writing times of the images from/to the disk for both implementations. However, as our approach requires the images to be available in the GPU memory, CPU to GPU data transfer cost is included in the total run-time to compare on a fair-ground.
The algorithm needs at least a single pair of images and as more than 7 exposures in one input set is not common, we experiment with at most 7 images. We provide the run-times of both implementations by varying the number of input images from 2 to 7 in Table I. As both implementations iterate the MTB algorithm for each image pair, number of input images linearly increases the time cost.
Speed-up gains of our implementation over the CPU code on the aforementioned hardware are provided in Table II. The GPU implementation on a low-spec GPU board (GPU 1) cannot grant a speed-up over high-spec CPU models (CPU 2 and CPU 3). On the other hand when a high-spec GPU (GPU 3) is used it can provide a speed-up of 3.13 times against a high-spec CPU (CPU 3) and 3.76 and 6.24 times speed-up against CPU 2 and 1 respectively.
Grosch  aligns five images each having 1024x768 resolution in 4 seconds; together with rotational alignment and object movement removal. Our study, on the other hand, focuses mainly on speeding-up the translational alignment phase. As our runtimes are at most in milliseconds for images with resolution 2560x1440, our work can also be integrated to speed-up this algorithm.
In this study, we have proposed a pipeline which makes use of GPU to translationally register the input stack of handheld exposures using the MTB alignment algorithm . Experimental evaluations show that this multi-threaded and multi-stream implementation has a speed-up of up to 6.24 times compared to the baseline OpenCV multi-threaded MTB implementation on CPU. This speed-up facilitates integration of this algorithm into a real-time HDR video sequence pipeline on a lower cost GPU based system.
The reference MTB alignment algorithm requires the scene to be static  and any in-scene object motion would have a negative effect on the performance by causing erroneous MTB values during alignment testing. As a future work, an object motion detection algorithm could be integrated into our work to enable the alignment of such scenes. In addition, the method could be extended with a basic rotational alignment step, which tests for the optimal rotation angle step by step together with the translation offset.