Optimal Piecewise Linear Function Approximation for GPU-based Applications

Many computer vision and human-computer interaction applications developed in recent years need evaluating complex and continuous mathematical functions as an essential step toward proper operation. However, rigorous evaluation of this kind of functions often implies a very high computational cost, unacceptable in real-time applications. To alleviate this problem, functions are commonly approximated by simpler piecewise-polynomial representations. Following this idea, we propose a novel, efficient, and practical technique to evaluate complex and continuous functions using a nearly optimal design of two types of piecewise linear approximations in the case of a large budget of evaluation subintervals. To this end, we develop a thorough error analysis that yields asymptotically tight bounds to accurately quantify the approximation performance of both representations. It provides an improvement upon previous error estimates and allows the user to control the trade-off between the approximation error and the number of evaluation subintervals. To guarantee real-time operation, the method is suitable for, but not limited to, an efficient implementation in modern Graphics Processing Units (GPUs), where it outperforms previous alternative approaches by exploiting the fixed-function interpolation routines present in their texture units. The proposed technique is a perfect match for any application requiring the evaluation of continuous functions, we have measured in detail its quality and efficiency on several functions, and, in particular, the Gaussian function because it is extensively used in many areas of computer vision and cybernetics, and it is expensive to evaluate.


page 1

page 8


Adaptive Padé-Chebyshev Type Approximation to Piecewise Smooth Functions

The aim of this article is to study the role of piecewise implementation...

Implementing evaluation strategies for continuous real functions

We give a technical overview of our exact-real implementation of various...

On Representing (Anti)Symmetric Functions

Permutation-invariant, -equivariant, and -covariant functions and anti-s...

Representations and evaluation strategies for feasibly approximable functions

A famous result due to Ko and Friedman (1982) asserts that the problems ...

Adaptive Padé-Chebyshev Type Approximation of Piecewise Smooth Functions

A piecewise Padé-Chebyshev type (PiPCT) approximation method is proposed...

More on zeros and approximation of the Ising partition function

We consider the problem of computing ∑_x e^f(x), where f(x)=∑_ij a_ijξ_i...

Automatic Generation of Complete Polynomial Interpolation Hardware Design Space

Hardware implementations of complex functions regularly deploy piecewise...

I Introduction

The design of high-quality and real-time image processing algorithms is a key topic in today’s context, where humans demand more capabilities and control over their rapidly increasing number of advanced imaging devices such as cameras, smart phones, tablets, etc. [1]. Many efforts are being made toward satisfying those needs by exploiting dedicated hardware, such as Graphics Processing Units (GPUs) [2, 3, 4]. Some areas where GPUs are emerging to improve Human Computer Interaction (HCI) include cybernetics (for example, in facial [5] or object [6]

recognition, classification using Support Vector Machines 


and genetic algorithms for clustering 

[8]), and image processing (e.g., unsupervised image segmentation [3, 9], optical coherence tomography systems [4], efficient surface reconstruction from noisy data [10], remote sensing [11], real-time background subtraction [12], etc.). In such hardware-oriented designed algorithms, the computational efficiency of processing tasks is significantly improved by parallelizing the operations.

[width=0.996height=0.747]app-tracking.pdf (a) [width=0.996]app-interface.pdf (b)
Fig. 1: High level tasks such as (a) object/person-tracking or (b) video-based gestural interfaces often require a foreground segmentation method as a base building block.

Automated video analysis applications such as person tracking or video-based gestural interfaces (see Fig. 1) rely on lower-level building blocks like foreground segmentation [13, 14], where the design of efficient computational methods for dedicated hardware has a significant impact. Multimodal nonparametric segmentation strategies have drawn a lot of attention [15, 16] since they are able to provide high-quality results even in complex scenarios (dynamic background, illumination changes, etc.) [17]. However, their main drawback is their extremely high computational cost (requiring the evaluation of billions of multidimensional Gaussian kernels per second), which makes them difficult to integrate in the latest generation of image processing applications [18, 19]. To overcome this important drawback and achieve real-time performance [20], the use of parallel hardware such as GPUs helps but may not be enough by itself, depending on the required resolution, hence the need for algorithms capable of evaluating non-linear (e.g., Gaussian) functions at high speed within a required error tolerance. GPU vendors are aware of this recurrent computing problem and provide hardware implementations of common transcendental functions in Special Function Units (SFUs) [21]; indeed, we can find in the literature examples of successful use of these hardware facilities [22]. However, their ease of use comes at the price of non-customizable reduced numerical precision [23].

I-a Contribution

We propose a novel, fast, and practical method to evaluate any continuous mathematical function within a known interval. Our contributions include, based on error equalization, a nearly optimal design of two types of piecewise linear approximations (linear interpolant and orthogonal projection) in the norm under the constraint of a large budget of evaluation subintervals . Moreover, we provide asymptotically tight bounds for the approximation errors of both piecewise linear representations, improving upon existing ones. Specifically, in addition to the convergence rate typical of these approximations, we quantify their interpolation constants: for the linear interpolant and a further improvement factor in case of the orthogonal projection. The obtained error relations are parameterized by the number of segments used to represent the complex (nonlinear) function, hence our approach allows the user to estimate the errors given or, conversely, estimate the required to achieve a target approximation error.

We also propose an efficient implementation of the technique in a modern GPU by exploiting the fixed-function interpolation routines present in its texture units to accelerate the computation while leaving the rest of the GPU (and, of course, the CPU) free to perform other tasks; this technique is even faster than using SFUs [21].

Although the initial motivation of the developed method was to improve the efficiency of nonparametric foreground segmentation strategies, it must be noted that it can be also used in many other scientific fields such as computer vision [24, 25] or audio signal processing [26, 27], where the evaluation of continuous mathematical functions constitutes a significant computational burden.

I-B Organization

Section II summarizes the basic facts about piecewise linear approximation of real-valued univariate functions and reviews related work in this topic; Section III derives a suboptimal partition of the domain of the approximating function in order to minimize the distance to the original function (proofs of results are given in the Appendixes); Section IV analyzes the algorithmic complexity of the proposed approximation strategies and Section V gives details about their implementation on modern GPUs. Section VI presents the experimental results of the proposed approximation on several functions (Gaussian, Lorentzian and Bessel’s), both in terms of quality and computational times, and its use is demonstrated in an image processing application. Finally, Section VII concludes the paper.

Ii Piecewise Linear Modelization

Ii-a Related Work

In many applications, rigorous evaluation of complex mathematical functions is not practical because it takes too much computational power. Consequently, this evaluation is carried out approximating them by simpler functions such as (piecewise-)polynomial ones. Piecewise linearization has been used as an attractive simplified representation of various complex nonlinear systems [28]. The resulting models fit into well established tools for linear systems and reduce the complexity of finding the inverse of nonlinear functions [29, 30]

. They can also be used to obtain approximate solutions in complex nonlinear systems, for example, in Mixed-Integer Linear Programming (MILP) models 

[31]. Some efforts have been devoted as well to the search for canonical representations in one and multiple dimensions [32, 33] with different goals such as black box system identification, approximation or model reduction.

Previous attempts to address the problem considered here (optimal function piecewise linearization) include [34, 35, 29, 36], the latter two in the context of nonlinear dynamical systems. In [34] an iterative multi-stage procedure based on dynamical programming is given to provide a solution to the problem on sequences of progressively finer 2-D grids. In [29]

the piecewise linear approximation is obtained by using the evolutionary computation approach such as genetic algorithm and evolution strategies; the resulting model is obtained by minimization of a sampled version of the mean squared error and it may not be continuous. In 


the problem is addressed using a hybrid approach based on curve fitting, clustering and genetic algorithms. Here we address the problem from a less heuristic point of view, using differential calculus to derive a more principled approach 

[37]. Our method solely relies on standard numerical integration techniques, which takes few seconds to compute, as opposed to recursive partitioning techniques such as [35], which take significantly longer.

Ii-B Basic Results in Piecewise-Linear Interpolation and Least-Squares Approximation

In this section we summarize the basic theory behind piecewise linear functions [38] and two such approximations to real-valued functions: interpolation and projection (Sections II-B1 and II-B2, respectively), pictured in Fig. 2 along with their absolute approximation error with respect to .


Fig. 2: Top: fifth-degree polynomial and two continuous piecewise linear (CPWL) approximations, the orthogonal projection and the linear interpolant ; bottom: corresponding absolute approximation errors (magnified by a factor).


Fig. 3: Hat functions constitute a basis of . Since all the basis functions are zero outside , the basis functions associated with boundary nodes are only half hats.

In essence, a piecewise function over an interval is a partition of into a set of subintervals , where , and a set of functions , one for each subinterval . In particular we are interested in continuous piecewise linear (CPWL) functions, which means that all the are linear and . CPWL functions of a given partition are elements of a vector space : the addition of such functions and/or multiplication by a scalar yields another CPWL function defined over the same subintervals. A useful basis for the vector space is formed by the set of hat functions or nodal basis functions , pictured in Fig. 3 and defined in general by the formula


The basis functions and associated to the boundary nodes and are only half hats.

These basis functions are convenient since they can represent any function in by just requiring the values of at its nodal points, , in the form


Ii-B1 Linear Interpolation

The piecewise linear interpolant of a continuous function over the interval can be defined in terms of the basis just introduced:


While this CPWL approximation is trivial to construct, and may be suitable for some uses, it is by no means the best possible one. Crucially, for any function that is convex in

. Depending on the application in which this approximation is used, this property could skew the results. However, as we will see in Section 

III, the linear interpolant is useful to analyze other possible approximations. It is also at the heart of the trapezoidal rule for numerical integration.

Ii-B2 Orthogonal Projection onto Vector Space

Let the usual inner product between two square-integrable () functions in the interval be given by


Then, the vector space can be endowed with the above inner product, yielding an inner product space. As usual, let be the norm induced by the inner product, and let be the distance between two functions . The orthogonal projection of the function onto is the function such that


Since , it can be expressed in the nodal basis by


where the coefficients solve the linear system of equations . The matrix has entries , and vector has entries given by . The Gramian matrix is tridiagonal and strictly diagonally dominant. Therefore, the system has exactly one solution , which can be obtained efficiently using the Thomas algorithm (a particular case of Gaussian elimination) [39].

is the element in that is closest to in the sense given by the aforementioned distance , as we recall next. For any ,

where we used property (5) that the vector . Next, applying the Cauchy-Schwarz inequality,

and so , i.e.,


with equality if . This makes most suitable as an approximation of under the norm.

Iii Finding the Optimal Partition

As just established, for a given vector space of CPWL functions , is the function in whose distance to is minimal. However, the approximation error does not take the same value for every possible ; therefore, we would like to find the optimal partition (or a reasonable approximation thereof) corresponding to the space in which the approximation error is minimum,

This is a difficult optimization problem: in order to properly specify in (6) and measure its distance to , we need to solve the linear system of equations , whose coefficients depend on the partition itself, which makes the problem symbolically intractable. Numerical algorithms could be used but it is still a challenging problem.

Let us examine the analogous problem for the interpolant function defined in Section II-B1. Again, we would like to find the optimal partition corresponding to the space in which the approximation error is minimum,

Albeit not as difficult as the previous problem, because

is more straightforward to define, this is still a challenging non-linear optimization problem. Fortunately, as it is shown next, a good approximation can be easily found under some asymptotic analysis.

In the rest of this section we investigate in detail the error incurred when approximating a function by the interpolant and the orthogonal projection defined in Section II (see Fig. 2). Then we derive an approximation to the optimal partition (Section III-B) that serves equally well for and because, as we will show (Section III-A), their approximation errors are roughly proportional under the assumption of a sufficiently large number of intervals.

Iii-a Error in a Single Interval

Iii-A1 Linear Interpolant

First, let us consider the error incurred when approximating a function , twice continuously differentiable, by its linear interpolant in an interval (Fig. 4).


The error between a given function and its linear interpolant


with , in the interval , of length , is bounded:


where .


See Appendix A.

Formula (9) has the following intuitive explanation: the error measures the deviation of from being straight (linear), and this is directly related to the convexity/concavity of the function, thus the presence of the term to bound the amount of bending.

The other interesting part of (9) is the dependence with respect to the interval size or the local density of knots. Other works in the literature use the weaker bound that does not require carrying out integration [39]: , for some constant .

Iii-A2 Best Linear Approximation


Fig. 4: Function in the subinterval and two linear approximations. On the left, the linear interpolant given by (8); on the right, a general linear segment given by (10). is a signed distance with respect to .

Let us now characterize the error of the orthogonal projection . We do so in two steps: first we compute the minimum error of a line segment in the interval , and then we use an asymptotic analysis to approximate the error of the orthogonal projection.

Stemming from , we can write any (linear) segment in as


where and

are extra degrees of freedom (pictured in Fig. 

4) with respect the interpolant that allow the line segment to better approximate the function in . By computing the optimal values of and we obtain the following result.


The minimum squared error between a given function and a line segment (10) in an interval , of length , adopts the expression


for some in .


See Appendix B.


If is sufficiently small so that is approximately constant within it, say , then



Substitute for in (11).

The segments in may not strictly satisfy (11) because must also be continuous across segments. However, as the size of the intervals become smaller (due to a finer partition  of the interval , i.e., the number of segments in increases) we may approximate in by the best (independently-optimized) segment and, consequently, use Corollary III-A2 to yield the following result.


The squared error between a given function and its orthogonal projection in a small interval , of length , is


where is the approximately constant value that takes within the interval .


See Appendix C.

In the same asymptotic situation, the linear interpolant (9), (see (29)) gives a bigger squared error by a factor of six,


Iii-B Approximation to the Optimal Partition

Now we give a procedure to compute a suboptimal partition of the target interval and then derive error estimates for both and on such a partition and the uniform one.

Iii-B1 Procedure to Obtain a Suboptimal Partition

Let us consider a partition of with subintervals . A suboptimal partition for a given is one in which every subinterval has approximately equal contribution to the total approximation error [39, 40], which implies that regions of with higher convexity are approximated using more segments than regions with lower convexity. Let us assume is large enough so that is approximately constant in each subinterval and therefore the bound (9) is tight. Consequently,


for some constant , and the lengths of the subintervals (local knot spacing [40]) should be chosen , i.e., smaller intervals as increases. Hence, the local knot distribution or density is


so that more knots of the partition are placed in the regions with larger magnitude of the second derivative.

Then, the proposed approximation to the optimal partition is as follows: , , and take knots given by


where the monotonically increasing function is


In this procedure, pictured in Fig. 5, the range of is divided into contiguous sub-ranges of equal length, and the values of are given by the abscissas corresponding to the endpoints of the sub-ranges.

Iii-B2 Error Estimates

With this partition, we can further estimate approximation error bounds in the entire interval based on those of the subintervals.


The approximation error of the linear interpolant in the partition given by (17) in  is



The total squared error for any partition is the sum of squared errors over all subintervals , and by (9),


which, under the condition of error equalization (15) for the proposed partition , becomes


To compute , let us sum over all intervals and approximate the result using the Riemann integral:


whose right hand side is independent of . Substituting (22) in (21) gives the desired approximate bound (19) for the error in the interval .


Fig. 5: Graphical summary of the proposed knot placement technique. Top: local knot density (16) obtained from input function (example in Fig. 2); middle: cumulative knot distribution function given by (18) and knots given by the abscissas corresponding to a uniform partition of the range of , as expressed by (17); bottom: approximation of by CPWL interpolant with (32 knots). In this suboptimal partition, knots are distributed according to the amount of local convexity/concavity of given by the in the middle plot. Hence, fewer knots are placed around the zeros of the , which correspond to the less steep regions of .

Since the approximation error of in each interval is roughly proportional to that of , as shown in (13) and (14), the partition (17) is also a very good approximation to the optimal partition for as the number of subintervals increases. This is briefly stated next.


The approximation error of the orthogonal projection in the partition given by (17) in  is


Both CPWL approximations ( and ) converge to the true function at a rate of at least ((19) and (23)).

We use a similar procedure to derive an estimate error bound for the uniform partition that can be compared to that of the optimized one.


The approximation error of the linear interpolant in the uniform partition of the interval  is



For , we may substitute in (20) and approximate the result using the Riemann integral,

Thus, we can estimate how much we can expect to benefit from optimizing a partition by simply dividing (19) by (24). Since (24) also shows a convergence, the profit will not depend on (assuming is large enough).

Iv Computational Analysis

1     Determine the subinterval that contains :

2     Find the fractional distance to the left endpoint of :

3     Return the interpolated value

Algorithm 1 (CPWL function evaluation). Given as input the description of a continuous piecewise linear function (CPWL) by means of the the nodal values at the knots of the partition , and an abscissa , return .

The implementation of the approximations previously discussed is straightforward (Algorithm 1) and we only need to distinguish two different cases depending on whether the partition of has all subintervals of equal width or not. Although their values are different, and are qualitatively the same from the computational viewpoint.

Let us discuss the general case that is non-uniform. Line 1 in Algorithm 1 implies searching in the ordered array for the right index . Since the other steps in the algorithm do not depend on the size of the input vectors, Line 1 is the dominant step and makes its run-time complexity . In the particular case that is uniform, no search is needed to determine the index: and the fractional part . Thus, the run-time complexity of the uniform case is . There is also no need to store the full array but only its endpoints and , roughly halving the memory needed for a non-uniform partition of the same number of intervals.

Consequently, approximations based on a uniform partition are expected to perform better in usual CPUs or GPUs, computationally-wise, than those based on non-uniform partitions. However, optimizing the partition might lead, depending on the specific objective function, to a reduction in the memory requirements for a desired approximation error. If memory constraints are relevant or hardware-aided search routines become available, optimized partitions could become practical.

V Implementation in a GPU

The proposed algorithm is simple to implement either in a CPU or in a GPU. However, there are some implementation details that help further optimize the performance in a GPU.

Modern GPUs implement a Single Instruction, Multiple Data (SIMD) execution model in which multiple threads execute exactly the same instructions (each) over different data. If two threads scheduled together in the same execution group or warp follow different branches in a conditional statement, both branches are not executed in parallel but sequentially, thereby halving throughput. In the case of being non-uniform, the (binary) search performed to find the index could lead to divergent paths for different threads evaluating for different values of , thus reducing throughput.

Since the purpose of the proposed algorithm is to save computation at run-time, it is reasonable to assume that at least , and possibly and too, have been determined at compile time. Then, we can particularize the search code in compile time, unrolling the loop needed to implement the binary search and, most importantly, replacing conditional statements with a fixed number of invocations of the ternary operator, which is implemented in the GPU with a single instruction and no code divergence [41].

In general, accesses to memory in a GPU need to be arranged so that neighbouring threads access neighbouring addresses in memory. This allows the host interface of the GPU to coalesce operations from several threads into fewer transactions bigger in size. In the described algorithm, if each thread is evaluating for different values of , it is not guaranteed that the memory access pattern complies with this restriction. Although some devices provide caching and isolate the programmer from this issue, not all do.

However, there is a case in a regular computer graphics pipeline in which a similar problem arises. Texture data need to be accessed many times in a short period and it is generally impossible to ensure that neighbouring pixels in screen need to read neighbouring texels. Consequently, most GPUs do cache texture data and even have dedicated cache layers for it. We therefore store the arrays of samples in texture memory rather than in global memory to benefit from texture caches.

The use of the texture units to access our samples provides another important benefit. In the usual computer graphics pipeline, GPUs need to constantly filter texture data. In order to efficiently perform that task, texture units are equipped with hardware fixed-function interpolation routines that include linear interpolation [21]. Therefore, we can use them to speed up line 3 in Algorithm 1; results in the next section confirm that the texture filtering unit is indeed faster than performing interpolation “manually”.

Vi Experimental Results

In this section we apply the proposed linearization algorithm to several functions of interest in cybernetics, computer vision and other scientific areas. The section starts with the analysis of the Gaussian function and demonstrates the proposed technique on an image processing application. The section then analyzes two other sample nonlinear functions of interest: the Lorentzian function and the Bessel function of the first kind.

Vi-a Gaussian Function

The Gaussian function is widely used in many applications in every field of science, but it is of particular interest to us in the context of foreground segmentation in video sequences using spatio-temporal nonparametric background models [20]. To estimate these models, the Gaussian function needs to be evaluated approximately 1300 times per pixel and input image, or around 2 billion times per second at modest video quality (CIF, pixels at 15 fps). Therefore it is crucial to lower the computing time of the Gaussian function. In the performed experiments it was enough to approximate the function in the interval to achieve the required quality.


Fig. 6: distance for different approximations to the Gaussian function over the interval .

Fig. 6 shows distances between the Gaussian function and the approximations described in previous sections. It reports actual distances as well as the approximate and tight upper bounds to the distances, i.e., Results III-B2 to III-B2. It can be observed that the distances between and using the uniform and optimized partitions agree well with Results III-B2 to III-B2. All curves in Fig. 6 have similar slope to that of (19), i.e., their convergence rate is , and the ratio between the distances corresponding to and is approximately the value that stems from Result III-A2 and (14).

Number of points () 32 64 128 256 512
CPU exact function using expf from <cmath> 13710
uniform partition 1750 - 1780
optimized partition 8900 11230 13830 16910 20120
GPU exact function using expf from <cmath> 33.3
fast function using __expf from the SFU 14.2
uniform partition, manual interpolation 19.7 - 19.8
uniform partition, hardware interpolation 7.8 - 7.9
optimized partition, manual interpolation 142.0 161.9 181.8 203.1 224.1
optimized partition, hardware interpolation 122.7 142.9 163.2 188.3 211.1
TABLE I: Mean per-evaluation execution times (in picoseconds) for the Gaussian function.

Table I shows mean processing times per evaluation both in the CPU (sequentially, one core only) and in the GPU. All execution times have been measured in a computer equipped with an Intel Core i7-2600K processor, 16 GiB RAM and an NVIDIA GTX 580 GPU. We have exercised the utmost care to ensure that all different strategies achieve the same utilization of the GPU so that measurements are fair.


Fig. 7: Mean per-evaluation execution times of all the proposed variants on a GPU (in picoseconds). The graph clearly shows the and dependencies on the number of intervals for uniform and non-uniform partitions, respectively.

We compare the proposed algorithm against the exact Gaussian implemented using the exponential function of the C standard library (CPU or GPU) and against a fast Gaussian implemented using the hardware-accelerated low-precision exponential function from the SFU of the GPU [23]. In either hardware (CPU or GPU), there are no separate measurements for and because they only differ in value but not in their implementation. As was expected from the computational analysis, execution times for the uniform partition are constant, whereas they are heavily dependent on  for the optimized partition (see Fig. 7).

In the CPU case, the optimized partition approach is faster than the C standard library implementation if , but the speed-up gain is smaller than that of the uniform partition approach, which is 8 times faster than the C standard library (regardless of ).

In the GPU case, we have also measured the proposed algorithm both with and without the texture filtering unit (last four rows of Table I) to prove that the former option is indeed faster than coding the interpolation explicitly, thanks to its dedicated circuitry. The proposed strategy, using a uniform partition (7.9 ps), solidly outperforms both the exact (33.3 ps) and fast SFU (14.2 ps) Gaussians, by approximate speed-up factors of and , respectively, even though the latter is hardware-accelerated. However, we must stress that this is a synthetic benchmark; in a real application such as the one described in the next section, there may exist limitations in memory bandwidth to access operands. Moreover, many resources such as shared memory, cache and hardware registers are being used for other tasks too. This can (and does) affect the execution time of each of the different implementation strategies.

Foreground Segmentation (Sample Application)

[width=0.996height=0.747]mod-original.pdf (a) [width=0.996height=0.747]mod-median.pdf (c) [width=0.996height=0.747]mod-mog.pdf (e) [width=0.996height=0.747]mod-groundt.pdf (b) [width=0.996height=0.747]mod-averageg.pdf (d) [width=0.996height=0.747]mod-nonpar.pdf (f)
Fig. 8: Results from several different foreground segmentation methods for a complex dynamic scene. (a) Original scene; (b) Ground truth; (c) Temporal Median Filter; (d) Running Gaussian Average; (e) Mixture of Gaussians; (f) Nonparametric modeling.

To obtain high-quality detections in complex situations (e.g., dynamic background, illumination changes, etc.), multimodal moving object detection strategies are commonly used because they are able to model the multiple pixel states in such situations. Among multimodal strategies, nonparametric methods have shown to provide the best quality detections [42] because, in contrast to other algorithms, they do not assume that the pixel values conform to a particular distribution, but obtain instead probabilistic models from sets of recent samples.

Fig. 8 illustrates this by comparing the results from different strategies against a difficult scene featuring a dynamic background. The results of two unimodal background modeling methods are shown in the second row of this figure: the Temporal Median Filter (e.g., [43]) and the Running Gaussian Average (e.g., [44]); whereas the segmentations resulting from the application of two multimodal modeling approaches are depicted in the third row: an improved version of the Mixture of Gaussians method [45] and the spatio-temporal nonparametric-based background modeling in [20].

To improve the quality of the results in sequences containing non-static background regions, and/or captured with portable cameras, most recent nonparametric methods use spatio-temporal reference data [46]. However, these strategies involve huge memory and computational costs, since they need to evaluate millions of Gaussian kernels per image. Therefore, lowering the cost of evaluating the Gaussian function has a direct impact in the performance of the algorithm. To demonstrate this, we have implemented our proposed technique in a nonparametric background modeling method [20]. Table II shows mean per-pixel (in order to make measurements resolution-independent) processing times for the three best performing options in Section VI. Processing times are not proportional to those in Table I because the model needs to do many other things aside from evaluating Gaussians, and access to reference data is bandwidth-limited. However, our proposed technique still outperforms any of the alternatives: the exact Gaussian by 27% and the fast SFU Gaussian by 12%, while the final segmentation results of the method are the same in all three cases, as depicted in Fig. 9.

Uniform partition, hardware interpolation 286
Fast Gaussian using __expf from the SFU 324
Exact Gaussian using expf from <cmath> 391
TABLE II: Mean per-pixel processing times (in nanoseconds) for the nonparametric background modeling technique.
[width=0.996height=0.747]res-track-orig.pdf (a) [width=0.996height=0.747]res-iface-orig.pdf (c) [width=0.996height=0.747]res-track-det.pdf (b) [width=0.996height=0.747]res-iface-det.pdf (d)
Fig. 9: Sample detections using nonparametric modeling with CPWL-approximated Gaussian kernels. Subfigures (a) and (b) show the original scene and results of the segmentation, respectively, for the tracking application; subfigures (c) and (d) show the original scene and results of the segmentation, respectively, for the video-based interface application.

Vi-B Lorentzian Function

The performance of the approximation technique has also been tested on other functions. For example, the Lorentzian function with peak at and width is


As a probability distribution, (


) is known as the Cauchy distribution. It is an important function in Physics since it solves the differential equation describing forced resonance. In such case,

is the resonance frequency and depends on the damping of the oscillator (and is inversely proportional to the factor, a measure of the sharpness of the resonance).

[clip,width=84mm] bounds_Cauchy.pdf

Fig. 10: distance for different CPWL approximations to the standard Cauchy distribution in the interval .

Fig. 10 reports the distances between  (25) and the CPWL approximations described in previous sections, in the interval . The measured errors agree well with the predicted approximate error bounds, showing the expected trend.

We have measured the mean per-evaluation execution times of the exact function both in the CPU (576 ps) and in the GPU, both using regular and reduced-precision accelerated division (19.5 ps and 9.2 ps, respectively). The evaluation times of our proposed strategy coincide with those of the Gaussian function (Table I) because the processing time of the CPWL approximation does not depend on the function values.

As expected from the lower complexity of this function (compared to that of Sec. VI-A), which only involves elementary arithmetic operations, the advantage of our proposal vanishes in the CPU because the operations needed to manually perform interpolation, together with the two values that need to be fetched from memory, are actually more than what the direct evaluation requires. However, note that in the GPU our proposal still remains competitive due to these operations being carried out by the dedicated circuitry of the texture unit.

Vi-C Bessel Function

Approximating the Bessel function of the first kind is more challenging than approximating the Gaussian or Lorentzian (bell-shaped) functions because it combines both nearly flat and oscillatory parts (see Fig. 11). Fig. 12 presents the approximation errors for using CPWL functions in the interval . For the measured errors agree well with the predicted approximate error bounds, whereas for the measured errors slightly differ from the predicted ones (specifically in the optimized partition) because in these cases the scarce number of linear segments does not properly represent the oscillations.

A sample optimized partition and the corresponding CPWL interpolant is also represented in Fig. 11. The knots of the partition are distributed according to the local knot density (see Fig. 11, Top), accumulating in the regions of high oscillations (excluding the places around the zeros of the ).

This function is more complex to evaluate in general because it does not have a closed form. However, our approximation algorithm works equally well on it. We have measured the mean per-evaluation execution times in the CPU (39 ns using the POSIX extensions of the GNU C Library and 130 ns—only double-precision provided—using the GNU Scientific Library) and in the GPU (78 ps using the CUDA [23] standard library). We have also measured the execution time of the first term in the asymptotic expansion [47, Eq. 9.57a], valid for , In the GPU the evaluation takes 42 ps using reduced-precision accelerated functions from the SFU for the cosine and multiplicative inverse of the square root. Despite this approximation having a non-customizable error (decreasing as increases) and using the fastest available implementation (SFU) of the operations involved, our strategy still outperforms it by a sizable margin. This clearly illustrates that the more complex the function to be evaluated is, the greater the advantage of our proposal.

[clip,width=84mm] knotdensity_BesselJ0.pdf

Fig. 11: Suboptimal partition for the Bessel function in the interval . Top: local knot density () corresponding to ; bottom: CPWL interpolant with ( knots) overlaid on function .

[clip,width=84mm] bounds_BesselJ0.pdf

Fig. 12: distance for different CPWL approximations to the Bessel function of the first kind in the interval .

Vii Conclusions

We have developed a fast method to numerically evaluate any continuous mathematical function in a given interval using simpler continuous piecewise linear (CPWL) functions and the built-in routines present in the texture units of modern GPUs. Our technique allows real-time implementation of demanding computer vision and cybernetics applications that use such mathematical functions.

For this purpose, we analyzed the CPWL approximations given by the linear interpolant and the orthogonal projection of a function. We carried out a detailed error analysis in the distance to seek a nearly optimal design of both approximations. In the practical asymptotic case of a large number of subintervals , we used error equalization to achieve a suboptimal design (partition ) and derived a tight bound on the approximation error for the linear interpolant, showing a convergence rate that was confirmed by experimental results. The orthogonal projection can only but improve upon the results of the linear interpolant, resulting in a gain factor of .

We discussed the computational complexity and the implementation in a GPU of the numerical evaluation of both CPWL approximations. Our experimental results show that our technique can outperform both the quality and the computational cost of previous similar approaches. In particular, the fastest strategy consists of using the texture units in the GPU to evaluate either of the CPWL approximations defined over a uniform partition. This is normally faster than performing exact function evaluations, even when using the reduced-precision accelerated implementation of the SFUs in a GPU, or evaluating the proposed linear approximations without the assistance of the texture units. In practice, the number of subintervals to be considered for representing the nonlinear function can be decided on its own or based on other considerations such as a target approximation error, speed or memory constraints.

Although mathematically sound, the strategies based on a suboptimal (non-uniform) partition are not practical to implement in current CPU/GPU architectures due to the high cost incurred to find the subinterval that contains the given point of evaluation. Nevertheless, this opens future research paths to explore practical implementations of such approaches using specialized hardware.

Appendix A Proof of Result Iii-A1 (Linear Interpolant Error)

Recall one of the theorems on interpolation errors [48]. Let be a function in , and let be a polynomial of degree or less that interpolates the function at distinct points . Then, for each there exists a point for which


In the interval , the linear interpolant is given by (8). Since interpolates the function at the endpoints of , we can apply theorem (26) (with ); hence, the approximation error solely depends on and , but not on or :


Let us integrate the square of (27) over the interval ,

Next, to simplify the previous integral, let us use the first mean value theorem for integration, which states that if is a continuous function and is an integrable function that does not change sign on the interval , then there exists a number such that


Since and for all , let us apply (28) to compute


for some . Finally, if , it is straightforward to derive the error bound (9) from the square root of (29).

Appendix B Proof of Result Iii-A2 (Line Segment Minimum Error)

The approximation error corresponding to the line segment (10) is


where, by analogy with the form  of we defined .

The proof proceeds by computing the optimal values of and that minimize the squared error over the interval :


The first term is given in (29). The second term is

and the third term is, applying (28) and the change of variables to evaluate the resulting integrals,

for some and in .

Substituting previous results in (31),

We may now find the line segment that minimizes the distance to by taking partial derivatives with respect to and , setting them to zero and solving the corresponding system of equations. Indeed, the previous error is quadratic in and , and attains its minimum at


The resulting minimum squared distance is (11).

Appendix C Proof of Result Iii-A2 (Asymptotic Analysis)

By the triangle inequality, the jump discontinuity at

between two adjacent independently-optimized segments is


where and are the offsets with respect to of the optimized segments (32) at the left and right of , respectively; and lie in the interval . Since we are dealing with functions twice continuously differentiable in a closed interval, the absolute value terms in (33) are bounded, according to the extreme value theorem; therefore, if and decrease (finer partition ), the discontinuity jumps at the knots of the partition also decrease. In the limit, as , , i.e., continuity is satisfied. Therefore the union of the independently-optimized segments , which is the unique piecewise linear function satisfying both continuity () and minimization of the error (7). Consequently, if is large we may approximate ; moreover, if is sufficiently small so that is approximately constant within it, , then we use Corollary III-A2 to get (13).


  • [1] G. Shapiro, “Consumer Electronics Association’s Five Technology Trends to Watch: Exploring New Tech That Will Impact Our Lives,” IEEE Consum. Electron. Mag., vol. 2, no. 1, pp. 32–35, 2013.
  • [2] N. Singhal, I. K. Park, and S. Cho, “Implementation and optimization of image processing algorithms on handheld GPU,” in IEEE Int. Conf. Image Process. (ICIP), 2010, pp. 4481–4484.
  • [3]

    V. Borges, M. Batista, and C. Barcelos, “A soft unsupervised two-phase image segmentation model based on global probability density functions,” in

    IEEE Int. Conf. Systems, Man and Cybernetics (SMC), Oct 2011, pp. 1687–1692.
  • [4] K. Kapinchev, F. Barnes, A. Bradu, and A. Podoleanu, “Approaches to General Purpose GPU Acceleration of Digital Signal Processing in Optical Coherence Tomography Systems,” in IEEE Int. Conf. Systems, Man and Cybernetics (SMC), Oct 2013, pp. 2576–2580.
  • [5] M. Song, D. Tao, Z. Liu, X. Li, and M. Zhou, “Image ratio features for facial expression recognition application,” IEEE Trans. Syst., Man, Cybern. B, vol. 40, no. 3, pp. 779–788, June 2010.
  • [6]

    J. Kim, E. Park, X. Cui, H. Kim, and W. Gruver, “A fast feature extraction in object recognition using parallel processing on CPU and GPU,” in

    IEEE Int. Conf. Systems, Man and Cybernetics (SMC), Oct 2009, pp. 3842–3847.
  • [7] S. Herrero-Lopez, “Accelerating SVMs by integrating GPUs into MapReduce clusters,” in IEEE Int. Conf. Systems, Man and Cybernetics (SMC), Oct 2011, pp. 1298–1305.
  • [8] P. Kromer, J. Platos, and V. Snasel, “Genetic algorithm for clustering accelerated by the CUDA platform,” in IEEE Int. Conf. Systems, Man and Cybernetics (SMC), Oct 2012, pp. 1005–1010.
  • [9] S. Balla-Arabe, X. Gao, and B. Wang, “GPU accelerated edge-region based level set evolution constrained by 2D gray-scale histogram,” IEEE Trans. Image Process., vol. 22, no. 7, pp. 2688–2698, 2013.
  • [10] A. Jalba and J. B. T. M. Roerdink, “Efficient Surface Reconstruction From Noisy Data Using Regularized Membrane Potentials,” IEEE Trans. Image Process., vol. 18, no. 5, pp. 1119–1134, 2009.
  • [11] S. Balla-Arabe, X. Gao, B. Wang, F. Yang, and V. Brost, “Multi-kernel implicit curve evolution for selected texture region segmentation in VHR satellite images,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 8, pp. 5183–5192, Aug 2014.
  • [12] L. Cheng, M. Gong, D. Schuurmans, and T. Caelli, “Real-Time Discriminative Background Subtraction,” IEEE Trans. Image Process., vol. 20, no. 5, pp. 1401–1414, 2011.
  • [13] P. Chiranjeevi and S. Sengupta, “Robust detection of moving objects in video sequences through rough set theory framework,” Image Vision Comput., vol. 30, no. 11, pp. 829 – 842, 2012.
  • [14] N. Liu, H. Wu, and L. Lin, “Hierarchical ensemble of background models for PTZ-based video surveillance,” IEEE Trans. Cybern., vol. 45, no. 1, pp. 89–102, Jan 2015.
  • [15] Y. Sheikh, O. Javed, and T. Kanade, “Background subtraction for freely moving cameras,” in IEEE Int. Conf. Computer Vision (ICCV).   IEEE, 2009, pp. 1219–1225.
  • [16] L. Sigal, M. Isard, H. Haussecker, and M. Black, “Loose-limbed people: Estimating 3D human pose and motion using non-parametric belief propagation,” Int. J. Comput. Vision, vol. 98, no. 1, pp. 15–48, 2012.
  • [17]

    C. Cuevas, R. Mohedano, and N. García, “Adaptable Bayesian classifier for spatiotemporal nonparametric moving object detection strategies.”

    Optics letters, vol. 37, no. 15, pp. 3159–3161, Aug. 2012.
  • [18] Y. Moshe, H. Hel-Or, and Y. Hel-Or, “Foreground detection using spatiotemporal projection kernels,” in

    IEEE Conf. Computer Vision and Pattern Recognition (CVPR)

    , 2012, pp. 3210–3217.
  • [19] C. Cuevas and N. García, “Efficient Moving Object Detection for Lightweight Applications on Smart Cameras,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 1, pp. 1–14, Jan. 2013.
  • [20] D. Berjón, C. Cuevas, F. Morán, and N. García, “GPU-based implementation of an optimized nonparametric background modeling for real-time moving object detection,” IEEE Trans. Consum. Electron., vol. 59, no. 2, pp. 361–369, May 2013.
  • [21] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla: A unified graphics and computing architecture,” IEEE Micro, vol. 28, no. 2, pp. 39–55, 2008.
  • [22] M. Sylwestrzak, D. Szlag, M. Szkulmowski, I. Gorczyńska, D. Bukowska, M. Wojtkowski, and P. Targowski, “Real time 3D structural and Doppler OCT imaging on graphics processing units,” in Proc. SPIE, no. 85710Y.   Int. Soc. Optics and Photonics, Mar. 2013.
  • [23] NVIDIA Corp., “CUDA C Programming Guide,” Tech. Rep., 2012.
  • [24] J.