I Introduction
The design of highquality and realtime image processing algorithms is a key topic in today’s context, where humans demand more capabilities and control over their rapidly increasing number of advanced imaging devices such as cameras, smart phones, tablets, etc. [1]. Many efforts are being made toward satisfying those needs by exploiting dedicated hardware, such as Graphics Processing Units (GPUs) [2, 3, 4]. Some areas where GPUs are emerging to improve Human Computer Interaction (HCI) include cybernetics (for example, in facial [5] or object [6]
recognition, classification using Support Vector Machines
[7]and genetic algorithms for clustering
[8]), and image processing (e.g., unsupervised image segmentation [3, 9], optical coherence tomography systems [4], efficient surface reconstruction from noisy data [10], remote sensing [11], realtime background subtraction [12], etc.). In such hardwareoriented designed algorithms, the computational efficiency of processing tasks is significantly improved by parallelizing the operations.Automated video analysis applications such as person tracking or videobased gestural interfaces (see Fig. 1) rely on lowerlevel building blocks like foreground segmentation [13, 14], where the design of efficient computational methods for dedicated hardware has a significant impact. Multimodal nonparametric segmentation strategies have drawn a lot of attention [15, 16] since they are able to provide highquality results even in complex scenarios (dynamic background, illumination changes, etc.) [17]. However, their main drawback is their extremely high computational cost (requiring the evaluation of billions of multidimensional Gaussian kernels per second), which makes them difficult to integrate in the latest generation of image processing applications [18, 19]. To overcome this important drawback and achieve realtime performance [20], the use of parallel hardware such as GPUs helps but may not be enough by itself, depending on the required resolution, hence the need for algorithms capable of evaluating nonlinear (e.g., Gaussian) functions at high speed within a required error tolerance. GPU vendors are aware of this recurrent computing problem and provide hardware implementations of common transcendental functions in Special Function Units (SFUs) [21]; indeed, we can find in the literature examples of successful use of these hardware facilities [22]. However, their ease of use comes at the price of noncustomizable reduced numerical precision [23].
Ia Contribution
We propose a novel, fast, and practical method to evaluate any continuous mathematical function within a known interval. Our contributions include, based on error equalization, a nearly optimal design of two types of piecewise linear approximations (linear interpolant and orthogonal projection) in the norm under the constraint of a large budget of evaluation subintervals . Moreover, we provide asymptotically tight bounds for the approximation errors of both piecewise linear representations, improving upon existing ones. Specifically, in addition to the convergence rate typical of these approximations, we quantify their interpolation constants: for the linear interpolant and a further improvement factor in case of the orthogonal projection. The obtained error relations are parameterized by the number of segments used to represent the complex (nonlinear) function, hence our approach allows the user to estimate the errors given or, conversely, estimate the required to achieve a target approximation error.
We also propose an efficient implementation of the technique in a modern GPU by exploiting the fixedfunction interpolation routines present in its texture units to accelerate the computation while leaving the rest of the GPU (and, of course, the CPU) free to perform other tasks; this technique is even faster than using SFUs [21].
Although the initial motivation of the developed method was to improve the efficiency of nonparametric foreground segmentation strategies, it must be noted that it can be also used in many other scientific fields such as computer vision [24, 25] or audio signal processing [26, 27], where the evaluation of continuous mathematical functions constitutes a significant computational burden.
IB Organization
Section II summarizes the basic facts about piecewise linear approximation of realvalued univariate functions and reviews related work in this topic; Section III derives a suboptimal partition of the domain of the approximating function in order to minimize the distance to the original function (proofs of results are given in the Appendixes); Section IV analyzes the algorithmic complexity of the proposed approximation strategies and Section V gives details about their implementation on modern GPUs. Section VI presents the experimental results of the proposed approximation on several functions (Gaussian, Lorentzian and Bessel’s), both in terms of quality and computational times, and its use is demonstrated in an image processing application. Finally, Section VII concludes the paper.
Ii Piecewise Linear Modelization
Iia Related Work
In many applications, rigorous evaluation of complex mathematical functions is not practical because it takes too much computational power. Consequently, this evaluation is carried out approximating them by simpler functions such as (piecewise)polynomial ones. Piecewise linearization has been used as an attractive simplified representation of various complex nonlinear systems [28]. The resulting models fit into well established tools for linear systems and reduce the complexity of finding the inverse of nonlinear functions [29, 30]
. They can also be used to obtain approximate solutions in complex nonlinear systems, for example, in MixedInteger Linear Programming (MILP) models
[31]. Some efforts have been devoted as well to the search for canonical representations in one and multiple dimensions [32, 33] with different goals such as black box system identification, approximation or model reduction.Previous attempts to address the problem considered here (optimal function piecewise linearization) include [34, 35, 29, 36], the latter two in the context of nonlinear dynamical systems. In [34] an iterative multistage procedure based on dynamical programming is given to provide a solution to the problem on sequences of progressively finer 2D grids. In [29]
the piecewise linear approximation is obtained by using the evolutionary computation approach such as genetic algorithm and evolution strategies; the resulting model is obtained by minimization of a sampled version of the mean squared error and it may not be continuous. In
[36]the problem is addressed using a hybrid approach based on curve fitting, clustering and genetic algorithms. Here we address the problem from a less heuristic point of view, using differential calculus to derive a more principled approach
[37]. Our method solely relies on standard numerical integration techniques, which takes few seconds to compute, as opposed to recursive partitioning techniques such as [35], which take significantly longer.IiB Basic Results in PiecewiseLinear Interpolation and LeastSquares Approximation
In this section we summarize the basic theory behind piecewise linear functions [38] and two such approximations to realvalued functions: interpolation and projection (Sections IIB1 and IIB2, respectively), pictured in Fig. 2 along with their absolute approximation error with respect to .
In essence, a piecewise function over an interval is a partition of into a set of subintervals , where , and a set of functions , one for each subinterval . In particular we are interested in continuous piecewise linear (CPWL) functions, which means that all the are linear and . CPWL functions of a given partition are elements of a vector space : the addition of such functions and/or multiplication by a scalar yields another CPWL function defined over the same subintervals. A useful basis for the vector space is formed by the set of hat functions or nodal basis functions , pictured in Fig. 3 and defined in general by the formula
(1) 
The basis functions and associated to the boundary nodes and are only half hats.
These basis functions are convenient since they can represent any function in by just requiring the values of at its nodal points, , in the form
(2) 
IiB1 Linear Interpolation
The piecewise linear interpolant of a continuous function over the interval can be defined in terms of the basis just introduced:
(3) 
While this CPWL approximation is trivial to construct, and may be suitable for some uses, it is by no means the best possible one. Crucially, for any function that is convex in
. Depending on the application in which this approximation is used, this property could skew the results. However, as we will see in Section
III, the linear interpolant is useful to analyze other possible approximations. It is also at the heart of the trapezoidal rule for numerical integration.IiB2 Orthogonal Projection onto Vector Space
Let the usual inner product between two squareintegrable () functions in the interval be given by
(4) 
Then, the vector space can be endowed with the above inner product, yielding an inner product space. As usual, let be the norm induced by the inner product, and let be the distance between two functions . The orthogonal projection of the function onto is the function such that
(5) 
Since , it can be expressed in the nodal basis by
(6) 
where the coefficients solve the linear system of equations . The matrix has entries , and vector has entries given by . The Gramian matrix is tridiagonal and strictly diagonally dominant. Therefore, the system has exactly one solution , which can be obtained efficiently using the Thomas algorithm (a particular case of Gaussian elimination) [39].
is the element in that is closest to in the sense given by the aforementioned distance , as we recall next. For any ,
where we used property (5) that the vector . Next, applying the CauchySchwarz inequality,
and so , i.e.,
(7) 
with equality if . This makes most suitable as an approximation of under the norm.
Iii Finding the Optimal Partition
As just established, for a given vector space of CPWL functions , is the function in whose distance to is minimal. However, the approximation error does not take the same value for every possible ; therefore, we would like to find the optimal partition (or a reasonable approximation thereof) corresponding to the space in which the approximation error is minimum,
This is a difficult optimization problem: in order to properly specify in (6) and measure its distance to , we need to solve the linear system of equations , whose coefficients depend on the partition itself, which makes the problem symbolically intractable. Numerical algorithms could be used but it is still a challenging problem.
Let us examine the analogous problem for the interpolant function defined in Section IIB1. Again, we would like to find the optimal partition corresponding to the space in which the approximation error is minimum,
Albeit not as difficult as the previous problem, because
is more straightforward to define, this is still a challenging nonlinear optimization problem. Fortunately, as it is shown next, a good approximation can be easily found under some asymptotic analysis.
In the rest of this section we investigate in detail the error incurred when approximating a function by the interpolant and the orthogonal projection defined in Section II (see Fig. 2). Then we derive an approximation to the optimal partition (Section IIIB) that serves equally well for and because, as we will show (Section IIIA), their approximation errors are roughly proportional under the assumption of a sufficiently large number of intervals.
Iiia Error in a Single Interval
IiiA1 Linear Interpolant
First, let us consider the error incurred when approximating a function , twice continuously differentiable, by its linear interpolant in an interval (Fig. 4).
Result
The error between a given function and its linear interpolant
(8) 
with , in the interval , of length , is bounded:
(9) 
where .
Proof:
See Appendix A.
Formula (9) has the following intuitive explanation: the error measures the deviation of from being straight (linear), and this is directly related to the convexity/concavity of the function, thus the presence of the term to bound the amount of bending.
IiiA2 Best Linear Approximation
Let us now characterize the error of the orthogonal projection . We do so in two steps: first we compute the minimum error of a line segment in the interval , and then we use an asymptotic analysis to approximate the error of the orthogonal projection.
Stemming from , we can write any (linear) segment in as
(10) 
where and
are extra degrees of freedom (pictured in Fig.
4) with respect the interpolant that allow the line segment to better approximate the function in . By computing the optimal values of and we obtain the following result.Result
The minimum squared error between a given function and a line segment (10) in an interval , of length , adopts the expression
(11) 
for some in .
Proof:
See Appendix B.
Corollary
If is sufficiently small so that is approximately constant within it, say , then
(12) 
Proof:
Substitute for in (11).
The segments in may not strictly satisfy (11) because must also be continuous across segments. However, as the size of the intervals become smaller (due to a finer partition of the interval , i.e., the number of segments in increases) we may approximate in by the best (independentlyoptimized) segment and, consequently, use Corollary IIIA2 to yield the following result.
Result
The squared error between a given function and its orthogonal projection in a small interval , of length , is
(13) 
where is the approximately constant value that takes within the interval .
Proof:
See Appendix C.
IiiB Approximation to the Optimal Partition
Now we give a procedure to compute a suboptimal partition of the target interval and then derive error estimates for both and on such a partition and the uniform one.
IiiB1 Procedure to Obtain a Suboptimal Partition
Let us consider a partition of with subintervals . A suboptimal partition for a given is one in which every subinterval has approximately equal contribution to the total approximation error [39, 40], which implies that regions of with higher convexity are approximated using more segments than regions with lower convexity. Let us assume is large enough so that is approximately constant in each subinterval and therefore the bound (9) is tight. Consequently,
(15) 
for some constant , and the lengths of the subintervals (local knot spacing [40]) should be chosen , i.e., smaller intervals as increases. Hence, the local knot distribution or density is
(16) 
so that more knots of the partition are placed in the regions with larger magnitude of the second derivative.
Then, the proposed approximation to the optimal partition is as follows: , , and take knots given by
(17) 
where the monotonically increasing function is
(18) 
In this procedure, pictured in Fig. 5, the range of is divided into contiguous subranges of equal length, and the values of are given by the abscissas corresponding to the endpoints of the subranges.
IiiB2 Error Estimates
With this partition, we can further estimate approximation error bounds in the entire interval based on those of the subintervals.
Result
Proof:
The total squared error for any partition is the sum of squared errors over all subintervals , and by (9),
(20) 
which, under the condition of error equalization (15) for the proposed partition , becomes
(21) 
To compute , let us sum over all intervals and approximate the result using the Riemann integral:
(22) 
whose right hand side is independent of . Substituting (22) in (21) gives the desired approximate bound (19) for the error in the interval .
Since the approximation error of in each interval is roughly proportional to that of , as shown in (13) and (14), the partition (17) is also a very good approximation to the optimal partition for as the number of subintervals increases. This is briefly stated next.
Result
Both CPWL approximations ( and ) converge to the true function at a rate of at least ((19) and (23)).
We use a similar procedure to derive an estimate error bound for the uniform partition that can be compared to that of the optimized one.
Result
The approximation error of the linear interpolant in the uniform partition of the interval is
(24) 
Proof:
Iv Computational Analysis
The implementation of the approximations previously discussed is straightforward (Algorithm 1) and we only need to distinguish two different cases depending on whether the partition of has all subintervals of equal width or not. Although their values are different, and are qualitatively the same from the computational viewpoint.
Let us discuss the general case that is nonuniform. Line 1 in Algorithm 1 implies searching in the ordered array for the right index . Since the other steps in the algorithm do not depend on the size of the input vectors, Line 1 is the dominant step and makes its runtime complexity . In the particular case that is uniform, no search is needed to determine the index: and the fractional part . Thus, the runtime complexity of the uniform case is . There is also no need to store the full array but only its endpoints and , roughly halving the memory needed for a nonuniform partition of the same number of intervals.
Consequently, approximations based on a uniform partition are expected to perform better in usual CPUs or GPUs, computationallywise, than those based on nonuniform partitions. However, optimizing the partition might lead, depending on the specific objective function, to a reduction in the memory requirements for a desired approximation error. If memory constraints are relevant or hardwareaided search routines become available, optimized partitions could become practical.
V Implementation in a GPU
The proposed algorithm is simple to implement either in a CPU or in a GPU. However, there are some implementation details that help further optimize the performance in a GPU.
Modern GPUs implement a Single Instruction, Multiple Data (SIMD) execution model in which multiple threads execute exactly the same instructions (each) over different data. If two threads scheduled together in the same execution group or warp follow different branches in a conditional statement, both branches are not executed in parallel but sequentially, thereby halving throughput. In the case of being nonuniform, the (binary) search performed to find the index could lead to divergent paths for different threads evaluating for different values of , thus reducing throughput.
Since the purpose of the proposed algorithm is to save computation at runtime, it is reasonable to assume that at least , and possibly and too, have been determined at compile time. Then, we can particularize the search code in compile time, unrolling the loop needed to implement the binary search and, most importantly, replacing conditional statements with a fixed number of invocations of the ternary operator, which is implemented in the GPU with a single instruction and no code divergence [41].
In general, accesses to memory in a GPU need to be arranged so that neighbouring threads access neighbouring addresses in memory. This allows the host interface of the GPU to coalesce operations from several threads into fewer transactions bigger in size. In the described algorithm, if each thread is evaluating for different values of , it is not guaranteed that the memory access pattern complies with this restriction. Although some devices provide caching and isolate the programmer from this issue, not all do.
However, there is a case in a regular computer graphics pipeline in which a similar problem arises. Texture data need to be accessed many times in a short period and it is generally impossible to ensure that neighbouring pixels in screen need to read neighbouring texels. Consequently, most GPUs do cache texture data and even have dedicated cache layers for it. We therefore store the arrays of samples in texture memory rather than in global memory to benefit from texture caches.
The use of the texture units to access our samples provides another important benefit. In the usual computer graphics pipeline, GPUs need to constantly filter texture data. In order to efficiently perform that task, texture units are equipped with hardware fixedfunction interpolation routines that include linear interpolation [21]. Therefore, we can use them to speed up line 3 in Algorithm 1; results in the next section confirm that the texture filtering unit is indeed faster than performing interpolation “manually”.
Vi Experimental Results
In this section we apply the proposed linearization algorithm to several functions of interest in cybernetics, computer vision and other scientific areas. The section starts with the analysis of the Gaussian function and demonstrates the proposed technique on an image processing application. The section then analyzes two other sample nonlinear functions of interest: the Lorentzian function and the Bessel function of the first kind.
Via Gaussian Function
The Gaussian function is widely used in many applications in every field of science, but it is of particular interest to us in the context of foreground segmentation in video sequences using spatiotemporal nonparametric background models [20]. To estimate these models, the Gaussian function needs to be evaluated approximately 1300 times per pixel and input image, or around 2 billion times per second at modest video quality (CIF, pixels at 15 fps). Therefore it is crucial to lower the computing time of the Gaussian function. In the performed experiments it was enough to approximate the function in the interval to achieve the required quality.
Fig. 6 shows distances between the Gaussian function and the approximations described in previous sections. It reports actual distances as well as the approximate and tight upper bounds to the distances, i.e., Results IIIB2 to IIIB2. It can be observed that the distances between and using the uniform and optimized partitions agree well with Results IIIB2 to IIIB2. All curves in Fig. 6 have similar slope to that of (19), i.e., their convergence rate is , and the ratio between the distances corresponding to and is approximately the value that stems from Result IIIA2 and (14).
Number of points ()  32  64  128  256  512  

CPU  exact function using expf from <cmath>  13710  
uniform partition  1750  1780  
optimized partition  8900  11230  13830  16910  20120  
GPU  exact function using expf from <cmath>  33.3  
fast function using __expf from the SFU  14.2  
uniform partition, manual interpolation  19.7  19.8  
uniform partition, hardware interpolation  7.8  7.9  
optimized partition, manual interpolation  142.0  161.9  181.8  203.1  224.1  
optimized partition, hardware interpolation  122.7  142.9  163.2  188.3  211.1 
Table I shows mean processing times per evaluation both in the CPU (sequentially, one core only) and in the GPU. All execution times have been measured in a computer equipped with an Intel Core i72600K processor, 16 GiB RAM and an NVIDIA GTX 580 GPU. We have exercised the utmost care to ensure that all different strategies achieve the same utilization of the GPU so that measurements are fair.
We compare the proposed algorithm against the exact Gaussian implemented using the exponential function of the C standard library (CPU or GPU) and against a fast Gaussian implemented using the hardwareaccelerated lowprecision exponential function from the SFU of the GPU [23]. In either hardware (CPU or GPU), there are no separate measurements for and because they only differ in value but not in their implementation. As was expected from the computational analysis, execution times for the uniform partition are constant, whereas they are heavily dependent on for the optimized partition (see Fig. 7).
In the CPU case, the optimized partition approach is faster than the C standard library implementation if , but the speedup gain is smaller than that of the uniform partition approach, which is 8 times faster than the C standard library (regardless of ).
In the GPU case, we have also measured the proposed algorithm both with and without the texture filtering unit (last four rows of Table I) to prove that the former option is indeed faster than coding the interpolation explicitly, thanks to its dedicated circuitry. The proposed strategy, using a uniform partition (7.9 ps), solidly outperforms both the exact (33.3 ps) and fast SFU (14.2 ps) Gaussians, by approximate speedup factors of and , respectively, even though the latter is hardwareaccelerated. However, we must stress that this is a synthetic benchmark; in a real application such as the one described in the next section, there may exist limitations in memory bandwidth to access operands. Moreover, many resources such as shared memory, cache and hardware registers are being used for other tasks too. This can (and does) affect the execution time of each of the different implementation strategies.
Foreground Segmentation (Sample Application)
To obtain highquality detections in complex situations (e.g., dynamic background, illumination changes, etc.), multimodal moving object detection strategies are commonly used because they are able to model the multiple pixel states in such situations. Among multimodal strategies, nonparametric methods have shown to provide the best quality detections [42] because, in contrast to other algorithms, they do not assume that the pixel values conform to a particular distribution, but obtain instead probabilistic models from sets of recent samples.
Fig. 8 illustrates this by comparing the results from different strategies against a difficult scene featuring a dynamic background. The results of two unimodal background modeling methods are shown in the second row of this figure: the Temporal Median Filter (e.g., [43]) and the Running Gaussian Average (e.g., [44]); whereas the segmentations resulting from the application of two multimodal modeling approaches are depicted in the third row: an improved version of the Mixture of Gaussians method [45] and the spatiotemporal nonparametricbased background modeling in [20].
To improve the quality of the results in sequences containing nonstatic background regions, and/or captured with portable cameras, most recent nonparametric methods use spatiotemporal reference data [46]. However, these strategies involve huge memory and computational costs, since they need to evaluate millions of Gaussian kernels per image. Therefore, lowering the cost of evaluating the Gaussian function has a direct impact in the performance of the algorithm. To demonstrate this, we have implemented our proposed technique in a nonparametric background modeling method [20]. Table II shows mean perpixel (in order to make measurements resolutionindependent) processing times for the three best performing options in Section VI. Processing times are not proportional to those in Table I because the model needs to do many other things aside from evaluating Gaussians, and access to reference data is bandwidthlimited. However, our proposed technique still outperforms any of the alternatives: the exact Gaussian by 27% and the fast SFU Gaussian by 12%, while the final segmentation results of the method are the same in all three cases, as depicted in Fig. 9.
Uniform partition, hardware interpolation  286 
Fast Gaussian using __expf from the SFU  324 
Exact Gaussian using expf from <cmath>  391 
ViB Lorentzian Function
The performance of the approximation technique has also been tested on other functions. For example, the Lorentzian function with peak at and width is
(25) 
As a probability distribution, (
25) is known as the Cauchy distribution. It is an important function in Physics since it solves the differential equation describing forced resonance. In such case,
is the resonance frequency and depends on the damping of the oscillator (and is inversely proportional to the factor, a measure of the sharpness of the resonance).Fig. 10 reports the distances between (25) and the CPWL approximations described in previous sections, in the interval . The measured errors agree well with the predicted approximate error bounds, showing the expected trend.
We have measured the mean perevaluation execution times of the exact function both in the CPU (576 ps) and in the GPU, both using regular and reducedprecision accelerated division (19.5 ps and 9.2 ps, respectively). The evaluation times of our proposed strategy coincide with those of the Gaussian function (Table I) because the processing time of the CPWL approximation does not depend on the function values.
As expected from the lower complexity of this function (compared to that of Sec. VIA), which only involves elementary arithmetic operations, the advantage of our proposal vanishes in the CPU because the operations needed to manually perform interpolation, together with the two values that need to be fetched from memory, are actually more than what the direct evaluation requires. However, note that in the GPU our proposal still remains competitive due to these operations being carried out by the dedicated circuitry of the texture unit.
ViC Bessel Function
Approximating the Bessel function of the first kind is more challenging than approximating the Gaussian or Lorentzian (bellshaped) functions because it combines both nearly flat and oscillatory parts (see Fig. 11). Fig. 12 presents the approximation errors for using CPWL functions in the interval . For the measured errors agree well with the predicted approximate error bounds, whereas for the measured errors slightly differ from the predicted ones (specifically in the optimized partition) because in these cases the scarce number of linear segments does not properly represent the oscillations.
A sample optimized partition and the corresponding CPWL interpolant is also represented in Fig. 11. The knots of the partition are distributed according to the local knot density (see Fig. 11, Top), accumulating in the regions of high oscillations (excluding the places around the zeros of the ).
This function is more complex to evaluate in general because it does not have a closed form. However, our approximation algorithm works equally well on it. We have measured the mean perevaluation execution times in the CPU (39 ns using the POSIX extensions of the GNU C Library and 130 ns—only doubleprecision provided—using the GNU Scientific Library) and in the GPU (78 ps using the CUDA [23] standard library). We have also measured the execution time of the first term in the asymptotic expansion [47, Eq. 9.57a], valid for , In the GPU the evaluation takes 42 ps using reducedprecision accelerated functions from the SFU for the cosine and multiplicative inverse of the square root. Despite this approximation having a noncustomizable error (decreasing as increases) and using the fastest available implementation (SFU) of the operations involved, our strategy still outperforms it by a sizable margin. This clearly illustrates that the more complex the function to be evaluated is, the greater the advantage of our proposal.
Vii Conclusions
We have developed a fast method to numerically evaluate any continuous mathematical function in a given interval using simpler continuous piecewise linear (CPWL) functions and the builtin routines present in the texture units of modern GPUs. Our technique allows realtime implementation of demanding computer vision and cybernetics applications that use such mathematical functions.
For this purpose, we analyzed the CPWL approximations given by the linear interpolant and the orthogonal projection of a function. We carried out a detailed error analysis in the distance to seek a nearly optimal design of both approximations. In the practical asymptotic case of a large number of subintervals , we used error equalization to achieve a suboptimal design (partition ) and derived a tight bound on the approximation error for the linear interpolant, showing a convergence rate that was confirmed by experimental results. The orthogonal projection can only but improve upon the results of the linear interpolant, resulting in a gain factor of .
We discussed the computational complexity and the implementation in a GPU of the numerical evaluation of both CPWL approximations. Our experimental results show that our technique can outperform both the quality and the computational cost of previous similar approaches. In particular, the fastest strategy consists of using the texture units in the GPU to evaluate either of the CPWL approximations defined over a uniform partition. This is normally faster than performing exact function evaluations, even when using the reducedprecision accelerated implementation of the SFUs in a GPU, or evaluating the proposed linear approximations without the assistance of the texture units. In practice, the number of subintervals to be considered for representing the nonlinear function can be decided on its own or based on other considerations such as a target approximation error, speed or memory constraints.
Although mathematically sound, the strategies based on a suboptimal (nonuniform) partition are not practical to implement in current CPU/GPU architectures due to the high cost incurred to find the subinterval that contains the given point of evaluation. Nevertheless, this opens future research paths to explore practical implementations of such approaches using specialized hardware.
Appendix A Proof of Result IiiA1 (Linear Interpolant Error)
Recall one of the theorems on interpolation errors [48]. Let be a function in , and let be a polynomial of degree or less that interpolates the function at distinct points . Then, for each there exists a point for which
(26) 
In the interval , the linear interpolant is given by (8). Since interpolates the function at the endpoints of , we can apply theorem (26) (with ); hence, the approximation error solely depends on and , but not on or :
(27) 
Let us integrate the square of (27) over the interval ,
Next, to simplify the previous integral, let us use the first mean value theorem for integration, which states that if is a continuous function and is an integrable function that does not change sign on the interval , then there exists a number such that
(28) 
Appendix B Proof of Result IiiA2 (Line Segment Minimum Error)
The approximation error corresponding to the line segment (10) is
(30) 
where, by analogy with the form of we defined .
The proof proceeds by computing the optimal values of and that minimize the squared error over the interval :
(31) 
The first term is given in (29). The second term is
and the third term is, applying (28) and the change of variables to evaluate the resulting integrals,
for some and in .
Substituting previous results in (31),
We may now find the line segment that minimizes the distance to by taking partial derivatives with respect to and , setting them to zero and solving the corresponding system of equations. Indeed, the previous error is quadratic in and , and attains its minimum at
(32) 
The resulting minimum squared distance is (11).
Appendix C Proof of Result IiiA2 (Asymptotic Analysis)
By the triangle inequality, the jump discontinuity at
between two adjacent independentlyoptimized segments is(33) 
where and are the offsets with respect to of the optimized segments (32) at the left and right of , respectively; and lie in the interval . Since we are dealing with functions twice continuously differentiable in a closed interval, the absolute value terms in (33) are bounded, according to the extreme value theorem; therefore, if and decrease (finer partition ), the discontinuity jumps at the knots of the partition also decrease. In the limit, as , , i.e., continuity is satisfied. Therefore the union of the independentlyoptimized segments , which is the unique piecewise linear function satisfying both continuity () and minimization of the error (7). Consequently, if is large we may approximate ; moreover, if is sufficiently small so that is approximately constant within it, , then we use Corollary IIIA2 to get (13).
References
 [1] G. Shapiro, “Consumer Electronics Association’s Five Technology Trends to Watch: Exploring New Tech That Will Impact Our Lives,” IEEE Consum. Electron. Mag., vol. 2, no. 1, pp. 32–35, 2013.
 [2] N. Singhal, I. K. Park, and S. Cho, “Implementation and optimization of image processing algorithms on handheld GPU,” in IEEE Int. Conf. Image Process. (ICIP), 2010, pp. 4481–4484.

[3]
V. Borges, M. Batista, and C. Barcelos, “A soft unsupervised twophase image segmentation model based on global probability density functions,” in
IEEE Int. Conf. Systems, Man and Cybernetics (SMC), Oct 2011, pp. 1687–1692.  [4] K. Kapinchev, F. Barnes, A. Bradu, and A. Podoleanu, “Approaches to General Purpose GPU Acceleration of Digital Signal Processing in Optical Coherence Tomography Systems,” in IEEE Int. Conf. Systems, Man and Cybernetics (SMC), Oct 2013, pp. 2576–2580.
 [5] M. Song, D. Tao, Z. Liu, X. Li, and M. Zhou, “Image ratio features for facial expression recognition application,” IEEE Trans. Syst., Man, Cybern. B, vol. 40, no. 3, pp. 779–788, June 2010.

[6]
J. Kim, E. Park, X. Cui, H. Kim, and W. Gruver, “A fast feature extraction in object recognition using parallel processing on CPU and GPU,” in
IEEE Int. Conf. Systems, Man and Cybernetics (SMC), Oct 2009, pp. 3842–3847.  [7] S. HerreroLopez, “Accelerating SVMs by integrating GPUs into MapReduce clusters,” in IEEE Int. Conf. Systems, Man and Cybernetics (SMC), Oct 2011, pp. 1298–1305.
 [8] P. Kromer, J. Platos, and V. Snasel, “Genetic algorithm for clustering accelerated by the CUDA platform,” in IEEE Int. Conf. Systems, Man and Cybernetics (SMC), Oct 2012, pp. 1005–1010.
 [9] S. BallaArabe, X. Gao, and B. Wang, “GPU accelerated edgeregion based level set evolution constrained by 2D grayscale histogram,” IEEE Trans. Image Process., vol. 22, no. 7, pp. 2688–2698, 2013.
 [10] A. Jalba and J. B. T. M. Roerdink, “Efficient Surface Reconstruction From Noisy Data Using Regularized Membrane Potentials,” IEEE Trans. Image Process., vol. 18, no. 5, pp. 1119–1134, 2009.
 [11] S. BallaArabe, X. Gao, B. Wang, F. Yang, and V. Brost, “Multikernel implicit curve evolution for selected texture region segmentation in VHR satellite images,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 8, pp. 5183–5192, Aug 2014.
 [12] L. Cheng, M. Gong, D. Schuurmans, and T. Caelli, “RealTime Discriminative Background Subtraction,” IEEE Trans. Image Process., vol. 20, no. 5, pp. 1401–1414, 2011.
 [13] P. Chiranjeevi and S. Sengupta, “Robust detection of moving objects in video sequences through rough set theory framework,” Image Vision Comput., vol. 30, no. 11, pp. 829 – 842, 2012.
 [14] N. Liu, H. Wu, and L. Lin, “Hierarchical ensemble of background models for PTZbased video surveillance,” IEEE Trans. Cybern., vol. 45, no. 1, pp. 89–102, Jan 2015.
 [15] Y. Sheikh, O. Javed, and T. Kanade, “Background subtraction for freely moving cameras,” in IEEE Int. Conf. Computer Vision (ICCV). IEEE, 2009, pp. 1219–1225.
 [16] L. Sigal, M. Isard, H. Haussecker, and M. Black, “Looselimbed people: Estimating 3D human pose and motion using nonparametric belief propagation,” Int. J. Comput. Vision, vol. 98, no. 1, pp. 15–48, 2012.

[17]
C. Cuevas, R. Mohedano, and N. García, “Adaptable Bayesian classifier for spatiotemporal nonparametric moving object detection strategies.”
Optics letters, vol. 37, no. 15, pp. 3159–3161, Aug. 2012. 
[18]
Y. Moshe, H. HelOr, and Y. HelOr, “Foreground detection using
spatiotemporal projection kernels,” in
IEEE Conf. Computer Vision and Pattern Recognition (CVPR)
, 2012, pp. 3210–3217.  [19] C. Cuevas and N. García, “Efficient Moving Object Detection for Lightweight Applications on Smart Cameras,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 1, pp. 1–14, Jan. 2013.
 [20] D. Berjón, C. Cuevas, F. Morán, and N. García, “GPUbased implementation of an optimized nonparametric background modeling for realtime moving object detection,” IEEE Trans. Consum. Electron., vol. 59, no. 2, pp. 361–369, May 2013.
 [21] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla: A unified graphics and computing architecture,” IEEE Micro, vol. 28, no. 2, pp. 39–55, 2008.
 [22] M. Sylwestrzak, D. Szlag, M. Szkulmowski, I. Gorczyńska, D. Bukowska, M. Wojtkowski, and P. Targowski, “Real time 3D structural and Doppler OCT imaging on graphics processing units,” in Proc. SPIE, no. 85710Y. Int. Soc. Optics and Photonics, Mar. 2013.
 [23] NVIDIA Corp., “CUDA C Programming Guide,” Tech. Rep., 2012.
 [24] J.