Super-resolution reconstruction (SRR) is a well established approach for digital image quality improvement. SRR consists basically in combining multiple low-resolution (LR) images of the same scene or object in order to obtain one or more images of higher resolution (HR), outperforming physical limitations of image sensors. Applications for SRR are numerous and cover diverse areas, including the reconstruction of satellite images in remote sensing, surveillance videos in forensics and images from low-cost digital sensors in standard end-user systems. References [2, 3] review several important concepts and initial results on SRR.
SRR algorithms usually belong to one of two groups. Image SRR algorithms, which reconstruct one HR image from multiple observations, and video SRR algorithms, which reconstruct an entire HR image sequence. Video SRR algorithms often include a temporal regularization that constrains the norm of the changes in the solution between adjacent time instants [4, 5, 6, 7, 8, 9]. This introduces information about the correlation between adjacent frames, and tends to ensure video consistency over time, improving the quality of the reconstructed sequences .
A typical characteristic of SRR algorithms is the high computational cost. Recent developments in both image and video SRR have been mostly directed towards achieving increased reconstruction quality, either using more appropriate a priori information about the underlying image and the acquisition process or learning the relationship between LR and HR images from a set of training data. Examples are the non-parametric methods based on spatial kernel regression , non-local methods [12, 13], variational Bayesian methods [14, 15]
and, more recently, deep-learning-based methods[16, 17, 18].
Although these techniques have led to considerable improvements in the quality of state of the art SRR algorithms, such improvements did not come for free. The computational cost of these algorithms is very high, which makes them unsuitable for real-time SRR applications. While deep-learning methods can be significantly faster than kernel or non-local methods, they rely on extensive training procedures with large amounts of data. Also, the training must be repeated whenever the test conditions change, or their performance may degrade significantly . Moreover, registration errors plague most SRR algorithms. This motivates the use of nonlinear cost functions and more complex methodologies, which further contributes to increase the computational cost of the corresponding algorithms [20, 21].
While these traditional SRR methods achieve good reconstruction results , real-time video SRR applications require simple algorithms. This limitation prompted a significant interest towards developing low complexity SRR algorithms. The regularized least mean squares (R-LMS) [23, 24] is one notable example among the simpler SRR algorithms. Even though other low-complexity algorithms have been proposed for global translational image motions, such as the
norm estimator in and the adaptive Wiener filter of , their computational complexity is still not competitive with that of the R-LMS. The R-LMS presents a reasonable reconstruction quality and follows a systematic mathematical derivation. This enables a formal characterization of its behavior  and the specification of well defined design methodologies . Furthermore, the R-LMS is also naturally robust to registration errors, which were shown to have a regularizing effect in the algorithm . This makes the quality of the R-LMS algorithm in practical situations to be competitive even with that of costly and elaborated algorithms like , as illustrated through an example in Figure 1.
Unfortunately, the performances of these simple algorithms, including the R-LMS, tend to be heavily affected by the presence of outliers such as large innovations caused by moving objects and sudden scene changes. This can lead to reconstructed sequences of worse quality than that of the observations themselves . Common strategies for obtaining robust algorithms often involve the optimization of nonlinear cost functions [4, 25, 7]
, and thus present a computational cost that is not comparable to that of algorithms like the R-LMS. Interpolation algorithms might seem to be a reasonable option, as their performance is not affected by motion related outliers. However, they do not offer a quality improvement comparable to SRR methods[30, 7]. Therefore, it is of interest to develop video SRR algorithms that combine good quality, robustness to outliers and a low computational cost.
This paper proposes a new adaptive video SRR algorithm with improved robustness to outliers when compared to the R-LMS algorithm. The contributions of this paper include: 1) A new interpretation of the R-LMS update equation as the proximal regularization of the associated cost function, linearized about the previous estimate, which leads to a better understanding of its quality performance and robustness in different situations. Using this representation we show that the slow convergence rate of the R-LMS algorithm (typical of stochastic gradient algorithms) establishes a trade-off between its robustness to outliers and the achievable reconstruction quality. 2) A simple model for the statistical properties of typical innovation outliers in natural image sequences is developed, which points to the desirable properties of the proposed technique. 3) A new cost function is then proposed to address the identified problems and two new adaptive algorithms are derived called Temporally Selective Regularized LMS (TSR-LMS) and Linearized Selective Regularized LMS (LTSR-LMS), which present improved robustness and similar quality at a computational cost comparable to that of the R-LMS algorithm.
The paper is organized as follows. In Section II, the image acquisition model is defined. In Section III, the R-LMS algorithm [23, 24] is derived as a stochastic gradient solution to the image estimation problem . In Section IV, an intuitive interpretation of the R-LMS behavior is presented using the proximal-point cost function representation of the gradient descent iterative equation. In Section V, a new robust cost function is proposed based on a statistical model for the innovations, and two adaptive algorithms are derived. In chapter VII, computer simulations are performed to assess the performance of the algorithms. The conclusions are presented in Section VIII.
Ii Image Acquisition Model
Given the matrix representation of an LR (observed) digital image , and an () matrix representation of the original HR digital image , the acquisition process can be modeled as 
where vectors() and () are the lexicographic representations of the degraded and original images, respectively, at discrete time instant . is an decimation matrix and models the sub-sampling taking place in the sensor. is an matrix, assumed known, that models the blurring in the acquisition process111Since is assumed to be estimated independently from the SRR process, the extension of the results in this work to the case of a time-varying is straightforward if an online estimation algorithm for is employed. Here, a possible dependence of on is omitted for notation simplicity.. The vector models the observation (electronic) noise, whose properties are assumed to be determined from camera tests. The dynamics of the input signal is modeled by 
where is the warp matrix that describes the relative displacement from to . Vector models the innovations in .
Iii The R-LMS SRR algorithm
Several SRR solutions are based on the minimization of the estimation error (see  and references therein)
where is the estimated HR image, and can be interpreted as the estimate of in (1). The LMS algorithm attempts to minimize the mean-square value of the norm of (3) conditioned on the estimate [23, 27]. Thus, it minimizes the cost function .
Since natural images are known to be intrinsically smooth, this a priori knowledge can be added to the estimation problem in the form of a regularization to the LMS algorithm by constraining the solution that minimizes . The R-LMS algorithm [29, 24] is the stochastic gradient version of the gradient descent search for the solution to the following optimization problem
Following the steepest descent method, the HR image estimate is updated in the negative direction of the gradient
and thus the iterative update of for a fixed value of is given by
where is the number of iterations of the algorithm, and is the step size used to control the convergence speed. The factor 1/2 is just a convenient scaling.
Between two time updates, (7) is iterated for . The estimate at a given time instant is then given by .
Iv R-LMS Performance in the Presence of Outliers
The R-LMS algorithm is computationally efficient when implemented with few stochastic gradient iterations (small ) per time instant . However, the R-LMS algorithm is derived under the assumption that the solution is only slightly perturbed between time instants, a characteristic known as the minimum disturbance principle. This assumption is satisfied in the R-LMS algorithm close to steady-state, when the estimate has already achieved a reasonable quality (i.e. ). Then, the initialization determined by (8) for the next time instant will be already relatively close to the optimal solution, what explains the good steady-state performance of the algorithm even with a small value of in (7).
The situation is significantly different in the occurrence of innovation outliers. Experience with the R-LMS algorithm shows that the slow convergence of (7) as a function of tends to degrade the quality of the super-resolved images in the presence of innovation outliers. Visible artifacts tend to be created, and the reconstructed images may end up being of inferior quality even when compared to the originally observed LR images. This significantly compromises the quality that can be achieved in real-time super-resolution of video sequences, not just using R-LMS but most of the existing low-complexity algorithms . Super-resolution algorithms designed to be robust under the influence of innovations tend to impose a high computational cost, making them unsuitable for real time applications [25, 28]. In the following we examine the R-LMS recursion under a new light, what leads to a mathematically motivated explanation for its lack of robustness to outliers. This explanation will then motivate the proposition of more robust stochastic video SRR algorithms.
An interesting interpretation of the R-LMS algorithm is possible if we view each iteration of the gradient algorithm (6) (for a fixed value of ) as a proximal regularization of the cost function linearized about the estimation of the previous iteration . Proceeding as in [32, Section 2.2] or [33, p. 546], the gradient iteration (6) can be written as
which, using the previous expressions, yields
where is the expected value of the observation error (3) conditioned on . This equivalence can be verified by differentiating the expression in the curl brackets in (9) with respect to , setting it equal to zero and solving for .
Now, the presence of the squared norm within the curl brackets in (9) and (10) means that the optimization algorithm seeks that minimizes the perturbation at each iteration. Evidencing this property leads to a more detailed understanding of the dynamical behavior of the algorithm, its robustness properties and the reconstruction quality it provides. For instance, this constraint on the perturbation of the solution explains how the algorithm tends to preserve details in that have been estimated in previous time iterations and that are present in . However, this same term also opposes changes from to , slowing down the reduction of the observation error from to since changes in require changes in . Therefore, this algorithm cannot simultaneously achieve a fast convergence rate and preserve the super-resolved details for the practically important case of a small number of iterations per time instant (small ). The time sequence of reconstructed images will either converge fast and yield low temporal correlation between time estimates (leading to a solution that approaches an interpolation of ), or will converge slowly and yield a highly correlated image sequence with better quality in the absence of innovation outliers. The occurrence of outliers will result in a significant deviation from the desired signal.
To illustrate this behavior, consider for instance that the reconstructed image sequence at time instant is reasonably close to the real (desired) sequence, i.e. . If we consider the video sequence to be only slightly perturbed at the next time instant such that in (2), the first iteration of (10) ( at time ) is given by
which, using (8) with , yields
Finally, assuming (no outlier in )
which is small since the energy of the observation noise is much smaller than that of registration errors and outliers in most practical applications [2, 28]. The first term in the r.h.s. of (14) is due to the regularization an promotes the smoothness of the solution. Hence, should be small to avoid compromising the estimation of the details of . The last term will promote a solution close to , especially for small values of . Then, for reasonably small values of and , and , the first and second terms in (14) can be neglected (i.e. ). Then, the solution of the optimization problem (14) will converge to a vector . The same reasoning can be extended to the remaining iterations for , which shows that, for , the algorithm will lead to a reconstructed image of good quality . This explains how the R-LMS algorithm preserves the reconstructed content in time and extracts information from the different observations, attaining good reconstruction results for well behaved sequences, i.e. in the absence of large innovations.
Now, let’s consider the presence of a significant innovation outlier at time , while still assuming a good reconstruction result at time (i.e. ). Due to the outlier at time instant , in (2) will have a significant energy. Then, repeating (14) without the assumption yields
where now the observation error is given by
This result clearly shows that, for large , the solution of (16) can be considerably away from , as it should contain information introduced by . For the desirable case of small , this estimation error will hardly be significantly reduced from to due to the slow convergence of the gradient based recursion. This explains the poor transient performance of the algorithm in the presence of outliers.
Performance improvement in the presence of outliers could be sought by increasing the value of to reduce the influence of the term in (16). However, cannot be made arbitrarily large for stability reasons. Hence, improvement has to come from increasing the importance of the first term in (16).
First, note that does not include information on the outlier , as it is not present in . Moreover, we have already seen that the first term promotes the smoothness of the solution. Thus, increasing the value of in an attempt to speed up convergence in the presence of large innovations by reducing the influence of the last term in (16) will also reduce the temporal correlation of the estimated image sequence, resulting in an overly blurred solution with lower quality in the absence of outliers. Hence, the solution can hardly approach the desired solution in few iterations (small ). If, however, is made sufficiently large, the solution can adapt to track the innovations even with a large weighting for the term . The algorithm could then achieve and maintain a good reconstruction quality both with and without the presence of outliers, but at a prohibitive computational cost.
Iv-a Illustrative example
The behavior of the R-LMS algorithm is illustrated in the following example, where we consider the reconstruction of a synthetic video sequence generated through small translational displacements of a window over a larger natural image. At a specific time instant during the video (the 32nd frame), an outlier is introduced by adding a black square of size to the scene. The square remain in the scene for frames, before disappearing again. The HR sequence is convolved with a uniform blurring mask and down-sampled by a factor of
. Finally, a white Gaussian noise with varianceis added to generate the low resolution video.
The R-LMS algorithm is applied to super-resolve the synthetic LR videos generated. The mean square error (MSE) between the original and reconstructed sequences is estimated by averaging the results from 50 realizations. To illustrate the trade-offs between the effects of different values of the step size and regularization parameter in the cost function, we reconstructed the sequence with both and , for . To evaluate the effect of using different numbers of iterations per time interval, we ran the algorithm with and . The MSE is depicted in Figure 2-(a). From the two curves for one can verify that a large value for (red curve) reduces the MSE in the presence of the outlier, while the greater temporal correlation induced by a small value of (black curve) tends to reduce the error for small innovations and to increase it in the presence of an outlier. Comparing the blue () and the black () curves, both for and , one verifies that the MSE can be substantially decreased by employing the R-LMS algorithm with a large . The MSE is smaller than that obtained for both for small and for large innovations. This performance improvement is because the algorithm is allowed to converge slowly for each time interval. Figure 2-(b) shows the MSE as a function of for time instant , when the outlier is present. These results illustrate the property that a large value of is necessary to achieve a significant MSE reduction for a fixed value of .
In the light of the aforementioned limitations of the R-LMS algorithm, it is of interest to devise an algorithm that performs better both in terms of robustness and quality at a reasonable computational cost.
V Improving the Robustness to Innovations
A temporal regularization of adaptive algorithms such as the R-LMS that constrains the value of in the SRR cost function [10, 4, 7] can be interpreted as the application of the well known least perturbation or minimum disturbance principle. This principle states that the parameters of an adaptive system should only be disturbed in a minimal fashion in the light of new data [34, p. 355]. Using this principle, the one-dimensional LMS algorithm can be shown to correspond not to an approximate solution of a gradient-based optimization problem, but to the exact solution of a constrained optimization problem [35, p. 216].
Differently from simultaneous video SRR methods, the cost function (4) of the R-LMS algorithm is defined for a single time instant. Thus, the proximal regularization described in Section IV only guarantees consistence between consecutive iterations in . As the solution at the previous time instant is only introduced during the initialization in (8), consistence between consecutive time instants is only obtained if the parameters of the R-LMS algorithm are selected such that the solution is not disturbed during all iterations (i.e. ). However, as illustrated in the example in section IV-A, this makes the R-LMS algorithm very sensitive to outliers. A choice of parameters leading to a superior robustness, on the other hand, compromises the estimation of the details in .
To preserve the super-resolved details between consecutive time instants regardless of the choice of the R-LMS parameters, one might be tempted to introduce an additional temporal term to the optimization problem (10) to prevent the loss of content estimated in when reconstructing . This results in the following optimization problem:
where is a weighting factor controlling the temporal disturbance. Albeit removing the dependence of its solution on the time initialization (8), the algorithm in (19) fails to achieve good results. Instead, this new regularization term makes the algorithm less robust since it prevents convergence to the desired solution in the presence of large innovations even for a large number of iterations (large ). This is clearly perceived by assuming again that is large and , and examining the norm of the last term in (19) for
which will be large if not only for , but for all iterations. Furthermore, this term would be unnecessary for small innovations. In this case the R-LMS can retain the temporal consistency even for a large number of iterations (), as illustrated in the example of section IV-A for . Hence, algorithm robustness and quality must be addressed using other approaches.
Most works in single-frame or video SRR seek robustness by considering cost functions including non-quadratic (e.g. ) error norms [25, 4, 7] or signal dependent regularizations [9, 36], which result in non-linear algorithms. Although these techniques achieve good reconstruction results, their increased computational cost makes real-time operation unfeasible even for the fastest algorithms. Differently from the simultaneous SRR methods, the robustness problem of the R-LMS is related with its slow convergence, since a good result is achieved for large . A different approach is therefore required to adequately handle the innovations in the R-LMS algorithm.
In the following, we propose to use meaningful a priori information about the statistical nature of the innovations in deriving a new stochastic SRR method using the least perturbation principle. The proposed approach can improve the robustness of the R-LMS algorithm while retaining a reduced computational cost. By employing statistical information about , which has been overlooked in the design of simple SRR algorithms, it becomes possible to provide robustness to the innovations while maintaining a good reconstruction quality.
V-a Constructing an Innovation-Robust Regularization
To achieve the desired effect, we propose to modify the norm being minimized in the last term of (19) through the inclusion of a weighting matrix properly designed to emphasize the image details in the regularization term. This will allow the resulting algorithm to attain a faster speed of convergence with a good quality, while at the same time reducing the influence of the innovations on the solution of the optimization problem. The new constraint is then given by
where must be designed to preserve only the details of the estimated images, so that innovations will have a minimal effect upon the regularization term. Thus, it is desired that
which means that the image details must lie in the column space of , while the innovations lie in its nullspace. Therefore, if we assume the reconstructed image in time instant to be reasonably close to the real (desired) image, (i.e. ), we can write the modified restriction as
enabling the preservation of the image details even in the presence of large innovations.
The question that arises is how to design the transformation matrix to achieve the required properties. We propose to base the design of on a stochastic model for the innovations.
V-B Statistical Properties of Innovations in Natural Image Sequences
The statistical properties of natural images have been thoroughly studied in the literature. A largely employed probabilistic model for natural images is characterized by a zero-mean and highly leptokurtotic, fat-tailed distribution, with its power spectral density remarkably close to a function, where is the absolute spatial frequency and is close to . This characterization has led, for example, to the development of sparse derivative prior models for natural images  that have been widely employed in image processing algorithms.
When it comes to obtaining accurate probabilistic models for the signals in the dynamic evolution of a video sequence, particularly the innovations, the task becomes more challenging. This is due to the dependence of the signal statistics on the generally unknown movement in the scene. With the motion frequently estimated from a low-resolution observed video sequence, the employed model must distinguish between errors originating from the image registration and errors caused by true changes in the scene, the latter often labeled as outliers.
The modeling of large magnitude changes in a scene has already been considered for the image matching problem. Hasler et. al.  proposed to consider the error patterns generated by non-coinciding regions of an aligned image pair to be similar to the error generated by comparing two random regions of the underlying scene. This relationship clearly arises in a dynamical model for a video sequence when the motion model fails to account for unpredictable changes between two adjacent images, generating an error signal that will consist of the difference between the new image and a misaligned part of the previous image. Considering the case of one dimensional signals for simplicity, the auto-correlation function of the difference between two patches of an image separated by samples can be computed as
where denotes statistical expectation, is a point in the one dimensional image, and is the image auto-correlation function. Thus, is the auto-correlation of the simulated outlier. If the covariance between the image pixels diminishes with their distance, for a sufficiently large value of the terms will become approximately equal to the square of the mean image value. Therefore, the auto-correlation function of the simulated outlier will be similar to that of a natural image.
This interpretation can be more intuitively achieved by considering a different approach, which models the innovations assuming a scene model composed by the interactions of objects in an occlusive environment . Innovations in a video sequence can be broadly described as pixels in that cannot be described as a linear combination of the pixels in (i.e. are statistically orthogonal). These pixels will be here divided as
where consists of small changes in the scene, originated, for instance from specular surfaces. can be modeled as a low power high frequency noise. represents large magnitude changes (outliers) arising due to occlusions or due to objects suddenly appearing in the scene (such as image borders). is usually sparse and compact  222Note that is not to be confused with registration errors due to the ill-posed nature of the motion estimation process. The latter can be shown to originate from a random linear combination of the pixels in ..
A region of the scene corresponding to a dis-occluded area typically reveals part of a background or object at a different depth from the camera. Hence, the nonzero pixels in will consist of highly correlated compact regions. Furthermore, the joint pixel statistics at these locations should actually be similar to that of natural scenes. This conclusion becomes straightforward if we consider, for instance, the Dead Leaves image formation model , which characterizes a natural scene by a superposition of opaque objects of random sizes and positions occluding each other. Here, a dis-occluded area would correspond to the removal of an object (or a “leaf") at random from the topmost of the z-axis. The corresponding region in the new image will therefore be composed of the next objects present on the z-axis. Since the area behind the view plane is completely filled with objects (superimposed "leaves") in this model, there is no difference between the statistical properties of a region in the foremost-top image and those of a region behind an object. This reinforces the notion of correlation obtained by considering the more generic outlier model of Hasler et al. .
To verify the proposed innovations model, we have determined the power spectral density (PSD) of synthetic images representing the innovations. These images were generated by pasting small pieces of the difference between two independent natural images with sizes ranging from to in random positions of a background. We have extracted the small pieces from 20 different natural images, so that they emulate small regions appearing in the occluded regions of a video sequence. The PSD is computed by averaging 200 realizations of a Monte Carlo (MC) simulation. Figure 3 shows the obtained result. It can be clearly seen that the energy is concentrated in the lower frequencies of the spectrum, resulting in a highly correlated signal.
V-B1 Choosing the Operator
Natural scene innovations tend to be highly correlated in space. Thus, their energy tends to be primarily concentrated in the low spatial frequencies. Hence, the operator should in general emphasize the high frequency components to accomplish the design objectives in (21). Unfortunately, the specific scenes to be processed are not known in advance, what hinders the accurate determination of the statistical properties of the innovations, and thus of the optimal operator . A simple solution with reduced computational complexity is to use a basic high-pass filter with small support, such as a differentiator or a Laplacian. For simplicity, the Laplacian filter mask will be employed during the remaining of this work. Thus, we shall use , leaving the search for an optimal operator for a future work.
Vi The Temporally Selective Regularized LMS (TSR-LMS) Algorithm
To derive the new algorithm, we propose a new cost function that minimizes the perturbation only on the details of the reconstructed image, while at the same time observing the objectives of the R-LMS algorithm. Differently from (19), the new cost function allows for more flexibility for the component of the solution in the subspace corresponding to the outlier while retaining its quality. Such strategy leads to an increased algorithm robustness.
We propose to solve the following optimization problem:
Calculating the gradient of the cost function with respect to , setting it equal to 0, solving for and approximating the statistical expectations by their instantaneous values yields the iterative equation for the TSR-LMS algorithm:
Algorithm (26) generalizes R-LMS and the least perturbation approach (19). It collapses to these solutions if or , respectively. It should perform well both with and without outliers, at the cost of little extra computational effort. Though the matrix inversion can be made a priori and the resulting inverse might be sparse, its storage is still rather costly. If is chosen to be block circulant (BC) (such as a Laplacian), then is block circulant  and can be computed as a convolution, leading to important memory savings.
Although (26) may resemble the Gradient Projection Method (GPM) , this is not generally true, as is not necessarily a projection matrix (i.e. ). Hence, the convergence and stability properties of these algorithms are not the same in general.
Vi-a The Linearized Temporally Selective Regularized LMS (LTSR-LMS)
Whereas algorithm (26) should present a good cost-benefit ratio, the aforementioned limitations motivates the pursuit of another algorithm that trades a small performance loss for a simpler implementation and a more predictable performance. This section describes one possible modification.
Since the details of the solution are minimally disturbed between iterations, we can safely assume that . Therefore, we can employ a linear approximation for the quadratic regularization introduced in the last term of (25) using a first-order Taylor series expansion of this norm with respect to the transformed variable about the point . The resulting cost function can be written as:
Note that if the algorithm initialization is selected as , the linearized regularization introduced in the last term of (27) is equal to zero for the first iteration (). Therefore, iterations per time instant are necessary in order to have an improvement over the R-LMS algorithm. This is not the case for the algorithm proposed in (26), where an improvement can be obtained even for .
By ignoring the constant terms in the optimization problem (27) and using (4) and (5), it can be shown that (27) corresponds to the proximal point cost function of the following Lagrangian for a single time time :
where . Note that by using on (VI-A), the algorithm particularizes to the well known temporal regularization, commonly employed in simultaneous video SRR in order to achieve temporal consistency [10, 7]. In this case, the algorithm is not expected to be robust since the innovations are not accounted for.
Calculating the gradient of the cost function in (27) with respect to , setting it equal to 0, solving for and approximating the statistical expectations by their instantaneous values yields the iterative equation for the LTSR-LMS algorithm based on the linearized version of the proposed regularization:
Vi-B Computational Cost of the Proposed Solution
The computational and memory costs of the (L)TSR-LMS algorithms are still comparable to those of the (R)-LMS algorithm. An important characteristic of the problem that allows a fast implementation of both the (R)-LMS and the (L)TSR-LMS methods is the spatial invariance assumption of the operators , , and , which results in them being block-circulant or block-Toeplitz matrices. In this case, the number of nonzero elements of the matrices (denoted by ) scales linearly with the number of HR image pixels (i.e. ), and so does the number of operations of the algorithms. In this case, the computational and memory costs for the algorithms considered can be seen in Tables I and II.
We now present four examples to illustrate the performace of the TSR-LMS and LTSR-LMS methods. Examples 1 and 2 use a controlled environment to assess differences in quality (section VII-A) and robustness (section VII-B) when compared to interpolation, LMS and R-LMS without the influence of unaccountable effects. Example 3 (section VII-C) compares the performances of the proposed methods to those of LMS, R-LMS and interpolation algorithms using real video sequences. Example 4 (section VII-D) evaluates the TSR-LMS and LTSR-LMS methods against state-of-the-art SRR algorithms in practical applications.
Example 1 evaluates the average performance of the algorithms without outliers, in a close-to-ideal environment. We used synthetic video sequences with small translational motion to enable Monte Carlo simulations and to be able to control the occurrence of modeling errors. The motion between frames was assumed known a priori, and the mean squared reconstruction error could be evaluated because we had access to the desired HR images. The simulation was also performed using a typical registration algorithm to evaluate the influence of motion estimation on the performances of algorithms (26) and (VI-A). A decline in performance was expected, as reported in  for the classical temporal regularization algorithms.
Example 2 evaluates the proposed algorithms in the presence of innovation outliers. A synthetic simulation emulates the case of a flying bird when an object suddenly appears in a frame or moves independently of the background, generating occlusions and leading to a high level of innovations in some specific frames of the video sequence.
Example 3 evaluates the performances of the algorithms when super-resolving real video sequences. We super-resolved a set of video sequences containing complex motion patterns and frames with large levels of innovations and registration errors. Finally, Example 4 extends Example 3 by comparing the TSR-LMS and LTSR-LMS methods with state-of-the-art SRR algorithms, namely, a Bayesian method 18]. We employ recent and challenging video sequences and compare the results for robustness, quality and computational cost.
For , the LTSR-LMS algorithm particularizes to the popular classical temporal regularization [10, 7]. We do not report this case here, as we could not obtain any improvement (quantitative or perceptual) when compared to the R-LMS () performance. The matrix employed in both algorithms for all examples shown here was a Laplacian filter. The SRR algorithms were compared to the bicubic interpolation algorithm proposed in . Both the MSE and the structural similarity index (SSIM) were considered in the evaluation. The obtained SSIM results were qualitatively similar to the MSE results. Hence, the SSIM values are provided only for the displayed images.
The boundary condition for the convolution matrices was chosen to be circulant. This simplifies the implementation and results in the inversion of a circulant matrix in (26) . We selected the boundary condition for in the global translational case to be circulant as well to simplify implementation. For the case of a dense motion field, the warped images were computed through the bilinear interpolation of the original image pixels.
Vii-a Example 1
For a Monte Carlo simulation, each HR video sequence was created based on the translation of an window over a static image, resulting in whole-image translational movements. The window displacements consisted in a random walk process (i.i.d. unitary steps) on both horizontal and vertical directions. The still images used to generate each video sequence were natural scenes such as Lena, Cameraman, Baboon and others, and were totally distinct from each other. The resulting sequence was then blurred with a uniform unitary gain mask and decimated by a factor of , resulting in LR images of dimension . Finally, white Gaussian noise with variance was added to the decimated images. The algorithm performances were evaluated by averaging 50 realizations, one for each input image.
The performances of the different algorithms are highly dependent on the parameters selected, as verified in section IV-A). Hence, the parameters were carefully selected to yield an honest comparison. We selected the parameters for each method to achieve the minimum steady-state MSE (i.e. for large ). The steady-state MSE for each set of parameters was estimated by running an exhaustive search over a small, independent set of images and averaging the squared errors for the last 5 frames. Table III shows the parameter values that resulted in the best performance for each algorithm.
We applied both standard and regularized versions of LMS and the proposed methods to super resolve the synthetic sequences, all initialized with as a bicubic (spline) interpolation of the first LR image. First, we considered the motion vectors to be known a priori to avoid the influence of a registration algorithm, and used iterations per time instant. The super-resolved sequences were compared to the original HR sequence and the MSE was computed across all realizations.
The MSE performance is depicted in Figure 4. It can be seen that the proposed methods outperform LMS and R-LMS. Moreover, both proposed algorithms achieve the same MSE given enough time. Finally, the LTSR-LMS algorithm reaches the minimum MSE faster than the original TSR-LMS algorithm in the absence of outliers.
For a more realistic evaluation, we repeated the MC simulation considering the influence of registration errors. The Horn & Schunck registration algorithm [44, 45] was employed333The parameters were set as: lambda=110, pyramid_levels=4, pyramid_spacing=2., with the velocity fields averaged across the entire image to compute the global displacements. The algorithm parameters were the same used in the previous simulation. The resulting MSE results are depicted in Figure 5, and an example of a reconstructed image of a resolution test chart is shown in Figure 6.
It can be verified that the proposed methods still outperform the conventional (R-)LMS algorithms, though by a smaller margin due to the effect of registration errors. It should be noted that large levels of registration errors tend to reduce the effectiveness of the TSR-LMS and LTSR-LMS methods, as we are basically preserving information (details) from previous frames that must be registered. Furthermore, the performance of the TSR-LMS algorithm showed greater sensitivity to unknown registration, as its performance degraded more when compared to the LTSR-LMS algorithm. The images reconstructed by the four evaluated algorithms were found to be perceptually similar, although a careful inspection reveals a slight improvement in the reconstruction result using the LTSR-LMS algorithm. Nevertheless, the following examples will illustrate that the proposed methods perform considerably better than the others in the presence of outliers.
Vii-B Example 2
This example evaluates the robustness to innovation outliers. This was done by super-resolving synthetic video sequences containing a suddenly appearing object, which is independent of the background. The first MC simulation of Example 1 was repeated, this time with the inclusion of an black square appearing in the middle of the 32nd frame and disappearing in the 35th frame of every sequence, thus emulating the behavior of a flying bird outlier on the scene.
The MSE evolution for the parameters shown in Table III is shown in Figures 7-(a) and 7-(c) (zoomed). The improvement provided by the LTSR-LMS algorithm is clearly significant, suggesting its greater robustness to outliers.
To improve the design of the proposed algorithms, we performed exhaustive searches in the parameter space of all algorithms to determine good sets of parameters for reconstructing a small independent set of images. The parameters in Table IV yielded the minimum MSE averaged between frames and for each algorithm. The MSE evolutions for these parameters are shown in Figures 7-(b) and 7-(d).
The proposed methods led to a significant performance gain compared to the other algorithms when in the presence of outliers in frames 32 and 35, with a slightly better results verified for the TSR-LMS method.
While R-LMS led to a MSE similar to that achieved by the TSR-LMS and LTSR-LMS algorithms for frames 33 and 34 (when the black square remained in the scene), its performance degraded considerably for larger . Note also that the loss in steady-state performance as a result of the optimization to handle outliers was higher for LMS and R-LMS. Finally, the LTSR-LMS algorithm showed to be less sensitive to the parameter selection, performing reasonably well in both simulations. Figure 8 shows the reconstructed images for frame 32, which confirm the quantitative results. The black square introduced in the sequence is significantly better represented for the proposed methods (when it is indeed present in the HR image), and a slight improvement can be noticed in the result of the TSR-LMS algorithm when compared to that of the LTSR-LMS.
Vii-C Example 3
This example illustrates the effectiveness of the proposed methods when super-resolving real video sequences. We performed a Monte Carlo simulation with 15 realizations consisting of natural video sequences, like Foreman, Harbour, News and others. In this case, the true motions of the objects and camera are unknown. We used Horn & Shunck algorithm with the same parameters shown in Table IV to estimate the dense velocity field, but now considering the displacement to be unique for each image pixel.
For a quantitative evaluation, the original videos were used as available HR image sequences. For simplicity, only the upper-right region of the original sequence was considered so that the resulting images were square. Like in Example 1, the HR sequences were blurred with an uniform unitary gain mask, decimated by a factor of and contaminated with white Gaussian noise with variance to form the LR images. The standard LMS versions and the proposed methods were used to super-resolve the LR sequences with iterations per time sample and the parameters set at the values in Table IV. Hence, they were not guaranteed to be optimal, as the amount of innovations is different and the motion is not known in advance.
Figure 9 shows the MSE evolutions, which indicate a better performance for the TSR-LMS and LTSR-LMS methods, both of which performed similarly. The improvement offered by the proposed methods can be most clearly observed when a high degree of innovations is present in the scene. It can be noted that the reconstruction error exhibits a more regular behavior across the entire sequence, without being considerably influenced by the outliers, which cause significant spikes in the MSE of the LMS and R-LMS algorithms. To illustrate this scenario, Figure 10 shows the 93rd super-resolved frame of the Foreman sequence. The advantage of the proposed methods becomes apparent through a more clear reconstruction result, as opposed to a vast amount of artifacts found in the images reconstructed by the LMS and R-LMS algorithms, mainly in the regions where innovations are present. The MSE performances of the four methods are less discernible at the time intervals where the amount of innovations is less significant. Nevertheless, the difference in the perceptual quality of the reconstructed images is still noticeable. For instance, Figure 11 shows the reconstruction results of part of the 33rd frame of the Foreman sequence. Although for this frame the MSE differences between the results of the four algorithms are small, the images super-resolved by the proposed methods still offer a better perceptual quality with reduced artifacts where small localized motion occur (e.g. close to the man’s mouth).
|Bicubic||LMS||R-LMS||TSR-LMS||LTSR-LMS||Bayesian ||CNN |
|1.56 s||0.46 s||0.53 s||0.60 s||0.62 s||49.68 s||81.72 s|
Vii-D Example 4
This example illustrates the performance of the algorithms using recent and challenging video sequences taken from the DAVIS dataset . We extracted six reference HR sequences from videos with resolution of pixels, and generated the degraded LR sequences from them as in Example 3.
To compare performances with recent state-of-the-art algorithms, we super-resolved these sequences using interpolation, LMS, R-LMS and two more recent video SRR algorithms, namely, the adaptive Bayesian method from  and the Convolutional Neural Network (CNN) from . For the method of  we fixed the solution to the blur matrix estimation of at the optimal value and used the same registration algorithm of , which was also used for the other methods. The CNN has an embedded registration algorithm which could not be modified. The parameters used for LMS, R-LMS, TSR-LMS and LTSR-LMS methods were the same as in Table IV (which were determined based on the simulations with synthetic sequences in Example 2). The CNN 
was implemented in Python using TensorFlow. The other methods were implemented in Matlab. Codes for methods from  and  were provided by the respective authors. All simulations were executed on a desktop computer with an Intel Core I7 processor with 4.2Ghz and 16Gb of RAM.
We assess the performances of the different methods both quantitatively and visually. The peak signal to noise ratio (PSNR) for all algorithms and video sequences is presented in Table VI. It can be verified that the proposed methods usually led to an image quality that is at least comparable with (usually better than) that resulting from using the competing algorithms. Excerpts from reconstruction results of two sequences444More extensive results are available in the supplementary document. Illustrative video excerpts are also included as supplementary material on https://www.dropbox.com/sh/hmyp5nks2sbj6to/AADnSqRE5YpTDJheLzDlTUI3a?dl=0., shown in Figs. 12 and 13, also support this conclusion. Fig. 12 shows that the proposed methods yield a significantly better resolution than the algorithms in  and . Fig. 13 is an example in which the competing methods performed well, specially the CNN  as can be noticed in the brick wall. Still, the proposed methods provided good reconstruction quality and show less influence of noise.
The methods of  and  also show robustness to innovation outliers, as can be noticed in the horse’s tail in Fig. 13 where the reconstructed images do not exhibit artifacts like the LMS and R-LMS algorithms. The proposed methods also show significantly less artifacts than the LMS and R-LMS algorithms, albeit not being as clear as the methods in  and .
Table V shows the execution times555The execution time of the CNN  includes image registration, since the SRR time cannot be measured separately in the provided implementation. for Example 4. Algorithms  and  were approximately two orders of magnitude slower than the remaining methods (except for bicubic interpolation). These results and the comparable reconstruction quality clearly show that the proposed algorithms are competitive in term of quality and robustness at a much lower computational cost.
The implementations of the proposed algorithms were not optimized, and thus they could not be tested in real-time. However, real-time implementation is perfectly within reach using existing devices. For instance, the LTSR-LMS algorithm needs approximately billion floating point operations (GFLOPS) to process each frame in the current example. Consider using a fast image registration algorithm such as , which has a computational complexity of , with being a small image window and the maximum displacement amplitude. For typical values and , the image registration cost is GFLOPS per frame. Now, for real-time performance (at 30 frames per second), the aggregate cost of image registration and SRR becomes GFLOPS/second, which is well within the capability of graphical processing units released almost a decade ago .
This work proposed a new super-resolution reconstruction method aimed at an increased robustness to innovation outliers in real-time operation. An intuitive interpretation was proposed for the proximal-point cost function representation of the R-LMS gradient descent algorithm. A new regularization was then devised using statistical information on the innovations. This new regularization allowed for faster convergence of the solution in the subspace related to the innovations, while preserving previously estimated details. Two new algorithms were derived which present an increased robustness to outliers when compared to the R-LMS, with only a modest increase in the resulting computational cost. Computer simulations both with synthetic and real video sequences illustrated the effectiveness of the proposed methods.
-  R. A. Borsoi, G. H. Costa, and J. C. M. Bermudez, “A new adaptive video SRR algorithm with improved robustness to innovations,” in Signal Processing Conference (EUSIPCO), 2017 25th European, pp. 1505–1509, IEEE, 2017.
-  S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image reconstruction: a technical overview,” Signal Processing Magazine, IEEE, vol. 20, no. 3, pp. 21–36, 2003.
-  K. Nasrollahi and T. Moeslund, “Super-resolution: A comprehensive survey,” Mach. Vision & Appl., vol. 25, no. 6, pp. 1423–1474, 2014.
-  S. Borman and R. L. Stevenson, “Simultaneous multi-frame MAP super-resolution video enhancement using spatio-temporal priors,” in Image Processing, 1999. ICIP 99. Proceedings. 1999 International Conference on, vol. 3, pp. 469–473, IEEE, 1999.
-  J. Tian and K.-K. Ma, “A new state-space approach for super-resolution image sequence reconstruction,” in Image Processing, 2005. ICIP 2005. IEEE International Conference on, vol. 1, pp. I–881, IEEE, 2005.
-  J. Tian and K.-K. Ma, “A state-space super-resolution approach for video reconstruction,” Signal, image and video processing, vol. 3, no. 3, pp. 217–240, 2009.
-  M. V. W. Zibetti and J. Mayer, “A robust and computationally efficient simultaneous super-resolution scheme for image sequences,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 17, no. 10, pp. 1288–1300, 2007.
-  S. P. Belekos, N. P. Galatsanos, and A. K. Katsaggelos, “Maximum a posteriori video super-resolution using a new multichannel image prior,” IEEE Trans. Image Process., vol. 19, no. 6, pp. 1451–1464, 2010.
-  H. Su, Y. Wu, and J. Zhou, “Adaptive incremental video super-resolution with temporal consistency,” in Image Processing (ICIP), 2011 18th IEEE International Conference on, pp. 1149–1152, IEEE, 2011.
-  M. G. Choi, N. P. Galatsanos, and A. K. Katsaggelos, “Multichannel regularized iterative restoration of motion compensated image sequences,” Journal of Visual Communication and Image Representation, vol. 7, no. 3, pp. 244–258, 1996.
-  H. Takeda, P. Milanfar, M. Protter, and M. Elad, “Super-resolution without explicit subpixel motion estimation,” Image Processing, IEEE Transactions on, vol. 18, no. 9, pp. 1958–1975, 2009.
-  M. Protter, M. Elad, H. Takeda, and P. Milanfar, “Generalizing the nonlocal-means to super-resolution reconstruction,” Image Processing, IEEE Transactions on, vol. 18, no. 1, pp. 36–51, 2009.
-  N. Barzigar, A. Roozgard, P. Verma, and S. Cheng, “A video super-resolution framework using SCoBeP,” IEEE Transactions on circuits and systems for video technology, vol. 26, no. 2, pp. 264–277, 2016.
-  S. D. Babacan, R. Molina, and A. K. Katsaggelos, “Variational bayesian super resolution,” Image Processing, IEEE Transactions on, vol. 20, no. 4, pp. 984–999, 2011.
-  C. Liu and D. Sun, “On bayesian adaptive video super resolution,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 2, pp. 346–360, 2014.
-  A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos, “Video super-resolution with convolutional neural networks,” IEEE Transactions on Computational Imaging, vol. 2, no. 2, pp. 109–122, 2016.
-  D. Liu, Z. Wang, Y. Fan, X. Liu, Z. Wang, S. Chang, and T. Huang, “Robust video super-resolution with learned temporal dynamics,” in , pp. 2507–2515, 2017.
-  X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia, “Detail-revealing deep video super-resolution,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
-  A. Bhowmik, S. Shit, and C. S. Seelamantula, “Training-free, single-image super-resolution using a dynamic convolutional network,” IEEE Signal Processing Letters, 2017.
-  S. Farsiu, D. Robinson, M. Elad, and P. Milanfar, “Advances and challenges in super-resolution,” International Journal of Imaging Systems and Technology, vol. 14, no. 2, pp. 47–57, 2004.
-  H. Song, L. Zhang, P. Wang, K. Zhang, and X. Li, “An adaptive l–l hybrid error model to super-resolution,” in Image Processing (ICIP), 17th IEEE International Conference on, pp. 2821–2824, IEEE, 2010.
-  M. K. Ng and N. K. Bose, “Mathematical analysis of super-resolution methodology,” IEEE Signal Processing Magazine, vol. 20, no. 3, pp. 62–74, 2003.
-  M. Elad and A. Feuer, “Superresolution restoration of an image sequence: Adaptive filtering approach,” Trans. Img. Proc., IEEE, vol. 8, pp. 387–395, Mar. 1999.
-  M. Elad and A. Feuer, “Super-resolution reconstruction of image sequences,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 21, no. 9, pp. 817–834, 1999.
-  S. Farsiu, M. D. Robinson, M. Elad, and P. Milanfar, “Fast and robust multiframe super resolution,” Image processing, IEEE Transactions on, vol. 13, no. 10, pp. 1327–1344, 2004.
-  R. Hardie, “A fast image super-resolution algorithm using an adaptive Wiener filter,” Trans. Img. Proc., vol. 16, pp. 2953–2964, Dec. 2007.
-  G. H. Costa and J. C. M. Bermudez, “Statistical analysis of the LMS algorithm applied to super-resolution image reconstruction,” Signal Processing, IEEE Transactions on, vol. 55, no. 5, pp. 2084–2095, 2007.
-  G. H. Costa and J. C. M. Bermudez, “Informed choice of the LMS parameters in super-resolution video reconstruction applications,” Signal Processing, IEEE Transactions on, vol. 56, no. 2, pp. 555–564, 2008.
-  G. H. Costa and J. C. M. Bermudez, “Registration errors: Are they always bad for super-resolution?,” Signal Processing, IEEE Transactions on, vol. 57, no. 10, pp. 3815–3826, 2009.
-  T. Blu, P. Thévenaz, and M. Unser, “MOMS: Maximal-order interpolation of minimal support,” IEEE Transactions on Image Processing, vol. 10, no. 7, pp. 1069–1080, 2001.
-  R. C. Gonzalez and R. E. Woods, Digital image processing. Prentice hall, 1 ed., 2002.
-  A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.
-  D. P. Bertsekas, Nonlinear programming. Athena scientific Belmont, 1999.
-  S. Haykin, Adaptive Filter Theory. Prentice Hall, 2 ed., 1991.
-  A. H. Sayed, Fundamentals of adaptive filtering. John Wiley & Sons, 2003.
-  M. Richter, C. Dolar, H. Schroder, P. Springer, and O. Erdler, “Spatio-temporal regularization featuring novel temporal priors and multiple reference motion estimation,” in Broadband Multimedia Systems and Broadcasting (BMSB), 2011 IEEE International Symposium on, pp. 1–6, IEEE, 2011.
-  v. A. Van der Schaaf and J. v. van Hateren, “Modelling the power spectra of natural images: statistics and information,” Vision research, vol. 36, no. 17, pp. 2759–2770, 1996.
-  M. F. Tappen, B. C. Russell, and W. T. Freeman, “Exploiting the sparse derivative prior for super-resolution and image demosaicing,” in IEEE Workshop on Statistical and Computational Theories of Vision, 2003.
-  D. Hasler, L. Sbaiz, S. Süsstrunk, and M. Vetterli, “Outlier modeling in image matching,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 25, no. 3, pp. 301–315, 2003.
-  A. B. Lee, D. Mumford, and J. Huang, “Occlusion models for natural images: A statistical study of a scale-invariant dead leaves model,” Int. J. Comput. Vision, vol. 41, no. 1-2, pp. 35–59, 2001.
-  S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski, “A database and evaluation methodology for optical flow,” International Journal of Computer Vision, vol. 92, no. 1, pp. 1–31, 2011.
-  T. De Mazancourt and D. Gerlic, “The inverse of a block-circulant matrix,” Antennas and Propagation, IEEE Transactions on, vol. 31, no. 5, pp. 808–810, 1983.
-  D. P. Bertsekas, “On the Goldstein-Levitin-Polyak gradient projection method,” Automatic Control, IEEE Transactions on, vol. 21, no. 2, pp. 174–184, 1976.
-  B. K. Horn and B. G. Schunck, “Determining optical flow,” Artificial intelligence, vol. 17, no. 1-3, pp. 185–203, 1981.
-  D. Sun, S. Roth, and M. J. Black, “Secrets of optical flow estimation and their principles,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 2432–2439, IEEE, 2010.
-  J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 DAVIS challenge on video object segmentation,” arXiv preprint arXiv:1704.00675, 2017.
-  G. Caner, A. M. Tekalp, G. Sharma, and W. Heinzelman, “Local image registration by adaptive filtering,” IEEE Transactions on Image Processing, vol. 15, no. 10, pp. 3053–3065, 2006.
-  V. Surkov, “Parallel option pricing with fourier space time-stepping method on graphics processing units,” Parallel Computing, vol. 36, no. 7, pp. 372–380, 2010.