1 Introduction
Nowadays, many computer vision and image analysis tasks are tackled by means of pattern recognition and machine learning techniques. This work makes some initial steps in the opposite direction. It does not reject the learning approach to computer vision, but it shows how tools form computer vision—and especially variational methods, can aid in efficiently solving some of the basic estimation tasks machine learners and pattern recognizers come across. The particular issue we consider is the problem of scale that, one way or the other, emerges in any learning tasks involving images or videos. As so often, however, it is overlooked and/or dealt with in a way that leaves much to be desired. This work is really at the very interface of computer vision and learning techniques and we may draw heavily on terminology from both fields. The main concept from machine learning and pattern recognition that we use are learning (or training) from examples (especially in relation to linear regression) and the ideas underlying artificial neural network, and convolutional neural in networks particular (see, for instance,
[1, 2] and [3, 4]). Nonetheless, we think that researchers schooled in scale space and variational methods should be able to follow our main line of thought.1.1 Pixelbased Classification and Regression
Supervised classification and regression techniques have been applied to a broad range of challenging image processing and analysis tasks. Learningbased pixel classification has been around at least since the 1960s. Early studies seem to have been conducted particularly within the field of remote sensing and abutting areas [5]. Though these approaches initially seemed to have focussed primarily on the use of the multiple spectral bands that the observations consisted of, later work also include spatial features based on derivative operator, texture measures, and the like (cf. [6]). An early overview of the general applicability of pixel classification can be found, for instance, in [7].
Training image filters on the basis of given inputoutput pairs by means of regression techniques seems to have been considered less often. The problem, as opposed to pixel classification, may be obtaining proper input and output image pairs. Also for this reason, possibly the most often studied application is the prediction of (supposedly) noiseless images from images corrupted with a known noise model, as in this case the inputoutput pairs are easily generated. Perhaps the first paper to consider such option is [8]—but the approach may only turn popular now it has been presented at a more fashionable venue [9]. A few years later, more advanced applications found their way into medical image analysis, in particular for filtering complex image structures out of chest radiographs [10, 11].
Where not so many years ago, pixelbased methods relied on features extracted by means of more or less complex, linear or nonlinear filter banks, the past decade has seen a trend of socalled representation learning
[12]. The idea is to avoid any initial (explicit) bias in the learning that entails from prespecifying the particular image features that are going to be used. Rather, one relies on raw input data (images in our case) and a complex learner that is capable of simulating the necessary filtering based on the raw input. Particular architectures that are used as learners are socalled deep networks, which are simply a specific type of artificial neural networks. Two examples of the use of such networks in image denoising can be found in [13] and the earlier mentioned work [9]. A first approach to supervised segmentation using these methods can be found in [14].We see the current work in the light of these developments in representation learning, although our results are not “deep”. In fact, here we will deal with shallow networks with a single linear convolutional neuron [12]; a basic element in the more complex deep structures referred to above. The problem we focus on is inferring an imagetoimage mapping, which is not necessarily limited to image denoising. The core of the issue we study is how to control the complexity of that single neuron. In our case, this is achieved by controlling the aperture scale at which the neuronal mapping operates. In current applications of deep networks, notably convolutional neural networks [4], the spatial extent from which the different layers draw their information is coarsely modeled by a rectangle with preset dimensions. We propose to not prefix the spatial range explicitly—and to basically have every pixel intensity in every location have potential influence on any other pixel. We decide to integrate the influence of scale by a regularization term into the overall objective function that is used to determine the fit of the neuron to the training data. In this way, we can trade off the influence of the training data and the scale of the aperture in a gradual and controlled way.
1.2 Outline
Section 2 formulates the initial problem setting in mathematical terms. The loss on the data term considered is the regular squared error and so we are basically dealing with standard least squares linear regression. The section shows that our nonregularized prediction problem can be seen as a convolution in which the convolution kernel is to be determined. As it turns out, the formulations allows us to solve regression problems in features spaces with very high dimensionality^{2}^{2}2As we are in the setting of representation learning, features are pixels values here. and with even larger numbers of observations. Section 3, covering the main part of our theory, argues that some form of regularization would typically be necessary, after which it introduces and explains our scaleregularized objective function. It also shows how to reformulate the optimization problem so that its minimization can be performed by means of a variational method and finally sketches a basic scheme to come to an actual solution. Section 4 provides some limited and artificial, yet illustrative examples and Section 5 discusses and concludes.
2 Regression and Supervised Filter Learning
Let us initially considering a set of inputoutput pairs of images defined on the full image domain : . We do not consider multiband or multispectral images, but out theory is equally applicable to this setting. Given these pairs, of what we will refer to as training images, we would like to infer a transformation that can be applied to any new and unseen input image , such that it optimally predicts its associated, and unobserved, output . The expected least squares loss between the true output and the prediction by is typically used to define optimality of the transformation :
(1) 
where we tacitly assume the integrals exist. In the absence of any precise knowledge of , the prior over pairs of input and output images, the true expected loss must be approximated. If there is training data available we may rely on the empirical risk, which is determined by substituting the empirical distribution of our observations for , leading to the objective
(2) 
In many a setting, the transformation would be taken translation invariant. This is the situation we consider here as well. In fact, since we focus on a single linear convolutional neuron, reduces to a simple linear convolution by means of a kernel . Equation (2) therefore simplifies to
(3) 
Denoting the Fourier transform by
or , the optimal solution to the above equation can be obtained as(4) 
This formulation, in fact, allows us to efficiently solve an image regression problem in, potentially, very high dimensional feature spaces. To see that Equation (3) basically formulates a regular linear regression problem, note that equals , where
are the explanatory variables or the feature “vectors” indexed by the variable
and can be interpreted as an estimate for the true regression parameters. Indeed, instead of using patches of limited size to capture the contextual information around every pixel location, this formulation basically takes the whole image (centralized around ) to be the patch to every location .2.1 Regression Problem Size, an Example
Consider the Brodatz images data set that we are going to experiment with later on [15]. The set consists of 112 images of dimensions , which we take as the input images (some examples are shown in Figure 1).
Let us assume that we have corresponding output images to all of the 112 Brodatz images, which are the original images corrupted by an unknown convolution filter and additive Gaussian noise. Finding a filter that is optimal in the empirical least squares sense means that one would actually have to solve a linear regression problem in thousand dimensions, coming from the patch size we consider. The number of instances, we would base the learning on is , which is more than million, as there are locations per image and we have images.
Solving this problem in the standard way by means of linear regression would, among others, mean that we have to invert a covariance matrix sized , which is sheer impossible. Because of the convolutional structure of the problem, however, explicit matrix inversion can be avoided and the computationally most demanding part in Equation (4) is the Fourier transformation. Relying on the fast Fourier transform, the necessary computations to find the more than 400 thousand weights of our neuron (encoded through ) can be done in one or two seconds, even on a modest laptop.
3 Scale Regularization
Using basic image processing techniques, for every pixel location in an image, we can actually include all other image values as context in its feature vector and still solve the highdimensional regression problem efficiently. Nonetheless, there is a good reason why a convolutional neural network would restrict the extent of every filter to an area considerably smaller than the whole image. Estimations in such highdimensional spaces easily leads to overtraining or overfitting as there are too many free parameters to be estimated compared to the number of observations that may be available. But it seems reasonable to assume that it is more likely that the useful predictive information for a particular location in an image comes from locations nearby rather then pixel values far away.
The current way to exploit this kind of prior knowledge is by explicitly extracting patches of limited size around every pixel location and base the regression on these features only. Equivalently, the convolutional objective function in Equation (3) can be adapted to do the same by simply restricting the support of the filter , i.e., one can minimize for Equation (3) under the constraint that is an appropriate subset of . Typically, is just taken to be a square patch.
Here, we suggest to take care of scale in, what we think is a more proper way. Instead of restricting the influence of surrounding pixel values to a particular region explicitly, we propose to gradually suppress the influence of more and more distant pixel values by means of a scalesensitive regularization term on the kernel , which we add as a term to our original least squares objective function in Equation (3). In particular, we consider minimizing the following:
(5) 
where controls the scale.
The primary characteristic of the regularizing term is that larger values for should be discouraged the further away one gets from the center of the kernel. Clearly, various other formulations would have been possible, but the current suggestion has some appealing properties. Firstly, the polynomial is rotationally invariant. Secondly, it is homogenous, so changing the unit in which we measure distance to the kernel center, can equivalently be accommodated by changing , i.e., the effect of substituting for , can also be achieved by substituting for . Still, such properties would hold for any choice of power, not only for the square. Choosing the square, however, leads to a relatively easy to solve variational problem, which allows us to retain some of the computationally attractive properties of the original formulation in (3).
3.1 Minimization
The choice of the regularization term in (5) makes the minimization easy via Fourier Transform: using the derivation properties of the Fourier transform as well as Plancherel’s theorem, one gets for an function :
(6)  
(7) 
Using the properties of the convolution and the Fourier transform, we can rewrite the criterion in Equation (5) as
(8) 
which is a Tikhonov regularization of the regression problem. By letting and , we can rewrite it as
(9) 
Computing the first variation of Equation (9) gives the optimality condition
(10) 
where denotes the Hermitian adjoint of , . Note that because the and are Fourier transforms of real functions they are Hermitian, i.e., they satisfy the equation
(11) 
Of course the solution will be Hermitian as well, as is easily checked.
More importantly from a computational perspective, note that and are, next to the value of , the only inputs to the optimization one needs. The size of both and equals the original image size and does not depend on the number of training images. This makes explicit that no matter how many training images one uses, once we have and , the computational complexity of getting to a solution for Equation (10) remains the same.
Now, the actual numerical minimization for the 2dimensional images in our experiment is carried out using a standard 5points stencil for the Laplacian, using periodic boundary conditions. The resulting system is solved by Jacobi relaxation and reads as follows:
(12) 
(omitting boundary conditions). Though faster solvers are possible, the use of the Jacobi solver automatically enforces at each iteration the discrete counterpart of the Hermitian relation in Equation (11) and therefore, at each iteration, remains the Fourier transform of a real signal. Finally note that in Equation 12, the regularizing effect of a positive can, in part, be seen back, as it keep the denominator bounded away from zero.
4 Experimental Setup and Results
To illustrate the potential of our scaleregularized filter learning, we set up some elementary experiments on the Brodatz image collection [15]. We take the 112 images in this database as our input images (see Figure 1) and corrupt them to create 112 matching output images. In order to do so, we first construct a kernel to convolve the original images with. Figure 2 depicts this filter: the gray value, the largest part of the image, takes on the value zero. The black part has value and the white part takes on the value , which makes sure the filter integrates to zero. Figure 3 gives the output images after convolving the corresponding input images in Figure 1. As a final step, we add i.i.d. Gaussian noise to every output image. The signal to noise ratios (SNRs) we experiment with are dB, dB, and dB. These noise levels are somewhat arbitrary, although it should be clear that if the outputs were noiseless, solving the regression without regularization would provide us with a perfect reconstruction of the original convolution kernel. Figure 4 displays the final noisy output images with the worst SNR of dB, in which case we would expect the worst performance for the unregularized learning scheme.
In order to learn the filters, we need to decide on a test set of pairs of input and output images. Once that has been decided, we can test the learned filters by applying them to the remaining images and measure the squared errors they achieve. As the learning will typically improve with increasing number of images, we tested our method with learning set sizes of 1, 2, and 4 image pairs to get an impression of the behavior w.r.t. this aspect as well. The values of considered are (no regularization), , , , and . Finally, as the learned filter may depend on the particular pairs of images we train on, we redid all experiments 10 times and report the averaged results. Figure 5 plots the results in three subplots. We note that all logdifferences are significant according to a paired test, never giving values higher than . The best performing regularized filter typically achieves improvements of about an order of magnitude, especially in case the learning is based on one image only (the blue lines in the plots).
For this extreme case of a single training pair, we also display some of the filters that have been inferred for the different noise levels. Figure 6 shows the unregularized filters in the top row and the “optimal” regularized versions in the bottom row (cf. Figure 5). To properly display these images, we decided to clip their gray values at twice the minimum and twice the maximum of the values attained by the original filter from Figure 2. For an SNR of dB, both procedures recover the original filter fairly well, though upon close inspection the unregularized filter clearly is more noisy in its offcenter parts. This is also reflected by the clearly inferior performance displayed in Figure 5. When the SNR reaches dB, the unregularized filter values get clipped for over 98% of the pixels, while in the case of scaleregularization this is less than 0.23%. This is also visually clear: while the latter, irrespective of the immense noise level, still resembles the original filter, the former has basically been reduced to noise. Figures 7 and 8 show their difference in another way and display the images obtained by convolving the input images from Figure 1 with the unregularized and scaleregularized kernels, respectively. While there is little but noise visible in Figure 7, the reader hopefully appreciates the rather close resemblance—though certainly not perfect—of the reconstructed images in Figure 8 and the noiseless outputs in 3.
5 Discussion and Conclusion
We devised a novel scheme that tackles the problem of scale in the elementary learning setting of inferring a linear filter from a set of inputoutput pairs. Such a filtering is a basic building block in many a deep network, notably convolutional neural networks, that deal with signals, images, or any other input having some spatiotemporal ordering. Our approach does not rely on any a priori restriction on the context size taken into account when performing the regression, but it incorporates a way of regulating scale by means of an added scaleregularization term that can be tuned. Relying on variational methods from computer vision to solve the inference, it also enables us to deal with learning in spaces of very high dimensionality, which current regression method, not exploiting the underlying image structure, would be unable to solve. In that respect, the observation that Equation 3 can be optimized very efficiently is already interesting in itself.
Though some of the results are definitely quite striking already, the approach is, indeed, rather elementary. To solve more complex, realworld filtering problems, we expect to need more complex learners. But a basic idea of neural networks is, in fact, that one can build arbitrarily complex regression and classification schemes out of more basic building blocks. What is essential in this, however, is that we do not have to limit ourselves to linear transformation of the data. The next important step in this research should therefore investigate how to incorporate a socalled activation function, which transforms the filter outputs in a nonlinear way, into our setup. Introducing this nonlinearity will take us even further away from the simpletosolve objective function in Equation
3 and it is as yet unknown to what extent computational efficiency can be retained.In the past years, contributions to toptier venues in computer vision have been dominated by methods and solutions that reformulate the problem into pattern recognition and machine learning lingo. We do not trivialize these achievements, but we are convinced that the proper scale space and variational methods—and the conceptual ideas underlying these methods, can further to the learning methods’ impact and success. Our contribution provides a modest step in this direction.
References
 [1] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
 [2] T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Verlag, 2001.
 [3] C.M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, 1995.
 [4] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
 [5] K.S. Fu, D.A. Landgrebe, and T.L. Phillips. Information processing of remotely sensed agricultural data. Proceedings of the IEEE, 57(4):639–653, 1969.
 [6] George Nagy. Digital imageprocessing activities in remote sensing for earth resources. Proceedings of the IEEE, 60(10):1177–1200, 1972.
 [7] F. Holdermann, M. Bohner, B. Bargel, and H. Kazmierczak. Review of automatic image processing. Photogrammetria, 34(6):225–258, 1978.
 [8] Z. Hou and T.S. Koh. Image denoising using robust regression. IEEE Signal Processing Letters, 11(2):243–246, 2004.
 [9] H.C. Burger, C.J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with BM3D? In Proceedings CVPR, 2012, pages 2392–2399, 2012.
 [10] M. Loog, B. van Ginneken, and A.M.R. Schilham. Filter learning: application to suppression of bony structures from chest radiographs. Medical Image Anal., 10(6):826–840, 2006.
 [11] K. Suzuki, H. Abe, H. MacMahon, and K. Doi. Imageprocessing technique for suppressing ribs in chest radiographs by means of massive training artificial neural network (MTANN). IEEE Transactions on Medical Imaging, 25(4):406–416, 2006.
 [12] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE TPAMI, 35(8):1798–1828, 2013.
 [13] V. Jain and S. Seung. Natural image denoising with convolutional networks. In NIPS, pages 769–776, 2009.

[14]
D. Grangier, L. Bottou, and R. Collobert.
Deep convolutional networks for scene parsing.
In
ICML 2009, Deep Learning Workshop
, volume 3, 2009.  [15] P. Brodatz. Textures: a photographic album for artists and designers. Dover, New York, 1966.
Comments
There are no comments yet.