Spatially-Adaptive Reconstruction in Computed Tomography using Neural Networks

by   Joseph Shtok, et al.

We propose a supervised machine learning approach for boosting existing signal and image recovery methods and demonstrate its efficacy on example of image reconstruction in computed tomography. Our technique is based on a local nonlinear fusion of several image estimates, all obtained by applying a chosen reconstruction algorithm with different values of its control parameters. Usually such output images have different bias/variance trade-off. The fusion of the images is performed by feed-forward neural network trained on a set of known examples. Numerical experiments show an improvement in reconstruction quality relatively to existing direct and iterative reconstruction methods.


page 2

page 7

page 8

page 9

page 10

page 11


Spatially-Adaptive Reconstruction in Computed Tomography Based on Statistical Learning

We propose a direct reconstruction algorithm for Computed Tomography, ba...

Combining reconstruction and edge detection in computed tomography

We present two methods that combine image reconstruction and edge detect...

Construct Deep Neural Networks Based on Direct Sampling Methods for Solving Electrical Impedance Tomography

This work investigates the electrical impedance tomography (EIT) problem...

Deep learning for plasma tomography using the bolometer system at JET

Deep learning is having a profound impact in many fields, especially tho...

Probabilistic Graphical Modeling approach to dynamic PET direct parametric map estimation and image reconstruction

In the context of dynamic emission tomography, the conventional processi...

Image reconstruction from dense binary pixels

Recently, the dense binary pixel Gigavision camera had been introduced, ...

Conditional Variational Autoencoder for Learned Image Reconstruction

Learned image reconstruction techniques using deep neural networks have ...

I Introduction

Computed tomography (CT) imaging produces an attenuation map of the scanned object, by sequentially irradiating it with X-rays from several directions. The integral attenuation of the X-rays, measured by comparing the radiation intensity entering and leaving the body, forms the raw data for the CT imaging. In practice, these photon count measurements are degraded by stochastic noise, typically modeled as instances of Poisson random variables. There are also other degradation effects due to a number of physical phenomena – see

e.g. [1] for a detailed account.

Given the projection data, known as the sinogram, a reconstruction process can be performed in order to recover the attenuation map. Various such algorithms exist, ranging from the simple and still very popular Filtered-Back-Projection (FBP) [2], and all the way to the more advanced Bayesian-inspired iterative algorithms (see e.g., [3, 4]) that take the statistical nature of the measurements and the unknown image into account. Since CT relies on X-ray, which is an ionizing radiation known to be dangerous to living tissues, there is a dire and constant need to improve the reconstruction algorithms in an attempt to enable reduction of radiation dose.

In this work we are concerned with the question of image post-processing, following the CT reconstruction, for the purpose of getting better quality CT image, thereby permitting an eventual radiation-dose reduction. The proposed method does not focus on a specific CT reconstruction algorithm, nor the properties of the images it produces. Instead, we take a generic approach which adapts, in an off-line learning process, to any such given algorithm. The only requirement is the access to design parameters of the reconstruction procedure which influence the nature of the output image, such as the resolution-variance trade-off.

We aim to exploit the fact that any reconstruction algorithm can provide more image information if instead of one fixed value of a parameter (or a vector of them) controlling the reconstruction, few different values are used (leading to different versions of the image). In order to extract this information from a collection of image versions, we use an Artificial Neural Network (ANN)

[5]. The proposed method can also use other techniques for computing a non-linear multivariate regression function.

Neural networks have been used extensively in medical imaging, particularly for the purpose of CT reconstruction (see Section III for an overview). Here we propose a new constellation, which consists in a local fusion of the different image versions, aimed at an improved reconstruction quality. We use a set of intensity values from a neighborhood of a pixel , taken from the different versions, as inputs to the network, and train it to compute a (smaller) neighborhood of which values are as close as possible (in Mean-Squared-Error or other sense) to those found in the reference image. As we show in this paper, the proposed approach enables an improvement of the variance-resolution trade-off of a given reconstruction algorithm, i.e. producing images with a reduced amount of noise without compromising the spatial resolution and without introducing artifacts.

This paper is organized as follows: Sections II and III are devoted to a brief and general discussion on CT scan/reconstruction and artificial neural networks. Readers familiar with these topics may skip and start reading at Section IV, where the core concept of this work is detailed. This section also contains an illustration on one-dimensional piece-wise constant signals, where it is easy to appreciate the action of the proposed algorithm and the effect of local fusion performed by a neural network. In the sequel, the proposed method is implemented on two tomographic reconstruction methods: boosting the Filtered Back-Projection (FBP) is presented in Section V and the same for Penalized Weighted Least-Squares (PWLS) method is described in Section VII. We conclude this work by discussing the computational complexity of the proposed algorithms in Section VIII, and a summary of this work and its potential implications in Section IX.

Ii Background on Computed Tomography

Ii-a Mathematical Model of CT Scan

In the process of a CT scan, the object is radiated with X-rays. In this work we consider a reconstruction in a plane from rays incident only to this plane (the two-dimensional tomography). From the mathematical point of view, the considered object is a function in the plane, which values are the attenuation coefficients of composing materials (i.e., tissues). When the measured photon counts are perfect, the measurements are directly related to the X-ray transform of the function as a collection of all the straight lines passing through the object, and the value associated with each such line is the integral of along it. In two dimensions, and under the assumption of a full rotated and parallel beam scan, this coincides with the Radon transform .

Let be a straight line from an X-ray source to a detector. The ideal photon count , measured by the detector is related to via the function


where is the blank scan count. The scanned data is stored in a matrix which columns correspond to the sampled angle ; each such column is referred to as a ”view” or a ”projection”, and is acquired, schematically, by a parallel array of X-rays passing through the object at the corresponding angle. The rows of the matrix, corresponding to the sampled values of the distance , are called the ”bins” of each projection. According to the Equation (II.1), for reconstruction purposes the measurements data undergoes the log transform


We refer to as the sinogram. The name indicates that every point in the image space traces a sine curve in this domain. Since the sinogram matrix is the (sampled) Radon transform of the original image , a discrete version of the image can be reconstructed by applying the inverse Radon transform (see Section II-B).

Each measured photon count is typically interpreted as an instance of the random variable

following a Poisson distribution

[1, 6, 7]. This reflects the photon count statistics at the detectors [8]. For a random variable

, the standard deviation

satisfies , and therefore the signal-to-noise ratio of , monotonously increases with its expected value.

In the sinogram domain, the standard deviation of the error between the ideal sinogram and the one computed from the measurements, , is [9], and this is well approximated by [10]. In Figure 1 we display a sinogram matrix and the corresponding Poisson noise image. Below, one can observe the resulting reconstruction artifacts produced by the standard FBP algorithm (see next sub-section). The sinogram error image has a high-energy regions where the sinogram values are relatively high; this corresponds to the predicted behavior of the noise variance. The reconstruction from the noisy sinogram is contaminated with anisotropic noise, mainly in the form of streaks. Their appearance is related to large errors in sinogram values.

Figure 1: Upper row, left to right: Exact sinogram, and the absolute-valued difference between this and the corrupted sinogram (darker shade corresponds to higher error). Lower left: true (reference) image, corresponding to the true sinogram. Lower right: reconstruction from the noisy data with FBP algorithm. Images are displayed in HU window .

Ii-B Reconstruction Algorithms for Computed Tomography

There are various reconstruction algorithms that aim at computing the attenuation map of the scanned object from its projections. In this paper we shall refer and work with two such algorithms: (i) the Filtered Back-Projection (FBP) ([2], which is a direct Radon inversion approach. This is a popular technique despite its known flaws; and (ii) an iterative reconstruction algorithm that takes the statistical nature of the unknown and the noise into account (e.g. [3]). Bayesian methods achieve better image quality than the direct Radon inversion, at the expense of longer processing time. We now describe these two methods is somewhat more details.

Filtered-Back-Projection Method: Mathematically, FBP is the linear operator of the form


Here is the adjoint of the Radon transform, known in the literature as ”back-projection”. is a 1-D convolution filter, applied to each individual projection (column in the sinogram matrix). It uses the Ram-Lak kernel [11], defined in the Fourier domain by , and is a low-pass filter which prevents the noise amplification at high frequencies, typical for the Ram-Lak action. In clinical CT scanners, the parameters of are tuned for specific needs of the radiologist: different preset values are chosen to view bones, soft tissues, high contrast/smooth images, specific anatomical regions, etc.

Without the low-pass filter, the FBP is an exact inverse of the Radon transform in the continuous domain [12]

for the noiseless case. Moving from theory to practice, the FBP algorithm does not perform very well. The low-pass 1-D convolution filter in the sinogram domain is not an effective remedy for the projections noise. The problem of photon starvation manifests through outlier values in the sinogram, which propagate to the output image in the form of streak artifacts. They corrupt the image contents and jeopardize its diagnostic value. Those artifacts can be explained as follows: each measured line integral is effectively smeared back over that line through the image by the back-projection; an incorrect measurement results in a (partial) line of wrong intensity in the image. Typically, the streaks radiate from bone regions or metal implants.

Statistically-Based Method: The relation between , the sought CT image, and the vector of measured counts can be described as


where approximates the Radon transform and models the scan process in reality. The additive error (which also depends of ) stems from the statistical noise. In the Bayesian framework, the reconstruction is performed by computing the Maximum a-Posteriory (MAP) estimate of the image


For CT, an accurate statistical model for the data is quite complicated and is often replaced by a Gaussian approximation with a suitable diagonal weighting term whose components are inversely proportional to the measurement variances. This leads to a penalized weighted least-squares (PWLS) formulation, see e.g.[13]


where , is a diagonal matrix of weights, which in simplistic model are proportional to photon counts ; The penalty term also referred to as the prior, expresses assumptions on the behavior of the clean CT image. In [14] this expression is chosen as


where for each image location , a scalar function is the convex edge-preserving Huber penalty

In order to minimize (II.6), we have used the L-BFGS optimization method [15]. The Matlab/C implementation of the algorithm is the courtesy of Mark Schmidt.

Iii Artificial Neural Networks (ANN)

For completeness of this paper, we provide here a brief background on ANN, and in particular their role in CT and medical imaging. ANN, mimicking after the biological networks of neurons which comprise the nervous system, are intensively used in many domains of Computer Science. In this work we focus on the multi-layer feed-forward ANN with no cycles. This is best represented by a directed, weighted graph which has an array of input nodes (data inputs), inner nodes (neurons) implementing specific (linear or non-linear) scalar functions, and another array of output nodes. The input argument of each neuron is the weighted sum of all its inputs, where the weights are associated with the edges. Those weights are learned during the network training and, effectively, define the regression function produced by the ANN.

More specifically, the first layer consists of inputs, coming from the outside world; then neurons are situated in the -th layer (), and the last one contains output nodes. Each input is connected to each neuron in the second (hidden) layer by a weighted edge with weight . The output of each neuron is connected to the input of every -th neuron in the second layer by the weight , and so on. Finally, each neuron of the last layer is connected to the output with a weight . We denote by the function implemented in each neuron. There is a number of popular choices for this function, for instance .

For example, here is the explicit definition of a network with one hidden layer:


The weights define the multi-variable regression function which approximates any continuous function implied by the set of training examples111The Universal Approximation Theorem states that a network with just one hidden layer, where each neuron is realized as a monotonically-increasing continuous function, can uniformly approximate any given multivariate continuous function up to an arbitrary small error bound [16]. In practice, adding hidden layers shows an improvement in the ANN performance.. A training set for the network comprises of a collection of examples , where is the vector of inputs and is the true output related to this vector. Training the network consists of optimizing the weights for a minimal error,


where the sum is over the training set, and E(a,b) is an error measure of some sort (e.g.

). The popular method for solution of this problem is the iterative backpropagation method

[17]. A scheme of such network is depicted in Figure 2.

Figure 2: A scheme of a multi-layer feed-forward ANN.

Since the development of the back-propagation algorithm for ANN in mid-eighties, the image processing community (among others) has attained a powerful tool to attack virtually any regression or discrimination task. Among the wealth of applications neural networks found in this area (see [18] for a broad and comprehensive overview), some were designed for medical imaging. As such, Hopfield ANN were used for computer-aided screening for cervical cancer [19], breast tumors [20] and segmentation [21]. ANN are also used for compression and classification in cardiac studies [22] and ECG beat recognition [23]. Tasks of filtering, segmentation and edge detection in medical images are addressed with cellular ANN in [24]. Our group has used neural networks for optimal photon detection in scintillation crystals in PET [25].

As for reconstruction problems, a series of works has appeared in which the ANN replaces the overall reconstruction chain by learning the net contribution of all detector readings to each pixel in the image. For Electron Magnetic Resonance (EMR), such an algorithm is proposed in [26]. Floyd et. al. have used this approach for SPECT reconstruction [27] with feed-forward networks and also for lesion detection in this modality [28]. We remark that such naive application of the ANN for reconstruction is limited to low-resolution images, since the network must have inputs and outputs. For instance, in [26], a image is reconstructed. Application of ANN for SPECT reconstruction was also studied by J. P. Kerr and E. B. Bartlett [29, 30].

Imaging modalities like PET and SPECT, where low-resolution images are produced, are a natural domain for ANN application. However, some works tackle also the problem of CT reconstruction where the image size is larger. Ref. [31] proposes using a neural network structure with training based on a minimization of a maximum entropy energy function. Reconstruction in Electrical Impedance Tomography was treated with ANN in [32]. Another variety, an Electrical Capacitance Tomography and an ANN-based reconstruction method for it, are described in [33].

Despite the abundance of applications, there is still place for innovation in the domain of ANN application for medical imaging. First, the CT reconstruction problem is rarely attacked with this tool due to the high dimensions of raw data and the resulting images, which render the naive application of ANN as the black box converting measurements to image unfeasible. Indeed, in our work we do not propose such a scheme per se – rather, our ANN is employed to perform a locally-adaptive fusion of a number of image versions, produced by a given reconstruction algorithm upon using different configurations. This brings us naturally to the next section where we describe our algorithm.

Iv The Proposed Scheme

Iv-a Local Fusion with a Regression Function

We consider the general setup of the non-linear inverse problem. Assume we are given the measurements vector of the form


where is some transformation, represents the noise, and is the signal to be recovered. Assume further that is some restoration algorithm designed to recover from this type of measurements, i.e.,


The scalar parameter controls the behavior of and therefore influences certain characteristics of the estimate . For example, when is responsible for variance-resolution tradeoff of the algorithm, the estimate may be obtained with different noise levels and corresponding spatial resolution characteristics.

The described situation is common in many signal/image processing scenarios. As a basic example, we consider a simple image denoising algorithm, which recovers the signal from noisy measurements by a shift-invariant low-pass filter, realized as a 2-D convolution with prescribed kernel. For some fixed shape of this kernel (say, a simple boxcar function or a 2-D Gaussian rotation-invariant kernel), its width (spread) can be parameterized by a scalar variable . A wider such kernel will perform a more aggressive noise reduction, by averaging the noisy signal over a larger area, at the cost of reducing the spatial resolution.

A second, and more relevant example to this work, is from the domain of CT reconstruction. Recovery of the attenuation map is classically performed by the Filtered Back-Projection algorithm. The latter involves a 1-D low-pass filter, applied to the individual projections. As in the above example, the cut-off frequency of this filter controls the variance-resolution properties of the reconstructed image. In these examples, and also in a general such situation, no single value for the parameter makes the best of the processing algorithm. For different signals, different values may be optimal in the sense of MSE or other quality measure. Indeed, in the same image, computed with two different values of , different regions will get the best treatment by different values of . For each specific case, ad-hoc considerations for tuning this scalar parameter are applied.

In the domain of non-parametric statistics, there is a noise reduction algorithm with proven near-optimality that devises a switch rule for selecting at each location of the signal an appropriate local filter [34]. In effect, the signal is processed by a low-pass filter adaptive to the local signal smoothness. In the context of our discussion, one can say that this algorithm performs a fusion of a number of filtered versions of a signal with varying filter parameter. The switch rule, developed for this adaptive signal smoothing, is based on the balance of the stochastic and structural noise components and model assumptions, and as such, it is very difficult to devise. Moreover, better output may be obtained if we allow to use some combination of the given image versions in each pixel, rather than selecting one of them alone. To our knowledge, no mathematical theory offering a descriptive rule for such local fusion is available for signal estimators, used for denoising or CT reconstruction.

Borrowing from the above switch-rule idea between filters, the solution we propose for the problem described above is a local fusion of a sequence of estimates with a specific regression function, learned on a training dataset consisting of similar cases. Among known regression methods, we choose to work with ANN, due to their strong adaptivity and generalization properties [5]. The supervised learning is done with a training set of examples: For each location in the processed signals, the features (input vector) are sample values extracted from the corresponding location in the sequence of reconstructed versions for this signal. The output is a small region of sample in the desired destination signal. Contemporary training algorithms employ error back-propagation to optimize the objective function, measuring the discrepancy between the correct output values and those predicted by the ANN [17]. In our work we employ the Matlab Neural Network toolbox; the training was performed with the Levenberg-Marquardt algorithm [35, 36]. Our networks consist of two hidden layers. We use the function , which has similar properties to the classical sigmoid and is computationally cheaper and is more robust to saturation caused by large arguments.

In this work, the outlined general concept is specialized to reconstruction algorithms for CT. Specifically, we consider representatives of the two types of those algorithms: the direct FBP and the iterative PWLS (Section II-B

) methods. For FBP, we propose making a sweep over the cut-off frequency of its low-pass filter in the sinogram domain. This parameter controls the noise-resolution tradeoff and has a major influence on the visual impression of the resulting images. For the iterative PWLS algorithm, a sequence of images is extracted along its execution by saving a version of the CT result every few iterations. In following sections we illustrate this approach on a simple 1-D denoising problem and work out a number of applications for CT reconstruction algorithms, as detailed above. Along the way, we discuss the choice of training set and design of features extracted for the ANN.

Iv-B An Example: ANN Fusion for 1-D Signal Denoising

To illustrate the proposed concept, we begin with the simple signal denoising algorithm as mentioned above. We assume that the original signal is 1-D piece-wise constant (PWC). This choice is beneficial for the test we are about to present, since random PWC signals can easily be generated for training/testing purposes, and the effect of low-pass filter denoising is easily observed. We generate such a signal of length by choosing step locations uniformly in random, and choosing the intensity value for each step uniformly at random as well, in the [0,1] segment.

Assume that such a signal has been created and is contaminated with i.i.d. Gaussian noise with . For the noise reduction, we perform a convolution of with a Gaussian kernel . For some chosen values of the standard deviation we obtain the sequence of estimates


In Figure 3 we display an instance of such a signal, the corresponding noisy version, and a number of signal estimates obtained with convolution filters of different widths.

Figure 3: Left to right, top to bottom: clean train signal , noisy signal , and several restored signals obtained by convolution with a Gaussian kernel of increasing width.

In this setup we train the ANN for a better signal restoration. For each location in , we extract a set of small neighborhoods of the same location from each of the signals . Those are concatenated into one vector which serves as the ANN input. Specifically, we take a -samples window from each processed signal in the sequence of signals. Thus, overall the feature vector for each location is of size samples. In the training stage, every such vector is matched with a label – the correct value , which is provided to the ANN as the desired output.

For the training procedure we generate a signal of size (=number of training samples) and extract the training data as described above. The obtained ANN is tested on another signal of length , randomly generated with the same engine. In Figure 4 such test results are presented. The neural network has improved the SNR of the best linear estimate from dB to dB, and this difference is observed in the fact that the ANN estimate fits the original signal much closer. The SNR values are calculated over an interval of samples in the center of the test signal, so as to avoid boundary problems.

Figure 4: Test results: a zoom-in on a portion of the clean signal , the noisy one (dB), the SNR-best linearly restored signal ( dB), and the restoration by the ANN (dB).

The presented algorithm has various design variables: the number, shape and width of the applied filters, the size and structure of the neural network, the structure of a input vector for each example (set of features). The questions of algorithm design will be pursued in the following sections, where CT reconstruction algorithms, relevant to our study, are invoked in the similar setup of performance boosting by local ANN fusion.

Iv-C Error Measures

Just before we conclude this section and move to present the specific details of boosting CT reconstruction algorithms, we should discuss the choice of the error function to use in the learning process, and the error measure to use when evaluating the quality of the reconstruction.

C.2 Training Risk

In the supervised learning procedure, we design the ANN weights so as to minimize the regression error between the ANN output and the desired labels (training output data). In many cases, the natural choice for this function would be the Mean-Squared-Error (MSE). However, in CT, we should contemplate whether MSE is the proper choice to use. Consider a homogeneous region in a CT image (corresponding to some tissue) with a small detail of a different yet similar intensity (a cavity or a lesion). The MSE penalty paid by an over-smoothing reconstruction filter that blurs this lesion is small, and therefore such faint details may be lost while leading to better MSE. The remedy for this problem could be to penalize not only for the difference in intensity values between the reference image and the reconstruction , but also for the difference in the derivatives of these two images. Alternatively, We can weight the training examples so as to boost the importance of such faint edge regions, at the expense of more pronounced parts of the image, where the edges are sufficiently strong. In this spirit, building on the general error term written in Equation (III.2), we propose to use

In this expression is the training data consisting of pairs of feature (input) vectors and their desired label (output), and the function is the output of the ANN, governed by its control parameters . This is a simple weighted MSE, and the idea mentioned above is encompassed in the choice of , the scalar weights assigned to the training examples. In our work we have chosen to be zero for examples having a very low variance in the input image, which correspond to air regions. Specifically, the threshold is set to times the maximal variance of . A zero weight is also assigned to all the examples where the accumulated gradient over the input patch (in the idea image) is above of its maximal value. The later pruning is introduced in order to avoid the bias of the very strong bone-flesh, flesh-air edges in the training process. As for the remaining examples, we assign their weight to be proportional to the accumulated gradient of the patch (again, in the ideal image). This way, the remaining informative edges get a more pronounced effect in the learning procedure.

C.2 Quality Assessment

The quality measures of CT images used in this study, are the following:

  • Signal-to-Noise Ratio (SNR), defined for the ideal signal and a deteriorated version by SNR. In practice, we consider the signal up to a multiplicative constant and compute


    To make the error measurement more meaningful, the SNR is only computed in the image region where the screened object resides, ignoring the background area. We have used an active contour technique to find the object region in the image; specifically we have used the Chan-Vese method [37].

  • Windowed Signal-to Noise Ratio. The dynamic range of the HU values in a CT image is very large, from for air to for bones. Often, the diagnostic interest lies in the soft tissues, the HU values of which are near zero (HU of water). For axial sections of thighs, we chose (by a criterion of best visibility ) the window of HU; our algorithms are tuned for best reconstruction in this HU range. Therefore, an appropriate SNR measurement considers only the regions in the image that fall in this range. Technically, the reference image and the noisy image are pre-processed before the standard SNR is computed by projecting values lower or higher than and respectively to these values.

  • Structured Similarity (SSIM) measure [38]. This measure of similarity between two images comes to replace the standard Mean Squared Error (the expression appearing in the SNR formula), which is known for its problamatic correlation with the human visual perception system (see [38]

    and the references 1-9 therein). SSIM compares small corresponding patches in the two images, after a normalization of the intensity and contrast. The explicit formula involves first and second moments of the local image statistics and the correlation between the two compared images. In our numerical experiments, we use the Matlab code provided by the authors of

    [38], which is available at /research/ssim.

  • Spatial resolution measure: the spatial resolution properties of a non-shift-invariant reconstruction method should be characterized using a local impulse response (LIR) function, which replaces the standard point-spread function [39]. We compute the LIRs by placing sharp impulses (single pixel) in many random locations in the reference image and by taking the difference between the reconstructed images, scanned with or without the spikes. For each LIR, the Full-Width Half-Maximum (FWHM) value is computed as follows: first, the 2-D image matrix of the response function is resized into an image larger by in each axis, in order to reduce the discretization effect. Then, the number of pixels with intensity higher than half-maximum is counted and divided by the refinement factor of . This is the FWHM resolution measure at the specific location.

V FBP Boost – Algorithm Design

V-a The Low-Pass FBP Filter Parameters

The method of local fusion, advocated in the previous section, is now applied to the standard Filtered Back-Projection (FBP) algorithm for CT reconstruction. The fusion is performed over the parameters of the low-pass sinogram filter, applied before the Back-Projection. This one-dimensional low-pass filter is realized as a multiplication with the Butterworth window in the Fourier domain, defined by


We sweep through the range of the parameter (expressing the cut-off frequency of the filter), thus changing the resolution-variance tradeoff of the FBP. We also change the parameter , which controls the steepness of the window roll-off. While controls the amount of blur introduced during the reconstruction, the parameter influences the texture of reconstructed image.

In Figure 5 we show the reconstruction for a fixed value of and an increasing cut-off frequency . Visually, the strong low-pass filter produces a cleaner image (which also have a higher SNR), but looses in the spatial resolution. The displayed sequence corresponds to values (the last corresponds to no filter).

Figure 5: FBP reconstruction with different cut-off frequency value. Upper to lower, Left to right: (the last image is compute without the low-pass filter).

After testing various combinations, we chose to use only three FBP images with cut-off frequencies and . Those were selected from eight images – three with the frequencies and , another three images with the same frequencies and , and the last two are obtained with and . The reason for the restriction to three images is the smaller ANN required.

V-B Design of The ANN Fusion and Training Setup

Let be a given set of versions of a CT image, reconstructed by FBP with different low-pass filters in the sinogram domain.222Note that all these images are produced from the very same raw sinogram, which means that the patient is exposed to radiation only once. We describe the fusion procedure used to compute the output image of the algorithm:

  • For each location in the image matrix, extract its disk-shaped neighborhood from each of the images , . The radius of the disk is set to pixels (containing pixels).

  • Compose a set of inputs for the ANN by stacking the pixel intensities from the neighborhoods into one vector. Normalize this vector in the training stage (discussed below).

  • Apply the ANN to produce a set of output values, which are the intensity values in the disk-shaped neighborhood of in the image . This disk has the same radius of pixels.

  • By this design, each pixel in the output image is covered by disk-shaped patches; its final value is computed by averaging all those contributions.

We detail now on the several of the steps in the list above. In the training stage, the neural network is tuned to minimize the discrepancy between true values in each output vector and those produced by the network from the set of noisy inputs. A vector of inputs is built, as described above, for a location in a reference image from a training set, using data from noisy reconstructions. The corresponding vector of outputs is the disk-shaped neighborhood of in the reference image. Thus, for each image we produce the set using pre-defined FBP filters and sample them to build the training dataset. The image is sampled on a cartesian grid, choosing every third pixel both in horizontal and vertical directions. The pair of input and output vectors for the neural networks is an example used in the training process. Examples from all the training images are put in one pool. A portion of this pool, having a very low variance in the inputs vector, is discarded (specifically, the threshold is set to times the maximal variance). Those examples correspond to regions of air, since no constant patch in any kind of tissue can be observed in the noisy FBP images. This step leads to an empirical improvement in the performance of the ANN.

It is generally acknowledged, that data normalization improves performance of neural networks [40]. Our data matrix , which columns are the individual example vectors, is normalized by


The two constants (the minimum value of the matrix ) and are stored along with the weights of the neural network, and the new data matrix in the test stage is transformed with those pre-computed constants.

Given intensity values in the neighborhood of a pixel in several noisy images, the network should predict a single value in this pixel for the fusion image. However, as a step of regularization, we design the ANN to produce a vector output which is interpreted as a small neighborhood of . the fusion image is then built from such disk-shaped overlapping patches, which are averaged to produce the final result. This is done to avoid possible artifacts, which can be produced by the network: in the training stage, if the ANN produces a single outlier intensity value, its penalty will be smaller than of a vector of such incorrect intensities. Such regularization reduces the performance the ANN can achieve on the training set, since more equations are imposed, but its performance on test images is expected to be more stable.

Vi FBP Boost – Empirical Study

Vi-a Evaluating the Algorithm Performance

In the experiments we have used sets of clinical CT images, axial body slices extracted from a 3D CT scan of a male head, abdomen and thighs. The images are courtesy of Visible Human Project. The intensity levels of those grayscale images correspond to Hounsfield Units. The training set comprises of male thighs sections. The image set for ANN training consists of images, from which examples are extracted. This number, in our experience, suffices to avoid an over-fitting for the chosen size of neural network ( neurons in the hidden layer, network inputs, overall weights). The vector of features for each example is built from the pixel neighborhoods of radius pixels, coming from the three corresponding FBP reconstructions. These images are a subset of the FBP reconstruction images mentioned before, seeking (manually) for the subgroup that would perform the best. The size of the input vector is entries.

In Figure 6 we present a reconstruction of a test image. This test image is taken cm away from the region where the training data was taken from. The middle upper image is the result of a fusion of the number of FBP versions, performed with the trained ANN. By the visual impression, the noise-resolution balance in the fused image is better than in any of the FBP versions forming it. The texture of tissues is closer to the original (observed in the reference image, upper left). The level of streaks and general noise are lower than in the central and right FBP images, and the image sharpness is higher than in the left and the central images. Thus, the fusion image enjoys the good properties offered by each of the FBP versions and is superior than any of them.

Recall that the training was done with a set of weights, corresponding to our penalty component from Equation IV-C. The quantitative error measures we compute for this comparison include plain SNR, SNR weighted by those weights, the training risk and the SSIM measures. These values are given in Table I. As observed from the table, the weighted SNR of the fusion image is dB higher than the highest attainable value in FBP images. For plain SNR this increment is dB. Values of the training risk measure behave expectedly: the weights of ANN training were designed to implicitly reduce this measure for the fusion image. Indeed, it is by lower than that of the optimal FBP image. Finally, the SSIM measure supports the claim the fusion image has the best visual appearance, since it admits the larger value for this measure.

Image FBP FBP FBP Fusion
SNR (uniform) 25.3059 22.1515 19.2833 26.8692
SNR (weighted) 24.3437 22.3835 19.4414 26.1060
Training-Risk 40.6049 25.3577 55.7256 20.5624
SSIM 0.8839 0.8939 0.6892 0.9298
Table I: Quantitative measures for the FBP reconstructions and the fusion result.
Figure 6: Upper left: reference image. Upper middle: the ANN fusion result. Other: FBP images participating in the fusion, produced with different low-pass filters.

Vi-B Size of Local Neighborhood

We study the algorithm performance with different amounts of local data provided for the ANN fusion. A sequence of test image reconstructions is produced, where the radius of the pixel neighborhood, extracted for the fusion, is increased from (single pixel) to ( pixels). The input vector for the ANN is built from three such neighborhoods, coming from FBP reconstructions corresponding to cut-off frequencies of the low-pass filter. We remark that in the special case of , the regression function learned by the network incorporates only the relations between the pixel values in the different image versions, while with larger neighborhood sizes there is also a possibility to perform some local filtering in each image.

In Figure 7 we display graphs of SNR values333Very similar effect was observed with SSIM. computed for the test image. Observably, the quality increment with the neighborhood radius is exhausted around . Our choice is to use , which requires a smaller number of variables (comparing to ) without almost no loss in quality. We also notice in these graphs that the fusion using only the central pixel has a performance very close to that of the best FBP version (but slightly higher, which testifies to the necessity to provide a larger neighborhood of each pixel for a successful fusion. We should note that large neighborhood allows the network to perform a kind of directionally anisotropic filtering matched to the direction of edges.

Figure 7: Graphs of the SNR values corresponding to reconstructions with ANN fusion using input pixel neighborhoods of radius (x-axis).

We also compare two cases of output vectors produced by the ANN. In the lower row of Figure 8, the image on the right is produced by the fusion process where a single pixel is recovered by the ANN for each input vector. The image on the left is produced by computing -pixel neighborhoods of each pixel and averaging the overlapping regions. The visual difference between the image is negligible, and the difference in SNR is dB in favor of the averaging approach. Judging from this (and other similar) tests, we conclude that forcing the neural networks to evaluate a number of pixels in the neighborhood of the one being recovered does not reduce its performance. We don’t have an empirical evidence that such a step is truly necessary, since no artifacts in single-pixel-estimation case were observed in this test.

Figure 8: Upper left: reference image. Upper right: best-SNR FBP reconstruction. Lower left: fusion result where ANN output size is pixel. Lower right: a fusion result where the ANN produces a single pixel value.

Vi-C Single-Image “Fusion”

A special case of the proposed method is to perform local processing with the ANN using only one FBP image. This, in fact, is a post-processing algorithm based on a regression function, which implements some non-linear local filter. In the following experiment we compare the performance of two fusion methods, one using three FBP images (sharp, normal and blurred) and another using only one FBP image produced with no low-pass filter. The results are displayed in Figure 9. Visually, in the single-image fusion case some artificial streaks are observed, which do not appear in the multi-image fusion (where also a lower MSE is achieved). On the other hand, the single-image fusion produces sharper images.

Figure 9: Two test images (corresponding to the upper and lower rows) of thighs sections. Left column: reference images. Middle column: ANN fusion of a single FBP image with no sinogram filter (). Quality of the upper middle image: SNR = 26.13 dB; lower middle: SNR = 27.02 dB. Right column: ANN fusion of three FBP versions, corresponding to filter cut-off frequency of . Quality of the upper right image: SNR = 27.53 dB; lower right: SNR = 27.51 dB.

Vii PWLS Boost - Algorithm Design and Empirical Study

Vii-a Algorithm Description

The iterative PWLS algorithm (see Section II-B) can be boosted by gathering intermediate versions of the image at different numbers of iterations. The idea is to capture the gradual transformation of the image from the initial to the final state. If the initial image is a blurred one, it gradually changes along the iterations towards a sharper version; the intermediate stages contain important information that can contribute to further improve the algorithm output.

The method is very similar to the one proposed in the previous section. At the training stage, a CT reconstruction is performed with a high-quality reference at hand. The examples for ANN training are produced in the following manner: the vector of inputs, corresponding to a location in the image, is assembled using neighborhoods of in the different versions of the image, gathered along the PWLS iterations. Specifically, we take a small neighborhood of pixels from each image in this sequence (see details below). The “correct answer”, corresponding to this vector of ANN inputs, is the value of the pixel in the reference image. As was done previously, the objective function for ANN training is augmented with weights which determine the importance of the individual examples.

Vii-B PWLS Boost - Empirical Study

We conducted numerical experiments to demonstrate the proposed method using the same setup as in the FBP experiment. Training data for the ANN was obtained using a data-set of axial male thighs section images. For each, an initial image is computed with the FBP algorithm using a sinogram filter with cut-off frequency value of (see Figure 5). The PWLS algorithm is implemented as described in Section II-B, with parameters , . We have performed iterations, saving an image version every iterations - overall we have a sequence of images. In practice, we use three images out of this sequence, namely those from iterations 20, 60 and 80. From the first and the third images, neighborhoods of radius ( pixels) were taken for the estimation of the pixel value, and the second image contributed a neighborhood of radius ( pixels). Overall, the ANN has inputs. It is set to be a network with neurons in the (single) hidden layer. It has a single output, set to produce only the central pixel of the provided neighborhood. These specific settings were obtained with a manual tuning of the design parameters.

In Figure 10 we display the fusion result along with individual PWLS reconstructions, used in the fusion process. The lower part of the figure contains the absolute-valued error images. The fusion result has a higher visual quality than any of the three underlying images. Comparing to those images, the noise level in the fusion image is the lowest, and the tissue texture is closer to the original. The sharpness is the same as in the lower middle PWLS image. The SNR values (stated in the Figure) also point to the improvement in quality. The SSIM of the fusion image is , while the sequence of PWLS results have the SSIM values of (corresponding to the lower row of Figure 10, left to right). A reconstruction of an additional test image is displayed in Figure 11. The effect of the fusion observed here is similar to the one in the previous reconstruction. We conclude that the ANN-based fusion can contribute also to the iterative reconstruction, without requiring any additional iterations; the computational cost of the fusion, exercised after the reconstruction, is lower by an order of magnitude than that of the iterative process.

Figure 10: Images of thighs section. Upper row, left to right: reference image, ANN-fused PWLS (SNR=28.18 dB). Middle rows: three PWLS versions (20 iterations, SNR=26.05, 60 iterations,SNR=26.86 dB, 80 iterations, SNR=24.77 dB). Lower row: absolute-valued error images for the fusion image(left) and best-SNR PWLS (right). Darker shade corresponds to a larger error.
Figure 11: A different test image. Left to right: reference image, ANN-fused PWLS (SNR=27.44 dB), three PWLS versions (20 iters., SNR=25.19, 60 iters.,SNR=26.02 dB, 80 iters., SNR=24.67 dB).

To summarize the fusion effect on the outcome of standard reconstruction algorithms, we display in Figure 12 images produced by both FBP and PWLS methods, before and after applying the proposed method of the ANN-based fusion; these images were previously given in Figures 6,10.

Figure 12: FBP and PWLS reconstruction before (middle column) and after (right column) the ANN fusion. Upper left: reference image. Upper row: FBP (central), FBP+ANN (right). Lower row: PWLS (central), PWLS+ANN (right).

In order to test the robustness of the training results, we apply the ANN trained with the thigh sections, for a reconstruction of images of other body parts – sections of the head and the abdomen. Reconstruction results are presented in Figure 13 in the same order as in the previous comparison: middle image in the upper row is the result of fusion, which components are presented in the lower row. The head reconstruction is improved substantially by the fusion process, as visual observation shows. However, the SNR values (given in Table II) point to the favor of the PWLS image corresponding to iterations (lower middle image). The highest SSIM value does belong to the fusion result, though. In the case of the abdomen section, the fusion image is similar to the -iterations version but contains less noise; its quantitative measures are somewhat better than those of the individual PWLS images.

Figure 13: Upper part: head reconstruction. Left to right, top to bottom: reference image, ANN-fused image, three PWLS versions (20, 60, 80 iterations). Lower part: abdomen reconstruction with the same arrangement.

iters. iters. iters. result
Head section
SNR (plain) 24.91 28.67 28.12 28.09
SSIM 0.878 0.873 0.858 0.881

Abdomen section
SNR (plain) 26.68 27.15 25.24 27.94
SSIM 0.813 0.800 0.761 0.821
Table II: Quantitative measures for the head and abdomen reconstructions.

As a last experiment, we consider the special case where the ANN only performs a local filtering of the single version of the image, without a reference to the other versions. A neighborhood of radius ( pixels) was extracted for each location in the PWLS image, corresponding to iteration number . The fusion result is visually compared in Figure 14 versus the image produced from PWLS versions, as before. It can be observed that the processing by ANN reduces the noise appearing in the PWLS image, but it is slightly inferior to the fusion image produced from several PWLS versions.

Figure 14: Left to right: PWLS image (60 iterations, SNR=26.02 dB), single-image fusion (SNR = 27.25 dB), multi-image fusion (SNR = 27.44 dB).

Viii Computational Complexity of The Method

We analyze the number of computations required for the proposed method in the cases of FBP and PWLS reconstruction. First, we consider the complexity of applying the ANN to perform the pixel-wise fusion from a number of image versions produced at the reconstruction stage. For an image, activations of the ANN is required. Typically, the dimension of the input vector of the ANN is of the order of samples, the output dimension has up to elements, and a single hidden layer of up to neurons is used. Thus, the network contains weights444This number can be reduced if a parallel implementation of the ANN is available, since each neuron output can be calculated separately

. Each neuron also implements a sigmoid function thus requiring

sigmoid calculations to produce the ANN output values. Therefore, the cost of performing a local fusion by the ANN is   operations.

When the method is used for the FBP reconstruction, a number of FBP versions must be produced; in our experiments, three reconstructions suffice. Each FBP reconstruction is of computational complexity of . Therefore, if no hardware changes in an existing scanner are made, producing the fusion image will require roughly four times the extent of a single reconstruction (three FBP processes and the fusion step). Of course, the regular FBP image will be available for the radiologist after the usual time of a single FBP reconstruction.

As for the iterative PWLS algorithm, no changes in the reconstruction process are needed, since we only sample images along the standard iterations. We do not have an accurate estimate for the time complexity of the PWLS, since it depends on the optimization method and its efficient implementation. However, the iterative process necessarily involves an application of the system matrix ( operations) in each iteration, and therefore it is by order of magnitude slower than the FBP. Adding the fusion step in the end of this process will only marginally increase the total reconstruction time.

Ix Summary

We have introduced a method for quality improvement for a general parametric signal estimator. The concept is to use a regression function for a local fusion of a number of estimator’s outputs, corresponding to different parameter settings. The regression proposed is realized with feed-forward artificial neural networks. The fusion process consists of two components: first, the behavior of the signal in its different versions is gathered; second, the ANN performs its own non-linear filtering of the signal versions in small neighborhoods of the estimated pixel.

The proposed method is very general and CT reconstruction is only one possible application for it. The local fusion can be used to solve any linear on non-linear inverse problem where an algorithm, producing a solution estimate, exists. The proposed method will enable to incorporate the algorithm outputs, produced with different values of a core parameter, to a single improved result, thus removing the need for tuning this parameter.

In this work this concept was illustrated for the case of CT reconstruction, done with two basic algorithms – the FBP and the PWLS. Empirical results suggest that the local fusion can improve on the resolution variance trade-off of the given reconstruction algorithm, thus adding to the visual quality of the CT images. The post-processing method is not very time-consuming, and the cost of the local fusion can be well below the extent of one FBP reconstruction.


  • [1] J. Bian P.J. La Riviere and P.A. Vargas, “Penalized-likelihood sinogram restoration for computed tomography,” IEEE Trans. Med. Imag., vol. 25, no. 8, pp. 1022–1036, Aug. 2006.
  • [2] L. A. Shepp and J. B. Kruskal, “Computerized tomography - the new medical x-ray technology,” The American Mathematical Monthly, vol. 85, no. 6, pp. 420–439, 1978.
  • [3] I.A. Elbakri and J.A. Fessler, “Statistical image reconstruction for polyenergetic x-ray computed tomography,” IEEE Trans. Med. Imag., vol. 21, no. 2, pp. 89–99, Feb. 2002.
  • [4] H. Lu Z. Liang J. Wang, T. Li, “Penalized weighted least-squares approach to sinogram noise reduction and image reconstruction for low-dose x-ray computed tomography,” IEEE Trans. Med. Imaging, vol. 25, no. 10, pp. 1272–1283, 2006.
  • [5] Simon Haykin, Neural Networks: A Comprehensive Foundation, Macmillan, 1994.
  • [6] K. D. Sauer J. Hsieh J.-B. Thibault, C. A. Bouman, “A recursive filter for noise reduction in statistical iterative tomographic imaging,” in SPIE/IS&T Conference on Computational Imaging IV. 2006, vol. 6065, pp. 60650X– 60650X–10, SPIE.
  • [7] Anja Borsdorf, “Adaptive filtering for noise reduction in x-ray computed tomography,” in Ph.D. Thesis, 2009.
  • [8] K. M. Hanson, “Noise and contrast discrimination in ct,” Radiology of the Skull and Brain, Vol. V: Technical Aspects of Computed Tomography, T. H. Newton and D. G. Potts, eds., pp. 3941–3955, 1981.
  • [9] A. Macovski, Medical Imaging Systems, Prentice-Hall, 1983.
  • [10] J.Wang J.Wen H. Lu J. Hsieh T. Li, X. Li and Z. Liang, “Nonlinear sinogram smoothing for low-dose x-ray ct,” IEEE Trans. Nucl. Sci.,, vol. 51, no. 5, pp. 2505 2513, 2004.
  • [11] G. N. Ramachandran and A. V. Lakshminarayanan,

    “Three-dimensional Reconstruction from Radiographs and Electron Micrographs: Application of Convolutions instead of Fourier Transforms,”

    Proceedings of the National Academy of Sciences of the United States of America, vol. 68, no. 9, pp. 2236–2240, 1971.
  • [12] F. Wubbeling F. Natterer, Mathematical methods in image reconstruction, SIAM, 2001.
  • [13] Sathish Ramani and Jeffrey A Fessler, “A splitting-based iterative algorithm for accelerated statistical x-ray ct reconstruction,” Medical Imaging, IEEE Transactions on, vol. 31, no. 3, pp. 677–688, 2012.
  • [14] J. Fessler, “Iterative methods for image reconstruction,” in ISBI Tutorial. 2006, fessler/papers/files/talk/08/isbi-notes.pdf.
  • [15] Stephen J. Nocedal, Jorge; Wright, Numerical Optimization, Springer-Verlag, 2-nd edition, 2006.
  • [16] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics Of Control, Signals, And Systems, vol. 2, no. 4, pp. 303–314, 1989.
  • [17] G. E. Hinton D. E. Rumelhart and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533 – 536, 1986.
  • [18] M. Egmont-Petersen, D. de Ridder, and H. Handels, “Image processing with neural networks a review,” Pattern Recognition, vol. 35, no. 10, pp. 2279 – 2301, 2002.
  • [19] C. Nagar M. Cenci and A. Vecchione, “Papnet-assisted primary screening of conventional cervical smears,” Anticancer Res, vol. 20, no. 5C, pp. 3887– 3889, 2000.
  • [20] J.H.C.L. Hendriks G.M.T. Brake, N. Karssemeijer, “An automatic method to discriminate malignant masses from normal tissue in digital mammograms,” Phys. Med. Biol., vol. 45, no. 10, pp. 2843 –2857, 2000.
  • [21] J.S. Lin K.S. Cheng and C.W. Mao, “The application of competitive hopfield neural network to medical image segmentation,” IEEE Trans. Med. Imag., vol. 15, no. 4, pp. 560–567, 1996.
  • [22] V.J. MartI nez J.R. Hilera, M. Mazo, “Neural networks for ecg compression and classification,” in Proceedings of the 3rd International Conference on Fuzzy Logic, Neural Nets and Soft Computing. 1994, pp. 121–124, Iizuka, Japan.
  • [23] S. Osowski and T. H. Linh, “Ecg beat recognition using fuzzy hybrid neural network,” IEEE Trans. On Biomedical Engineering, vol. 48, no. 11, pp. 1265 – 1271, 2001.
  • [24] I. Aizenberg, N. Aizenberg, J. Hiltner, C. Moraga, and E. Meyer zu Bexten, “Cellular neural networks and computational intelligence in medical image processing,” Image and Vision Computing, vol. 19, no. 4, pp. 177 – 183, 2001.
  • [25] Alexander M Bronstein, Michael M Bronstein, Michael Zibulevsky, and Yehoshua Y Zeevi, “Optimal nonlinear line-of-flight estimation in positron emission tomography,” Nuclear Science, IEEE Transactions on, vol. 50, no. 3, pp. 421–426, 2003.
  • [26] R. Murugesan DC Durairaj, MC Krishna, “A neural network approach for image reconstruction in electron magnetic resonance tomography,” Comput. Biol. Med., vol. 37, no. 10, pp. 1492–501, 2007.
  • [27] Jr. C. E. Floyd, “An artificial neural network for spect image reconstruction,” IEEE Trans. Med. Imag., vol. 10, no. 3, pp. 485–487, 1991.
  • [28] C.E. Floyd Jr. and G.D. Tourassi, “An artificial neural network for lesion detection on single-photon emission computed tomographic images,” Investigative radiology, vol. 27, no. 9, pp. 667–672, 1992.
  • [29] J. P. Kerr and E. B. Bartlett, “Medical image processing utilizing neural networks trained on a massively parallel computer,” Computers in Biology and Medicine, vol. 25, no. 4, pp. 393–403, Jul 1995.
  • [30] E.B. Bartlett J.P. Kerr, “Neural network reconstruction of spect images,” J. Digital Imaging, vol. 8, no. 3, pp. 116– 126, 1995.
  • [31] R. Cierniak, “A 2d approach to tomographic image reconstruction using a hopfield-type neural network,” Artificial Intelligence in Medicine, vol. 43, pp. 113– 125, 2008.
  • [32] A. Adler and R. Guardo, “A neural network image reconstruction technique for electrical impedance tomography,” IEEE Trans. Med. Imag., vol. 13, no. 4, pp. 594–600, 1994.
  • [33] B.S. Hoyle A.Y. Nooralahiyan, “Three-component tomographic flow imaging using artificial neural network reconstruction,” Chemical Engineering Science, vol. 52, no. 13, pp. 2139–2148, 1997.
  • [34] A. Goldenshluger and A. Nemirovsky, “On spatial adaptive estimation of nonparametric regression,” Math. Meth. Statist., vol. 6, no. 2, pp. 135– 170, 1997.
  • [35] D.W. Marquardt, “An algorithm for least-squares estimation of nonlinear parameters,” Journal of the Society for Industrial and Applied Mathematics, vol. 11, no. 2, pp. 431–441, 1963.
  • [36] H. Gavin, “The levenberg-marquardt method for nonlinear least squares curve-fitting problems,” Department of Civil and Environmental Engineering, Duke University, 2011.
  • [37] L.Vese T.Chan, “Active contours without edges,” IEEE Trans. Im. Proc., vol. 10, no. 2, pp. 266–277, 2001.
  • [38] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr 2004.
  • [39] J.A. Fessler and W.L. Rogers, “Spatial resolution properties of penalized-likelihood image reconstruction: space-invariant tomographs,” Image Processing, IEEE Transactions on, vol. 5, no. 9, pp. 1346 –1358, 1996.
  • [40] Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller, “Efficient backprop,” in Neural networks: Tricks of the trade, pp. 9–50. Springer, 1998.