Log In Sign Up

Radon cumulative distribution transform subspace modeling for image classification

We present a new supervised image classification method for problems where the data at hand conform to certain deformation models applied to unknown prototypes or templates. The method makes use of the previously described Radon Cumulative Distribution Transform (R-CDT) for image data, whose mathematical properties are exploited to express the image data in a form that is more suitable for machine learning. While certain operations such as translation, scaling, and higher-order transformations are challenging to model in native image space, we show the R-CDT can capture some of these variations and thus render the associated image classification problems easier to solve. The method is simple to implement, non-iterative, has no hyper-parameters to tune, it is computationally efficient, and provides competitive accuracies to state-of-the-art neural networks for many types of classification problems, especially in a learning with few labels setting. Furthermore, we show improvements with respect to neural network-based methods in terms of computational efficiency (it can be implemented without the use of GPUs), number of training samples needed for training, as well as out-of-distribution generalization. The Python code for reproducing our results is available at


page 1

page 2

page 4


Invariance encoding in sliced-Wasserstein space for image classification with limited training data

Deep convolutional neural networks (CNNs) are broadly considered to be s...

End-to-End Signal Classification in Signed Cumulative Distribution Transform Space

This paper presents a new end-to-end signal classification method using ...

The Radon cumulative distribution transform and its application to image classification

Invertible image representation methods (transforms) are routinely emplo...

BraidNet: procedural generation of neural networks for image classification problems using braid theory

In this article, we propose the approach to procedural optimization of a...

Entanglement Entropy of Target Functions for Image Classification and Convolutional Neural Network

The success of deep convolutional neural network (CNN) in computer visio...

The Cumulative Distribution Transform and Linear Pattern Classification

Discriminating data classes emanating from sensors is an important probl...

Redistributor: Transforming Empirical Data Distributions

We present an algorithm and package, Redistributor, which forces a colle...

I Introduction

Image classification refers to the process of automatic image class prediction based on the numerical content of their corresponding pixel values. Automated image classification methods have been used to detect cancer from microscopy images of tumor specimens [1][2], detect and quantify atrophy from magnetic resonance images of the human brain [3][4], identify and authenticate a person from cell phone camera images [5]

, and numerous other applications in computer vision, medical imaging, automated driving and others.

While many methods for automated image classification have been developed, those based on supervised learning have attracted most of the attention in the past and the present given that, generally speaking, they are more accurate than the alternatives. In supervised learning, a set of labeled example images (known as training data) is utilized to estimate the value of parameters of a mathematical model to be used for classification. Given an unknown test image, the goal of the classification method is to automatically assign the label or class of that image.

An extensive set of supervised learning-based classification algorithms have been proposed in the past (see [6, 7, 8]

for a few reviews on the subject). We categorize these algorithms into two broad categories: 1) learning of classifiers on hand-engineered features and 2) end-to-end learning of features and classifiers, e.g., hierarchical neural networks. Certainly, many algorithms exist that may fit into more than one categories, while other algorithms may not fit into them at all. However, for the purposes of our discussion we focus on these two broad categories.

Image classification methods based on hand-engineered features, perhaps the first to arise [9]

, generally work in a two step process: step one being the extraction of numerical features from the pixel data, and step two relating to the application of statistical classification methods. A large number of numerical features have been engineered in the past to represent the information from a given image, including Haralick features, Gabor features, shape features

[10][11], and numerous others [6]. These are then combined with many different multivariate regression-based classification methods including linear discriminant analysis [12] [13]

, support vector machines

[14] [15]

, random forests

[16] [17], as well as their kernel versions.

Fig. 1: System diagram outlining the proposed Radon cumulative distribution transform subspace modeling technique for image classification. (a) R-CDT - a nonlinear, invertible transformation: The R-CDT transform simplifies the data space; (b) Generative modeling - subspace learning: the simplified data spaces can be modeled as linear subspaces; (c) Classification pipeline: the classification method consists of the R-CDT transform followed by a nearest subspace search in the R-CDT space.

Methods based on hierarchical neural networks [18], such as convolutional neural networks (CNNs) [7] [19], have been widely studied recently given they have achieved top performance in certain classification tasks [18] [20] [21]

. In contrast to hand-engineered features, CNNs typically combine both feature extraction and classification methods within one consistent framework, i.e., end-to-end learning. The unprecedented performance of the deep neural networks on a wide variety of tasks has made them quintessential to modern supervised learning on images. These methods, however, are: 1) computationally expensive, often requiring graphic processing units (GPUs) to train and deploy, 2) data-hungry, requiring thousands of labeled images per class, and 3) often vulnerable against out-of-distribution samples, e.g., adversarial attacks.

Specific methods that fall outside these two broad categories have been few. One such example is the classification method relying on the R-CDT, a newly developed image transform [22], for extracting information from pixel intensities. Unlike most numerical feature based methods described above, this operation is invertible as the R-CDT is an invertible image transform and thus the R-CDT can be viewed as a mathematical image representation method. The R-CDT developed in [22] has connections to optimal transport theory [23][24]. In particular, the R-CDT can be interpreted as the application of the linear optimal transportation concept [25] to the sliced Wasserstein distance [26]. It can also be interpreted as a nonlinear kernel [26]. The R-CDT, and linear optimal transport models [25], have been applied to image classification before in combination with linear classifiers such as Fisher discriminant analysis, support vector machines, and their respective kernel techniques [22] [26][27]. While successful in certain applications [27], this approach of classification using the R-CDT has failed to produce the state of the art classification results in certain other applications (see Figure 3 from [6]).

Our contribution

The approach we describe here expands and improves upon the R-CDT-based image classification technique, and provides a simple non iterative training method. We model each image class as being generated by the existence of a template form (e.g. a template for the digit ‘1’ in an MNIST [28] image classification problem) but observed under different poses and other intensity re-arrangements, which are explained below. We then utilize the properties of the cumulative distribution transform (CDT)[29] and Radon-cumulative distribution transform (R-CDT) [22]

to propose that the re-arrangements responsible for generating each signal class be modeled as a convex subspace in transform domain. Based on this model, we propose a non iterative training and testing algorithm and demonstrate their strengths and weaknesses in comparison with existing deep learning alternatives. Figure 

1 shows a system diagram outlining the main computational modeling steps in the proposed method. In comparison with deep learning-based methods, the proposed method is computationally cheap, it is label efficient, it has shown robustness under out-of-distribution setups, and it offers competitive performance on a broad class of image classification problems.

The paper is organized as follows: Section II presents a preliminary overview of a family of transport-based nonlinear transforms: the CDT for 1D signals, and the R-CDT for 2D images. The classification problem is stated in Section III with the proposed solution in Section IV. Descriptions of the experimental setup and the datasets used are available in Section V. Experimental results are presented in Section VI with the discussion of the results in Section VII. Finally, Section VIII offers concluding remarks.

Symbols Description
Signal / image
Domain of
Radon transform of
CDT / R-CDT transform of
Forward / inverse Radon transform operation
Strictly increasing and differentiable function
Strictly increasing and differentiable
function, indexed by an angle
: composition of with
: composition of with
along the dimension of
Set of functions with for some
Set of functions with and
for some interval
Set of all possible increasing diffeomorphisms
between two domains and in
TABLE I: Description of symbols

Ii Preliminaries

Ii-a Notation

Throughout the manuscript, we deal with signals assuming these to be square integrable in their respective domains. That is, we assume that , where is the domain over which is defined. In addition, we at times make use of the common notation: , where is the inner product. Signals are assumed to be real, so the complex conjugate does not play a role. We will apply the same notation for functions whose input argument is two dimensional, i.e. images. Let . A 2D continuous function representing the continuous image is denoted . Signals or images are denoted when the class information is available, where the superscript represents the class label.

Below we will also make use of one dimensional (1D) diffeomorphisms (one to one mapping functions), which are denoted as for signals and when they need to be parameterized by an angle . The set of all possible diffeomorphisms between two domains and will be denoted as . Finally, at times we also utilize the ‘’ operator to denote composition. A summary of the symbols and notation used can be found in Table I.

Ii-B The Cumulative Distribution Transform (CDT)

The CDT [29]

is an invertible nonlinear 1D signal transform from the space of smooth probability densities to the space of diffeomorphisms. The CDT morphs a given input signal, defined as a probability density function (PDF), into another PDF in such a way that the Wasserstein distance between them is minimized. More formally, let

and define a given signal and a reference signal, respectively, which we consider to be appropriately normalized such that , and . The forward CDT transform of with respect to is given by the strictly increasing function that satisfies

As described in detail in [29], the CDT is a nonlinear and invertible operation, with the inverse being

Moreover, like the Fourier transform

[30] for example, the CDT has a number of properties which will help us render signal and image classification problems easier to solve.

Property II-B.1 (Composition): Let denote a normalized signal and let be the CDT of . The CDT of is given by


Here, is an invertible and differentiable function (diffeomorphism), , and ‘’ denotes the composition operator with . For a proof, see Appendix A in supplementary materials.

Fig. 2: The cumulative distribution transform (CDT) of a signal (probability density function). Note that the CDT of an altered (transported) signal (see text for definition) is related to the transform of . In short, the CDT renders displacements into amplitude modulations in transform space.

The CDT composition property implies that, variations in a signal caused by applying to the independent variable will change only the dependent variable in CDT space. This property is illustrated in Figure 2 where variations along both independent and dependent axis directions in original signal space become changes solely along the dependent axis in CDT space).

Property II-B.2 (Embedding): CDT induces an isometric embedding between the space of 1D signals with the 2-Wasserstein metric and the space of their CDT transforms with a weighted-Euclidean metric [22][29], i.e.,


for all signals . That is to say, if we wish to use the Wasserstein distance as a measure of similarity between , we can compute it as simply a weighted Euclidean norm in CDT space. For a proof, see Appendix C in supplementary materials.

The property above naturally links the CDT and Wasserstein distances for PDFs. Wasserstein [23] distances are linked to optimal transport and have been used in a variety of applications in signal and image processing and machine learning (see [24] for a recent review).

Ii-C The Radon transform

The Radon transform of an image , which we denote by , is defined as


Here, is the perpendicular distance of a line from the origin and , where is the angle over which the projection is taken.

Furthermore, using the Fourier Slice Theorem [31][32], the inverse Radon transform is defined as


where is the ramp filter (i.e., ) and is the Fourier transform.

Property II-C.1 (Intensity equality): Note that


which implies that for any two choices .

Ii-D Radon Cumulative Distribution Transform (R-CDT)

The CDT framework was extended for 2D patterns (images as normalized density functions) through the sliced-Wasserstein distance in [22], and was denoted as R-CDT. The main idea behind the R-CDT is to first obtain a family of one dimensional representations of a two dimensional probability measure (e.g., an image) through the Radon transform and then apply the CDT over the dimension in Radon transform space. More formally, let and define a given image and a reference image, respectively, which we consider to be appropriately normalized. The forward R-CDT of with respect to is given by the measure preserving function that satisfies

Fig. 3: The process of calculating the Radon cumulative distribution transform (R-CDT) of an image (defined as a 2-dimensional probability density function). The first step is to apply the Radon transform on to obtain . The R-CDT is then obtained by applying the CDT over the dimension of .

As in the case of the CDT, a transformed signal in R-CDT space can be recovered via the following inverse formula [22],

The process of calculating the R-CDT transform is shown in Figure 3. As with the CDT, the R-CDT has a couple of properties outlined below which will be of interest when classifying images.

Property II-D.1 (Composition): Let denotes an appropriately normalized image and let and be the Radon transform and the R-CDT transform of , respectively. The R-CDT transform of is given by


where , , and . Here for a fixed , can be thought of an increasing and differentiable function with respect to . The above equation hence follows from the composition property for 1D CDT. For a proof, see Appendix B in supplementary materials.

The R-CDT composition property implies that, variations along both independent and dependent axis directions in an image, caused by applying to the independent variable of its Radon transform, become changes solely along the dependent variable in R-CDT space.

Property II-D.2 (Embedding): R-CDT induces an isometric embedding between the space of images with sliced-Wasserstein metric and the space of their R-CDT transforms with a weighted-Euclidean metric, i.e.,


for all images and . For a proof, see Appendix D in supplementary materials.

As the case with the 1D CDT shown above, the property above naturally links the R-CDT and sliced Wasserstein distances for PDFs and affords us a simple means of computing similarity among images [22]. We remark that throughout this manuscript we use the notation for both CDT or R-CDT transforms of a signal or image with respect to a fixed reference signal or image , if a reference is not specified.

Iii Generative Model and Problem Statement

Using the notation established above we are ready to discuss a generative model-based problem statement for the type of classification problems we discuss in this paper. We begin by noting that in many applications we are concerned with classifying image or signal patterns that are instances of a certain prototype (or template) observed under some often unknown deformation pattern. Consider the problem of classifying handwritten digits (e.g. the MNIST dataset [28]). A good model for each class in such a data set is to assume that each observed digit image can be thought of as being an instance of a template (or templates) observed under some (unknown) deformation or similar variation or confound. For example, a generative model for the set of images of the digit 1 could be a fixed pattern for the digit 1, but observed under different translations – the digit can be positioned randomly within the field of view of the image. Alternatively, the digit could also be observed with different sizes, or slight deformations. The generative models stated below for 1D and 2D formalize these statements.

Example 1 (1D generative model with translation).

Consider a 1D signal pattern denoted as (the superscript here denotes the class in a classification problem), observed under a random translation parameter . In this case, we can mathematically represent the situation by defining the set of all possible functions , with

being a random variable whose distribution is typically unknown. A random observation (randomly translated) pattern can be written mathematically as

. Note that in this case , and thus the generative model simply amounts to random translation of a template pattern. Figure 4 depicts this situation.

The example above (summarized in Figure 4) can be expressed in more general form. Let denotes a set of 1D spatial transformations of a specific kind (e.g. the set of affine transformations). We then use these transformations to provide a more general definition for a mass (signal intensity) preserving generative data model.

Fig. 4: Generative model example. A signal generative model can be constructed by applying randomly drawn confounding spatial transformations, in this case translation , to a template pattern from class , denoted here as . The notation here is meant to denote the signal from the class.
Definition III.1 (1D generative model).

Let . The 1D mass (signal intensity) preserving generative model for the class is defined to be the set


The notation here is meant to denote the signal from the class. The derivative term preserves the normalization of signals. This extension allows us to define and discuss problems where the confound goes beyond a simple translation model.

With the definition of the 2-Dimensional Radon transform from section II-C, we are now ready to define the 2-dimensional definition of the generative data model we use throughout the paper:

Definition III.2 (2D generative model).

Let be our set of confounds. The 2D mass (image intensity) preserving generative model for the class is defined to be the set

Fig. 5: Generative model for signal classes in signal (top panel) and transform (bottom panel) spaces. Four classes are depicted on the left: , each with three example signals shown. The top panel: it shows the signal classes in their corresponding native signal spaces. For each class, three example signals are shown under different translations. The right portion of the top panel shows the geometry of these four classes forming nonlinear spaces. The bottom panel: it depicts the situation in transform (CDT, or R-CDT) space. The left portion of the bottom panel shows the corresponding signals in transform domain, while the right portion shows the geometry of the signal classes forming convex spaces.

We note that the generative model above using the R-CDT model can yield a non convex set, depending on the choice of template function and confound category . Note that we use the same notation for both 1D and 2D versions of the set. The meaning each time will be clear from the context.

We are now ready to define a mathematical description for a generative model-based problem statement using the definitions above:

Definition III.3 (Classification problem).

Let and define our set of confounds, and let be defined as in equation (9) (for signals) or equation (10) (for images). Given training samples (class 1), (class 2), as training data, determine the class of an unknown signal or image .

It is important to note that the generative model discussed yields nonconvex (and hence nonlinear) signal classes (see Figure 5, top panel). We express this fact mathematically as: for arbitrary and we have that , for , may not necessarily be in . The situation is similar for images (the 2D cases). Convexity, on the other hand, means the weighted sum of samples does remain in the set; this property greatly simplifies the classification problem as will be shown in the next section.

Iv Proposed Solution

We postulate that the CDT and R-CDT introduced earlier can be used to drastically simplify the solution to the classification problem posed in definition III.3. While the generative model discussed above generates nonconvex (hence nonlinear) signal and image classes, the situation can change by transforming the data using the CDT (for 1D signals) or the R-CDT (for 2D images). We start by analyzing the one dimensional generative model from definition III.1.

Employing the composition property of the CDT (see Section II-B) to the 1D generative model stated in equation (9) we have that


and thus

Thus we have the following lemma:

Lemma IV.1.

If is convex, the set is convex.


Let be a template signal defined as a PDF. For , let . Then using the composition property of CDT, we have that . Hence . Since is convex, it follows that is convex.

Fig. 6: The training and testing process of the proposed classification model. Training: First, obtain the transform space representations of the given training samples of a particular class . Then, enrich the space by adding the deformation spanning set (see text for definition). Finally, orthogonalize to obtain the basis vectors which span the enriched space. Testing: First, obtain the transform space representation of a test sample . Then, the class of is estimated to be the class corresponding to the subspace which has the minimum distance from (see text for definitions). Here, .

Remark IV.2.

Let and represent two generative models. If , then .

This follows from the fact that the CDT is a one to one map between the space of probability measures and the space of 1D diffeomorphisms. As such the CDT operation is one to one, and therefore there exists no .

Lemma IV.1 above states that if the set of spatial transformations formed by taking elements of and inverting them (denoted as ) is convex, then the generative model will be convex in signal transform space. The situation is depicted in Figure 5. The top part shows a four class generative model that is nonlinear/non-convex. When examined in transform space, however, the data geometry simplifies in a way that signals can be added together to generate other signals in the same class – the classes become convex in transform space.

The analysis above can be extended to the case of the 2D generative model (definition III.2) through the R-CDT. Employing the composition property of the R-CDT (see Section II-D) to the 2D generative model stated in equation (10) we have that


Lemma IV.1 and Remark IV.2 hold true in the 2-dimensional R-CDT case as well. Thus, if (in 1D) or (in 2D) are convex, the CDT/R-CDT transform simplifies the data geometry in a way that image classes become convex in the R-CDT transform space. Figure 1(a) depicts the situation.

We use this information to propose a simple non-iterative training algorithm (described in more detail in Section IV-A) by estimating a projection matrix that projects each (transform space) sample onto , for all classes , where denotes the subspace generated by the convex set as follows:


Figure 1(b) provides a pictorial representation of . We assume that the subspaces generated by and do not overlap other than at origin, i.e., .

As far as a test procedure for determining the class of some unknown signal or image , under the assumption that , it then suffices to measure the distance between and the nearest point in each subspace corresponding to the generative model . If the test sample was generated according to the generative model for one of the classes, then there will exist exactly one class for which , where is the Euclidean distance between and the nearest point in or . Therefore, under the assumption that the testing sample at hand was generated according to one of the (unknown) classes as described in definition III.3, the class of the unknown sample can be decoded by solving


Finally, note that due to property II-D.2 we also have that

with . In words, the R-CDT nearest subspace method proposed above can be considered to be approximately equivalent to a nearest (in the sense of the sliced-Wasserstein distance) subset method in image space, with the subset given by the generative model stated in definition III.2.

Iv-a Training algorithm

Using the principles and assumptions laid out above, the algorithm we propose estimates the subspace corresponding to the transform space given sample data . Naturally, the first step is to transform the training data to obtain . We then approximate as follows:

Given the composition properties for the CDT and R-CDT (see properties II-B.1 and II-D.1), it is also possible to enrich in such a way that it will automatically include the samples undergoing some specific deformations without explicitly training with those samples under said deformation. The spanning sets corresponding to two such deformations, image domain translation and isotropic scaling, are derived below:

  • Translation: let be the translation by and . Note that denotes the Jacobian matrix of . Following [22] we have that where . We define the spanning set for translation in transform domain as , where and .

  • Isotropic scaling: let and , which is the normalized dilatation of by where . Then according to [22], , i.e. a scalar multiplication. Therefore, an additional spanning set is not required here and thereby the spanning set for isotropic scaling becomes .

Note that the spanning sets are not limited to translation and isotropic scaling only. Other spanning sets might be defined as before for other deformations as well. However, deformation spanning sets other than translation and isotropic scaling are not used here and left for future exploration.

In light of the above discussion, we define the enriched space as follows:


where , with and . Figure 1(b) depicts this situation.

We remark that although the R-CDT transform (6) is introduced in a continuous setting, numerical approximations for both the Radon and CDT transforms are available for discrete data, i.e., images in our applications [22]. Here we utilize the computational algorithm described in [29] to estimate the CDT from observed, discrete data. Using this algorithm, and given an image , is computed on a chosen grid and reshaped as a vector in .111The same grid is chosen for all images. are positive integers . Also the elements in were computed on the above grid and reshaped to obtain a set of vectors in .

Finally, the proposed training algorithm includes the following steps: for each class

  1. Transform training samples to obtain

  2. Orthogonalize to obtain the set of basis vectors , which spans the space (see equation (15)). Use the output of orthogonalization procedure matrix that contains the basis vectors in its columns as follows:

The training algorithm described above is summarized in Figure 6.

Iv-B Testing algorithm

The testing procedure consists of applying the R-CDT transform followed by a nearest subspace search in R-CDT space (see Figure 1(c)). Let us consider a testing image whose class is to be predicted by the classification model described above. As a first step, we apply R-CDT on to obtain the transform space representation . We then estimate the distance between and the subspace model for each class by . Note that is an orthogonal projection matrix onto the space generated by the span of the columns of (which form an orthogonal basis). To obtain this distance, we must first obtain the projection of onto the nearest point in the subspace , which can be easily computed by utilizing the orthogonal basis obtained in the training algorithm. Although the pseudo-inverse formula could be used, it is advantageous in testing to utilize an orthogonal basis for the subspace instead. The class of is then estimated to be

where, . Figure 6 shows a system diagram outlining these steps.

V Computational Experiments

V-a Experimental setup

Our goal is to study the classification performance of the method outlined above with respect to state of the art techniques (deep CNN’s), and in terms of metrics such as classification accuracy, computational complexity, and amount of training data needed. Specifically, for each dataset we study, we generate train-test splits of different sizes from the original training set, train the models on these splits, and report the performances on the original test set. For a train split of a particular size, its samples are randomly drawn (without replacement) from the original training set, and the experiments for this particular size are repeated 10 times. All algorithms see the same train-test data samples for each split. Apart from predictive performances, we also measure different models’ computational complexity, in terms of total number of floating point operations (FLOPs).

A particularly compelling property of the proposed approach is that the R-CDT subspace model can capture different sizes of deformations (e.g. small translations vs. large translations) without requiring that all such small and large deformations be present in the training set. In other words, our model generalizes to data distributions that were previously unobserved. This is a highly desirable property particularly for applications such as the optical communication under turbulence problem described below, where training data encompassing the full range of possible deformations are limited. This property will be explored in section VI.

Given their excellent performance in many classification tasks, we utilize different kinds of neural network methods as a baseline for assessing the relative performance of the method outlined above. Specifically, we test three neural network models: 1) a shallow CNN model consisting of two convolutional layers and two fully connected layers (based on PyTorch’s official MNIST demonstration example), 2) the standard VGG11 model 

[33], and 3) the standard Resnet18 model [34]

. All these models were trained for 50 epochs, using the Adam 

[35] optimizer with learning rate of 0.0005. When the training set size was less than or equal to 8, a validation set was not used, and the test performance was measured using the model after the last epoch. When the training set had more than 8 samples we used 10% of the training samples for validation, and reported the test performance based on the model that had the best validation performance. To make a fair comparison we do not use data augmentation in the training phase of the neural network models nor of the proposed method.

The proposed method was trained and tested using the methods explained in section IV. The orthogonalization of

was performed using singular value decomposition (SVD). The matrix of basis vectors

was constructed using the left singular vectors obtained by the SVD of

. The number of the basis vectors was chosen in such a way that the sum of variances explained by all the selected basis vectors in the

-th class captures 99% of the total variance explained by all the training samples in the -th class. A 2D uniform probability density function was used as the reference image for R-CDT computation (see equation (6)).

V-B Datasets

Image size
No. of
No. of
No. of
Chinese printed character 64 64 1000 100000 100000
MNIST 28 28 10 60000 10000
Affine-MNIST 84 84 10 60000 10000
Optical OAM 151 151 32 22400 9600
Sign language 128 128 3 3280 1073
OASIS brain MRI 208 208 2 100 100
CIFAR10 32 32 10 50000 10000
TABLE II: Datasets used in the experiment.

To demonstrate the comparative performance of the proposed method, we identified seven datasets for image classification: Chinese printed characters, MNIST, Affine-MNIST, optical OAM, sign language, OASIS Brain MRI, and CIFAR10 image datasets. The Chinese printed character dataset with 1000 classes was created by adding random translations and scalings to the images of 1000 printed Chinese characters. The MNIST dataset contains images of ten classes of handwritten digits which was collected from [28]. The Affine-MNIST dataset was created by adding random translations and scalings to the images of the MNIST dataset. The optical orbital angular momentum (OAM) communication dataset was collected from [27]. The dataset contains images of 32 classes of multiplexed oribital angular momentum beam patterns for optical communication which were corrupted by atmospheric turbulence. The sign language dataset was collected from [36] which contains images of hand gestures. Normalized HOGgles images [37] of first three classes of the original RGB hand gesture images were used. Finally, the OASIS brain MRI image dataset was collected from [38]. The 2D images from the middle slices of the the original 3D MRI data were used in this paper. Besides these six datasets, we also demonstrated the results on the natural images of the gray-scale CIFAR10 dataset [39]. The details of the seven datasets used are available in Table II.

Vi Results

Vi-a Test accuracy

Fig. 7: Percentage test accuracy of different methods as a function of the number of training images per class.
Fig. 8: The total number of floating point operations (FLOPs) required by the methods to attain a particular test accuracy in the MNIST dataset (left) and the sign language dataset (right).

The average test accuracy values of the methods tested on Chinese printed character, MNIST, Affine-MNIST, optical OAM, sign language, and OASIS brain MRI image datasets for different number of training samples per class are shown in Figure 7. Note that we did not use VGG11 in the MNIST dataset because the dimensions of MNIST images (, see Table II) are too small for VGG11.

Overall, the proposed method outperforms other methods when the number of training images per class is low (see Figure 7). For some datasets, the improvements are strikingly significant. For example, in the optical OAM dataset, and for learning from only one sample per class, our method provides an absolute improvement in test accuracy of over the CNN-based techniques. Also, the proposed method offers comparable performance to its deep learning counterparts when increasing the number of training samples.

Furthermore, in most cases, the accuracy vs. training size curves have a smoother trend in the proposed method as compared with that of CNN-based learning. Also, the accuracy vs. training curves of the neural network architectures significantly varies for different datasets. For example, Shallow-CNN outperforms Resnet in MNIST dataset while it underperforms Resnet in Affine-MNIST dataset in terms of test accuracy. Again, while outperforming VGG11 in the sign language dataset, the Resnet architecture underperforms VGG11 in the Affine-MNIST dataset.

Vi-B Computational efficiency

Figure 8 presents the number of floating point operations (FLOPs) required in the training phase of the classification models in order to achieve a particular test accuracy value. We used the Affine-MNIST and the sign language datasets in this experiment.

Fig. 9: Computational experiments under the out-of-distribution setup. The out-of-distribution setup consists of disjoint training (‘in distribution’) and test (‘out distribution’) sets containing different sets of magnitudes of the confounding factors (see the left panel). Percentage test accuracy of different methods are measured as a function of the number of training images per class under the out-of-distribution setup (see the middle and the right panel).

The proposed method obtains test accuracy results similar to that of the CNN-based methods with to times savings in computational complexity, as measured by the number of FLOPs (see Figure 8). The reduction of the computational complexity is generally larger when compared with a deep neural network, e.g., VGG11. The number of FLOPs required by VGG11 is to times higher than that required by the proposed method, whereas Shallow-CNN is to times more computationally expensive than the proposed method in terms of number of FLOPs. Note that, we have included the training FLOPs only in Figure 8. We also calculated the number of FLOPs required in the testing phase. For all the methods, the number of test FLOPs per image is approximately orders of magnitude () lower than the number of training FLOPs. The testing FLOPs of the proposed method depend on the number of training samples where the test FLOPs for the CNN-based methods are fixed. The number of test FLOPs required by the CNN-based methods is to times more than the maximum number of test FLOPs required by the proposed method. These plots are not shown for brevity.

Vi-C Out-of-distribution testing

In this experiment, we varied the magnitude of the confounding factors (e.g., translation) to generate a gap between training and testing distributions that allows us to test the out-of-distribution performance of the methods. Formally, let define the set of confounding factors. Let us consider two disjoint subsets of , denoted as and , such that and . Using the generative model in equation (10) the ‘in distribution’ image subset and the ‘out distribution’ image subset are defined using the two disjoint confound subsets and as follows:

We defined the ‘in distribution’ image subset as the generative model for the training set and the ‘out distribution’ image subset as the generative model for the test set in this modified experimental setup (see the left panel of Figure 9).

We measured the accuracy of the methods on the Affine-MNIST and the optical OAM datasets under the modified experimental setup. The Affine-MNIST dataset for the modified setup was generated by applying random translations and scalings to the original MNIST images in a controlled way so that the confound subsets and do not overlap. The ‘in distribution’ image subset consisted of images with translations by not more than pixels and scale factors varying between . On the other hand, images with translations by more than pixels and scale factors varying between were used to generate the ‘out distribution’ image subset . For the optical OAM dataset, the images at turbulence level 5 (low turbulence) [27] were included in the ‘in distribution’ subset and those at turbulence level 10 and 15 (medium and high turbulence) were included in the ‘out distribution’ subset . The average test accuracy results for different training set sizes under the out-of-distribution setup are shown in Figure 9.

The proposed method outperforms the other methods by a greater margin than before under this modified experimental scenario (see Figure 9). For the Affine-MNIST dataset, the test accuracy values of the proposed method are to higher than that of the CNN-based methods. For the optical OAM dataset, the accuracy values of the proposed method are to higher than those of the CNN-based methods (see Figure 9).

Fig. 10: Comparison of the percentage test accuracy results obtained in the two ablation studies conducted (using the MLP-based classifier in R-CDT space and the nearest subspace classifier in image space) with that of the proposed method.

As compared with the general experimental setup (Figure 7), the test accuracy results of all the methods mostly reduce under this challenging modified experimental setup (Figure 9). The average reduction of test accuracy of the proposed method under the modified setup is also significantly lower than that of the CNN-based methods. For the Affine-MNIST dataset, the average reduction of test accuracy for the proposed method is . Whereas, the reduction of test accuracy for the CNN-based methods are . Similarly, for the optical OAM dataset, the average reduction of accuracy are and for the proposed method and the CNN-based methods, respectively.

Vi-D Ablation study

To observe the relative impact of different components of our proposed method, we conducted two ablation studies using the Affine-MNIST and the optical OAM datasets. In the first study, we replaced the nearest subspace-based classifier used in our proposed method with a multilayer perceptron (MLP)

[40] and measured the test accuracy of this modified model. In the second study, we replaced the R-CDT transform representations with the raw images. We measured the test accuracy of the nearest subspace classifier used with the raw image data. The percentage test accuracy results obtained in these two modified experiments are illustrated in Figure 10 along with the results of the proposed method for comparison. The proposed method outperforms both these modified models in terms of test accuracy (see Figure 10).

Fig. 11: Percentage test accuracy results in the CIFAR10 dataset. The natural images in the CIFAR10 dataset might not conform to the underlying generative model, and therefore, the proposed method doesnot perform well in the CIFAR10 dataset.

Vi-E An example where the proposed method fails

There are examples of image classification problems (e.g. natural image dataset) where the proposed method does not perform well. One such example of this kind of dataset is CIFAR10 dataset. To demonstrate this point, we measured the test accuracies of different methods on the gray-scale CIFAR10 dataset (see Figure 11). It can be seen that, the highest accuracy of the proposed method is lower than the CNN-based methods. All of the CNN-based methods used outperform the proposed method in the gray-scale CIFAR10 dataset in terms of maximum test accuracy.

Vii Discussion

Test accuracy

Results shown with 6 example datasets suggest the proposed method obtains competitive accuracy figures as compared with state of the art techniques such as CNNs as long as the data at hand conform to the generative model in equation (10). Moreover, in these examples, the nearest R-CDT subspace method was shown to be more data efficient: generally speaking, it can achieve higher accuracy with fewer training samples.

Computational efficiency

The proposed method obtains accuracy figures similar to that of the CNN-based methods with to times reduction of the computational complexity. Such a drastic reduction of computation can be achieved due to the simplicity and non-iterative nature of the proposed solution. As opposed to the neural networks where GPU implementations are imperative, the proposed method can efficiently be implemented in a CPU and greatly simplify the process of obtaining an accurate classification model for the set of problems that are well modeled by our problem statement defined in definition III.3.

Out-of-distribution testing

The accuracy results of the CNN-based methods drastically fall under the out-of-distribution setting whereas the proposed method maintains its test accuracy performance. Based on the above findings we infer that the proposed method can be suitable for both interpolation (predicting the classes of data samples within the known distribution) and extrapolation (predicting the classes of data samples outside the known distribution) when the data conforms to the generative model expressed in definition


The out-of-distribution setting for image classification also bears practical significance. For example, consider the problem of classifying the OAM beam patterns for optical communications (see the optical OAM dataset in Figure 7). As these optical patterns traverse air with often unknown air flow patterns, temperature, humidity, etc., exact knowledge of the turbulence level that generated a test image may not always be at hand. Therefore, it is practically infeasible to train the classification model with images at the same turbulence level as the test data. The out-of-distribution setup is more practical under such circumstances.

Ablation study

Based on the ablation study results, we conclude that the proposed method of using the nearest subspace classifier in R-CDT domain is more appropriate for the category of classification problems we are considering. Data classes in original image domain do not generally form a convex set and therefore and in these instances the subspace model is not appropriate in image domain. The subspace model is appropriate in R-CDT domain as the R-CDT transform provides a linear data geometry. Considering the subspace model in R-CDT space also enhances the generative nature of the proposed classification method by implicitly including the data points from the convex combination of the given training data points. Use of a discriminative model for classification (e.g., MLP, SVM) with the R-CDT domain representations of images does not have that advantage.

When are and convex

The definitions expressed in III.1 and III.2 define the generative model for the data classes used in our classification problem statement III.3. As part of the solution to the classification problem, it was proved in Lemma IV.1 that so long as or (the inverse of the transportation subset of functions) is convex, is convex, and that is a precondition for the proposed classification algorithm summarized in Figure 6 to solve the classification problem stated III.3. A natural question to ask is when, or for what types of transportation functions is this condition met? Certain simple examples are easy to describe. For example, when or denotes the set of translations in 1 or 2D, then or can be shown to be convex. Furthermore, when or refers to the set of scalings of a function, then or can be shown to be convex. When or contains a set of fixed points, i.e. when , then or can be shown to be convex. Our hypothesis is that the 6 problems we tested the method on conform to the generative model specifications at least in part, given that classification accuracies significantly higher than chance are obtained with the method. A careful mathematical analysis of these and related questions is the subject of present and future work.

Limitation: An example where the proposed method fails

The fundamental assumption of the proposed method is that the data at hand conform to an underlying generative model (equation 10). If the dataset does not conform to the generative model, the proposed method may not perform well. The CIFAR10 dataset (Figure 11) is an example where the data classes might not follow the generative model. The proposed method underperforms the CNN-based methods in the case of the CIFAR10 dataset.

Viii Conclusions

We introduced a new algorithm for supervised image classification. The algorithm builds on prior work related to the Radon Cumulative Distribution Transform (R-CDT) [22] and classifies a given image by measuring the the distance between the R-CDT of that image and the linear subspaces , estimated from the linear combination of the transformed input training data. As distances between two images in R-CDT space equate to the sliced Wasserstein distances between the inverse R-CDT of the same points, the classification method can be interpreted as a ‘nearest’ Sliced Wasserstein distance method between the input image and other images in the generative model for each class .

The model was demonstrated to solve a variety of real-world classification problems with accuracy figures similar to state of the art neural networks including a shallow method, VGG11 [33], and a Resnet18 [34]

. The proposed model was also shown to outperform the neural networks by a large margin in some specific practical scenarios, e.g., training with very few training samples and testing with ‘out of distribution’ test sets. The method is also extremely simple to implement, non iterative, and it does not require tuning of hyperparameters. Finally, as far as training is concerned the method was also demonstrated to be significantly less demanding in terms of floating point operations relative to different neural network methods.

We note, however, that the method above is best suited for problems that are well modeled by the generative model definition provided in Section III. The definition is naturally tailored towards modeling images which are segmented (foreground extracted). Examples shown here include classifying written Chinese characters, MNIST numerical digits, optical communication patterns, sign language hand shapes, and brain MRIs. We also note that the model does not account for many other variations present in many important image classification problems. Specifically, the proposed model does not account for occlusions, introduction of other objects in the scene, or variations which cannot be modeled as a mass (intensity) preserving transformation on a set of templates. Computational examples using the CIFAR10 dataset demonstrate that indeed the proposed model lags far behind, in terms of classification accuracy, the standard deep learning classification methods to which it was compared.

Finally, we note that numerous adaptations of the method are possible. We note that the linear subspace method (in R-CDT space) described above can be modified to utilize other assumptions regarding the subspace that best models each class. While certain classes or problems may benefit from a simple linear subspace method as described above, where all linear combinations are allowed, other classes may be composed by the union of non-orthogonal subspaces. The exploration of this and other modifications and extensions of the method are left for future work.


This work was supported in part by NIH grants GM130825, GM090033.


  • [1] Olcay Sertel, Jun Kong, Hiroyuki Shimada, Umit V Catalyurek, Joel H Saltz, and Metin N Gurcan. Computer-aided prognosis of neuroblastoma on whole-slide images: Classification of stromal development. Pattern recognition, 42(6):1093–1103, 2009.
  • [2] Saurav Basu, Soheil Kolouri, and Gustavo K Rohde. Detecting and visualizing cell phenotype differences from microscopy images using transport-based morphometry. Proceedings of the National Academy of Sciences, 111(9):3448–3453, 2014.
  • [3] Shinjini Kundu, Soheil Kolouri, Kirk I Erickson, Arthur F Kramer, Edward McAuley, and Gustavo K Rohde. Discovery and visualization of structural biomarkers from mri using transport-based morphometry. NeuroImage, 167:256–275, 2018.
  • [4] Jörg B Schulz, Johannes Borkert, Stefanie Wolf, Tanja Schmitz-Hübsch, Maryla Rakowicz, Caterina Mariotti, Ludger Schoels, Dagmar Timmann, Bart van de Warrenburg, Alexandra Dürr, et al. Visualization, quantification and correlation of brain atrophy with clinical symptoms in spinocerebellar ataxia types 1, 3 and 6. Neuroimage, 49(1):158–168, 2010.
  • [5] Abdenour Hadid, JY Heikkila, Olli Silvén, and M Pietikainen. Face and eye detection for person authentication in mobile phones. In 2007 First ACM/IEEE International Conference on Distributed Smart Cameras, pages 101–108. IEEE, 2007.
  • [6] Mohammad Shifat-E-Rabbi, Xuwang Yin, Cailey E Fitzgerald, and Gustavo K Rohde. Cell image classification: A comparative overview. Cytometry Part A, 97A(4):347-362, 2020.
  • [7] Waseem Rawat and Zenghui Wang. Deep convolutional neural networks for image classification: A comprehensive review. Neural computation, 29(9):2352–2449, 2017.
  • [8] Dengsheng Lu and Qihao Weng. A survey of image classification methods and techniques for improving classification performance. International journal of Remote sensing, 28(5):823–870, 2007.
  • [9] Judith MS Prewitt and Mortimer L Mendelsohn. The analysis of cell images. Annals of the New York Academy of Sciences, 128(3):1035–1053, 1966.
  • [10] Nikita Orlov, Lior Shamir, Tomasz Macura, Josiah Johnston, D Mark Eckley, and Ilya G Goldberg. WND-CHARM: Multi-purpose image classification using compound image transforms. Pattern recognition letters, 29(11):1684–1693, 2008.
  • [11] Gennady V Ponomarev, Vladimir L Arlazarov, Mikhail S Gelfand, and Marat D Kazanov. Ana hep-2 cells image classification using number, size, shape and localization of targeted cell regions. Pattern Recognition, 47(7):2360–2366, 2014.
  • [12] Tatyana V Bandos, Lorenzo Bruzzone, and Gustavo Camps-Valls. Classification of hyperspectral images with regularized linear discriminant analysis. IEEE Transactions on Geoscience and Remote Sensing, 47(3):862–873, 2009.
  • [13] Timothy J Muldoon, Nadhi Thekkek, Darren M Roblyer, Dipen Maru, Noam Harpaz, Jonathan Potack, Sharmila Anandasabapathy, and Rebecca R Richards-Kortum. Evaluation of quantitative image analysis criteria for the high-resolution microendoscopic detection of neoplasia in barrett’s esophagus. Journal of biomedical optics, 15(2):026027, 2010.
  • [14] Jianguo Zhang, Marcin Marszałek, Svetlana Lazebnik, and Cordelia Schmid. Local features and kernels for classification of texture and object categories: A comprehensive study. International journal of computer vision, 73(2):213–238, 2007.
  • [15] Florent Perronnin, Jorge Sánchez, and Thomas Mensink. Improving the fisher kernel for large-scale image classification. In European conference on computer vision, pages 143–156. Springer, 2010.
  • [16] Anna Bosch, Andrew Zisserman, and Xavier Munoz. Image classification using random forests and ferns. In 2007 IEEE 11th international conference on computer vision, pages 1–8. Ieee, 2007.
  • [17] Peijun Du, Alim Samat, Björn Waske, Sicong Liu, and Zhenhong Li. Random forest and rotation forest for fully polarized sar image classification using polarimetric and spatial features. ISPRS Journal of Photogrammetry and Remote Sensing, 105:38–53, 2015.
  • [18] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
  • [19] Hoo-Chang Shin, Holger R Roth, Mingchen Gao, Le Lu, Ziyue Xu, Isabella Nogues, Jianhua Yao, Daniel Mollura, and Ronald M Summers.

    Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning.

    IEEE transactions on medical imaging, 35(5):1285–1298, 2016.
  • [20] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  • [21] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • [22] Soheil Kolouri, Se Rim Park, and Gustavo K Rohde. The radon cumulative distribution transform and its application to image classification. IEEE transactions on image processing, 25(2):920–934, 2016.
  • [23] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • [24] Soheil Kolouri, Se Rim Park, Matthew Thorpe, Dejan Slepcev, and Gustavo K Rohde. Optimal mass transport: Signal processing and machine-learning applications. IEEE signal processing magazine, 34(4):43–59, 2017.
  • [25] Wei Wang, Dejan Slepčev, Saurav Basu, John A Ozolek, and Gustavo K Rohde. A linear optimal transportation framework for quantifying and visualizing variations in sets of images. International journal of computer vision, 101(2):254–269, 2013.
  • [26] Soheil Kolouri, Yang Zou, and Gustavo K Rohde.

    Sliced wasserstein kernels for probability distributions.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5258–5267, 2016.
  • [27] Se Rim Park, Liam Cattell, Jonathan M Nichols, Abbie Watnik, Timothy Doster, and Gustavo K Rohde. De-multiplexing vortex modes in optical communications using transport-based pattern recognition. Optics express, 26(4):4004–4022, 2018.
  • [28] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [29] Se Rim Park, Soheil Kolouri, Shinjini Kundu, and Gustavo K Rohde. The cumulative distribution transform and linear pattern classification. Applied and Computational Harmonic Analysis, 45(3):616–641, 2018.
  • [30] Ronald Newbold Bracewell and Ronald N Bracewell. The Fourier transform and its applications, volume 31999. McGraw-Hill New York, 1986.
  • [31] Eric Todd Quinto. An introduction to x-ray tomography and radon transforms. In Proceedings of symposia in Applied Mathematics, volume 63, page 1, 2006.
  • [32] Frank Natterer. The mathematics of computerized tomography. SIAM, 2001.
  • [33] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [34] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [35] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [36] Kaggle: Sign Language MNIST. Accessed: 2020-03-10.
  • [37] Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, and Antonio Torralba. Hoggles: Visualizing object detection features. In Proceedings of the IEEE International Conference on Computer Vision, pages 1–8, 2013.
  • [38] Daniel S Marcus, Tracy H Wang, Jamie Parker, John G Csernansky, John C Morris, and Randy L Buckner. Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. Journal of cognitive neuroscience, 19(9):1498–1507, 2007.
  • [39] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [40] Matt W Gardner and SR Dorling. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric environment, 32(14-15):2627–2636, 1998.

Appendix A Proof of property II-B.1.

The composition property of the CDT:

Let denote a normalized signal and let be the CDT of . The CDT of is given by


Let denote a reference signal. If and denote the CDTs of and , respectively, with respect to the reference , we have that

By substituting we have


By the change of variables theorem, we can replace , in equation (A.1):


From equation (A.1), we have that

Appendix B Proof of property II-D.1.

The composition property of the R-CDT:

Let denote a normalized image and let and are the Radon transform and the R-CDT transform of , respectively. The R-CDT of is given by


Let denote a reference image. Let and denote the Radon transforms of and , respectively, and let and denote the CDTs of and , respectively, with respect to the reference . Then , we have that

If we substitute or, . Then , we have


By the change of variables theorem, we can replace , in equation (B.1):


From equation (B.2), we have that

Appendix C Proof of Property II-B.2

Recall that given two signals and , the Wasserstein metric between them is defined in the following way:


where is the CDT of with respect to .


Recall that an isometric embedding between two metric spaces is an injective mapping that preserve distances. Define the embedding by the correspondence , it is left to show that

for all signals . Let be the CDT of with respect to , then

By the definition of CDT, and . Then by the composition property, . Here again are CDT with respect to a fixed reference . Let . Using the change of variables formula,

Appendix D Proof of Property II-D.2

Recall that given two images , using the correspondence in (6) the Sliced Wasserstein metric