I Introduction
Image classification refers to the process of automatic image class prediction based on the numerical content of their corresponding pixel values. Automated image classification methods have been used to detect cancer from microscopy images of tumor specimens [1][2], detect and quantify atrophy from magnetic resonance images of the human brain [3][4], identify and authenticate a person from cell phone camera images [5]
, and numerous other applications in computer vision, medical imaging, automated driving and others.
While many methods for automated image classification have been developed, those based on supervised learning have attracted most of the attention in the past and the present given that, generally speaking, they are more accurate than the alternatives. In supervised learning, a set of labeled example images (known as training data) is utilized to estimate the value of parameters of a mathematical model to be used for classification. Given an unknown test image, the goal of the classification method is to automatically assign the label or class of that image.
An extensive set of supervised learningbased classification algorithms have been proposed in the past (see [6, 7, 8]
for a few reviews on the subject). We categorize these algorithms into two broad categories: 1) learning of classifiers on handengineered features and 2) endtoend learning of features and classifiers, e.g., hierarchical neural networks. Certainly, many algorithms exist that may fit into more than one categories, while other algorithms may not fit into them at all. However, for the purposes of our discussion we focus on these two broad categories.
Image classification methods based on handengineered features, perhaps the first to arise [9]
, generally work in a two step process: step one being the extraction of numerical features from the pixel data, and step two relating to the application of statistical classification methods. A large number of numerical features have been engineered in the past to represent the information from a given image, including Haralick features, Gabor features, shape features
[10][11], and numerous others [6]. These are then combined with many different multivariate regressionbased classification methods including linear discriminant analysis [12] [13][14] [15][16] [17], as well as their kernel versions.Methods based on hierarchical neural networks [18], such as convolutional neural networks (CNNs) [7] [19], have been widely studied recently given they have achieved top performance in certain classification tasks [18] [20] [21]
. In contrast to handengineered features, CNNs typically combine both feature extraction and classification methods within one consistent framework, i.e., endtoend learning. The unprecedented performance of the deep neural networks on a wide variety of tasks has made them quintessential to modern supervised learning on images. These methods, however, are: 1) computationally expensive, often requiring graphic processing units (GPUs) to train and deploy, 2) datahungry, requiring thousands of labeled images per class, and 3) often vulnerable against outofdistribution samples, e.g., adversarial attacks.
Specific methods that fall outside these two broad categories have been few. One such example is the classification method relying on the RCDT, a newly developed image transform [22], for extracting information from pixel intensities. Unlike most numerical feature based methods described above, this operation is invertible as the RCDT is an invertible image transform and thus the RCDT can be viewed as a mathematical image representation method. The RCDT developed in [22] has connections to optimal transport theory [23][24]. In particular, the RCDT can be interpreted as the application of the linear optimal transportation concept [25] to the sliced Wasserstein distance [26]. It can also be interpreted as a nonlinear kernel [26]. The RCDT, and linear optimal transport models [25], have been applied to image classification before in combination with linear classifiers such as Fisher discriminant analysis, support vector machines, and their respective kernel techniques [22] [26][27]. While successful in certain applications [27], this approach of classification using the RCDT has failed to produce the state of the art classification results in certain other applications (see Figure 3 from [6]).
Our contribution
The approach we describe here expands and improves upon the RCDTbased image classification technique, and provides a simple non iterative training method. We model each image class as being generated by the existence of a template form (e.g. a template for the digit ‘1’ in an MNIST [28] image classification problem) but observed under different poses and other intensity rearrangements, which are explained below. We then utilize the properties of the cumulative distribution transform (CDT)[29] and Radoncumulative distribution transform (RCDT) [22]
to propose that the rearrangements responsible for generating each signal class be modeled as a convex subspace in transform domain. Based on this model, we propose a non iterative training and testing algorithm and demonstrate their strengths and weaknesses in comparison with existing deep learning alternatives. Figure
1 shows a system diagram outlining the main computational modeling steps in the proposed method. In comparison with deep learningbased methods, the proposed method is computationally cheap, it is label efficient, it has shown robustness under outofdistribution setups, and it offers competitive performance on a broad class of image classification problems.The paper is organized as follows: Section II presents a preliminary overview of a family of transportbased nonlinear transforms: the CDT for 1D signals, and the RCDT for 2D images. The classification problem is stated in Section III with the proposed solution in Section IV. Descriptions of the experimental setup and the datasets used are available in Section V. Experimental results are presented in Section VI with the discussion of the results in Section VII. Finally, Section VIII offers concluding remarks.
Symbols  Description 
Signal / image  
Domain of  
Radon transform of  
CDT / RCDT transform of  
Forward / inverse Radon transform operation  
Strictly increasing and differentiable function  
Strictly increasing and differentiable  
function, indexed by an angle  
: composition of with  
: composition of with  
along the dimension of  
Set of functions with for some  
interval  
Set of functions with and  
for some interval  
Set of all possible increasing diffeomorphisms  
between two domains and in 
Ii Preliminaries
Iia Notation
Throughout the manuscript, we deal with signals assuming these to be square integrable in their respective domains. That is, we assume that , where is the domain over which is defined. In addition, we at times make use of the common notation: , where is the inner product. Signals are assumed to be real, so the complex conjugate does not play a role. We will apply the same notation for functions whose input argument is two dimensional, i.e. images. Let . A 2D continuous function representing the continuous image is denoted . Signals or images are denoted when the class information is available, where the superscript represents the class label.
Below we will also make use of one dimensional (1D) diffeomorphisms (one to one mapping functions), which are denoted as for signals and when they need to be parameterized by an angle . The set of all possible diffeomorphisms between two domains and will be denoted as . Finally, at times we also utilize the ‘’ operator to denote composition. A summary of the symbols and notation used can be found in Table I.
IiB The Cumulative Distribution Transform (CDT)
The CDT [29]
is an invertible nonlinear 1D signal transform from the space of smooth probability densities to the space of diffeomorphisms. The CDT morphs a given input signal, defined as a probability density function (PDF), into another PDF in such a way that the Wasserstein distance between them is minimized. More formally, let
and define a given signal and a reference signal, respectively, which we consider to be appropriately normalized such that , and . The forward CDT transform of with respect to is given by the strictly increasing function that satisfiesAs described in detail in [29], the CDT is a nonlinear and invertible operation, with the inverse being
Moreover, like the Fourier transform
[30] for example, the CDT has a number of properties which will help us render signal and image classification problems easier to solve.Property IIB.1 (Composition): Let denote a normalized signal and let be the CDT of . The CDT of is given by
(1) 
Here, is an invertible and differentiable function (diffeomorphism), , and ‘’ denotes the composition operator with . For a proof, see Appendix A in supplementary materials.
The CDT composition property implies that, variations in a signal caused by applying to the independent variable will change only the dependent variable in CDT space. This property is illustrated in Figure 2 where variations along both independent and dependent axis directions in original signal space become changes solely along the dependent axis in CDT space).
Property IIB.2 (Embedding):
CDT induces an isometric embedding between the space of 1D signals with the 2Wasserstein metric and the space of their CDT transforms with a weightedEuclidean metric [22][29], i.e.,
(2) 
for all signals . That is to say, if we wish to use the Wasserstein distance as a measure of similarity between , we can compute it as simply a weighted Euclidean norm in CDT space. For a proof, see Appendix C in supplementary materials.
IiC The Radon transform
The Radon transform of an image , which we denote by , is defined as
(3) 
Here, is the perpendicular distance of a line from the origin and , where is the angle over which the projection is taken.
IiD Radon Cumulative Distribution Transform (RCDT)
The CDT framework was extended for 2D patterns (images as normalized density functions) through the slicedWasserstein distance in [22], and was denoted as RCDT. The main idea behind the RCDT is to first obtain a family of one dimensional representations of a two dimensional probability measure (e.g., an image) through the Radon transform and then apply the CDT over the dimension in Radon transform space. More formally, let and define a given image and a reference image, respectively, which we consider to be appropriately normalized. The forward RCDT of with respect to is given by the measure preserving function that satisfies
(6) 
As in the case of the CDT, a transformed signal in RCDT space can be recovered via the following inverse formula [22],
The process of calculating the RCDT transform is shown in Figure 3. As with the CDT, the RCDT has a couple of properties outlined below which will be of interest when classifying images.
Property IID.1 (Composition): Let denotes an appropriately normalized image and let and be the Radon transform and the RCDT transform of , respectively. The RCDT transform of is given by
(7) 
where , , and . Here for a fixed , can be thought of an increasing and differentiable function with respect to . The above equation hence follows from the composition property for 1D CDT. For a proof, see Appendix B in supplementary materials.
The RCDT composition property implies that, variations along both independent and dependent axis directions in an image, caused by applying to the independent variable of its Radon transform, become changes solely along the dependent variable in RCDT space.
Property IID.2 (Embedding):
RCDT induces an isometric embedding between the space of images with slicedWasserstein metric and the space of their RCDT transforms with a weightedEuclidean metric, i.e.,
(8) 
for all images and . For a proof, see Appendix D in supplementary materials.
As the case with the 1D CDT shown above, the property above naturally links the RCDT and sliced Wasserstein distances for PDFs and affords us a simple means of computing similarity among images [22]. We remark that throughout this manuscript we use the notation for both CDT or RCDT transforms of a signal or image with respect to a fixed reference signal or image , if a reference is not specified.
Iii Generative Model and Problem Statement
Using the notation established above we are ready to discuss a generative modelbased problem statement for the type of classification problems we discuss in this paper. We begin by noting that in many applications we are concerned with classifying image or signal patterns that are instances of a certain prototype (or template) observed under some often unknown deformation pattern. Consider the problem of classifying handwritten digits (e.g. the MNIST dataset [28]). A good model for each class in such a data set is to assume that each observed digit image can be thought of as being an instance of a template (or templates) observed under some (unknown) deformation or similar variation or confound. For example, a generative model for the set of images of the digit 1 could be a fixed pattern for the digit 1, but observed under different translations – the digit can be positioned randomly within the field of view of the image. Alternatively, the digit could also be observed with different sizes, or slight deformations. The generative models stated below for 1D and 2D formalize these statements.
Example 1 (1D generative model with translation).
Consider a 1D signal pattern denoted as (the superscript here denotes the class in a classification problem), observed under a random translation parameter . In this case, we can mathematically represent the situation by defining the set of all possible functions , with
being a random variable whose distribution is typically unknown. A random observation (randomly translated) pattern can be written mathematically as
. Note that in this case , and thus the generative model simply amounts to random translation of a template pattern. Figure 4 depicts this situation.The example above (summarized in Figure 4) can be expressed in more general form. Let denotes a set of 1D spatial transformations of a specific kind (e.g. the set of affine transformations). We then use these transformations to provide a more general definition for a mass (signal intensity) preserving generative data model.
Definition III.1 (1D generative model).
Let . The 1D mass (signal intensity) preserving generative model for the class is defined to be the set
(9) 
The notation here is meant to denote the signal from the class. The derivative term preserves the normalization of signals. This extension allows us to define and discuss problems where the confound goes beyond a simple translation model.
With the definition of the 2Dimensional Radon transform from section IIC, we are now ready to define the 2dimensional definition of the generative data model we use throughout the paper:
Definition III.2 (2D generative model).
Let be our set of confounds. The 2D mass (image intensity) preserving generative model for the class is defined to be the set
(10) 
We note that the generative model above using the RCDT model can yield a non convex set, depending on the choice of template function and confound category . Note that we use the same notation for both 1D and 2D versions of the set. The meaning each time will be clear from the context.
We are now ready to define a mathematical description for a generative modelbased problem statement using the definitions above:
Definition III.3 (Classification problem).
It is important to note that the generative model discussed yields nonconvex (and hence nonlinear) signal classes (see Figure 5, top panel). We express this fact mathematically as: for arbitrary and we have that , for , may not necessarily be in . The situation is similar for images (the 2D cases). Convexity, on the other hand, means the weighted sum of samples does remain in the set; this property greatly simplifies the classification problem as will be shown in the next section.
Iv Proposed Solution
We postulate that the CDT and RCDT introduced earlier can be used to drastically simplify the solution to the classification problem posed in definition III.3. While the generative model discussed above generates nonconvex (hence nonlinear) signal and image classes, the situation can change by transforming the data using the CDT (for 1D signals) or the RCDT (for 2D images). We start by analyzing the one dimensional generative model from definition III.1.
Employing the composition property of the CDT (see Section IIB) to the 1D generative model stated in equation (9) we have that
(11) 
and thus
Thus we have the following lemma:
Lemma IV.1.
If is convex, the set is convex.
Proof.
Let be a template signal defined as a PDF. For , let . Then using the composition property of CDT, we have that . Hence . Since is convex, it follows that is convex.
∎
Remark IV.2.
Let and represent two generative models. If , then .
This follows from the fact that the CDT is a one to one map between the space of probability measures and the space of 1D diffeomorphisms. As such the CDT operation is one to one, and therefore there exists no .
Lemma IV.1 above states that if the set of spatial transformations formed by taking elements of and inverting them (denoted as ) is convex, then the generative model will be convex in signal transform space. The situation is depicted in Figure 5. The top part shows a four class generative model that is nonlinear/nonconvex. When examined in transform space, however, the data geometry simplifies in a way that signals can be added together to generate other signals in the same class – the classes become convex in transform space.
The analysis above can be extended to the case of the 2D generative model (definition III.2) through the RCDT. Employing the composition property of the RCDT (see Section IID) to the 2D generative model stated in equation (10) we have that
(12) 
Lemma IV.1 and Remark IV.2 hold true in the 2dimensional RCDT case as well. Thus, if (in 1D) or (in 2D) are convex, the CDT/RCDT transform simplifies the data geometry in a way that image classes become convex in the RCDT transform space. Figure 1(a) depicts the situation.
We use this information to propose a simple noniterative training algorithm (described in more detail in Section IVA) by estimating a projection matrix that projects each (transform space) sample onto , for all classes , where denotes the subspace generated by the convex set as follows:
(13) 
Figure 1(b) provides a pictorial representation of . We assume that the subspaces generated by and do not overlap other than at origin, i.e., .
As far as a test procedure for determining the class of some unknown signal or image , under the assumption that , it then suffices to measure the distance between and the nearest point in each subspace corresponding to the generative model . If the test sample was generated according to the generative model for one of the classes, then there will exist exactly one class for which , where is the Euclidean distance between and the nearest point in or . Therefore, under the assumption that the testing sample at hand was generated according to one of the (unknown) classes as described in definition III.3, the class of the unknown sample can be decoded by solving
(14) 
Finally, note that due to property IID.2 we also have that
with . In words, the RCDT nearest subspace method proposed above can be considered to be approximately equivalent to a nearest (in the sense of the slicedWasserstein distance) subset method in image space, with the subset given by the generative model stated in definition III.2.
Iva Training algorithm
Using the principles and assumptions laid out above, the algorithm we propose estimates the subspace corresponding to the transform space given sample data . Naturally, the first step is to transform the training data to obtain . We then approximate as follows:
Given the composition properties for the CDT and RCDT (see properties IIB.1 and IID.1), it is also possible to enrich in such a way that it will automatically include the samples undergoing some specific deformations without explicitly training with those samples under said deformation. The spanning sets corresponding to two such deformations, image domain translation and isotropic scaling, are derived below:

Translation: let be the translation by and . Note that denotes the Jacobian matrix of . Following [22] we have that where . We define the spanning set for translation in transform domain as , where and .

Isotropic scaling: let and , which is the normalized dilatation of by where . Then according to [22], , i.e. a scalar multiplication. Therefore, an additional spanning set is not required here and thereby the spanning set for isotropic scaling becomes .
Note that the spanning sets are not limited to translation and isotropic scaling only. Other spanning sets might be defined as before for other deformations as well. However, deformation spanning sets other than translation and isotropic scaling are not used here and left for future exploration.
In light of the above discussion, we define the enriched space as follows:
(15) 
where , with and . Figure 1(b) depicts this situation.
We remark that although the RCDT transform (6) is introduced in a continuous setting, numerical approximations for both the Radon and CDT transforms are available for discrete data, i.e., images in our applications [22]. Here we utilize the computational algorithm described in [29] to estimate the CDT from observed, discrete data. Using this algorithm, and given an image , is computed on a chosen grid and reshaped as a vector in .^{1}^{1}1The same grid is chosen for all images. are positive integers . Also the elements in were computed on the above grid and reshaped to obtain a set of vectors in .
Finally, the proposed training algorithm includes the following steps: for each class

Transform training samples to obtain

Orthogonalize to obtain the set of basis vectors , which spans the space (see equation (15)). Use the output of orthogonalization procedure matrix that contains the basis vectors in its columns as follows:
The training algorithm described above is summarized in Figure 6.
IvB Testing algorithm
The testing procedure consists of applying the RCDT transform followed by a nearest subspace search in RCDT space (see Figure 1(c)). Let us consider a testing image whose class is to be predicted by the classification model described above. As a first step, we apply RCDT on to obtain the transform space representation . We then estimate the distance between and the subspace model for each class by . Note that is an orthogonal projection matrix onto the space generated by the span of the columns of (which form an orthogonal basis). To obtain this distance, we must first obtain the projection of onto the nearest point in the subspace , which can be easily computed by utilizing the orthogonal basis obtained in the training algorithm. Although the pseudoinverse formula could be used, it is advantageous in testing to utilize an orthogonal basis for the subspace instead. The class of is then estimated to be
where, . Figure 6 shows a system diagram outlining these steps.
V Computational Experiments
Va Experimental setup
Our goal is to study the classification performance of the method outlined above with respect to state of the art techniques (deep CNN’s), and in terms of metrics such as classification accuracy, computational complexity, and amount of training data needed. Specifically, for each dataset we study, we generate traintest splits of different sizes from the original training set, train the models on these splits, and report the performances on the original test set. For a train split of a particular size, its samples are randomly drawn (without replacement) from the original training set, and the experiments for this particular size are repeated 10 times. All algorithms see the same traintest data samples for each split. Apart from predictive performances, we also measure different models’ computational complexity, in terms of total number of floating point operations (FLOPs).
A particularly compelling property of the proposed approach is that the RCDT subspace model can capture different sizes of deformations (e.g. small translations vs. large translations) without requiring that all such small and large deformations be present in the training set. In other words, our model generalizes to data distributions that were previously unobserved. This is a highly desirable property particularly for applications such as the optical communication under turbulence problem described below, where training data encompassing the full range of possible deformations are limited. This property will be explored in section VI.
Given their excellent performance in many classification tasks, we utilize different kinds of neural network methods as a baseline for assessing the relative performance of the method outlined above. Specifically, we test three neural network models: 1) a shallow CNN model consisting of two convolutional layers and two fully connected layers (based on PyTorch’s official MNIST demonstration example), 2) the standard VGG11 model
[33], and 3) the standard Resnet18 model [34]. All these models were trained for 50 epochs, using the Adam
[35] optimizer with learning rate of 0.0005. When the training set size was less than or equal to 8, a validation set was not used, and the test performance was measured using the model after the last epoch. When the training set had more than 8 samples we used 10% of the training samples for validation, and reported the test performance based on the model that had the best validation performance. To make a fair comparison we do not use data augmentation in the training phase of the neural network models nor of the proposed method.The proposed method was trained and tested using the methods explained in section IV. The orthogonalization of
was performed using singular value decomposition (SVD). The matrix of basis vectors
was constructed using the left singular vectors obtained by the SVD of. The number of the basis vectors was chosen in such a way that the sum of variances explained by all the selected basis vectors in the
th class captures 99% of the total variance explained by all the training samples in the th class. A 2D uniform probability density function was used as the reference image for RCDT computation (see equation (6)).VB Datasets
Image size 




Chinese printed character  64 64  1000  100000  100000  
MNIST  28 28  10  60000  10000  
AffineMNIST  84 84  10  60000  10000  
Optical OAM  151 151  32  22400  9600  
Sign language  128 128  3  3280  1073  
OASIS brain MRI  208 208  2  100  100  
CIFAR10  32 32  10  50000  10000 
To demonstrate the comparative performance of the proposed method, we identified seven datasets for image classification: Chinese printed characters, MNIST, AffineMNIST, optical OAM, sign language, OASIS Brain MRI, and CIFAR10 image datasets. The Chinese printed character dataset with 1000 classes was created by adding random translations and scalings to the images of 1000 printed Chinese characters. The MNIST dataset contains images of ten classes of handwritten digits which was collected from [28]. The AffineMNIST dataset was created by adding random translations and scalings to the images of the MNIST dataset. The optical orbital angular momentum (OAM) communication dataset was collected from [27]. The dataset contains images of 32 classes of multiplexed oribital angular momentum beam patterns for optical communication which were corrupted by atmospheric turbulence. The sign language dataset was collected from [36] which contains images of hand gestures. Normalized HOGgles images [37] of first three classes of the original RGB hand gesture images were used. Finally, the OASIS brain MRI image dataset was collected from [38]. The 2D images from the middle slices of the the original 3D MRI data were used in this paper. Besides these six datasets, we also demonstrated the results on the natural images of the grayscale CIFAR10 dataset [39]. The details of the seven datasets used are available in Table II.
Vi Results
Via Test accuracy
The average test accuracy values of the methods tested on Chinese printed character, MNIST, AffineMNIST, optical OAM, sign language, and OASIS brain MRI image datasets for different number of training samples per class are shown in Figure 7. Note that we did not use VGG11 in the MNIST dataset because the dimensions of MNIST images (, see Table II) are too small for VGG11.
Overall, the proposed method outperforms other methods when the number of training images per class is low (see Figure 7). For some datasets, the improvements are strikingly significant. For example, in the optical OAM dataset, and for learning from only one sample per class, our method provides an absolute improvement in test accuracy of over the CNNbased techniques. Also, the proposed method offers comparable performance to its deep learning counterparts when increasing the number of training samples.
Furthermore, in most cases, the accuracy vs. training size curves have a smoother trend in the proposed method as compared with that of CNNbased learning. Also, the accuracy vs. training curves of the neural network architectures significantly varies for different datasets. For example, ShallowCNN outperforms Resnet in MNIST dataset while it underperforms Resnet in AffineMNIST dataset in terms of test accuracy. Again, while outperforming VGG11 in the sign language dataset, the Resnet architecture underperforms VGG11 in the AffineMNIST dataset.
ViB Computational efficiency
Figure 8 presents the number of floating point operations (FLOPs) required in the training phase of the classification models in order to achieve a particular test accuracy value. We used the AffineMNIST and the sign language datasets in this experiment.
The proposed method obtains test accuracy results similar to that of the CNNbased methods with to times savings in computational complexity, as measured by the number of FLOPs (see Figure 8). The reduction of the computational complexity is generally larger when compared with a deep neural network, e.g., VGG11. The number of FLOPs required by VGG11 is to times higher than that required by the proposed method, whereas ShallowCNN is to times more computationally expensive than the proposed method in terms of number of FLOPs. Note that, we have included the training FLOPs only in Figure 8. We also calculated the number of FLOPs required in the testing phase. For all the methods, the number of test FLOPs per image is approximately orders of magnitude () lower than the number of training FLOPs. The testing FLOPs of the proposed method depend on the number of training samples where the test FLOPs for the CNNbased methods are fixed. The number of test FLOPs required by the CNNbased methods is to times more than the maximum number of test FLOPs required by the proposed method. These plots are not shown for brevity.
ViC Outofdistribution testing
In this experiment, we varied the magnitude of the confounding factors (e.g., translation) to generate a gap between training and testing distributions that allows us to test the outofdistribution performance of the methods. Formally, let define the set of confounding factors. Let us consider two disjoint subsets of , denoted as and , such that and . Using the generative model in equation (10) the ‘in distribution’ image subset and the ‘out distribution’ image subset are defined using the two disjoint confound subsets and as follows:
We defined the ‘in distribution’ image subset as the generative model for the training set and the ‘out distribution’ image subset as the generative model for the test set in this modified experimental setup (see the left panel of Figure 9).
We measured the accuracy of the methods on the AffineMNIST and the optical OAM datasets under the modified experimental setup. The AffineMNIST dataset for the modified setup was generated by applying random translations and scalings to the original MNIST images in a controlled way so that the confound subsets and do not overlap. The ‘in distribution’ image subset consisted of images with translations by not more than pixels and scale factors varying between . On the other hand, images with translations by more than pixels and scale factors varying between were used to generate the ‘out distribution’ image subset . For the optical OAM dataset, the images at turbulence level 5 (low turbulence) [27] were included in the ‘in distribution’ subset and those at turbulence level 10 and 15 (medium and high turbulence) were included in the ‘out distribution’ subset . The average test accuracy results for different training set sizes under the outofdistribution setup are shown in Figure 9.
The proposed method outperforms the other methods by a greater margin than before under this modified experimental scenario (see Figure 9). For the AffineMNIST dataset, the test accuracy values of the proposed method are to higher than that of the CNNbased methods. For the optical OAM dataset, the accuracy values of the proposed method are to higher than those of the CNNbased methods (see Figure 9).
As compared with the general experimental setup (Figure 7), the test accuracy results of all the methods mostly reduce under this challenging modified experimental setup (Figure 9). The average reduction of test accuracy of the proposed method under the modified setup is also significantly lower than that of the CNNbased methods. For the AffineMNIST dataset, the average reduction of test accuracy for the proposed method is . Whereas, the reduction of test accuracy for the CNNbased methods are . Similarly, for the optical OAM dataset, the average reduction of accuracy are and for the proposed method and the CNNbased methods, respectively.
ViD Ablation study
To observe the relative impact of different components of our proposed method, we conducted two ablation studies using the AffineMNIST and the optical OAM datasets. In the first study, we replaced the nearest subspacebased classifier used in our proposed method with a multilayer perceptron (MLP)
[40] and measured the test accuracy of this modified model. In the second study, we replaced the RCDT transform representations with the raw images. We measured the test accuracy of the nearest subspace classifier used with the raw image data. The percentage test accuracy results obtained in these two modified experiments are illustrated in Figure 10 along with the results of the proposed method for comparison. The proposed method outperforms both these modified models in terms of test accuracy (see Figure 10).ViE An example where the proposed method fails
There are examples of image classification problems (e.g. natural image dataset) where the proposed method does not perform well. One such example of this kind of dataset is CIFAR10 dataset. To demonstrate this point, we measured the test accuracies of different methods on the grayscale CIFAR10 dataset (see Figure 11). It can be seen that, the highest accuracy of the proposed method is lower than the CNNbased methods. All of the CNNbased methods used outperform the proposed method in the grayscale CIFAR10 dataset in terms of maximum test accuracy.
Vii Discussion
Test accuracy
Results shown with 6 example datasets suggest the proposed method obtains competitive accuracy figures as compared with state of the art techniques such as CNNs as long as the data at hand conform to the generative model in equation (10). Moreover, in these examples, the nearest RCDT subspace method was shown to be more data efficient: generally speaking, it can achieve higher accuracy with fewer training samples.
Computational efficiency
The proposed method obtains accuracy figures similar to that of the CNNbased methods with to times reduction of the computational complexity. Such a drastic reduction of computation can be achieved due to the simplicity and noniterative nature of the proposed solution. As opposed to the neural networks where GPU implementations are imperative, the proposed method can efficiently be implemented in a CPU and greatly simplify the process of obtaining an accurate classification model for the set of problems that are well modeled by our problem statement defined in definition III.3.
Outofdistribution testing
The accuracy results of the CNNbased methods drastically fall under the outofdistribution setting whereas the proposed method maintains its test accuracy performance. Based on the above findings we infer that the proposed method can be suitable for both interpolation (predicting the classes of data samples within the known distribution) and extrapolation (predicting the classes of data samples outside the known distribution) when the data conforms to the generative model expressed in definition
III.2.The outofdistribution setting for image classification also bears practical significance. For example, consider the problem of classifying the OAM beam patterns for optical communications (see the optical OAM dataset in Figure 7). As these optical patterns traverse air with often unknown air flow patterns, temperature, humidity, etc., exact knowledge of the turbulence level that generated a test image may not always be at hand. Therefore, it is practically infeasible to train the classification model with images at the same turbulence level as the test data. The outofdistribution setup is more practical under such circumstances.
Ablation study
Based on the ablation study results, we conclude that the proposed method of using the nearest subspace classifier in RCDT domain is more appropriate for the category of classification problems we are considering. Data classes in original image domain do not generally form a convex set and therefore and in these instances the subspace model is not appropriate in image domain. The subspace model is appropriate in RCDT domain as the RCDT transform provides a linear data geometry. Considering the subspace model in RCDT space also enhances the generative nature of the proposed classification method by implicitly including the data points from the convex combination of the given training data points. Use of a discriminative model for classification (e.g., MLP, SVM) with the RCDT domain representations of images does not have that advantage.
When are and convex
The definitions expressed in III.1 and III.2 define the generative model for the data classes used in our classification problem statement III.3. As part of the solution to the classification problem, it was proved in Lemma IV.1 that so long as or (the inverse of the transportation subset of functions) is convex, is convex, and that is a precondition for the proposed classification algorithm summarized in Figure 6 to solve the classification problem stated III.3. A natural question to ask is when, or for what types of transportation functions is this condition met? Certain simple examples are easy to describe. For example, when or denotes the set of translations in 1 or 2D, then or can be shown to be convex. Furthermore, when or refers to the set of scalings of a function, then or can be shown to be convex. When or contains a set of fixed points, i.e. when , then or can be shown to be convex. Our hypothesis is that the 6 problems we tested the method on conform to the generative model specifications at least in part, given that classification accuracies significantly higher than chance are obtained with the method. A careful mathematical analysis of these and related questions is the subject of present and future work.
Limitation: An example where the proposed method fails
The fundamental assumption of the proposed method is that the data at hand conform to an underlying generative model (equation 10). If the dataset does not conform to the generative model, the proposed method may not perform well. The CIFAR10 dataset (Figure 11) is an example where the data classes might not follow the generative model. The proposed method underperforms the CNNbased methods in the case of the CIFAR10 dataset.
Viii Conclusions
We introduced a new algorithm for supervised image classification. The algorithm builds on prior work related to the Radon Cumulative Distribution Transform (RCDT) [22] and classifies a given image by measuring the the distance between the RCDT of that image and the linear subspaces , estimated from the linear combination of the transformed input training data. As distances between two images in RCDT space equate to the sliced Wasserstein distances between the inverse RCDT of the same points, the classification method can be interpreted as a ‘nearest’ Sliced Wasserstein distance method between the input image and other images in the generative model for each class .
The model was demonstrated to solve a variety of realworld classification problems with accuracy figures similar to state of the art neural networks including a shallow method, VGG11 [33], and a Resnet18 [34]
. The proposed model was also shown to outperform the neural networks by a large margin in some specific practical scenarios, e.g., training with very few training samples and testing with ‘out of distribution’ test sets. The method is also extremely simple to implement, non iterative, and it does not require tuning of hyperparameters. Finally, as far as training is concerned the method was also demonstrated to be significantly less demanding in terms of floating point operations relative to different neural network methods.
We note, however, that the method above is best suited for problems that are well modeled by the generative model definition provided in Section III. The definition is naturally tailored towards modeling images which are segmented (foreground extracted). Examples shown here include classifying written Chinese characters, MNIST numerical digits, optical communication patterns, sign language hand shapes, and brain MRIs. We also note that the model does not account for many other variations present in many important image classification problems. Specifically, the proposed model does not account for occlusions, introduction of other objects in the scene, or variations which cannot be modeled as a mass (intensity) preserving transformation on a set of templates. Computational examples using the CIFAR10 dataset demonstrate that indeed the proposed model lags far behind, in terms of classification accuracy, the standard deep learning classification methods to which it was compared.
Finally, we note that numerous adaptations of the method are possible. We note that the linear subspace method (in RCDT space) described above can be modified to utilize other assumptions regarding the subspace that best models each class. While certain classes or problems may benefit from a simple linear subspace method as described above, where all linear combinations are allowed, other classes may be composed by the union of nonorthogonal subspaces. The exploration of this and other modifications and extensions of the method are left for future work.
Acknowledgments
This work was supported in part by NIH grants GM130825, GM090033.
References
 [1] Olcay Sertel, Jun Kong, Hiroyuki Shimada, Umit V Catalyurek, Joel H Saltz, and Metin N Gurcan. Computeraided prognosis of neuroblastoma on wholeslide images: Classification of stromal development. Pattern recognition, 42(6):1093–1103, 2009.
 [2] Saurav Basu, Soheil Kolouri, and Gustavo K Rohde. Detecting and visualizing cell phenotype differences from microscopy images using transportbased morphometry. Proceedings of the National Academy of Sciences, 111(9):3448–3453, 2014.
 [3] Shinjini Kundu, Soheil Kolouri, Kirk I Erickson, Arthur F Kramer, Edward McAuley, and Gustavo K Rohde. Discovery and visualization of structural biomarkers from mri using transportbased morphometry. NeuroImage, 167:256–275, 2018.
 [4] Jörg B Schulz, Johannes Borkert, Stefanie Wolf, Tanja SchmitzHübsch, Maryla Rakowicz, Caterina Mariotti, Ludger Schoels, Dagmar Timmann, Bart van de Warrenburg, Alexandra Dürr, et al. Visualization, quantification and correlation of brain atrophy with clinical symptoms in spinocerebellar ataxia types 1, 3 and 6. Neuroimage, 49(1):158–168, 2010.
 [5] Abdenour Hadid, JY Heikkila, Olli Silvén, and M Pietikainen. Face and eye detection for person authentication in mobile phones. In 2007 First ACM/IEEE International Conference on Distributed Smart Cameras, pages 101–108. IEEE, 2007.
 [6] Mohammad ShifatERabbi, Xuwang Yin, Cailey E Fitzgerald, and Gustavo K Rohde. Cell image classification: A comparative overview. Cytometry Part A, 97A(4):347362, 2020.
 [7] Waseem Rawat and Zenghui Wang. Deep convolutional neural networks for image classification: A comprehensive review. Neural computation, 29(9):2352–2449, 2017.
 [8] Dengsheng Lu and Qihao Weng. A survey of image classification methods and techniques for improving classification performance. International journal of Remote sensing, 28(5):823–870, 2007.
 [9] Judith MS Prewitt and Mortimer L Mendelsohn. The analysis of cell images. Annals of the New York Academy of Sciences, 128(3):1035–1053, 1966.
 [10] Nikita Orlov, Lior Shamir, Tomasz Macura, Josiah Johnston, D Mark Eckley, and Ilya G Goldberg. WNDCHARM: Multipurpose image classification using compound image transforms. Pattern recognition letters, 29(11):1684–1693, 2008.
 [11] Gennady V Ponomarev, Vladimir L Arlazarov, Mikhail S Gelfand, and Marat D Kazanov. Ana hep2 cells image classification using number, size, shape and localization of targeted cell regions. Pattern Recognition, 47(7):2360–2366, 2014.
 [12] Tatyana V Bandos, Lorenzo Bruzzone, and Gustavo CampsValls. Classification of hyperspectral images with regularized linear discriminant analysis. IEEE Transactions on Geoscience and Remote Sensing, 47(3):862–873, 2009.
 [13] Timothy J Muldoon, Nadhi Thekkek, Darren M Roblyer, Dipen Maru, Noam Harpaz, Jonathan Potack, Sharmila Anandasabapathy, and Rebecca R RichardsKortum. Evaluation of quantitative image analysis criteria for the highresolution microendoscopic detection of neoplasia in barrett’s esophagus. Journal of biomedical optics, 15(2):026027, 2010.
 [14] Jianguo Zhang, Marcin Marszałek, Svetlana Lazebnik, and Cordelia Schmid. Local features and kernels for classification of texture and object categories: A comprehensive study. International journal of computer vision, 73(2):213–238, 2007.
 [15] Florent Perronnin, Jorge Sánchez, and Thomas Mensink. Improving the fisher kernel for largescale image classification. In European conference on computer vision, pages 143–156. Springer, 2010.
 [16] Anna Bosch, Andrew Zisserman, and Xavier Munoz. Image classification using random forests and ferns. In 2007 IEEE 11th international conference on computer vision, pages 1–8. Ieee, 2007.
 [17] Peijun Du, Alim Samat, Björn Waske, Sicong Liu, and Zhenhong Li. Random forest and rotation forest for fully polarized sar image classification using polarimetric and spatial features. ISPRS Journal of Photogrammetry and Remote Sensing, 105:38–53, 2015.
 [18] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.

[19]
HooChang Shin, Holger R Roth, Mingchen Gao, Le Lu, Ziyue Xu, Isabella Nogues,
Jianhua Yao, Daniel Mollura, and Ronald M Summers.
Deep convolutional neural networks for computeraided detection: Cnn architectures, dataset characteristics and transfer learning.
IEEE transactions on medical imaging, 35(5):1285–1298, 2016.  [20] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
 [21] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [22] Soheil Kolouri, Se Rim Park, and Gustavo K Rohde. The radon cumulative distribution transform and its application to image classification. IEEE transactions on image processing, 25(2):920–934, 2016.
 [23] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
 [24] Soheil Kolouri, Se Rim Park, Matthew Thorpe, Dejan Slepcev, and Gustavo K Rohde. Optimal mass transport: Signal processing and machinelearning applications. IEEE signal processing magazine, 34(4):43–59, 2017.
 [25] Wei Wang, Dejan Slepčev, Saurav Basu, John A Ozolek, and Gustavo K Rohde. A linear optimal transportation framework for quantifying and visualizing variations in sets of images. International journal of computer vision, 101(2):254–269, 2013.

[26]
Soheil Kolouri, Yang Zou, and Gustavo K Rohde.
Sliced wasserstein kernels for probability distributions.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5258–5267, 2016.  [27] Se Rim Park, Liam Cattell, Jonathan M Nichols, Abbie Watnik, Timothy Doster, and Gustavo K Rohde. Demultiplexing vortex modes in optical communications using transportbased pattern recognition. Optics express, 26(4):4004–4022, 2018.
 [28] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [29] Se Rim Park, Soheil Kolouri, Shinjini Kundu, and Gustavo K Rohde. The cumulative distribution transform and linear pattern classification. Applied and Computational Harmonic Analysis, 45(3):616–641, 2018.
 [30] Ronald Newbold Bracewell and Ronald N Bracewell. The Fourier transform and its applications, volume 31999. McGrawHill New York, 1986.
 [31] Eric Todd Quinto. An introduction to xray tomography and radon transforms. In Proceedings of symposia in Applied Mathematics, volume 63, page 1, 2006.
 [32] Frank Natterer. The mathematics of computerized tomography. SIAM, 2001.
 [33] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [34] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [35] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [36] Kaggle: Sign Language MNIST. https://www.kaggle.com/datamunge/signlanguagemnist. Accessed: 20200310.
 [37] Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, and Antonio Torralba. Hoggles: Visualizing object detection features. In Proceedings of the IEEE International Conference on Computer Vision, pages 1–8, 2013.
 [38] Daniel S Marcus, Tracy H Wang, Jamie Parker, John G Csernansky, John C Morris, and Randy L Buckner. Open access series of imaging studies (oasis): crosssectional mri data in young, middle aged, nondemented, and demented older adults. Journal of cognitive neuroscience, 19(9):1498–1507, 2007.
 [39] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
 [40] Matt W Gardner and SR Dorling. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric environment, 32(1415):2627–2636, 1998.
Appendix A Proof of property IIB.1.
The composition property of the CDT:
Let denote a normalized signal and let be the CDT of . The CDT of is given by
Appendix B Proof of property IID.1.
The composition property of the RCDT:
Let denote a normalized image and let and are the Radon transform and the RCDT transform of , respectively. The RCDT of is given by
Proof.
Let denote a reference image. Let and denote the Radon transforms of and , respectively, and let and denote the CDTs of and , respectively, with respect to the reference . Then , we have that
If we substitute or, . Then , we have
(B.1) 
By the change of variables theorem, we can replace , in equation (B.1):
(B.2) 
From equation (B.2), we have that
∎
Appendix C Proof of Property IIB.2
Recall that given two signals and , the Wasserstein metric between them is defined in the following way:
(C.1) 
where is the CDT of with respect to .
Proof.
Recall that an isometric embedding between two metric spaces is an injective mapping that preserve distances. Define the embedding by the correspondence , it is left to show that
for all signals . Let be the CDT of with respect to , then
By the definition of CDT, and . Then by the composition property, . Here again are CDT with respect to a fixed reference . Let . Using the change of variables formula,
∎
Appendix D Proof of Property IID.2
Recall that given two images , using the correspondence in (6) the Sliced Wasserstein metric