Transport-Based Pattern Theory: A Signal Transformation Approach

02/20/2018
by   Liam Cattell, et al.
0

In many scientific fields imaging is used to relate a certain physical quantity to other dependent variables. Therefore, images can be considered as a map from a real-world coordinate system to the non-negative measurements being acquired. In this work we describe an approach for simultaneous modeling and inference of such data, using the mathematics of optimal transport. To achieve this, we describe a numerical implementation of the linear optimal transport transform, based on the solution of the Monge-Ampere equation, which uses Brenier's theorem to characterize the solution of the Monge functional as the derivative of a convex potential function. We use our implementation of the transform to compute a curl-free mapping between two images, and show that it is able to match images with lower error that existing methods. Moreover, we provide theoretical justification for properties of the linear optimal transport framework observed in the literature, including a theorem for the linear separation of data classes. Finally, we use our optimal transport method to empirically demonstrate that the linear separability theorem holds, by rendering non-linearly separable data as linearly separable following transform to transport space.

READ FULL TEXT VIEW PDF

Authors

page 7

page 9

page 10

page 12

page 13

page 15

page 16

08/28/2020

Numerical Analysis of the 1-D Parabolic Optimal Transport Problem

Numerical methods for the optimal transport problem is an active area of...
10/20/2016

Regularized Optimal Transport and the Rot Mover's Distance

This paper presents a unified framework for smooth convex regularization...
07/02/2021

Data-driven mapping between functional connectomes using optimal transport

Functional connectomes derived from functional magnetic resonance imagin...
02/15/2018

Blind Source Separation with Optimal Transport Non-negative Matrix Factorization

Optimal transport as a loss for machine learning optimization problems h...
03/12/2021

Multiview Sensing With Unknown Permutations: An Optimal Transport Approach

In several applications, including imaging of deformable objects while i...
04/02/2021

Physics Informed Convex Artificial Neural Networks (PICANNs) for Optimal Transport based Density Estimation

Optimal Mass Transport (OMT) is a well studied problem with a variety of...
01/31/2020

Optimal Transport Based Seismic Inversion: Beyond Cycle Skipping

Full waveform inversion (FWI) is today a standard process for the invers...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many imaging problems relate to analyzing spatial patterns of a certain physical quantity for the purpose of understanding the spatiotemporal distribution of such quantity in relation to other dependent variables. Historically, astronomers, for example, used images from telescopes to understand the mass distribution of different galaxies [Schwarzschild (1954), Bertola (1966)]. Biologists and pathologists make use of digital images of cells in order to understand their structure as a function of different experimental conditions or disease [Yeung . (2005), Wang . (2010)]. Scientists and radiologists utilize magnetic resonance images (MRI), computed tomography (CT), and other imaging modalities, in order to understand the structure of different organs as they relate to dynamics, physiological, and other functional characteristics [Gorbunova . (2012), Schmitter . (2015)].

In the majority of these cases, imaging refers to the measurement of a certain physical quantity which is positive in nature. In X-ray CT, detectors measure the number of photons arriving at a certain region of the detector in a certain window of time. In magnitude reconstructed MRI, the quantity being reconstructed is the magnitude of the magnetization density. In light microscopy, detectors also measure the number of photons arriving within a certain time window. In transmission microscopy, light is attenuated and a darker pixel corresponds to an increased density in that location. In fluorescence microscopy, the amount of signal being received is proportional to the concentration of a certain fluorophore at that location. In these, and other examples, an image can be considered as a map from the domain representing the coordinate system in which measurements are being made, to the non-negative real numbers: .

Here we describe an approach for modeling and inference on such data by considering each image in a database as an instance of a given template ‘morphed’ by a given function . That is , where is the determinant of the Jacobian of . The association between and is made unique by the mathematics of optimal transport, as explained in [Wang . (2013), Kolouri, Tosun . (2016)]. The embedding provided by the so called linear optimal transport (LOT) approach can be viewed as mathematical image transforms with forward and inverse operations, thus enabling inference and modeling simultaneously. Here we extend the work presented in [Wang . (2013), Kolouri, Tosun . (2016)] to provide theoretical justification for certain properties observed empirically in the analysis of cell and brain image data bases [Ozolek . (2014), Kolouri, Tosun . (2016), Tosun . (2015), Kundu . (2018)], including a linear separation theorem in transform space similar to the work shown in [Park . (2017), Kolouri, Park  Rohde (2016)].

We note that the framework discussed above, whereby images are modeled as actions has similarities to the pattern theory approach of Grenander and Mumford, and others [Grenander (1994), Grenander (1996), Grenander  Miller (1998), Mumford (1997)]. In particular, the pattern theory-based computational anatomy approach originally proposed by Grenander and Miller [Grenander  Miller (1998)] involves viewing the orbit formed by the action of a diffeomorphism as a generative model for imagery. As such, it attempts to model each observed image as

based on representations, probabilities, and inference methods that can be developed within a Bayesian perspective. The framework has been useful for motivating a number of methods used in pattern recognition

[Grenander  Miller (2007)]. The LOT approach introduced in [Wang . (2013), Kolouri, Tosun . (2016), Park . (2017)] shares a similar point of view, and can be seen as a particular version of the more general pattern theory approach [Grenander (1994), Grenander (1996), Grenander  Miller (1998), Mumford (1997)], with the main innovation being that the transport formulation can encompass both displacement (geometric transformations) as well as photometric (i.e. intensity differences) information.

1.1 Overview

We begin with Section 2, Definitions and Preliminaries, by introducing the basics of the mathematics of optimal mass transport, including the Monge functional, and Brenier’s theorem that characterizes the solution of the Monge functional. We also review basic notions related to the geometry of optimal transport, as well as the LOT metric described in [Wang . (2013), Kolouri, Tosun . (2016)]. In Section 3, we define the basics of the LOT transform, including the forward and inverse operations. In Section 4 we describe several mathematical properties of the LOT transform. These include translation, scaling, and composition claims, as well as linear separability properties of the transform. Section 5 describes the numerical discretization and implementation of the LOT transform based on the solution of the Monge-Ampère equation, within a multi-resolution framework. Section 6 describes computational experiments and concluding remarks are shown in Section 8.

2 Definitions and preliminaries

2.1 Optimal Mass Transport

Consider two Borel probability measures and

with finite ‘p’ moments defined on measure spaces

and , respectively. Consider the case when the measures are smooth and are associated with positive density functions and , such that and . The original optimal transport problem posed by Monge [Monge (1781)] is to find a mass-preserving (MP) map that pushes on to and minimizes the objective function

(1)

where is the cost function, and , where denotes the push-forward of measure . This is characterized as:

(2)

If is a smooth, one-to-one mapping, Eq. (2) can be written in a differential form:

(3)

where is the determinant of the Jacobian of .

In his original paper, Monge considered the Euclidean distance as the cost function [Monge (1781)]. However, under the smoothness assumption required for Eq. (3), the cost function is often chosen to be the -norm, such that Eq. (1) becomes the Wasserstein metric [Haker . (2004), Trigila  Tabak (2016)]:

(4)

It should be noted that both the objective function and the constraint in Eq. (4) are nonlinear with respect to . Moreover, an important result in the field is Brenier’s theorem, which characterizes the unique solution to the Monge problem [Brenier (1991)].

Theorem 1 (Brenier’s Theorem)

Let and be non-negative functions of same total mass and with bounded support:

When for some strictly convex function , then there exists a unique optimal transport map that minimizes Eq. (1). Furthermore, if then there exists a (unique up to adding a constant) convex function such that . A proof is available in [Villani (2008), Gangbo  McCann (1996)].

Finally, it should be mentioned that, for certain measures, the Monge formulation of the optimal transport problem is ill-posed; in the sense that there is no transport map to rearrange one measure to another. In such cases the Kantorovich formulation of the problem using transport plans is preferred. For brevity, we omit a detailed discussion of the Kantorovich mass transport formulation and instead refer the reader to [Kolouri . (2017)] for more details.

2.2 Geometry of Optimal Transport

Brenier’s theorem states that the OT map between and is unique and, in the continuous setting presented above, the mass from each point is moved to a single location, given as the value of the function . Formally, the set of measures not just a metric space, but is also a Riemannian manifold equipped with an inner product on the tangent space at any point [do Carmo (1992)]. Specifically, the tangent space at the measure with density (i.e.

) is the set of the following vector fields

and the inner product is the weighted [Wang . (2013)]:

Therefore, the OT distance is the length of the geodesic connecting two measures [Benamou  Brenier (2000)]. Fortunately, in the context of optimal transport, this has a straightforward interpretation: if , is the geodesic connecting to , then is the measure obtained when mass from is transported by the transportation map . Consequently, .

2.3 The Linear Optimal Transport Distance

Computational complexity can hinder the application of OT-related metrics to learning and pattern analysis problems that involve large numbers of images. For recognition problems, the application of classification techniques such as nearest neighbors [Altman (1992)], linear discriminant analysis [McLachlan (2004), Wang . (2011)]

, support vector machines

[Cortes  Vapnik (1995)] and their respective kernel versions [Cristianini  Shawe-Taylor (2000)], would require OT computations, with representing the number of images in the dataset. To ease this burden, Wang et al. [Wang . (2013)] proposed the linear optimal transport (LOT) metric, which borrows the statistical atlas notion often seen in brain morphology studies [Ashburner  Friston (2000)]

, and seeks to compare images to one another based on their relation to a reference.

Consider a given signal associated with measure via , as before, and a reference function associated with measure such that , . We can then identify (and ) with a map via the optimization problem stated in Eq. (1). Brenier’s theorem tells us that this identification is unique and that it is characterized by the map being the gradient of a potential function , that is , such that . This identification can be viewed as the projection of on to the tangent plane described above. More precisely, let the projection of be:

Then :

where is defined to be . As explained in [Wang . (2013)], this is known as an azimuthal equidistant projection in cartography, while in differential geometry it would be the inverse of the exponential map. Finally, given two measures and , and a fixed reference , the LOT distance between and is defined as [Wang . (2013)]:

(5)

3 LOT Transform

Consider two measures and , with corresponding densities and , and let be a fixed reference measure with corresponding density . Furthermore, let and correspond to the OT maps linking to and to . That is such that is uniquely defined through the solution to the Monge minimization problem in Eq. (1). From Eq. (5) we have:

where is defined to be , and

The result above motivated Kolouri [Kolouri, Park, Thorpe . (2016)] to define the continuous LOT embedding (transform) for signals and images. Here, a signal is interpreted to be a continuous function , with , where refers to the dimension of the signal (i.e. for 1D signals, for 2D images, etc.). The forward (analysis) LOT transform of can then be defined as:

(6)

where is the gradient of a potential function , such that , and . The inverse (synthesis) LOT transform of is then:

(7)

where . Thus, we can consider the LOT transform as an operation taking signals from one domain (signal space) to another (displacements). Just like in engineering, where we consider time/space domain versus frequency domain, the LOT transform allows us to consider signal space domain versus displacement domain (with respect to a fixed reference).

The idea was extended by Park et al. [Park . (2017)], who exploited the fact that, in 1D, the unique map between any two positive smooth densities can be computed in closed form. Consequently, the authors defined the Cumulative Distribution Transform (CDT) for 1D signals [Park . (2017)]. This allowed for the definition of several interesting transform pair properties, some of which have relevance to learning problems that are popular at the time of writing. Kolouri et al. [Kolouri, Tosun . (2016)] expanded the CDT to higher dimensional signals (images) by using the Radon transform formalism to obtain a set of 1D projections from an image, and applying the CDT along these projections. This is equivalent to obtaining an embedding for the sliced Wasserstein metric [Kolouri, Park  Rohde (2016)]. Similarly to the LOT transform above, the CDT and Radon-CDT are invertible operations with forward (analysis) and inverse (synthesis) operations.

4 LOT Transform Properties

In this section, we present three properties of the LOT transform pertaining to changes in density coordinates: the translation theorem, scaling theorem, and composition theorem. In addition, we demonstrate that the linear separation property of the CDT and Radon-CDT also holds for the LOT transform in -dimensions.

4.1 Translation

Let be the translation of the density by , such that . The transform of with respect to the reference density is given by:

(8)

where is defined on measure space . For a proof, see Appendix A.

4.2 Scaling

Let be the scaling of the density by , such that . The transform of with respect to the reference density is given by:

(9)

where is defined on measure space . For a proof, see Appendix B.

4.3 Composition

Let represent the composition of the density with an invertible function , such that , where is the determinant of the Jacobian of , and is the gradient of a convex potential function (i.e. ). The transform of with respect to the reference density is given by:

(10)

where is defined on measure space . For a proof, see Appendix C.

4.4 Linear Separability in LOT Space

It is well-known that if two sets are convex and disjoint, there always exists a hyperplane that is able to separate the sets (see

[Park . (2017)], for example). As a consequence, the sets are said to be linearly separable. In this section we will demonstrate that, under certain conditions, the LOT transform can enhance the linear separability of image classes.

Consider two non-overlapping, non-linearly separable subsets and of a vector space . All elements of are generated by transforming a “mother” density with a differentiable function :

where is the gradient of a convex potential , such that . Moreover, denotes the determinant of the Jacobian of , and is a set of diffeomorphisms (e.g. the set of all translations). Similarly, the elements of are generated by transforming with a differentiable function :

It should also be noted that since sets and do not overlap:

The definitions above provide a framework that can be used to construct or interpret image classes. For example, let be the set of all translations , where

is a random variable. The elements of sets

and are given by and , respectively, and are translations of the original “mother” densities and . Therefore, the translation

becomes the confound parameter that a classifier must decode in order to separate the classes.

Theorem 2

(Linear separability in LOT space) Let follow the definitions given above. If and are the LOT transforms of and , then and will be linearly separable if satisfies the following conditions:

  1. where

For a proof, see Appendix D.

We note that the definition of the embedding (transform) for a certain image given in Eq. (6) is in a way redundant and that the convex potential

could be used instead. In fact, doing so would be preferred in many practical situations, such as performing machine learning-type operations on digital images. This is because the array used to store the transform defined in Eq. (

6) would be twice the size of the array used to store the potential , since there would be derivatives in the and directions. More than just a savings in computer memory, using the potential directly would translate to performing operations on objects of lower dimension. Ultimately, the mathematical meaning is the same, as the two formulations are related by a gradient operation. However, gradient operations are also known to be noise augmenting, and thus introduce additional difficulties when dealing with real data. Thus in the computational examples we utilize the potential , rather than the LOT defined in Eq. (6).

5 Numerical Implementation

Given image , and reference , with

(in the results shown below we use 2D images), we compute an approximate solution for the the LOT transform by solving the associated Monge-Ampère partial differential equation. Since Brenier’s theorem asserts that the optimal map between

and is characterized by , we can write:

(11)

where represents the determinant of the Hessian matrix of second partial derivatives of . In our implementation we use a parametric formulation for the potential function based on a linear combination of basis functions :

(12)

where are the coefficients (to be determined via optimization), and is the pixel grid for the image. For simplicity, we chose the basis function

to be a radially symmetric Gaussian of standard deviation

, where is a user-defined parameter:

(13)

By combining Brenier’s theorem with Eq. (12)), the model for the spatial transformation is then given by:

(14)

Rather than solving the Monge-Ampère PDE directly, we propose to approximate the solution by iterative optimization of the related cost function (presented in discretized form):

(15)

Thus, the solution for the OT problem can be found by solving a registration-like image processing problem, under the model for the spatial transformation specified in Eqs. (13) and (14). Note that for the spatial transformation to remain invertible, must be positive definite throughout the domain of the image, and thus input images must be such that . Finally, we note the approach has similarities to other parametric image registration methods based on basis functions [Rohde . (2003), Kybic  Unser (2003), Rueckert . (1999)].

5.1 Numerical Optimization

The objective function can be minimized using gradient descent, which can be described by the following iterative update rule:

(16)

where is the learning rate and is the derivative of with respect to . We refer the reader to Appendix E for the full derivation of for the case when and are 2D images.

It is well known that standard gradient descent can be very slow to converge to the optimum, and therefore, alternative methods have been developed that can dramatically increase the rate of convergence [Nesterov (1983), Duchi . (2011), Kingma  Ba (2015)]. In this work, we use the method known as Adam [Kingma  Ba (2015)]

, which typically requires little tuning and has been shown to outperform other first-order gradient-based methods in neural networks. The Adam method estimates the first moment

and second moment of the gradient:

(17)

where are user-defined exponential decay rates.

Since and are initialized as zero, the moment estimates are biased towards zero. Kingma and Ba [Kingma  Ba (2015)] counteract this problem by computing bias-corrected estimates and :

(18)

These estimates are then used to compute an adaptive learning rate. As a result, the coefficients are updated as follows:

(19)

The authors suggest default values of , , and [Kingma  Ba (2015)]. We adopt these values throughout this work.

5.2 Multi-Scale Approach

The efficiency of the minimization not only relies on the choice of gradient descent algorithm, but also the position from where the optimization is started. Generally, we can assume that an image comprises a hierarchy of scales. Low frequency features, such as the general shapes within the image, are found at the coarsest scale, and high frequency features, such as noise, are found at the finest scale [Paquin . (2006)]. We can guide our method towards the globally optimum potential , by computing the optimum potential at scale , and using the result as the initial position for the gradient descent at the next, finer scale. For scales, the globally optimum potential can be expressed as follows:

(20)

By extension, the optimal transport map is given by:

(21)

In this work, we generated a hierarchy of image scales using a Gaussian pyramid. The Gaussian pyramid is constructed by smoothing the images and with a Gaussian filter, and subsampling the smoothed images by a factor of two in each direction [Adelson . (1984)]. An illustration of the result is depicted in Fig. 1. During the multi-scale gradient descent, moving from a coarse scale to a finer scale requires upsampling the potential . Unlike the multi-scale approach developed by Kundu et al. [Kundu . (2018)]

, which requires the transport maps to be interpolated from one scale to the next, the potential

in our method can be upsampled using Eq. (12) with appropriate values for .

Figure 1: An illustration of a Gaussian pyramid used in the multi-scale version of our single potential optimal transport method.

6 Experiments

6.1 Image Warping

6.1.1 Data Preparation

We tested our proposed single potential optimal transport method on three separate datasets: the extended Yale Face Database B [Lee . (2005)], the LONI Probabilistic Brain Atlas dataset [Shattuck . (2008)]

, and a random sample of images from ImageNet

[Deng . (2009)]. In this section, we briefly describe the datasets and any pre-processing that was conducted prior to our experiments.

YaleB

The extended Yale Face Database B [Lee . (2005)] (abbreviated to YaleB) comprises 1710 grayscale images of 38 individuals under 45 different lighting conditions. The authors of the dataset aligned the images to one another using the location of the eyes as landmarks. The images were also cropped to exclude hair and ears [Lee . (2005)]. In order to limit the number of image matches, we used a single cropped image from each subject under direct illumination, resulting in a dataset of 38 images. Exemplar images from our reduced YaleB dataset are shown in Fig. (a)a. Prior to performing the warpings, the images were normalized so that the sum of intensities was the same in every image. This ensured that the total mass constraint in Brenier’s theorem was satisfied. Identically to Kundu et al. [Kundu . (2018)], we normalized the images to a large value () in order to avoid numerical precision errors due to small numbers. We also added a small constant (0.1) to the normalized images so that they were strictly positive.

(a) YaleB
(b) LPBA40
(c) ImageNet
Figure 2: Sample images after pre-processing from LABEL:sub@fig:data-yaleb the YaleB dataset, LABEL:sub@fig:data-lpba40 the LPBA40 dataset, and LABEL:sub@fig:data-imagenet the ImageNet dataset.
Lpba40

Brain image data from 40 subjects were used by the Laboratory of Neuro Imaging (LONI) at UCLA to construct the LONI Probabilistic Brain Atlas (LPBA) [Shattuck . (2008)]. The three-dimensional magnetic resonance images were skull-stripped and aligned to a standard space using rigid-body transformations [Klein . (2009)]. In this work, we used the central axial slice from the 40 pre-processed volumes, and normalized these 2D slices in an identical manner to the YaleB data. Exemplar LPBA40 images are shown in Fig. (b)b.

ImageNet

The ImageNet database consists of 3.2 million color images spread over 5247 different categories [Deng . (2009)]. The large dataset is typically used to train and test machine learning methods for object detection and recognition. However, in our work, we used a randomly selected subset of 40 images from the database. Prior to use in our experiments, we converted the subset of ImageNet images to grayscale and resized them to pixels. The images were normalized using the same procedure as the YaleB and LPBA40 data. Sample ImageNet images after pre-preprocessing are shown in Fig. (c)c.

6.1.2 Warping Experiments

As indicated in Section 5, our single potential optimal transport (SPOT) method has three user-defined parameters: the standard deviation of the Gaussian basis function, the learning rate for the gradient descent, and the number of scales . In order to assess the effect of computing the optimal transport maps using a multi-scale approach, we compared SPOT at a single image scale (i.e. ) and SPOT with (henceforth called multi-SPOT).

To determine the optimum values for and , we split each dataset into a training and test set. The three training sets comprised five images randomly selected from the corresponding dataset. For each of the training sets, we computed the optimal transport map from every image to every other image in the set for a range of parameter values. This produced 20 () transport maps for each set of parameters. The parameters that resulted in the lowest mean value of the objective function (Eq. (15)) across the 20 mappings for a given dataset were used to test (multi-)SPOT on that dataset.

The SPOT and multi-SPOT methods were tested using the data excluded from the parameter optimization. For each of the three test datasets, we computed the optimal transport map from every image to every other image in the set. This resulted in 1056 () transport maps for the YaleB dataset, and 1190 () transport maps for the LPBA40 and ImageNet datasets.

6.1.3 Evaluation Measures

Following the computation of the transport maps, we evaluated the SPOT and multi-SPOT methods by measuring the relative mean squared error (MSE) between and , the proportion of mass transported, and the mean absolute curl of the optimal transport map. The measures are defined as follows:

(22)
(23)
(24)

For all three measures, the smaller the value, the closer the computed map is to the true OT map.

In order to assess whether the relative MSEs of the SPOT and multi-SPOT methods were significantly different from one another, we performed a paired -test on the relative MSE results. Furthermore, we also performed paired -tests for the mass transported and mean absolute curl.

6.1.4 Comparison to Other Optimal Transport Methods

We compared our single-scale SPOT method to the three methods outlined in the following sections: the flow minimization (FM) method by Haker et al. [Haker . (2004)], the dual problem minimization (DPM) method by Chartrand et al. [Chartrand . (2009)], and a single-scale version of the variational optimal transport (VOT) method by Kundu et al. [Kundu . (2018)]. We also compared our multi-SPOT method to the multi-scale variational optimal transport method (multi-VOT) [Kundu . (2018)]. The optimum parameters for the FM, DPM, VOT, and multi-VOT methods were determined using the approach described in Section 6.1.2. We then repeated the image warping experiments for all of the optimal transport methods, and evaluated the resulting transport maps using the measures defined in Section 6.1.3.

Flow Minimization (FM)

Haker et al. [Haker . (2004)] proposed a flow minimization method to compute the optimal transport map from the Monge formulation of the optimal transport problem. The method first obtains an initial transport map using the Knothe-Rosenblatt rearrangement [Rosenblatt (1952), Knothe (1957)]. If the initial map is considered to be a smooth vector field, Helmholtz’s theorem can be used to resolve into the sum of a curl-free and a divergence-free vector field:

The method by Haker et al. [Haker . (2004)] uses the Helmholtz decomposition of , and removes the curl by minimizing the objective function in Eq. (1) with cost function . The minimization is achieved using a gradient descent scheme:

where is the Laplace operator and denotes the divergence operator.

Minimizing the Dual Problem (DPM)

To compute the optimal transport map between two images Chartrand et al. [Chartrand . (2009)] use Kantorovich’s dual form of the problem [Kantorovich (1948)]:

over all integrable functions and satisfying

By considering the cost as a strictly convex function , and redefining and , the resulting problem is to minimize:

where and satisfy:

Chartrand et al. [Chartrand . (2009)] apply the proposition that has a unique minimizing pair of functions and that are convex conjugates of one another (i.e. and ). As a result, the derivative of can be found, and thus the authors develop a gradient descent approach to update and compute the optimal transport map:

where is the learning rate. For a full derivation, we refer the reader to [Chartrand . (2009)].

Variational Optimal Transport (VOT)

Similarly to the work of Haker et al. [Haker . (2004)], Kundu et al. [Kundu . (2018)] compute the optimal transport map from the Monge formulation of the optimal transport problem. However, the authors modify the objective function in Eq. (1) to help guide the solution towards a curl-free mapping. The optimization problem is relaxed further to incorporate the mass-preserving constraint in Eq. (3), resulting in the following objective function:

where and are regularization coefficients, and is the curl operator. The objective function is non-convex, so Kundu et al. [Kundu . (2018)] use a variational optimization technique to solve it. The authors use the Euler-Lagrange equations for the objective function to derive its gradient , and use Nesterov’s accelerated gradient descent method [Nesterov (1983)] to compute the optimal transport map. Moreover, a multi-scale technique is employed to guide the method towards the globally optimum solution.

6.2 Image Classification

In order to demonstrate the linear separability property of the LOT transform, we tested the SPOT method on five classification tasks: determining the number of Gaussian peaks in an image, distinguishing between animal faces, identifying the shape of galaxies, classifying liver nuclei as cancerous or benign, and recognizing handwritten digits. It should be noted that our aim is not to determine the optimal classification method for each task, but rather to experimentally validate the linear separability theorem proposed in Section 4.4.

6.2.1 Synthetic Gaussian Data

Using the same terminology as Section 4.4, the “mother” densities for the image classes in this dataset are a single 2D Gaussian, a mixture of two 2D Gaussians, and a mixture of three 2D Gaussians. We generated 1000 128128px images for each class by randomly translating the mother patterns. To ensure that the sets of images were disjoint, only translations that resulted in non-overlapping Gaussian peaks were selected. Example images from each class are shown in Fig. 3.

Figure 3: Sample images from each class of the Gaussian dataset. The task is to classify the number of Gaussian peaks in an image.

After the Gaussian dataset had been generated, we used our SPOT method to compute the transport map – and by extension, the potential – between each image and a uniform reference density . As the linear separability theorem does not prescribe an optimal classifier, we assessed the classification accuracy of the potential

using multiple classifiers: a linear support vector machine (SVM), logistic regression (LR), and penalized Fisher’s linear discriminant analysis (PLDA)

[Wang . (2011)]. Moreover, we employed a ten-fold stratified cross-validation scheme. The dataset was randomly divided into ten subsets, each containing the same proportion of images from each class. Nine subsets were used to train the linear classifier, and the remaining subset was used as the test data. This process was repeated until all ten subsets had been used as the test set. We then computed the mean classification accuracy and standard deviation. For comparison, the classification experiments were conducted on the original images, as well as the potential.

Although classifiers such as SVM and logistic regression are more commonly used than PLDA in the literature, PLDA computes a low-dimensional projection of the data. Therefore, in addition to a quantitative evaluation of the linear separability, PLDA can provide a qualitative visual assessment of the linear separability by projecting the high-dimensional data into two dimensions. Computing PLDA directly on the images or potential involves inverting a matrix of size 128

21282

. This is computationally expensive, and therefore, prior to classification using PLDA, the size of the input data was reduced using principal component analysis (PCA). In all of our classification experiments, we used the maximum number of principal components (equal to the number of images, in this case) to transform the data. Mathematically, this should have no effect on the classification results, since PCA is linear transformation.

6.2.2 Animal Faces

The LHI-Animal-Faces dataset comprises face images from 20 different classes of animals [Si  Zhu (2012)]. To complicate the classification of the animals, the categories exhibit a large range of within-class variation, including rotation, posture variation, and different animal sub-types (e.g. deer with and without antlers). In this work, we used a subset of three classes, namely cats, deer, and pandas.

In order to better fit the data requirements of the linear separability theorem, the images were pre-processed using the same method as [Kolouri, Park  Rohde (2016)]. Firstly, an edge map was generated by applying Canny edge detection to the images. Secondly, the edge maps were smoothed with a Gaussian filter. Example images before and after pre-processing are shown in Fig. 4.

Figure 4: Sample images from the animal faces dataset (left), and the corresponding smoothed edge maps used in the classification experiment described in Section 6.2.2 (right).

Following pre-processing, we computed the optimal transport map between each image and a uniform reference image using our SPOT algorithm. Identically to the Gaussian data experiment, we then assessed the classification accuracy of the potential using SVM, logistic regression, and PLDA classifiers with stratified ten-fold cross-validation. For comparison, we also computed the classification accuracy using the pre-processed images.

6.2.3 Galaxy Shapes

In [Shamir (2009)], Shamir proposed a method that can automatically classify images of elliptical, spiral, and edge-on galaxies. To test the method, the author used a random subset of images from the Galaxy Zoo project [Lintott . (2008)]: 225 elliptical galaxies, 224 spiral galaxies, and 75 edge-on galaxies. We used the same subset of Galaxy Zoo data in this work, however, prior to computing the transport maps between the images and a uniform reference, we converted the color images to grayscale, centered the galaxy within each image, and removed background noise by segmenting the galaxy using a thresholding technique. Example pre-processed images are shown in Fig. 5. As before, we compared the classification accuracy of the potential and the original images using three different linear classifiers.

Figure 5: Sample images from each class of the galaxy dataset after pre-processing.

6.2.4 Liver Nuclei

Another classification task is to classify liver cell nuclei as either fetal-type hepatoblastoma (a type of cancer) or benign. The dataset consists of nuclei images from 17 cancer patients and 26 healthy controls. Each patient has between 60 and 187 nuclei images (mean: 96, standard deviation: 16), resulting in 4145 images in total. Although there is no guarantee that all of the cells from the cancer patients are cancerous, we used the subject label (cancer or benign) and the ground truth for each image. Example images from each class are shown in Fig. 6.

Figure 6: Sample images from both classes of the liver nuclei dataset. Since the patient classes (cancer or healthy) were used as the ground truth labels for the cells, there is no guarantee that all of the nuclei in the “cancer" class are cancerous.

Unlike the other classification experiments, in which the transport maps were computed using a uniform reference image, for this task, we used a wide Gaussian distribution as the reference. The Gaussian had a similar diameter to the nuclei, resulting a more task-specific reference image. The remainder of the classification experiment was identical to those described in the previous sections.

6.2.5 Mnist

The final classification task is to assess the linear separation theorem on the MNIST database of handwritten digits. MNIST has become a benchmark dataset for testing machine learning methods, and by using deep neural networks, classification errors of less than 0.25% have been reported in the literature

[Wan . (2013), Cireşan . (2012)]. The aim of this work is not to present state-of-the-art classification results on MNIST, but rather, to assess the ability of linear classifiers to distinguish between the classes in potential space compared to image space.

The dataset comprises 60000 training images and 10000 test images, however, to be consistent with our other classification experiments, we used ten-fold cross-validation on the entire set of 70000 images when evaluating the three linear classifiers in image space and potential space. The transport maps (and therefore, the potential) were computed using a uniform reference image .

7 Results

7.1 Image Warping Results

The optimum SPOT and multi-SPOT parameters for each dataset are presented in Table 1. The table shows that a standard deviation of universally achieved the lowest mean relative MSE on the training data. Similarly, a gradient descent learning rate of was selected at the finest resolution across all datasets. In contrast to the LPBA40 dataset, the optimum values for and are larger at the coarser scales for the YaleB and ImageNet datasets.

SPOT Multi-SPOT
YaleB 1 0.01 {1, 0.1, 0.01}
LPBA40 1 0.01 {0.01, 0.01, 0.01}
ImageNet 1 0.01 {1, 0.1, 0.01}
Table 1: The optimum parameters for the single-scale and multi-scale versions of our SPOT method, across all three datasets. The multi-SPOT parameters are ordered from the coarsest scale to the finest scale, with given in units of pixels at the original (finest) image resolution.

Figure 7 compares the mean and standard deviation of the relative MSE and mass transported for the SPOT and multi-SPOT methods. For all of the datasets tested, multi-SPOT resulted in a statistically lower mean relative MSE than SPOT (), and had a smaller standard deviation. Conversely, the SPOT method achieved a significantly lower mean mass transported than multi-SPOT on the YaleB and ImageNet datasets (). Although the mean mass transported for multi-SPOT was only slightly lower than the mean mass transported for SPOT on the LPBA40 dataset (0.15 and 0.18, respectively), the two means are significantly different (). The final evaluation measure, mean absolute curl, has been omitted from Fig. 7 because both methods achieved zero curl for all three datasets. This is because the curl of a potential is zero.

(a) Relative MSE
(b) Mass transported
Figure 7: Mean and standard deviation of the LABEL:sub@fig:bar-spot-relmse relative MSE and LABEL:sub@fig:bar-spot-mass mass transported for the single- and multi-scale SPOT methods on all three datasets. The results for the mean absolute curl have been omitted, since both methods achieved zero curl on all of the datasets.

Figure 8 shows example results for the single-scale optimal transport methods (FM, DPM, VOT and SPOT) on all three datasets. The central columns show the warped input image for each method, after attempting to map to . The relative MSE of each result is displayed below the warped image. The top and bottom rows of each subfigure show the mapping that resulted in the lowest and highest relative MSE, respectively, for the SPOT method. In all of the examples in Fig. 8, the relative MSE for the SPOT method was lower than the corresponding relative MSEs for the other single-scale optimal transport methods. Moreover, the lowest relative MSE for the SPOT method was two orders of magnitude smaller than the corresponding relative MSEs of the other methods.

(a) YaleB
(b) LPBA40
(c) ImageNet
Figure 8: Example results for the single-scale optimal transport methods on LABEL:sub@fig:results-yaleb-spot the YaleB dataset, LABEL:sub@fig:results-lpba40-spot the LPBA40 dataset, and LABEL:sub@fig:results-imagenet-spot the ImageNet dataset. Columns 2-5 show the warped input image for each method after attempting to map to . The relative MSE of each result is displayed below the image. The top and bottom rows of each subfigure show the mapping that resulted in the lowest and highest relative MSE, respectively, for the SPOT method.

Analogously to Fig. 8, Fig. 9 shows example results for the multi-scale optimal transport methods (multi-VOT and multi-SPOT) on all three datasets. The second and third columns show the warped input image for each method, after attempting to map to . The relative MSE of each result is displayed below the warped image. The top and bottom rows of each subfigure show the mapping that resulted in the lowest and highest relative MSE, respectively, for the multi-SPOT method. Similarly to the single-scale optimal transport results, in all three datasets, the lowest relative MSE for the multi-SPOT method was an order of magnitude lower than the corresponding relative MSE for the multi-VOT method. However, the cases that produced the highest relative MSE for the multi-SPOT method resulted in a lower relative MSE in the multi-VOT method.

(a) YaleB
(b) LPBA40
(c) ImageNet
Figure 9: Example results for the multi-scale optimal transport methods on LABEL:sub@fig:results-yaleb-multispot the YaleB dataset, LABEL:sub@fig:results-lpba40-multispot the LPBA40 dataset, and LABEL:sub@fig:results-imagenet-multispot the ImageNet dataset. Columns 2-3 show the warped input image for each method after attempting to map to . The relative MSE of each result is displayed below the image. The top and bottom rows of each subfigure show the mapping that resulted in the lowest and highest relative MSE, respectively, for the multi-SPOT method.

The mean and standard deviation of the error measures for the single-scale optimal transport methods (FM, DPM, VOT, and SPOT) are presented in Fig. 10. For clarity, large bars have been truncated, and the true mean value of the error measure is displayed above the bar. Across all three datasets, SPOT resulted in a significantly lower mean relative MSE that the other single-scale optimal transport methods (, corrected for multiple comparisons). The standard deviation of the relative MSE was also smaller for SPOT compared to the other methods. A similar trend is observed for the mass transported in Fig. (b)b, except that the VOT method resulted in a significantly lower mean mass transported than SPOT () for the ImageNet dataset. Since the SPOT and DPM methods are based on Brenier’s theorem and derive the transport maps from a potential, the mean and standard deviation of the mean absolute curl were universally zero for those two methods .

(a) Relative MSE
(b) Mass transported
(c) Mean absolute curl
Figure 10: Mean and standard deviation of the LABEL:sub@fig:bar-single-relmse relative MSE, LABEL:sub@fig:bar-single-mass mass transported, and LABEL:sub@fig:bar-single-mac mean absolute curl, for the single-scale optimal transport methods on all three datasets. For clarity, large bars have been truncated, and the true mean value of the error measure is displayed above the bar.

Figure 11 shows the mean and standard deviation of the error measures for the multi-scale optimal transport methods (multi-VOT and multi-SPOT). Identically to Fig. 10, large bars have been truncated, and the true mean value of the error measure is displayed above the bar. For convenience of comparison, the mass transported and mean absolute curl are shown at the same scale as Fig. 10, however, the relative MSE is shown at a smaller scale to reflect the lower relative MSEs for the multi-scale optimal transport methods. For all error measures and datasets, multi-SPOT resulted in a significantly lower mean value than multi-VOT (). Moreover, multi-SPOT achieved a smaller standard deviation than multi-VOT for all error measures and datasets, except for relative MSE on the LPBA40 dataset.

(a) Relative MSE
(b) Mass transported
(c) Mean absolute curl
Figure 11: Mean and standard deviation of the LABEL:sub@fig:bar-multi-relmse relative MSE, LABEL:sub@fig:bar-multi-mass mass transported, and LABEL:sub@fig:bar-multi-mac mean absolute curl, for the multi-scale optimal transport methods on all three datasets. For clarity, large bars have been truncated, and the true mean value is displayed above the bar. It should also be noted that the relative MSE in LABEL:sub@fig:bar-multi-relmse is shown at a smaller scale than Figure (a)a.

7.2 Classification Results

The mean and standard deviation of the classification accuracy from the five classification tasks are presented in Table 2. For the task in which the classifiers have to determine the number of Gaussian peaks in an image, the mean classification accuracy of the original image data was only marginally higher than chance for all three classifiers. In contrast, when classifying the data using the potential , the mean accuracy was around 90%. This result is illustrated in Fig. 12, where the classes are indistinguishable in the two-dimensional PLDA projection of the image data, but are almost perfectly linearly separable in the potential space.

(a) Image space
(b) Potential space
Figure 12: Two-dimensional PLDA projection of LABEL:sub@fig:classify-gaussian-img the original Gaussian image data, and LABEL:sub@fig:classify-gaussian-phi its potential . In image space, the classes are indistinguishable, whereas in potential space they are almost perfectly linearly separable. The results shown here correspond to a single set of test data from the ten-fold cross validation.

Although the increase in classification accuracy between image and potential space is smaller for the animal faces dataset than the Gaussians dataset, a higher accuracy was achieved across all three classifiers when using the potential. Interestingly, the accuracy of the logistic regression on the animal faces image data was considerably lower than the accuracy obtained using a SVM or PLDA on the same data. In contrast, all three linear classifiers resulted in a very similar classification accuracy (85%) for the potential derived from the animal faces data.

For the galaxy and liver nuclei datasets, classification on the potential only resulted in a higher mean accuracy than image space for two out of the three classifiers. However, the accuracies on image space and potential space were very close for the cases where classification on image space achieved a higher accuracy. Finally, for the MNIST dataset, classification on potential space resulted in a lower accuracy than image space using the PLDA classifier. The results for the SVM and logistic regression classifiers were almost identical in both image space and potential space, with image space achieving a marginally smaller standard deviation.

Dataset SVM Logistic regression PLDA
Image Image Image
Gaussians 0.3480.011 0.8820.030 0.3470.020 0.8950.013 0.3570.024 0.9300.012
Animal faces 0.7140.078 0.8510.058 0.4230.007 0.8510.064 0.7840.071 0.8300.039
Galaxies 0.6360.048 0.7730.034 0.6980.054 0.8100.034 0.8050.057 0.7880.058
Liver nuclei 0.7620.019 0.7590.020 0.7690.019 0.7710.017 0.7320.029 0.7630.030
MNIST 0.9170.003 0.9170.004 0.9170.003 0.9170.004 0.8100.003 0.7370.005
Table 2: Mean standard deviation classification accuracy on the images and potential using a linear SVM, logistic regression, and PLDA. The best results for each classifier are highlighted in bold font.

8 Discussion

In the first set of experiments, we compared our numerical implementation of the LOT transform, based on the solution of the Monge-Ampère equation, to existing optimal transport methods from the literature. We showed that our SPOT method is able to provide a curl-free mapping between two images, which results in a lower error than other methods. This is highlighted by the high-quality image matches in Figs. 8 and 9, as well as the plots of mean MSE in Figs. 10 and 11. For all four methods tested in Section 7.1, the lowest mean relative MSE was obtained on the LPBA40 dataset. This could be due to the relatively small deformations and intensity changes required to match the brain images compared to the face images and ImageNet data. This is also reflected in the optimum multi-SPOT parameters presented in Table 1. Wider Gaussian basis functions are able to capture larger deformations, and larger values were selected for the coarsest scales of the YaleB and ImageNet datasets compared to the LPBA40 dataset.

Despite achieving a lower mean mass transported than the multi-VOT method, multi-SPOT resulted in a larger mean mass transported than its single-scale equivalent. This is possibly because our cost function in Eq. (15) does not directly minimize the mass transported, but rather, the MSE. As a result, the multi-scale approach to SPOT achieves a significantly lower mean MSE than the single-scale approach, at the expense of a smaller mean mass-transported. However, the single- and multi-scale versions of our algorithm resulted in a considerably lower mass transported than the other methods, suggesting that the map obtained by our numerical implementation is closer to the true optimal transport map.

The aim of the classification experiments described in Section 6.2 was to demonstrate the LOT linear separability theorem stated in Section 4.4. Using our synthetic Gaussian dataset, specifically designed to match the mathematical criteria of the linear separability theorem, we showed that the data are linearly inseparable in image space, and almost perfectly separable in transport space. A visual inspection of the few misclassified results in Fig. (b)b reveals that these images contain Gaussian peaks close to the edge of the image. Consequently, slight errors in accurately computing the potential at the boundary could have affected the classification result.

Other than the PLDA classifier on the MNIST dataset, classification on the potential achieved a similar, if not higher, accuracy than classification in image space using the same linear classifier. This suggests that classification in LOT space can be beneficial on real data as well as synthetic examples. Moreover, the relatively lower classification accuracy of the PLDA classifier on the MNIST dataset could be due to the handwritten digits not fulfilling the mathematical requirements of the linear separability theorem. For example, the confusion matrix of classification results indicates that images containing the digit “4” were more commonly misclassified as “9” in potential space compared to image space. Example images of 4’s and 9’s that were correctly classified by the PLDA classifier in image space and potential space are shown in Figs.

(a)a and (b)b. Images that were correctly identified as 4’s in image space, but incorrectly classified as 9’s in potential space, are shown in Fig. (c)c. In comparison to the correctly classified 4’s, the misclassified 4’s appear to be less angular, with the top part of the digit forming a loop similar to several of the correctly identified 9’s. This suggests that certain classes in the MNIST dataset are not completely disjoint, and thus do not meet the requirements of the linear separability theorem outlined in Section 4.4.

(a)
(b)
(c)
Figure 13: Images from the MNIST dataset that were LABEL:sub@fig:misclass-mnist-44 correctly identified as the digit “4” by the PLDA classfier in both spaces, LABEL:sub@fig:misclass-mnist-99 correctly identified as the digit “9” by the PLDA classifier in both spaces, and LABEL:sub@fig:misclass-mnist-49 correctly classified as 4’s by the PLDA classfier in image space, but misclassified as 9’s in potential space.

8.1 Summary and Conclusions

In this work we have presented a pattern theoretic approach for modeling and inference from image data based on the mathematics of optimal transport. Inspired by the work described in [Park . (2017), Kolouri, Park  Rohde (2016)], we have framed the linear optimal transport approach developed in [Wang . (2013), Basu . (2014), Kolouri, Tosun . (2016)] as a nonlinear signal transform, and described some of its mathematical properties. The transform involves computing a transport map between a given image and a chosen reference. Hence – and in contrast to linear signal transforms such as the Fourier and Wavelet transforms – the LOT transform allows for encoding of pixel displacements, and thus allows for the comparison of pixel intensities at non-fixed image coordinates.

It should be noted that the computational results shown here relied on a numerical implementation of the underlying (Monge) OT problem, where the transport map was modeled directly as the gradient of a convex potential function . Furthermore, the potential function was itself modeled as a linear sum of smooth basis functions. We showed numerical comparisons to other numerical (Monge) OT approaches in order to verify the accuracy of the computed solutions. We clarify, however, that the signal LOT transformation described here could make use of any numerical solver for actual implementation.

We also note that the nature of the numerical implementation we described here is, in a way, incompatible with the signal transformation properties (e.g. translation) shown earlier. This is because have have used a numerical PDE approach which requires both source and target images to be fixed grids in the square domain . This could explain in part some loss in the ability of the numerical transformation method to make signal classes more linear separable, though the numerical classification results suggest at times the signal transformation framework can be valuable in making the pattern recognition problem simpler to solve.

We believe the contributions presented here could enhance the theoretical foundations of the emerging transport and other Lagrangian signal transformation methods that have been recently been employed in a variety of applications including modeling brain anatomy [Kundu . (2018)] and cell morphology [Basu . (2014)], cancer detection [Ozolek . (2014), Tosun . (2015), Hanna . (2017)], communications [Park . (2018)], modeling turbulence in imagine experiments [Nichols . (2017)], inverse problems [Kolouri  Rohde (2015), Cattell . (2017)], and other applications including machine learning and applied statistics [Kolouri . (2017)].

9 Acknowledgements

The authors gratefully acknowledge funding from the National Science Foundation (CCF 1421502) and the National Institutes of Health R01 (GM090033) in contributing to this work.

Appendix A Proof of the Translation Property

Consider two positive density functions and defined on measure spaces and , respectively. We assume that the total mass associated with is equal to the total mass associated with , such that:

(25)

The optimal transport problem is to find a the least cost transformation that is also a mass-preserving map that maps one density to the other:

(26)

If we consider to be the gradient of a convex potential , such that , then by Brenier’s theorem, there exists a unique (up to a constant) transport map that will satisfy the equation above. Now let represent a translation of by , such that . In the same manner as above, we can solve for :

(27)

By substituting into Eq. (27) via the change of variables theorem, and equating Eqs. (26) and (27) we get:

(28)

Since the integrals are equal, we have . If we define the transform of density with respect to the reference density as , then:

(29)

Appendix B Proof of the Scaling Property

Consider two positive density functions and defined on measure spaces and , respectively. We assume that the total mass associated with is equal to the total mass associated with , such that:

(30)

The optimal transport problem is to find a mass-preserving map that maps one density to the other:

(31)

If we consider to be the gradient of a convex potential , such that , then by Brenier’s theorem, there exists a unique transport map that will satisfy the equation above. Now let represent a scaling of by , such that . In the same manner as above, we can solve for :

(32)

By substituting into Eq. (32) via the change of variables theorem, and equating Eqs. (31) and (32) we get:

(33)

Since the integrals are equal, we have . If we define the transform of density with respect to the reference density as , then:

(34)

Appendix C Proof of the Composition Property

Consider two positive density functions and defined on measure spaces and , respectively. We assume that the total mass associated with is equal to the total mass associated with , such that:

(35)

The optimal transport problem is to find a mass-preserving map that maps one density to the other:

(36)

If we consider to be the gradient of a convex potential , such that , then by Brenier’s theorem, there exists a unique (up to a constant) transport map , that will satisfy the equation above. Now let represent the composition of with an invertible function , such that where is the determinant of the Jacobian of . Similarly to , is also the gradient of a potential (i.e. ). In the same manner as before, we can solve for :

(37)

By substituting into Eq. (37) via the change of variables theorem, and equating Eqs. (36) and (37) we get:

(38)

Since the integrals are equal, we have . If we define the transform of density with respect to the reference density as , then:

(39)

Appendix D Proof of the Linear Separability Property

Two non-empty subsets and of a vector space are linearly separable if, and only if, their convex hulls are disjoint, i.e:

(40)

for any subset and , where and . For a proof, we refer the reader to [Park . (2017)].

In order to show that the LOT transforms and (as defined in Eq. (6)) of sets and are linearly separable, we first assume that and are not linearly separable (proof by contradiction). This implies that the convex hulls of the LOT transforms are not disjoint, and therefore, from Eq. (40) we can write:

(41)

Since the LOT transforms are defined as and , the equation above can be rewritten in terms of the transport maps and :

(42)

where, by Brenier’s theorem, and , and are convex potentials.

Using the Jacobian equation from the Monge formulation of the optimal transport problem (Eq. (3)), and the fact that the LOT transform uniquely identifies an with some , we can write . Moreover, using the generative model for image classes described in Section 4.4, the elements of are generated via . By combining these two results via the composition theorem for LOT transforms, one can show that