Neural TMDlayer: Modeling Instantaneous flow of features via SDE Generators

08/19/2021 ∙ by Zihang Meng, et al. ∙ University of Wisconsin-Madison University of Illinois at Chicago 0

We study how stochastic differential equation (SDE) based ideas can inspire new modifications to existing algorithms for a set of problems in computer vision. Loosely speaking, our formulation is related to both explicit and implicit strategies for data augmentation and group equivariance, but is derived from new results in the SDE literature on estimating infinitesimal generators of a class of stochastic processes. If and when there is nominal agreement between the needs of an application/task and the inherent properties and behavior of the types of processes that we can efficiently handle, we obtain a very simple and efficient plug-in layer that can be incorporated within any existing network architecture, with minimal modification and only a few additional parameters. We show promising experiments on a number of vision tasks including few shot learning, point cloud transformers and deep variational segmentation obtaining efficiency or performance improvements.



There are no comments yet.


page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a deep neural network model with parameters

which we train using the following update rule,



is a random variable representing data and

represents the loss function. Now, consider a slightly general form of the same update formula,


The only change here is the introduction of which can be assumed to be some data transformation matrix. If

, we see that Stochastic Gradient Descent (SGD) is a special case of (

2) under the assumption that we approximate the expectation in (2) with finite iid samples (or a mini-batch).

Let us unpack the data transformation notation a bit to check what it offers. If a set of transformations are chosen beforehand, and applied to the data samples before training commences, simply represents data samples derived via data augmentation. On the other hand, may not necessarily be explicitly instantiated as above. For example, spherical CNN [16] shows that when point cloud type data are embedded on the sphere with spherical convolutional operators, then it is possible to learn representations of data that are equivariant to the group action of rotations with no explicit data augmentation procedure. In particular, these approaches register each data point on a standard template (like the sphere) on which efficient convolutions can be defined based on differential geometric constructions – in other words, utilizing the properties of the transformations of interest and how they relate the data points, such a treatment enables the updates to implicitly take into account the loss on . Conceptually, many results [16, 48, 42] on equivariance show that by considering the entire orbit of each sample (a 3D point cloud) during training, for special types of , it is possible to avoid explicit data augmentation.

We can take a more expanded view of the above idea. Repeated application of a transformation on data point produces a discrete sequence where . In general, the transformation matrix at the -th step, denoted by , need not even be generated from a fixed matrix. Indeed, in practice is selected from a set of appropriate transformations such as rotation, blur and so on, with some ordering, which could even be stochastic. At a high level, approaches such as [16, 12] can be seen as a special case of (2). Making this argument precise needs adding an appropriate number of auxiliary variables and by averaging over all possible realizable ’s – the specific steps are not particularly relevant since apart from helping set up the intuition we just described, algorithms for equivariance to specific group actions do not directly inform our development. For the sake of convenience, we will primarily focus on the continuous time system since under the same initial conditions, the trajectories of both (continuous and discrete) systems coincide at all integers .

Figure 1: Overview of TMDlayer use in few-shot recognition, point cloud learning and segmentation. “EGNN” refers to edge-labeling graph neural network [25]; “FF” refers to feed-forward layer (10) and “CV” refers to our proposed deep Chan Vese model (12). The manifold (top) describes the meaning of and : captures the structure of the manifold. is an approximation of constructed from samples.

What does actually represent? There are two interpretations of : (i) it formalizes on-the-fly or instantaneous (smooth) data augmentation which are often used to accelerate training by exploiting symmetries in the landscape of , and (ii) a data dependent can be designed for invariance-like requirements, useful for downstream applications. In fact, learning data dependent transformations has also been explored by [14]. The starting point of this work is to exploit the view that the data sample provided to us is merely a snapshot of an underlying process which we will discuss shortly. Nonetheless, the key hypothesis is that specifying this process to our deep neural network model will be beneficial and provide a fresh perspective on some strategies that are already in use in the literature.

Main ideas.

The foregoing use of “process” to describe the data sample hints at the potential use of an ordinary differential equations (ODE). While ODE type constructions

can be used to characterize simple processes, it will be insufficient to model more complex processes that will better reflect practical considerations. The key challenge in directly instantiating the “” idea for SDEs is that it is clearly infeasible since there are infinite possible trajectories for the same initial conditions. Our main insight is that recent results in the SDE literature show that (under some technical conditions), the dynamics can be completely characterized by (functions of) the infinitesimal generator of the process which can be efficiently estimated using finite data. We exploit this result via a simple modification to the estimation procedure – one that can be directly used within anybackpropagation based training scheme. Specifically, we exploit the result from [2] where the authors call the generator Target Measure Diffusion map (TMDmap). This leads to our TMDlayer

that can be conveniently dropped into a network, and be used as a plug-and-play module with just a few additional parameters. When utilized within standard deep learning pipelines, our layer allows incorporating much richer domain information if available, or as a regularizer or an augmentation scheme, or as a substitute to an existing layer. We find this is beneficial to the overall performance of the model.

Our contributions. Models such as a Neural ODE [10] and Neural SDE [34] usually parameterize the dynamical system as a stand-alone model, and show how gradients can be efficiently backpropagated through this module. We take a different line of approach: we propose a stochastic process inspired layer which, in its most rudimentary form, can be thought of as an augmentation scheme that can work with existing layers in deep neural networks. But different from explicit data augmentation (rotation, flipping) that happens in the input image space, our layer can be utilized in the feature space and is fully adaptive to the input. But it is more than another augmentation scheme. Our layer allows modeling the time varying/stochastic property of the data/features, and controls them by a proper parameterization which is highly parameter efficient. We show that this stochasticity is not only mathematically interesting, but can be exploited in applications including point cloud transformers, object segmentation and few-shot recognition.

1.1 Related Work.

Early work in vision has made extensive use of differential equations [7, 36, 45, 6]

, especially for segmentation. In machine learning, differential equations are useful for manifold learning


and semi-supervised learning

[3, 38] among others. Recently, a number of strategies combine differential equations with deep neural networks (DNNs) for solving vision problems. For example, [9] utilizes a conditional random field after the CNN encoder to refine the semantic segmentation results whose update rules can be viewed as a differential equation and [37, 22] uses a CNN to extract visual features before feeding them to an active contour model which iteratively refines the contour according to the differential equation. Separately, the literature includes strategies for solving differential equations with DNNs [24, 39, 32]. Over the last few years, a number of formulations including neural ODE [10], neural SDE [34] and augmented neural ODE [15] have been proposed, motivated by the need to solve differential equation modules within DNNs. Note that [34] proposes to stabilize the neural ODE network with stochastic noise, which leads to a neural SDE, a setting quite different from the one studied here. Finally, we note that SDEs as a tool have also been used for stochastic analysis of DNNs [8].

2 Preliminaries

Background.Partial differential equation (PDE) is a functional equation in which the solution satisfies given relations between its various partial derivatives interpreted as multivariable functions. Consider a commonly used PDE model for segmentation – the heat equation, where depends on both and . By the celebrated Feynman-Kac formula, we know that the solution can be equivalently written as a conditional expectation with respect to a continuous time stochastic process . This means that the solution (segmentation) can be obtained by averaging a sequence of stochastic integration problems. For prediction, we need an algebraic concept called the “generator” of a function (like a neural network) since we are more interested in the pushforward mappings .

Given a time invariant stochastic process , the (infinitesimal) generator of a function is defined as,


If the process is deterministic, the expectation operator becomes identity, and so the generator simply measures the instantaneous rate of change in with respect to . In addition, say that can also be expressed as a (Itô) Stochastic Differential Equation (SDE), i.e., satisfies:


where is a (multidimensional) Brownian motion with covariance , and represent drift and diffusion functions. Then, it turns out that can be written in closed form (without limit) as,


where acts as a linear operator on functions , see [29]. We will shortly explain how to estimate and utilize .

Setup. Consider the setting where represents our input features (say, an image as a 3D array for the RGB channels) and is a network with layers. Let the data be in the form of points with , which lie on a compact -dimensional differentiable submanifold which is assumed to be unknown. We assume that in our case is defined implicitly using samples , and so it is impossible to obtain closed form expressions for the operators in (5). In such cases, recall that, when , Diffusion maps [13] uncovers the geometric structure by using to construct an matrix as an approximation of the linear operator .

Interpreting SDE. Recall that when is used on the input space, it can model stochastic transformations to the input image (rotation and clipping are special cases). When is used on the feature space (e.g., in an intermediate layer of a DNN), it can then model stochastic transformations of the features where it is hard to hand design augmentation methods. Moreover, it enables us to parameterize and learn the underlying stochastic changes/SDE of the features.

Roadmap. In the next section, we describe the estimation of differential operator within deep network training pipelines. Based on this estimate, we define TMDlayer as a approximation to for a small time interval using Taylor’s theorem. In §4, we discuss four different applications of TMDlayer, where the pushforward measure

under the flow of features (interpreted as a vector field)

may be a reasonable choice.

3 Approximating in Feedforward Networks

We now discuss a recently proposed nonparametric procedure to estimate given finite samples when . This is an important ingredient because in our setup, we often do not have a meaningful model of minibatch samples, especially in the high dimensional setting (e.g., images).

Constructing in DNN training. The definition in (3) while intuitive, is not immediately useful for computational purposes. Under some technical conditions such as smoothness of , and the rank of , [2] recently showed that for processes that satisfy (4), it is indeed possible to construct finite sample estimators of . In [2], the approach is called Target Measure Diffusion (TMD) so we call our proposed layer, a TMDlayer.

To construct the differential operator, we first need to compute a kernel matrix from the data. For problems involving a graph or a set of points as input, we can simply use the given data points ( would be the number of nodes in the graph, or the number of points in the set), while for problems with a single input (e.g., standard image classification), we may not have access to data points directly. In this case, we can construct the kernel matrix by sampling a batch from the dataset and processing them together because we can often assume that the entire dataset is, in fact, sampled from some underlying distribution.

1 Input: Function , a batch of data samples , coefficient , parameterized time interval
2 Construct distance matrix by (6)

Compute kernel density estimate:

4 Parameterize target distribution by (7)
5 Form the diagonal matrix with components
6 Use to right-normalize :
7 Construct by (8) using
Algorithm 1 Operational steps in TMDlayer

After getting the set of data samples, we first project the data into a latent space with a suitable using a learnable linear layer, before evaluating them with an appropriate kernel function such as,


We then follow [2] to construct the differential operator as follows: we compute the kernel density estimate . Then, we form the diagonal matrix with components . Here, we allow the network to learn by


where can be a linear layer or a MLP depending on the specific application. Next we use to right-normalize the kernel matrix and use which is the diagonal matrix of row sums of to left-normalize . Then we can build the TMDmap operator as


We will use (8) to form our TMDlayer as described next.

3.1 TMDlayer: A Transductive Correction via

Observe that (4) is very general and can represent many computer vision tasks where the density could be defined using a problem specific energy function, and is the source of noise. In other words, we aim to capture the underlying structure of the so-called image manifold [61] by using its corresponding differential operator (5). Intuitively, this means that if we are provided a network with parameters , then by Taylor’s theorem, the infinitesimal generator estimate can be used to approximate the change of as follows:


where such that the th coordinate , and

is interpreted as a hyperparameter in our use cases, see Algorithm


Inference using . In the ERM framework, typically, each test sample is used independently, and identically i.e., network (at optimal parameters) is used in a sequential manner for predictive purposes. Our framework allows us to further use relationships between the test samples for prediction. In particular, we can design custom choices of tailored for downstream applications. For example, in applications that require robustness to small and structured perturbations, it may be natural to consider low bias diffusion processes i.e., we can prescribe the magnitude using almost everywhere for some small constant (akin to radius of perturbation) and structure using diffusion functions . Inference can then be performed using generators derived using the corresponding process.

Layerwise for improved estimation of . While (9) allows us to use for any network with no modifications, using it naively can be unsatisfactory in practice. For example, often we find that features from input layers might not be too informative for the task and may hinder training, especially in the early stages. We suggest a simple adjustment: instead of applying approximation in (9) on the entire network, we do it layerwise – this could be every intermediate layer or several layers of interest. It means that can in principle be any layer (e.g., a layer in graph neural networks or a layer in Resnet), as shown in Fig. 1.


Recall that most feed-forward neural networks can be completely defined by a finite sequence of linear transformations followed by activation functions (along with intermediate normalization layers). One option is to estimate

by directly applying the Taylor series-like expansion in (9) on where represents the number of layers. However, from (9

) we can see that the variance of such an estimate of the value

will be high due to the well-known propagation of uncertainty phenomenon (across ’s). To avoid this, we can estimate in a sequential manner i.e., use to estimate . We will show in §4 that this parameterization can be useful in various applications.

Synopsis. We briefly summarize the benefits of our TMDlayer next: Our TMDlayer can parameterize the underlying stochastic transformations of features, providing a way to augment features at any layer. The stochasticity/randomness in our TMDlayer is a stability inducing operation for robust predictive purposes [20]. Our TMDlayer is parameter efficient. All we need is a projection linear layer and a linear layer parameterizing the density and a scalar parameter . In practice, we can work with a small latent dimension (e.g., = 16) when constructing , thus the total number of parameters in TMDlayer is very small when compared with the layer function in most deep learning applications. But the reader will see that a mild limitation of the SDE perspective in practice is that, in principle, the dynamics may eventually get stuck in a meta-stable state. This means that in this case, the estimate will not be very informative in the forward pass, and so the gradient estimates might be biased. In such cases, it may be useful to add points by sampling on the orbit if needed. We will now describe four different vision settings where our TMDlayer can be instantiated in a plug-and-play manner.

4 Applications

In this section, we evaluate our TMDlayer in the context of different applications. As a warm-up, in §4.1, we demonstrate the use of TMDlayer on a simple image classification task. We study its properties in both inductive and transductive settings. Then, in §4.2, we move to learning with point cloud datasets. Here, we see that the data type naturally offers a suitable object for leveraging the features of TMDlayer. In this case, we conduct experiments in an inductive setting. Next, in §4.3, we explore the use of TMDlayer on a segmentation task (also in an inductive setting). We propose a novel deep active contour model which can be viewed as a dynamical process within a neural network. We demonstrate the use of our TMDlayer on top of such a dynamical process. Finally, in §4.4, we investigate few-shot learning. Here, the problem setup natively provides the graph needed for computing our and allows transductive inference.

4.1 A Simple Sanity check on Resnet

We start with a simple example of image classification on CIFAR10 [28] using Resnet [23], to demonstrate applicability of our TMDlayer and evaluate its behavior.

4.1.1 Role of TMDlayer: Finetuning/Robustify Resnet

We choose Resnet-18 as the backbone network and simply treat each of its three residual blocks Res as (see [23] for details of a residual block) in TMDlayer as follows,

where is the feature at the -th layer and is constructed from a mini-batch of samples.

4.1.2 Experimental results

During training, we first sample data points in a batch and use it as the input so that we can construct . During test time, an input batch also contains samples (similar to training time), where increases from to . We can see from Table 1 that does have an influence on the test accuracy where a larger performs better than a smaller . A key reason is that using a larger can better capture the geometric structure of the data.

We also test whether our TMDlayer can help improve the robustness of the network. We can assess this property by adding random noise to the input image and evaluating the test accuracy (see Table 2). With our TMDlayer, the network is more noise resilient. This can be partly attributed to the use of our parameterized , which allows the network to control the stochastic process in the TMDlayer adaptively and dependent on the input. In summary, the performance profile is similar (Tab. 1) with small improvements in robustness (Tab. 2).

Inference w/ TMDlayer Accuracy (%)
1 No 75.15
1 Yes 87.35
10 Yes 87.65
50 Yes 88.14
100 Yes 88.52
150 Yes 88.55
200 Yes 88.25
Table 1: Accuracy on test set of CIFAR10 after adding TMDlayer to Resnet-18. Here, is the batch size used to construct during test/inference time. The accuracy of Resnet-18 (trained/tested without TMDlayer) is comparable to .
0.01 0.02 0.03 0.05 0.1
Resnet-18 87.54 83.90 75.85 53.87 17.27
Ours 87.79 84.37 77.96 56.18 19.18
Table 2: Accuracy on CIFAR10 when adding random noise (mean = , std = ) to the input. “Ours” refers to Resnet-18 plus the TMDlayer.

4.2 Point cloud transformer

Tasks involving learning with point cloud data is important within 3D vision. The input here is usually a 3D point cloud represented by a set of points, each associated with its own feature descriptor. These points can be naturally thought of as samples from the underlying distribution which captures the geometric structure of the object. The problem provides an ideal sandbox to study the effect of our TMDlayer. But before we do so, we provide some context for where and how the TMDlayer will be instantiated. Recently, [19] proposed a transformer based model for point cloud learning which achieves state-of-the-art performance on this task – and corresponds to an effective and creative use of transformers in this setting. Nonetheless, Transformer models are known to be parameter costly (e.g., see [5, 56, 59] for cheaper approximations effective in NLP settings) and it is sensible to check to what extent our TMDlayer operating on a simple linear layer can be competitive with the transformer layer proposed in [19]. Our goal will be to check if significant parameter efficiency is possible.

4.2.1 Problem formulation

Denote an input point cloud with points, each with a -dimensional feature descriptor. The classification task is to predict a class or label for the entire point cloud.

4.2.2 Role of TMDlayer: Replace transformer layer

The point cloud transformer layer in [19] is constructed as,


where FF refers to their feed-forward layer (a combination of Linear, BatchNorm and ReLU layer), and

is the output of self-attention module which takes as an input (we refer the reader to [19] for more details of their network design, also included in our appendix).

A Transformer layer is effective for point cloud because it simultaneously captures the relation between features of all points. Since our TMDlayer can be viewed as a diffusion operator which captures the structure of the underlying data manifold from the data, we can check to what extent its ability suffices. We use the TMDlayer on a single feed-forward layer to replace the Transformer layer in (10).


Surprisingly, it turns out that this simple layer can perform comparably with the carefully designed Transformer layer in (10) while offering a much more favorable parameter efficiency profile. Here, is constructed using points of the same point cloud (setting is identical to baselines).

4.2.3 Experimental results

Dataset. We follow [19] to conduct a point cloud classification experiment on ModelNet40 [54]. The dataset contains CAD models in object categories, widely used in benchmarking point cloud shape classification methods. We use the official splits for training/evaluation.

Network architecture and training details. We use the same network as [19] except that we replace each point cloud transformer layer with a TMDlayer built on a single feed forward layer. We follow [19] to use the same sampling strategy to uniformly sample each object via points and the same data augmentation strategy during training. The mini-batch size is and we train epochs using SGD (momentum , initial learning rate , cosine annealing schedule). The hidden dimension is for the whole network and for constructing (in TMDlayer).

Results. We see from Table 3 that our approach achieves comparable performance with [19]. In terms of the number of parameters, using hidden dimension (used in this experiment) as an example, one self-attention layer contains k parameters; one linear layer contains k parameters; and the TMDlayer module only needs k parameters.

Method Input #Points Accuracy(%)
PointNet [41] P 1k 89.2
A-SCN [55] P 1k 89.8
SO-Net [31] P, N 2k 90.9
Kd-Net [26] P 32k 91.8
PointNet++ [41] P 1k 90.7
PointNet++ [41] P, N 5k 91.9
PointGrid [30] P 1k 92.0
PCNN [1] P 1k 92.3
PointConv [53] P, N 1k 92.5
A-CNN [27] P, N 1k 92.6
DGCNN [52] P 1k 92.9
PCT [19] P 1k 93.2
Ours P 1k 93.0
Table 3: Classification results on ModelNet40. Accuracy means overall accuracy. P = points, N = normals. “Ours” means replacing transformer layers in PCT with TMDlayer.

4.3 Object segmentation

Here, we show that our TMDlayer (a dynamical system) can also be built on top of another dynamical system. We do so by demonstrating experiments on object segmentation.

Recall that active contour models are a family of effective segmentation models which evolve the contour iteratively until a final result is obtained. Among many options available in the literature (e.g., [44, 49, 57]), the widely used Chan-Vese [7] model evolves the contour based on a variational functional. Here, we propose to combine the Chan-Vese functional with a deep network by parameterizing the iterative evolution steps and build our TMDlayer on top of it. We see that this simple idea leads to improved results. The appendix includes more details of our model.

4.3.1 Problem formulation

Let be a bounded open subset of , where is its boundary. Let be an image, object segmentation involves predicting a dense map in where (and ) indicates the object (and background). In our formulation, we parameterize the object contour by a level set function and evolve it within the DNN. We note that hybrid approaches using level sets together with DNNs is not unique to our work, see [37, 58].

4.3.2 Role of TMDlayer: in deep active contour model

Our proposed deep active contour model evolves the contour in the form of a level set function within the network, and the update scheme is,


where is the level set function at layer and is derived from our proposed deep variational functional. The appendix includes more details of our model, the variational functional, and the derivation of update equation.

Denote the update function in (12) as . Then, our TMDlayer forward pass can be written as,

Remark 1

Note that in (12) and the in (13) correspond to two different dynamical systems. The first one pertains to the update function of the deep active contour model and the second one refers to the TMDlayer. in (13) is constructed using samples from the same mini-batch.

Remark 2

Note that our proposed segmentation model is different from [58] which uses the variational energy function directly as the final loss, whereas we are parameterizing the updating steps within our network so that the final output will already satisfy low variational energy.

4.3.3 Experimental results

Dataset. The Vaihingen buildings dataset consists of 168 building images extracted from the training set of ISPRS “2D semantic labeling contest” with a resolution of 9cm. We use only 100 images to train the model and the remaining 68 serve as the test set.

Network Architecture and Experiment Setup. We use an encoder CNN with an architecture similar to [21] and [37]. The input is the original image. The network is trained with a learning rate of for 300 epochs using a batch size of . We setup our baseline using the same CNN architecture to predict the segmentation mask without our Chan-Vese update module. Previous works combining active contour model and deep learning [37, 33] often can only be used to provide segmentations of a single building based on manual initialization or another initialization (based on a separate algorithm) whereas our model can be used to segment multiple buildings in the image without any initialization. So, the results cannot be meaningfully compared. See our appendix for more details about the setup.

Figure 2: Qualitative results on Vaihingen dataset. Our model performs well despite the small sample size.

Results and Discussion. We use the average Intersection over Union (IoU) to evaluate the performance on Vaihingen dataset: the baseline yields 68.9 while our model without TMDlayer achieves 73.5 and our complete model with TMDlayer achieves 74.6, which is a significant improvement in terms of IoU. This experiment shows that our TMDlayer can be built on top of another dynamical system and can provide additional benefits. Qualitative results of the baseline and our model are shown in Fig. 2. Our method tends to predict a more precise shape/boundary, and also fixes some flaws/errors relative to the baseline results.

4.4 Few-shot learning

In -way -shot few-shot learning, the input is a set of samples which naturally forms a fully connected graph. This serves to construct the differential operator . To provide context for where and how our TMDlayer will be instantiated, we note that [25] proposed a GNN approach (EGNN) for few-shot learning and this model achieves state-of-the-art performance. We show that by adding our TMDlayer, the performance increases by a clear margin.

4.4.1 Problem formulation

Few-shot learning classification seeks to learn a classifier given only a few training samples for every class. Each few-shot classification task

contains a support set which is a set of labeled input-label pairs and a query set (an unlabeled set where the learned classifier is evaluated). Given labeled samples for each of classes in the support set , the problem is a -way -shot classification problem.

4.4.2 Role of TMDlayer: Use in graph neural network

Let be the graph formed by samples from the task , with nodes denoted as . The node feature update equation is designed as (we refer readers to [25] or our appendix for more details about the network)


where is the feature of node at -th layer, is the edge feature between node and node , and refers to the parameters in the update function. We abstract (14) as and use our TMDlayer as,

Remark 3

In (15), the is constructed using samples from the same episode, and is a GNN module updating the node features using all node features and edge features.

4.4.3 Experimental results

Dataset. We follow [25] to conduct experiments on miniImageNet, proposed by [51] and derived from ILSVRC-12 dataset [46]. Images are sampled from different classes with samples per class (size pixels). We use the same splits as in [43, 25]: , and classes for training, validation and testing respectively.

Network architecture and training details. We use the same graph neural network architecture and follow the training strategy as in [25] by utilizing the code provided by the authors. We add our TMDlayer as shown in (15) to each node update layer in the graph neural network, with a latent dimension of for constructing . We follow [25] to conduct experiments for 5-way 5-shot learning, in both transductive and non-transductive settings, as well as for both supervised and semi-supervised settings. The network is trained with Adam optimizer with an initial learning rate of and weight decay of . The learning rate is cut in half every episodes. For evaluation, each test episode was formed by randomly sampling queries for each of classes, and the performance is averaged over

randomly generated episodes from the test set. Note that the feature embedding module is a convolutional neural network which consists of four blocks (following

[25]) and used in most few-shot learning models without any skip connections. Thus, Resnet-based models are excluded from the table for a fair comparison. We refer the reader to [25] or the appendix for more training and evaluation details.

Results. The performance of supervised and semi-supervised 5-way 5-shot learning is given in Tables 45 respectively. Our TMDlayer leads to consistent and clear improvements in both supervised and semi-supervised settings (also for transductive/non-transductive settings).

Model Trans. Accuracy(%)
Matching Networks [51] No 55.30
Reptile [40] No 62.74
Prototypical Net [47] No 65.77
GNN [18] No 66.41
EGNN [25] No 66.85
Ours No 68.35
MAML [17] BN 63.11
Reptile + BN [40] BN 65.99
Relation Net [50] BN 67.07
MAML + Transduction [17] Yes 66.19
TNP [35] Yes 69.43
TPN (Higher K) [35] Yes 69.86
EGNN+Transduction [25] Yes 76.37
Ours+Transduction Yes 77.78
Table 4: Results of 5-way 5-shot learning on miniImageNet, averaged over 600 test episodes. “Ours” means EGNN plus our TMDlayer.“BN” means that the query batch statistics are used instead of global batch normalization parameters.
Labeled Ratio (5-way 5-shot)
Training method 20% 40% 60% 1000%
GNN-semi [18] 52.45 58.76 - 66.41
EGNN-semi [25] 61.88 62.52 63.53 66.85
Ours 63.14 64.32 64.83 68.35
EGNN-semi(T) [25] 63.62 64.32 66.37 76.37
Ours(T) 64.84 66.43 68.62 77.78
Table 5: Accuracy of semi-supervised few-shot classification. “Ours” means EGNN plus our TMDlayer.

4.5 Runtime overhead/Relation with augmentation

Runtime overhead. Our construction does involve some training time overhead because of computing the kernel matrix, and varies depending on the use case. For reference, the overhead is in §4.2, in §4.3 and in §4.4.

Relationship with data augmentation. Data augmentation and TMDLayer are complementary, not mutually exclusive. In all our experiments, the baselines use data augmentations (e.g., random rotation or cropping). Our TMDLayer offers benefits, above and beyond augmentation.

5 Discussion and Conclusions

We proposed an SDE based framework that allows a unified view of several different learning tasks in vision. Our framework is beneficial where data generation (or the data itself) can be described using stochastic processes, or more specifically diffusion operators. This is particularly useful in settings where obtaining a deterministic model of the image manifold or learning density functions are impossible or challenging due to high sample complexity requirements. Our TMDlayer does not require explicit generation of diffused samples, especially during training, making it computationally efficient. The “process” of which the provided data sample is a snapshot and whose characterization is enabled by our TMDlayer, also appears to have implications for robust learning. Indeed, if the parameters that define the process are explicitly optimized, we should be able to establish an analogy between the resultant model as a stochastic/simpler version of recent results for certified margin radius maximization [60] which often require access to Monte Carlo sampling oracles [11]. We believe that periodicity in SDEs for data augmentation is an important missing ingredient – for instance – this may help model seasonal patterns in disease progression studies for predictions, automatically. For this purpose, tools from Floquet theory may allow us to consider transformed versions of the process, potentially with simplified generators. Our code is available at .


This work was supported by NIH grants RF1 AG059312 and RF1 AG062336. SNR was supported by UIC start-up funds. We thank Baba Vemuri for providing many important suggestions on formulating the Chan-Vese model within deep networks.


  • [1] M. Atzmon, H. Maron, and Y. Lipman (2018) Point convolutional neural networks by extension operators. arXiv preprint arXiv:1803.10091. Cited by: Table 3.
  • [2] R. Banisch, Z. Trstanova, A. Bittracher, S. Klus, and P. Koltai (2020) Diffusion maps tailored to arbitrary non-degenerate itô processes. Applied and Computational Harmonic Analysis 48 (1), pp. 242–265. Cited by: §1, §3, §3.
  • [3] M. Belkin, P. Niyogi, and V. Sindhwani (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples.. Journal of machine learning research 7 (11). Cited by: §1.1.
  • [4] M. Belkin and P. Niyogi (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation 15 (6), pp. 1373–1396. Cited by: §1.1.
  • [5] I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. Cited by: §4.2.
  • [6] V. Caselles, R. Kimmel, and G. Sapiro (1997) Geodesic active contours. International journal of computer vision 22 (1), pp. 61–79. Cited by: §1.1.
  • [7] T. F. Chan and L. A. Vese (2001) Active contours without edges. IEEE Transactions on image processing 10 (2), pp. 266–277. Cited by: §1.1, §4.3.
  • [8] P. Chaudhari and S. Soatto (2018) Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA), pp. 1–10. Cited by: §1.1.
  • [9] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §1.1.
  • [10] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud (2018) Neural ordinary differential equations. arXiv preprint arXiv:1806.07366. Cited by: §1.1, §1.
  • [11] J. Cohen, E. Rosenfeld, and Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310–1320. Cited by: §5.
  • [12] T. S. Cohen, M. Geiger, J. Köhler, and M. Welling (2018) Spherical cnns. arXiv preprint arXiv:1801.10130. Cited by: §1.
  • [13] R. R. Coifman and S. Lafon (2006) Diffusion maps. Applied and computational harmonic analysis 21 (1), pp. 5–30. Cited by: §2.
  • [14] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) Autoaugment: learning augmentation strategies from data. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 113–123. Cited by: item ii.
  • [15] E. Dupont, A. Doucet, and Y. W. Teh (2019) Augmented neural odes. arXiv preprint arXiv:1904.01681. Cited by: §1.1.
  • [16] C. Esteves, C. Allen-Blanchette, A. Makadia, and K. Daniilidis (2017) Learning so(3) equivariant representations with spherical cnns. CoRR. External Links: Link, 1711.06721 Cited by: §1, §1.
  • [17] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135. Cited by: Table 4.
  • [18] V. Garcia and J. Bruna (2017) Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043. Cited by: Table 4, Table 5.
  • [19] M. Guo, J. Cai, Z. Liu, T. Mu, R. R. Martin, and S. Hu (2020) PCT: point cloud transformer. arXiv preprint arXiv:2012.09688. Cited by: §4.2.2, §4.2.3, §4.2.3, §4.2.3, §4.2, Table 3.
  • [20] M. Hardt, B. Recht, and Y. Singer (2016) Train faster, generalize better: stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 1225–1234. Cited by: item 2.
  • [21] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik (2015) Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 447–456. Cited by: §4.3.3.
  • [22] A. Hatamizadeh, D. Sengupta, and D. Terzopoulos (2019) End-to-end deep convolutional active contours for image segmentation. arXiv preprint arXiv:1909.13359. Cited by: §1.1.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.1, §4.1.
  • [24] E. Kharazmi, Z. Zhang, and G. E. Karniadakis (2019) Variational physics-informed neural networks for solving partial differential equations. arXiv preprint arXiv:1912.00873. Cited by: §1.1.
  • [25] J. Kim, T. Kim, S. Kim, and C. D. Yoo (2019) Edge-labeling graph neural network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11–20. Cited by: Figure 1, §4.4.2, §4.4.3, §4.4.3, §4.4, Table 4, Table 5.
  • [26] R. Klokov and V. Lempitsky (2017) Escape from cells: deep kd-networks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 863–872. Cited by: Table 3.
  • [27] A. Komarichev, Z. Zhong, and J. Hua (2019) A-cnn: annularly convolutional neural networks on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7421–7430. Cited by: Table 3.
  • [28] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
  • [29] H. Kunita (1997) Stochastic flows and stochastic differential equations. Vol. 24, Cambridge university press. Cited by: §2.
  • [30] T. Le and Y. Duan (2018) Pointgrid: a deep network for 3d shape understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9204–9214. Cited by: Table 3.
  • [31] J. Li, B. M. Chen, and G. H. Lee (2018) So-net: self-organizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: Table 3.
  • [32] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar (2020) Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895. Cited by: §1.1.
  • [33] H. Ling, J. Gao, A. Kar, W. Chen, and S. Fidler (2019) Fast interactive object annotation with curve-gcn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5257–5266. Cited by: §4.3.3.
  • [34] X. Liu, S. Si, Q. Cao, S. Kumar, and C. Hsieh (2019) Neural sde: stabilizing neural ode networks with stochastic noise. arXiv preprint arXiv:1906.02355. Cited by: §1.1, §1.
  • [35] Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. Hwang, and Y. Yang (2019) Learning to propagate labels: transductive propagation network for few-shot learning. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: Table 4.
  • [36] R. Malladi, J. A. Sethian, and B. C. Vemuri (1995) Shape modeling with front propagation: a level set approach. IEEE transactions on pattern analysis and machine intelligence 17 (2), pp. 158–175. Cited by: §1.1.
  • [37] D. Marcos, D. Tuia, B. Kellenberger, L. Zhang, M. Bai, R. Liao, and R. Urtasun (2018) Learning deep structured active contours end-to-end. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8877–8885. Cited by: §1.1, §4.3.1, §4.3.3.
  • [38] L. Melas-Kyriazi (2020)

    The geometry of semi-supervised learning

    Ph.D. Thesis, Harvard University Cambridge, Massachusetts. Cited by: §1.1.
  • [39] C. Michoski, M. Milosavljević, T. Oliver, and D. R. Hatch (2020) Solving differential equations using deep neural networks. Neurocomputing 399, pp. 193–212. Cited by: §1.1.
  • [40] A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999. Cited by: Table 4.
  • [41] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: Table 3.
  • [42] G. Qi, L. Zhang, F. Lin, and X. Wang (2020)

    Learning generalized transformation equivariant representations via autoencoding transformations

    IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
  • [43] S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. ICLR 2017. Cited by: §4.4.3.
  • [44] R. Ronfard (1994) Region-based strategies for active contour models. International journal of computer vision 13 (2), pp. 229–251. Cited by: §4.3.
  • [45] L. I. Rudin and S. Osher (1994) Total variation based image restoration with free local constraints. In Proceedings of 1st International Conference on Image Processing, Vol. 1, pp. 31–35. Cited by: §1.1.
  • [46] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.4.3.
  • [47] J. Snell, K. Swersky, and R. S. Zemel (2017) Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175. Cited by: Table 4.
  • [48] R. Spezialetti, S. Salti, and L. D. Stefano (2019) Learning an effective equivariant 3d descriptor without supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6401–6410. Cited by: §1.
  • [49] Ö. N. Subakan and B. C. Vemuri (2011) A quaternion framework for color image smoothing and segmentation. International Journal of Computer Vision 91 (3), pp. 233–250. Cited by: §4.3.
  • [50] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1199–1208. Cited by: Table 4.
  • [51] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. arXiv preprint arXiv:1606.04080. Cited by: §4.4.3, Table 4.
  • [52] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: Table 3.
  • [53] W. Wu, Z. Qi, and L. Fuxin (2019) Pointconv: deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9621–9630. Cited by: Table 3.
  • [54] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §4.2.3.
  • [55] S. Xie, S. Liu, Z. Chen, and Z. Tu (2018) Attentional shapecontextnet for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4606–4615. Cited by: Table 3.
  • [56] Y. Xiong, Z. Zeng, R. Chakraborty, M. Tan, G. Fung, Y. Li, and V. Singh (2021) Nyströmformer: a nyström-based algorithm for approximating self-attention. arXiv preprint arXiv:2102.03902. Cited by: §4.2.
  • [57] C. Xu, D. L. Pham, and J. L. Prince (2000) Image segmentation using deformable models. Handbook of medical imaging 2 (20), pp. 0. Cited by: §4.3.
  • [58] J. Yuan, C. Chen, and L. Fuxin (2020) Deep variational instance segmentation. NeurIPS 2020. Cited by: §4.3.1, Remark 2.
  • [59] Z. Zeng, Y. Xiong, S. Ravi, S. Acharya, G. M. Fung, and V. Singh (2021) You only sample (almost) once: linear cost self-attention via bernoulli sampling. In International Conference on Machine Learning, pp. 12321–12332. Cited by: §4.2.
  • [60] X. Zhen, R. Chakraborty, and V. Singh (2021) Simpler certified radius maximization by propagating covariances. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7292–7301. Cited by: §5.
  • [61] J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros (2016) Generative visual manipulation on the natural image manifold. In European conference on computer vision, pp. 597–613. Cited by: §3.1.