
Advancing GraphSAGE with A DataDriven Node Sampling
As an efficient and scalable graph neural network, GraphSAGE has enabled an inductive capability for inferring unseen nodes or graphs by aggregating subsampled local neighborhoods and by learning in a minibatch gradient descent fashion. The neighborhood sampling used in GraphSAGE is effective in order to improve computing and memory efficiency when inferring a batch of target nodes with diverse degrees in parallel. Despite this advantage, the default uniform sampling suffers from high variance in training and inference, leading to suboptimum accuracy. We propose a new datadriven sampling approach to reason about the realvalued importance of a neighborhood by a nonlinear regressor, and to use the value as a criterion for subsampling neighborhoods. The regressor is learned using a valuebased reinforcement learning. The implied importance for each combination of vertex and neighborhood is inductively extracted from the negative classification loss output of GraphSAGE. As a result, in an inductive node classification benchmark using three datasets, our method enhanced the baseline using the uniform sampling, outperforming recent variants of a graph neural network in accuracy.
04/29/2019 ∙ by Jihun Oh, et al. ∙ 30 ∙ shareread it

Stability Properties of Graph Neural Networks
Data stemming from networks exhibit an irregular support, whereby each data element is related by arbitrary pairwise relationships determined by the network. Graph neural networks (GNNs) have emerged as information processing architectures that exploit the particularities of this underlying support. The use of nonlinearities in GNNs, coupled with the fact that filters are learned from data, raises mathematical challenges that have precluded the development of theoretical results that would give insight in the reasons for the remarkable performance of GNNs. In this work, we prove the property of stability, that states that a small change in the support of the data leads to a small (bounded) change in the output of the GNN. More specifically, we prove that the bound on the output difference of the GNN computed on one graph or another, is proportional to the difference between the graphs and the design parameters of the GNN, as long as the trained filters are integral Lipschitz. We exploit this result to provide some insights in the crucial effect that nonlinearities have in obtaining an architecture that is both stable and selective, a feat that is impossible to achieve if using only linear filters.
05/11/2019 ∙ by Fernando Gama, et al. ∙ 22 ∙ shareread it

Kymatio: Scattering Transforms in Python
The wavelet scattering transform is an invariant signal representation suitable for many signal processing and machine learning applications. We present the Kymatio software package, an easytouse, highperformance Python implementation of the scattering transform in 1D, 2D, and 3D that is compatible with modern deep learning frameworks. All transforms may be executed on a GPU (in addition to CPU), offering a considerable speed up over CPU implementations. The package also has a small memory footprint, resulting inefficient memory usage. The source code, documentation, and examples are available undera BSD license at https://www.kymat.io/
12/28/2018 ∙ by Mathieu Andreux, et al. ∙ 12 ∙ shareread it

Planning with Arithmetic and Geometric Attributes
A desirable property of an intelligent agent is its ability to understand its environment to quickly generalize to novel tasks and compose simpler tasks into more complex ones. If the environment has geometric or arithmetic structure, the agent should exploit these for faster generalization. Building on recent work that augments the environment with userspecified attributes, we show that further equipping these attributes with the appropriate geometric and arithmetic structure brings substantial gains in sample complexity.
09/06/2018 ∙ by David Folqué, et al. ∙ 10 ∙ shareread it

Deep Geometric Prior for Surface Reconstruction
The reconstruction of a discrete surface from a point cloud is a fundamental geometry processing problem that has been studied for decades, with many methods developed. We propose the use of a deep neural network as a geometric prior for surface reconstruction. Specifically, we overfit a neural network representing a local chart parameterization to part of an input point cloud using the Wasserstein distance as a measure of approximation. By jointly fitting many such networks to overlapping parts of the point cloud, while enforcing a consistency condition, we compute a manifold atlas. By sampling this atlas, we can produce a dense reconstruction of the surface approximating the input cloud. The entire procedure does not require any training data or explicit regularization, yet, we show that it is able to perform remarkably well: not introducing typical overfitting artifacts, and approximating sharp features closely at the same time. We experimentally show that this geometric prior produces good results for both manmade objects containing sharp features and smoother organic objects, as well as noisy inputs. We compare our method with a number of wellknown reconstruction methods on a standard surface reconstruction benchmark.
11/27/2018 ∙ by Francis Williams, et al. ∙ 8 ∙ shareread it

On the Expected Dynamics of Nonlinear TD Learning
While there are convergence guarantees for temporal difference (TD) learning when using linear function approximators, the situation for nonlinear models is far less understood, and divergent examples are known. Here we take a first step towards extending theoretical convergence guarantees to TD learning with nonlinear function approximation. More precisely, we consider the expected dynamics of the TD(0) algorithm. We prove that this ODE is attracted to a compact set for smooth homogeneous functions including some ReLU networks. For overparametrized and wellconditioned functions in sufficiently reversible environments we prove convergence to the global optimum. This result improves when using kstep or λ returns. Finally, we generalize a divergent counterexample to a family of divergent problems to motivate the assumptions needed to prove convergence.
05/29/2019 ∙ by David Brandfonbrener, et al. ∙ 8 ∙ shareread it

On the Expressive Power of Deep Polynomial Neural Networks
We study deep neural networks with polynomial activations, particularly their expressive power. For a fixed architecture and activation degree, a polynomial neural network defines an algebraic map from weights to polynomials. The image of this map is the functional space associated to the network, and it is an irreducible algebraic variety upon taking closure. This paper proposes the dimension of this variety as a precise measure of the expressive power of polynomial neural networks. We obtain several theoretical results regarding this dimension as a function of architecture, including an exact formula for high activation degrees, as well as upper and lower bounds on layer widths in order for deep polynomials networks to fill the ambient functional space. We also present computational evidence that it is profitable in terms of expressiveness for layer widths to increase monotonically and then decrease monotonically. Finally, we link our study to favorable optimization properties when training weights, and we draw intriguing connections with tensor and polynomial decompositions.
05/29/2019 ∙ by Joe Kileel, et al. ∙ 5 ∙ shareread it

Global convergence of neuron birthdeath dynamics
Neural networks with a large number of parameters admit a meanfield description, which has recently served as a theoretical explanation for the favorable training properties of "overparameterized" models. In this regime, gradient descent obeys a deterministic partial differential equation (PDE) that converges to a globally optimal solution for networks with a single hidden layer under appropriate assumptions. In this work, we propose a nonlocal mass transport dynamics that leads to a modified PDE with the same minimizer. We implement this nonlocal dynamics as a stochastic neuronal birthdeath process and we prove that it accelerates the rate of convergence in the meanfield limit. We subsequently realize this PDE with two classes of numerical schemes that converge to the meanfield equation, each of which can easily be implemented for neural networks with finite numbers of parameters. We illustrate our algorithms with two models to provide intuition for the mechanism through which convergence is accelerated.
02/05/2019 ∙ by Grant Rotskoff, et al. ∙ 4 ∙ shareread it

Stability of Graph Scattering Transforms
Scattering transforms are nontrainable deep convolutional architectures that exploit the multiscale resolution of a wavelet filter bank to obtain an appropriate representation of data. More importantly, they are proven invariant to translations, and stable to perturbations that are close to translations. This stability property dons the scattering transform with a robustness to small changes in the metric domain of the data. When considering network data, regular convolutions do not hold since the data domain presents an irregular structure given by the network topology. In this work, we extend scattering transforms to network data by using multiresolution graph wavelets, whose computation can be obtained by means of graph convolutions. Furthermore, we prove that the resulting graph scattering transforms are stable to metric perturbations of the underlying network. This renders graph scattering transforms robust to changes on the network topology, making it particularly useful for cases of transfer learning, topology estimation or timevarying graphs.
06/11/2019 ∙ by Fernando Gama, et al. ∙ 4 ∙ shareread it

Gradient Dynamics of Shallow Univariate ReLU Networks
We present a theoretical and empirical study of the gradient dynamics of overparameterized shallow ReLU networks with onedimensional input, solving leastsquares interpolation. We show that the gradient dynamics of such networks are determined by the gradient flow in a nonredundant parameterization of the network function. We examine the principal qualitative features of this gradient flow. In particular, we determine conditions for two learning regimes:kernel and adaptive, which depend both on the relative magnitude of initialization of weights in different layers and the asymptotic behavior of initialization coefficients in the limit of large network widths. We show that learning in the kernel regime yields smooth interpolants, minimizing curvature, and reduces to cubic splines for uniform initializations. Learning in the adaptive regime favors instead linear splines, where knots cluster adaptively at the sample points.
06/18/2019 ∙ by Francis Williams, et al. ∙ 3 ∙ shareread it

Backplay: "Man muss immer umkehren"
A longstanding problem in model free reinforcement learning (RL) is that it requires a large number of trials to learn a good policy, especially in environments with sparse rewards. We explore a method to increase the sample efficiency of RL when we have access to demonstrations. Our approach, which we call Backplay, uses a single demonstration to construct a curriculum for a given task. Rather than starting each training episode in the environment's fixed initial state, we start the agent near the end of the demonstration and move the starting point backwards during the course of training until we reach the initial state. We perform experiments in a competitive four player game (Pommerman) and a pathfinding maze game. We find that this weak form of guidance provides significant gains in sample complexity with a stark advantage in sparse reward environments. In some cases, standard RL did not yield any improvement while Backplay reached success rates greater than 50 generalized to unseen initial conditions in the same amount of training time. Additionally, we see that agents trained via Backplay can learn policies superior to those of the original demonstration.
07/18/2018 ∙ by Cinjon Resnick, et al. ∙ 2 ∙ shareread it