# Empirical Risk Minimization and Stochastic Gradient Descent for Relational Data

Empirical risk minimization is the principal tool for prediction problems, but its extension to relational data remains unsolved. We solve this problem using recent advances in graph sampling theory. We (i) define an empirical risk for relational data and (ii) obtain stochastic gradients for this risk that are automatically unbiased. The key ingredient is to consider the method by which data is sampled from a graph as an explicit component of model design. Theoretical results establish that the choice of sampling scheme is critical. By integrating fast implementations of graph sampling schemes with standard automatic differentiation tools, we are able to solve the risk minimization in a plug-and-play fashion even on large datasets. We demonstrate empirically that relational ERM models achieve state-of-the-art results on semi-supervised node classification tasks. The experiments also confirm the importance of the choice of sampling scheme.

There are no comments yet.

## Authors

• 15 publications
• 6 publications
• 12 publications
• 78 publications
• 8 publications
• ### Diametrical Risk Minimization: Theory and Computations

The theoretical and empirical performance of Empirical Risk Minimization...
10/24/2019 ∙ by Matthew Norton, et al. ∙ 0

• ### Borrowing From the Future: An Attempt to Address Double Sampling

For model-free reinforcement learning, the main difficulty of stochastic...
12/01/2019 ∙ by Yuhua Zhu, et al. ∙ 0

• ### A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics

We study the Stochastic Gradient Langevin Dynamics (SGLD) algorithm for ...
02/18/2017 ∙ by Yuchen Zhang, et al. ∙ 0

• ### Black-box α-divergence Minimization

Black-box alpha (BB-α) is a new approximate inference method based on th...
11/10/2015 ∙ by José Miguel Hernández-Lobato, et al. ∙ 0

• ### Classification from Pairwise Similarities/Dissimilarities and Unlabeled Data via Empirical Risk Minimization

Pairwise similarities and dissimilarities between data points might be e...
04/26/2019 ∙ by Takuya Shimada, et al. ∙ 22

• ### Incremental Without Replacement Sampling in Nonconvex Optimization

Minibatch decomposition methods for empirical risk minimization are comm...
07/15/2020 ∙ by Edouard Pauwels, et al. ∙ 0

• ### Empirical Risk Minimization under Random Censorship: Theory and Practice

We consider the classic supervised learning problem, where a continuous ...
06/05/2019 ∙ by Guillaume Ausset, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Relational data is data that can be represented as a graph, possibly annotated with additional information. An example is the link graph of a social network, annotated by user profiles. We consider prediction problems for such data. For example, how to predict the preferences of a user of a social network using both the preferences and profiles of other users, and the network itself? In the classical case of i.i.d. sequence data—where the observed data does not include link structure—the data decomposes into individual examples. Prediction methods for such data typically rely on this decomposition, e.g., predicting a user’s preferences from only the profile of the user, ignoring the network structure. Relational data, however, does not decompose; e.g., because of the link structure, a social network can not be decomposed into individual users. Accordingly, classical methods do not generally apply to relational data, and new methods cannot be developed with the same ease as for i.i.d. sequence data.

With i.i.d. sequence data, prediction problems are typically solved with models fit by empirical risk minimization (ERM) [Vapnik:1992, Vapnik:1995, Shalev-Shwartz:Ben-David:2014]. We give an (unusual) presentation of ERM that anticipates the relational case. The observed data is a set that decomposes into examples . The task is to choose a predictor that completes

by estimating missing information

, e.g., a class label. An ERM model is defined by two parts: (i) a hypothesis class from which

is chosen, and (ii) a loss function

where measures the reconstruction error of predictor on example . The empirical risk is the expected loss on an example randomly selected from the dataset:

 ^R(θ,¯¯¯Sn)\coloneqqE¯¯¯¯X∼F(¯¯¯Sn)[L(¯¯¯¯¯X;θ)|¯¯¯Sn], (1)

where is the empirical distribution.222The empirical risk is more often equivalently written as . The ERM dogma is to select the predictor given by . That is, the objective function that defines learning is the empirical risk.

ERM has two useful properties. (1) It provides a principled framework for defining new machine learning methods. In particular, when examples are generated i.i.d., model-agnostic results guarantee that ERM models cohere as more data is collected (e.g., in the sense of statistical convergence)

[Shalev-Shwartz:Ben-David:2014]

. (2) For differentiable models, mini-batch stochastic gradient descent (SGD) can efficiently solve the minimization problem (albeit, approximately). The ease of SGD comes from the definition of the empirical risk as the expectation over a randomly subsampled example: the gradient of the loss on a randomly subsampled example is an unbiased estimate of the gradient of the empirical risk. Combined with automatic differentiation, this provides a turnkey approach to fitting machine-learning models.

Returning to relational data, the observed data is now a graph of size (e.g., the number of vertices or edges). The graph is possibly annotated, e.g., by vertex labels. We further consider as an incomplete version of . For example, may censor labels of the vertices or some of the edges from . In relational learning, the task is to find a predictor that completes by estimating the missing information. Typically, is chosen from a parameterized family to minimize an objective function . Unlike the empirical risk, the objective is not built from a loss on individual examples; must be specified for the entire observed graph.

In relational learning, there is not yet a framework that has properties (1) and (2) of ERM. The challenge is that relational data does not decompose into individual examples. Regarding (1), theory is elusive because the i.i.d. sequence assumption is meaningless for relational data. This makes it difficult to reason about what happens as more data is collected. Regarding (2), mini-batch SGD is not generally applicable even for differentiable models. SGD requires unbiased estimates of the full gradient. For a random subgraph of , the stochastic gradient is not generally unbiased. In particular, the bias depends on the choice of random sampling scheme used to select the subgraph. Circumventing these two issues requires either careful design of the objective function used for learning [e.g.,][]Perozzi:Al-Rfou:Skiena:2014,Grover:Leskovec:2016,Chamberlain:Clough:Deisenroth:2017,Yang:Cohen:Salakhudinov:2016,Hamilton:Ying:Leskovec:2017:inductive, or model-specific derivation and analysis. For example, graph convolutional networks [Kipf:Welling:2016, Kipf:Welling:2017, Schlichtkrull:Kipf:Bloem:vandenBerg:Titov:Welling:2018, vandenBerg:Kipf:Welling:2017] use full batch gradients, and scaling training requires custom derivation of stochastic gradients [Chen:Zhu:Song:2018].

This paper introduces relational ERM, a generalization of ERM to relational data. Relational ERM provides a recipe for machine learning with relational data that preserves the two important properties of ERM:

1. It provides a simple way to define (task-specific) relational learning methods, and

2. For differentiable models, relational ERM minimization can be efficiently solved in a turnkey fashion by mini-batch stochastic gradient descent.

Relational ERM mitigates the need for model-specific analysis and fitting procedures.

Extending turnkey mini-batch SGD to relational data allows the easy use of autodiff-based machine-learning frameworks for relational learning. To facilitate this, we provide fast implementations of a number of graph subsampling algorithms, and integration with TensorFlow.

In Section 2 we define relational ERM models and show how to automatically calculate unbiased mini-batch stochastic gradients. In Section 3 we explain connections to previous work on machine learning for graph data and we illustrate how to develop task-specific relational ERM models. In Section 4 we review several randomized algorithms for subsampling graphs. Relational ERM models require the specification of such algorithms. In Section 5 we establish theory for relational ERM models. The main insights are: (i) the i.i.d. assumption can be replaced by an assumption on how the data is collected [Orbanz:2017, Veitch:Roy:2016, Borgs:Chayes:Cohn:Veitch:2017, Crane:Dempsey:2016:snm], and, (ii) the choice of randomized sampling algorithm is necessarily viewed as a model component. In Section 6, we study relational ERM empirically by implementing the models of Section 3. We observe that the turnkey mini-batch SGD procedure succeeds in efficiently fitting the models, and that the choice of graph subsampling algorithm has a large effect in practice.

## 2 Relational ERM and SGD

Our aim is to define relational ERM in analogy with classical ERM. The fundamental challenge is that relational data does not decompose into individual examples. Classical ERM uses the empirical distribution to define the objective function Eq. 1. There is no canonical analogue of the empirical distribution for relational data.

The first insight is that the empirical distribution may be viewed as a randomized algorithm for subsampling the dataset. The required analogue is then a randomized algorithm for subsampling a graph. In the i.i.d. setting, uniform subsampling is almost always used. However, there are many possible ways to sample from a graph. We review a number of possibilities in Section 4. For example, the sampling algorithm might draw a subgraph induced by sampling vertices at random, or the subgraph induced by a random walk of length . The challenge is that there is no a priori criterion for deciding which sampling algorithm is “best.”

Our approach is to give up and declare victory: we define the required analogue as a component of model design. We require the analyst to choose a randomized sampling algorithm , where is a random subgraph of size . The choice of defines a notion of “example.” This allows us to complete the analogy to classical ERM.

A relational ERM model is defined by three ingredients:

1. A sampling routine .

2. A predictor class with parameter .

3. A loss function , where measures the reconstruction quality of on example .

The objective function is defined in analogy with the empirical risk Eq. 1. The relational empirical risk is:

 ^Rk(π,¯¯¯¯Gn):=E¯¯¯¯Gk=Sample(¯¯¯¯Gn,k)[L(¯¯¯¯Gk;θ)∣¯¯¯¯Gn]. (2)

Relational empirical risk minimization selects a predictor that minimizes the relational empirical risk,

 ^π:=π^θn where ^θn:=argminθ^Rk(πθ,¯¯¯¯Gn). (3)

### Stochastic gradient descent

A crucial property of relational ERM is that SGD can be applied to solve the minimization problem Eq. 3 without any model specific analysis. Define a stochastic gradient as , the gradient of the loss computed on a sample of size drawn with . Observe that

 ∇θ^Rr(θ,Gn) =∇θE[L(Sample(Gn,k);θ)∣¯¯¯¯Gn] =E[∇θL(Sample(Gn,k);θ)∣¯¯¯¯Gn].

That is, the random gradient is an unbiased estimator of the gradient of the full relational empirical risk. If is computationally efficient, then SGD with this stochastic estimator can solve the relational ERM.

To specify a relational ERM model in practice, the practitioner implements the three ingredients in code. Machine-learning frameworks provide tools to make it easy to specify a class of predictors and a per-example loss function, which are ingredients of classical ERM. Relational ERM additionally requires implementing and integrating it with a machine-learning framework. In practice, can be chosen from a standard library of sampling routines. To that end, we provide efficient implementations of a number of routines and integration with an automatic differentiation framework (TensorFlow). This gives an effective “plug-and-play” approach for defining and fitting models.

## 3 Example Models

We consider several examples of relational ERM models. We split the parameter into a pair : the global parameters are shared across the entire graph, and the embedding parameters provide low-dimensional embeddings for each vertex . Informally, global parameters encode population properties—“people with different political affiliation are less likely to be friends”—and the embeddings encode per-vertex information—“Bob is a radical vegan.”

#### Graph representation learning

Methods for learning embeddings of vertices are widely studied; see [Hamilton:Ying:Leskovec:2017:review] for a review. Many such methods rely on decomposing the graph into neighborhoods determined by (random) walks of fixed size. One example is Node2Vec [Grover:Leskovec:2016] (an extension of DeepWalk [Perozzi:Al-Rfou:Skiena:2014]). The basic approach is to draw a large collection of simple random walks, view each of these walks as a “sentence” where each vertex is a “word”, and learn vertex embeddings by applying a standard word embedding method [Mikolov:Chen:Corrado:Dean:2013, Mikolov:Sutskever:Chen:Corrado:2013]. To use mini-batch SGD, the objective function is restricted to a uniform sum over all walks. Unbiased stochastic gradients to be computed by uniformly sampling walks.

Relational ERM models include graph representation models of this kind. For example, Node2Vec [Grover:Leskovec:2016] is equivalent to a relational ERM model that (i) predicts graph structure using a predictor parameterized only by embedding vectors, (ii) uses a cross-entropy loss on graph structure, and (iii) takes as a random-walk of fixed length (augmented with randomly sampled negative examples).

A number of other relational learning methods also enable SGD by restricting the objective function to a uniform sum over fixed-size subgraphs [e.g.,][]Grover:Leskovec:2016,Chamberlain:Clough:Deisenroth:2017,Yang:Cohen:Salakhudinov:2016,Hamilton:Ying:Leskovec:2017:inductive. Any such model is equivalent to a relational ERM model that takes

as the uniform distribution over fixed-size subgraphs. But, in general, relational ERM does not require restricting to sampling schemes of this kind. Note that “negative-sampling” algorithms—which are critical in practice—do not uniformly sample fixed size subgraphs.

The next examples illustrate relational ERM for problems that are difficult with existing approaches to graph representation learning.

#### Semi-supervised node classification

Consider a network where each node is labeled by binary features—for example, hyperlinked documents labeled by subjects, or interacting proteins labeled by function. The task is to predict the labels of a subset of these nodes using the graph structure and the labels of the remaining nodes.

The model has the following form: Each vertex is assigned a -dimensional embedding vector . Labels are predicted using a parameterized function

that maps the node embeddings to the probability of each label. The presence or absence of edge

is predicted based on . This enables learning embeddings for unlabeled vertices. Let

denote the sigmoid function; let label

denote whether vertex has label ; and let . The loss on subgraphs is:

 L(Gk;λ,γ,l)= (4) q(∑i∈v(Gk)L∑j=1lijlogf(λi;γ)j+(1−lij)log(1−f(λi;γ)j)) +(1−q)(−∑i,j∈e(Gk)logσ(λTjλi)−∑i,j∈¯e(Gk)log(1−σ(λTjλi))).

Here, , , and denote the vertices, edges, and non-edges of the graph respectively. The loss on edge terms is cross-entropy, a standard choice in embedding models [Hamilton:Ying:Leskovec:2017:review]. Intuitively, the predictor uses the embeddings to predict both the vertex labels and the subgraph structure.

The model is completed by choosing a sampling scheme . Relational ERM then fits the parameters as

 (^λn,^γn)=argminλ,γE[L(λ,γ; Sample(Gn,k),l)∣Gn].

We can vary the choice of independent of optimization concerns; in Section 6 we observe that this leads to improved predictive performance.

Older embedding approaches use a two-stage procedure: node embeddings are first pre-trained using the graph structure, and then used as inputs to a logistic regression that predicts the labels [e.g.,][]Perozzi:Al-Rfou:Skiena:2014,Grover:Leskovec:2016. Yang:Cohen:Salakhudinov:2016 adapt a random-walk based method to allow simultaneous training; their approach requires extensive development, including a custom (two-stage) variant of SGD. Relational ERM allows simultaneous learning with no need for model specific derivation.

#### Wikipedia category embeddings

We consider Wikipedia articles joined by hyperlinks. Each article is tagged as a member of one or more categories—for example, “Muscles_of_the_head_and_neck”, “Japanese_rock_music_groups”, or “People_from_Worcester.” The task is to learn embeddings that encode semantic relationships between the categories.

Let denote the hyperlink graph and let denote the categories of article . Each category is assigned an embedding , and the embedding of each article (vertex) is taken to be the sum of the embeddings of its categories, . The loss is

 L(Gk,C;λ)= (5) −∑i,j∈e(Gk)logσ(λTjλi)−∑i,j∈¯e(Gk)log(1−σ(λTjλi)),

where and denote, respectively, the presence and absence of hyperlinks between articles. Intuitively, the predictor uses the category embeddings to predict the hyperlink structure of subgraphs. Relational ERM chooses the embeddings as

 ^γn=argminγE[L(λ(γ); Sample(Gn,k),C)∣Gn].

We write to emphasize that the article embeddings are a function of the category embeddings. Category embeddings obtained with this model are illustrated in Fig. 1; see Section 6 for details on the experiment.

The point of this example is: relational ERM makes it easy to implement this non-standard relational learning model and fit it with mini-batch SGD. The use of mini-batch SGD is important because the data graph is large.

#### Statistical relational learning

Statistical relational learning takes the graph to encode the dependency structure between the units [[]

e.g.]Neville:Jensen:2007,Getoor:Taskar:2007. The idea is to infer a joint probability distribution over the entire dataset, respecting the dependency structure. The distribution can then be used to make graph-aware predictions. There is also work on adapting SGD to this setting

[Yang:Ribeiro:Neville:2017]. Despite the similar goals, Relational ERM does not attempt to infer a distribution; the precise relationship with statistical relational learning is not clear.

## 4 Subsampling algorithms

In classical ERM, sampling uniformly (with or without replacement) is typically the only choice. In contrast, there are many ways to sample from a graph. Each such sampling algorithm leads to a different notion of empirical risk in (2).

As described above, random walks underlie graph representation methods built in analogy with language models. A simple random walk of length on a graph selects vertices by starting at a given vertex , and drawing each vertex uniformly from the neighbors of . Typically, random-walk based methods augment the sample by hallucinating additional edges using a strategy borrowed from the Skipgram model [Mikolov:Chen:Corrado:Dean:2013]:

###### Algorithm 1 (Random walk: Skipgram [Perozzi:Al-Rfou:Skiena:2014]).
1. [label=(),topsep=0pt]

2. Sample a random walk starting at a uniformly selected vertex of .

3. Report . The window is a sampler parameter, and is the number of steps between and .

Since relational ERM is indifferent to the connection with language models, a natural alternative augmentation strategy is:

###### Algorithm 2 (Random walk: Induced).
1. [label=(),topsep=0pt]

2. Sample a random walk starting at a uniformly selected vertex of .

3. Report as the edge list of the vertex induced subgraph of the walk.

A simple choice is to sample vertices uniformly at random and report as the induced subgraph. Such an algorithm will not work well in practice since it is not suitable for sparse graphs. We are typically interested in the case . If is sparse then such a sample typically includes few or no edges, and thus carries little information about . The next algorithm modifies uniform vertex sampling to fix this pathology. The idea is to over-sample vertices and retain only those vertices that participate in at least one edge in the induced subgraph.

###### Algorithm 3 (p-sampling [Veitch:Roy:2016]).
1. [label=(),topsep=0pt]

2. Select each vertex in independently, with a fixed probability .

3. Extract the induced subgraph of on the selected vertices.

4. Delete all isolated vertices from , and report the resulting graph.

Another natural sampling scheme is:

###### Algorithm 4 (Uniform edge sampling).
1. [label=(),topsep=0pt]

2. Select edges in uniformly and independently from the edge set.

3. Report the graph consisting of these edges, and all vertices incident to these edges.

Many other sampling schemes are possible; see Leskovec:Faloutsos:2006 for a discussion of possible options in a related context.

### 4.1 Negative sampling

For a pair of vertices in an input graph , a sampling algorithm can report three types of edge information: The edge may be observed as present, observed as absent (a non-edge), or may not be observed. The algorithms above do not treat edge and non-edge information equally: Algorithms 1, 2 and 4 cannot report non-edges, and the deletion step in Algorithm 3 biases it towards edges over non-edges. However, the locations of non-edges can carry significant information.

Negative sampling schemes are “add-on” algorithms that are applied to the output of a graph sampling algorithm and augment it by non-edge information. Let denote a sample generated by one of the algorithms above from an input graph .

###### Algorithm A (Negative sampling: Induced).
1. [label=(),topsep=0pt]

2. Report the subgraph induced by , in the input graph from which was drawn.

Another method, originating in language modeling [Mikolov:Sutskever:Chen:Corrado:2013, Goldberg:Levy:2014], is based on the unigram distribution: Define a probability distribution on the vertex set of by , the probability that would occur in a separate, independent sample generated from by the same algorithm as . For , we define a distribution , where is the appropriate normalization.

###### Algorithm B (Negative sampling: Unigram).

For each vertex in :

1. [label=(),topsep=0pt]

2. Select vertices .

3. If is a non-edge in , add it to .

The canonical choice in the embeddings literature is [Mikolov:Sutskever:Chen:Corrado:2013].

## 5 Theory

We now turn to formalizing and establishing theoretical properties of relational ERM. Particularly, (i) relational ERM satisfies basic theoretical desiderata, and (ii) should be viewed as a model component. We first give the results, and then discuss their interpretation and significance.

When the data is unstructured (i.e., no link structure), theoretical analysis of ERM relies on the assumption that the data is generated i.i.d. The i.i.d. assumption is ill-defined for relational data. Any analysis requires some analogous assumption for how the data is generated. Following recent work emphasizing the role of sampling theory in modeling graph data [Orbanz:2017, Veitch:Roy:2016, Borgs:Chayes:Cohn:Veitch:2017, Crane:Dempsey:2016:snm], we model as a random sample drawn from some large population network. Specifically, we consider a population graph with edges, and assume that the observed sample of size is generated by -sampling from , with . We assume the population graph is “very large,” in the sense that . The distribution of in the “infinite population” case is well-defined [Borgs:Chayes:Cohn:Veitch:2017].

The analogy with i.i.d. data generation is two-fold: Foundationally, the i.i.d. assumption is equivalent to assuming the data is collected by uniform sampling from some population [Politis:Romano:Wolf:1999], and -sampling is a direct analogue [Veitch:Roy:2016, Borgs:Chayes:Cohn:Veitch:2017, Orbanz:2017]. Pragmatically, both assumptions strike a balance between being flexible enough to capture real-world data [Caron:Fox:2017, Veitch:Roy:2015] and simple enough to allow precise theoretical statements.

We establish results for several choices of . Edges may be selected by either -sampling with —note the size of is free of —or by using a simple random walk of length (Algorithm 1 or Algorithm 2). Negative examples may be chosen by creftype A or creftype B.

The main result guarantees that the limiting risk of the parameter we learn depends only on the population and the model, and not on idiosyncrasies of the training data.

###### Theorem 5.1.

Suppose that is collected by -sampling as described above, that is fixed, and that is fixed to a sampling algorithm based on either -sampling or random walk sampling as described above. Suppose further that the loss is bounded and parameter setting satisfies mild technical conditions given in the appendix. Then there is some constant such that

 ^Rk(¯θ;¯¯¯¯Gn)→c¯θ(Sample,k) (6)

both in probability and in as . Moreover, there is some constant such that

 minθ^Rk(θ;¯¯¯¯Gn)→c∗(Sample,k) (7)

both in probability and in , as .

The limits depend on the choice of (and ), and usually do not agree between different sampling schemes.

The result is proved for based on -sampling in Appendix C and for random-walk based sampling in Appendix D.

Classical ERM guarantees usually apply even to the parameter itself, not just its risk. In the relational setting, the possibly complicated interplay of the learned embeddings makes such results more difficult. The next two results build on Theorem 5.1 to establish (partial) guarantees for the parameter itself.

We establish a convergence result for the global parameters output by a two-stage procedure where the embedding vectors are learned first. Such a result is applicable, for example, when predicting vertex attributes from embedding vectors that are pre-trained to explain graph structure. The proof is given in Appendix E.

###### Theorem 5.2.

Suppose the conditions of Theorem 5.1, and also that the loss function verifies a certain strong convexity property in , given explicitly in the appendix. Let . Then in probability for some constant .

We next establish a stability result showing that collecting additional data does not dramatically change learned embeddings. The proof is given in Appendix F.

###### Theorem 5.3.

Suppose the conditions of Theorem 5.1, and also that the loss function is twice differentiable and the Hessian of the empirical risk is bounded. Let denote the restriction of the embeddings to the vertices present in . Then in probability, as .

The examples of Section 3 do not satisfy the conditions of the theorem because the cross-entropy loss is unbounded. However, the models can be trivially modified to bound the output probabilities away from 0 and 1. In this case, the loss is bounded. Further, for the logistic regression model used in the experiments the convexity and Hessian conditions also hold, by direct computation.

#### Interpretation and Significance

The properties we establish are minimal desiderata that one might demand of any sensible learning procedure. Nevertheless, such results have not been previously established for relational learning methods. The obstruction is the need for a suitable analogue of the i.i.d. assumption. The demonstration that population sampling can fill this role is itself a main contribution of the paper. Indeed, the results we establish are weaker than the analogous guarantees for classical ERM, and main significance is perhaps the demonstration that such results can be established at all. This is important both as a foundational step towards a full theoretical analysis of relational learning, and because it strengthens the analogy with classical ERM.

A strength of our arguments is that they are largely agnostic to the particular choice of model, mitigating the need for model-specific analysis and justification. For example, our results include random-walk based graph representation methods as a special case, providing some post-hoc support for the use of such methods.

The limits in Theorems 5.2 and 5.1 depend on the choice of . Accordingly, the limiting risk and learned parameters depend on in the same sense they depend on the choice of predictor class and the loss function; i.e.,

is a model component. This underscores the need to consider the choice in model design, either through heuristics—e.g., random-walk sampling upweights the importance of high degree vertices relative to

-sampling—or by trying several choices experimentally.

## 6 Experiments

The practical advantages of using relational ERM to define new, task-specific, models are: (i) Mini-batch SGD can be used in a plug-and-play fashion to solve the optimization problem. This allows inference to scale to large data. And, (ii) by varying we may improve model quality. We have used relational ERM to define novel models in Section 3. The models are determined by (4) and (5) up to the choice of . We now study these example models empirically.5 The main observations are: (i) SGD succeeds in quickly fitting the models in all cases. And, (ii) the choice of has a dramatic effect in practice. Additionally, we observe that the best model for the semi-supervised node classification task uses -sampling. -sampling has not previously been used in the embedding literature, and is very different from the random-walk based schemes that are commonly used.

### Node classification problems

We begin with the semi-supervised node classification task described in Section 3, using the model Eq. 4 with different choices of .

We study the blog catalog and protein-protein interaction data reported in [Grover:Leskovec:2016], summarized by the table to the right. We pre-process the data to remove self-edges, and restrict each network to the largest connected component. Each vertex in the graph is labeled, and of the labels are censored at training time. The task is to predict these labels at test time.

##### Two-stage training.

We first train the model (4) using no label information to learn the embeddings (that is, with ). We then fit a logistic regression to predict vertex features from the trained embeddings. This two stage approach is a standard testing procedure in the graph embedding literature, e.g. [Perozzi:Al-Rfou:Skiena:2014, Grover:Leskovec:2016]. We use the same scoring procedure as Node2Vec [Grover:Leskovec:2016]

(average macro F1 scores), and, where applicable, the same hyperparameters.

Table 1 shows the effect of varying the sampling scheme used to train the embeddings. As expected, we observe that the choice of sampling scheme affects the embeddings produced via the learning procedure, and thus also the outcome of the experiment. We further observe that sampling non-edges by unigram negative sampling gives better predictive performance relative to selecting non-edges from the vertex induced subgraph.

##### Simultaneous training.

Next, we fit the model of Section 3 with —training the embeddings and global variables simultaneously. Recall that simultaneous training is enabled by the use of relational ERM. We choose label predictor as logistic regression, and adapt the label prediction loss to measure the loss only on vertices in the positive sample.

There is not a unique procedure for creating a test set for relational data. We report test scores for test-sets drawn according to several different sampling schemes. Results are summarized by Table 2. We observe:

• Simultaneous training improves performance.

• -sampling outperforms the standard rw/skipgram procedure.

• This persists irrespective of how the test set is selected (i.e., it is not an artifact of the data splitting procedure).

Note that the average computed with uniform vertex sampling is the standard scoring procedure used in the previous table. The last observation is somewhat surprising: we might have expected a mismatch between the training and testing objectives to degrade performance. One possible explanation is that the random-walk based sampler excessively downweights low-connectivity vertices, and thus fails to fully exploit their label information.

### Wikipedia Category Embeddings

We consider the task of discovering semantic relations between Wikipedia categories, as described in Section 3. This task is not standard; wholly new model is required.

We define a relational ERM model by choosing category embedding dimension , the loss function in (5), and as 1+B, the skipgram random walk sampler with unigram negative sampling. The data is the Wikipedia hyperlink network from [Klymko:Gleich:Kolda:2014], consisting of Wikipedia articles from 2011-09-01 restricted to articles in categories containing at least 100 articles.

The challenge for this task is that the dataset is relatively large—about 1.8M nodes and 28M edges—and the model is unusual—embeddings are assigned to vertex attributes instead of the vertices themselves. SGD converges in about 90 minutes on a desktop computer equipped with a Nvidia Titan Xp GPU. Fig. 1 on page 1 visualizes example trained embeddings, which clearly succeed in capturing latent semantic structure.

## 7 Conclusion

Relational ERM is a generalization of ERM from i.i.d. data to relational data. The key ideas are introducing as a component of model design, which defines an analogue of the empirical distribution, and using the assumption that the data is sampled from a population network as an analogue of the i.i.d. assumption. Relational ERM models can be fit automatically using SGD. Accordingly, relational ERM provides an easy method to specify and fit relational data models.

The results presented here suggest a number of directions for future inquiry. Foremost: what is the relational analogue of statistical learning theory? The theory derived in

Section 5 establishes initial results. A more complete treatment may provide statistical guidelines for model development. Our results hinge critically on the assumption that the data is collected by -sampling; it is natural to ask how other data-generating mechanisms can be accommodated. Similarly, it is natural to ask for guidelines for the choice of .

#### Acknowledgments

VV and PO were supported in part by grant FA9550-15-1-0074 of AFOSR. DB is supported by ONR N00014-15-1-2209, ONR 133691-5102004, NIH 5100481-5500001084, NSF CCF-1740833, the Alfred P. Sloan Foundation, the John Simon Guggenheim Foundation, Facebook, Amazon, and IBM. The Titan Xp used for this research was donated by the NVIDIA Corporation.

## Appendix A Overview of Proofs

The appendix is devoted to proving the theoretical results of the paper. These results are obtained subject to the assumption that the data is collected by -sampling. This assumption is natural in the sense that it provides a reasonable middle ground between a realistic data collection assumption—-sampling can result in complex models capturing many important graph phenomena [Caron:Fox:2017, Veitch:Roy:2015, Borgs:Chayes:Cohn:Holden:2016]—and mathematical tractability—we are able to establish precise guarantees.

The appendix is organized as follows. We begin by recalling the connection between -sampling and graphex processes in Section B.1; this affords a useful explicit representation of the data generating process. In Section B.2, we recall the method of exchangeable pairs, a technical tool required for our convergence proofs. Next, in Section B.3, we collect the necessary notation and definitions. Empirical risk convergence results for -sampling are then proved in Appendix C and results for the random-walk in Appendix D. Convergence results for the global parameters are established in Appendix E. Finally, in Appendix F, we show that learned embeddings are stable in sense that they are not changed much by collecting a small amount of additional data.

## Appendix B Preliminaries

### b.1 Graphex processes

Recall the setup for the theoretical results: we consider a very large population network with edges, and we study the graph-valued stochastic process given by taking each to be an -sample from and requiring these samples to cohere in the obvious way. We idealize the population size as infinite by taking the limit . The limiting stochastic process is well defined, and is called a graphex process [Borgs:Chayes:Cohn:Veitch:2017].

Graphex processes have a convenient explicit representation in terms of (generalized) graphons [Veitch:Roy:2015, Borgs:Chayes:Cohn:Holden:2016, Caron:Fox:2017].

###### Definition B.1.

A graphon is an integrable function .

###### Remark B.2.

This notion of graphon is somewhat more restricted than graphons (or graphexes) considered in full generality, but it suffices for our purposes and avoids some technical details.

We now describe the generative model for a graphex process with graphon . Informally, a graph is generated by (i) sampling a collection of vertices each with latent features , and (ii) randomly connecting each pair of vertices with probability dependent on the latent features. Let

be a Poisson (point) process on with intensity , where is the Lebesgue measure. Each atom of the point process is a candidate vertex of the sampled graph; the are interpreted as (real-valued) labels of the vertices, and the as latent features that explain the graph structure. Each pair of points with is then connected independently according to

 1[(ηi,ηj) connected]ind ∼ Bern(W(xi,xj)).

This procedure generates an infinite graph. To produce a finite sample of size , we restrict to the collection of edges . That is, we report the subgraph induced by restricting to vertices with label less than , and removing all vertices that do not connect to any edges in the subgraph. This last step is critical; in general there are an infinite number of points of the Poisson process such that , but only a finite number of them will connect to any edge in the induced subgraph.

Modeling as collected by -sampling is essentially equivalent to positing that is the graph structure of generated by some graphon . Strictly speaking, the -sampling model induces a slightly more general generative model that allows for both isolated edges that never interact with the main graph structure, and for infinite star structures; see [Borgs:Chayes:Cohn:Veitch:2017]. Throughout the appendix, we ignore this complication and assume that the dataset graph is generated by some graphon. It is straightforward but notationally cumbersome to extend our results to -sampling in full generality.

### b.2 Technical Background: Exchangeable Pairs

We will need to bound the deviation of the (normalized) degree of a vertex from its expectation. To that end, we briefly recall the method of exchangeable pairs; see [Chaterjee:2005] for details.

###### Definition B.3.

A pair of real random variables

is said to be exchangeable if

 (X,X′)d=(X′,X).

Let and be measurable function such that:

 E(F(X,X′)|X)a.s=f(X), and F(X,X′)=−F(X′,X).

Let

 v(X)≜12E((f(X)−f(X′))F(X,X′)∣∣X),

and suppose that for some . Then

 ∀x>0, P(|f(X)−E(f(X))|≥x)≤2e−x22C.

Further, for all and it holds that:

 P(|f(X)−E(f(X))|>x)≤(2p−1)p∥v(X)|∥ppxp.

### b.3 Notation

For convenient reference, we include a glossary of important notation.

First, notation to refer to important graph properties:

• is the latent Poisson process that defines the graphex process in Section B.1. The labels are and the latent variables are .

• is the restriction of the Poisson process to atoms with labels in .

• To build the graph from the point of process we need to introduce a process of independent uniform variables. Let

 UΠ≜(Uηi,ηj)ηi,ηj∈Π

be such that is an independent process where

• is the (random) edge set of the graphex process at size .

• is the set of vertices of .

• is all pairs of points in that are not connected by an edge.

• The number of edges in the graph is

• The neighbors of in are

 Nn(η)≜{η′ : (η,η′)∈P1(Γn)}
• For all , the set of paths of length in is

 Pk(Γn)≜{(ηi)i≤k+1∈V(Γn)k+1 :  (ηi,ηi+1)∈Γn ∀i≤k}.
• The degree of in is .

• Asymptotically, the number of edges of a graphex process scales as [Borgs:Chayes:Cohn:Holden:2016]. Let be the proportionality constant

 E≜limn→∞Enn2.

Next, we introduce notation relating to model parameters. Treating the embedding parameters requires some care. The collection of vertices of the graph is a random quantity, and so the embedding parameters must also be modeled as random. For graphex processes, this means the embedding parameters depend on the latent Poisson process used in the generative model. To phrase a theoretical result, it is necessary to assume something about the structure of the dependence. The choice we make here is: the embedding parameters are taken to be markings of the Poisson process . In words, the embedding parameter of a vertex may depend on the (possibly latent) properties of that vertex, but the embeddings are independent of everything else.

• The collection of all possible parameters is:

 ΩΠθ≜{(λη,γ)η∈Π : λη∈Ωθ ∀η∈Π and γ∈Ωγ}.

Note that we attach a copy of the global parameter to each vertex for mathematical convenience.

• For all , let denote the projection on and let denote the projection on

• The following concepts and notations are needed to build a marking of the Poisson process: Let be a distributional kernel on . We generate the marks according to a distribution on , conditional on , such that if then:

• is an independent process

• for all

• Let the augmented object that carries information about both the graph structure () and the model parameters .

## Appendix C Basic asymptotics for p-sampling

We begin by establishing the result for -sampling, with and the non-edges chosen by taking the induced subgraph. This is the simplest case, and is useful for the introduction of ideas and notation. We consider more general approaches to negative sampling in the next section, where it is treated in tandem with random walk sampling. The same arguments can be used to extend -sampling to allow for, e.g., unigram negative sampling used in our experiments.

For all , and all , let denote the loss on where is restricted to the embeddings (and global parameters) associated with .

###### Theorem C.1.

Let a random variable taking value in such that , for a certain kernel , then there is some constant such that if then

 ^Rk(Γn,¯θ)→cpsm

both a.s. and in , as .

Moreover there is some constant such that

 minθ^Rk(Γn,θ)→cps∗

both a.s. and in , as .

###### Proof.

We will first prove the first statement. Let , let be the edge set of , and let be the partially labeled graph obtained from by forgetting all labels in (but keeping larger labels and the embeddings ). Let be the -field generated by . The critical observation is

 ^Rk(Γn,¯θ)=E[L(Γk,¯θ)∣Fn(¯θ)]. (8)

The reason is that choosing a graph by -sampling is equivalent uniformly relabeling the vertices in and restricting to labels less than ; averaging over this random relabeling operation is precisely the expectation on the righthand side.

By the reverse martingale convergence theorem we get that:

 ^Rk(Γn,¯θ) a.s.,L1−−−−−→E[L(Γk,¯θ)∣F∞(¯θ)],

but as is a trivial sigma-algebra we get the desired result.

We will now prove the second statement. Let be the partially labeled graph obtained from by forgetting all labels in and let be the -field generated by . Further, we denote the set of embeddings of the graph by:

 ΩΓmθ≜{(λV,γ)V∈Γm:∀V∈V(Γm) λV∈Ωλ,γ∈Ωγ}.

We are now ready to state the proof. Let , and observe that:

 E[minθ∈ΩΓnθ^Rk(Γn,θ)∣Fm] ≤minθ∈ΩΓmθE[L(Γk,θ)∣Fm] (9) =minθ∈ΩΓmθ^Rk(Γn,θ). (10)

Thus, is a supermartingale with respect to the filtration . Moreover, by assumption, the loss is bounded and thus so also is the empirical risk. Supermartingale convergence then establishes that converges almost surely and in to some random variable that is measureable with respect to . The proof is completed by the fact that is trivial.∎

## Appendix D Basic asymptotics for random-walk sampling

In this section we establish the convergence of the relational empirical risk defined by the random walk. The argument proceeds as follows: We first recast the subsampling algorithm as a random probability measure, measurable with respect to the dataset graph . Producing a graph according to the sampling algorithm is the same as drawing a graph according to the random measure. Establishing that the relational empirical risk converges then amounts to establishing that expectations with respect to this random measure converge; this is the content of Theorem D.8. To establish this result, we show in Lemma D.6 that sampling from the random-walk random measure is asymptotically equivalent to a simpler sampling procedure that depends only on the properties of the graphex process and not on the details of the dataset. We allow for very general negative sampling distributions in this result; we show that how to specialize to the important case of (a power of) the unigram distribution in Lemma D.7.

### d.1 Random-walk Notation

We begin with a formal description of the subsampling procedure that defines the relational empirical risk. We will work with random subset of the Poisson process ; these translate to random subgraphs of in the obvious way. Namely, if the sampler selects in the Poisson process, then it selects in .

Sampling follows a two stage procedure: we choose a random walk, and then augment this random walk with additional vertices—this is the negative-sampling step. The following introduces much of the additional notation we require for this section.

###### Definition D.1 (Random-walk sampler).

Let be a (random) probability measure over . Let be a sequence of vertices sampled according to:

1. (random-walk) and let for .

2. (augmentation) be a sequence of additional vertices sampled from independently from each other and also from .

Let be the vertex induced subgraph of . Let be the random probability distribution over subgraphs induced by this sampling scheme.

With this notation in hand, We rewrite the loss function and the risk in a mathematically convenient form

###### Definition D.2 (Loss and risk).

The loss on a subsample is

 L(GH,¯θ)∈[0,1],

where we implicity restrict to the embeddings (and global parameters) associated with vertices in . The empirical risk is

 EPn[L(GH,¯θ)|¯Πn(¯θ)].
###### Remark D.3.

Note that the subgraphs produced by the sampling algorithm explicitly include all edges and non-edges of the graph. However, the loss may (and generally will) depend on only a subset of the pairs. In this fashion, we allow for the practically necessary division between negative and positive examples. Skipgram augmentation can be handled with the same strategy.

We impose a technical condition on the distribution that the additional vertices are drawn from. Intuitively, the condition is that the distribution is not too sensitive to details of the dataset in the large data limit.

###### Definition D.4 (Augmentation distribution).

We say is an asymptotically exchangeable augmentation distribution if is there is a such that

• There is a deterministic function such that

• where

Lemma D.7 establishes that the unigram distribution respects these conditions.

### d.2 Technical lemmas

We begin with some technical inequalities controlling sums over the latent Poisson process. To interpret the theorem, note that the degree of a vertex with latent property is given by in the theorem statement.

###### Lemma D.5.

Let be such that is distributed as a process of independent uniforms in and let

 fn(y,Π)≜∑η∈ΠnI(Ux(η)≤W(y,x)),

for all . Then the following hold:

1. such that , there are such that ,

 P(∣∣fn(y,Π)nW(y,⋅)−1∣∣≥β)≤Kn3βp.
2. such that

 P(∣∣fn(y,Π))n−W(y,⋅)∣∣≥β)≤Kpnpβ2p

and

 P(∣∣Enn2E−1∣∣≥β)≤Kpnpβ2p.
3. such that such that then

###### Proof.

We will first write the proof of the first statement, which is harder. We then highlight the differences in the other cases. We use the Stein exchangeable pair method, recalled in Section B.2.

Let be such that

 ∀x,y F(x,y)=[x−y].

Let and let

 Π′=T[¯J,¯J+1],[n,n+1]⋅Πν×Πx,

where is the permutation of and and

 T[¯J,¯J+1],[n,n+1]⋅Πν×Πx≜{(T[¯J,¯J+1],[n,n+1](ν),x), ∀(ν,x)∈Π}

Then we can check the following:

• As we obtain that

 E(fn(y,Π)W(y,⋅)−fn(y,Π′)W(y,⋅)∣∣Πn)(a)=1nW(y,⋅)[n−1∑j=0∑Πj+1∖ΠjI(Ux(η)≤W(y,x))−E(I(Ux(η)≤W(y,x)))](b)=fn(y,Π)nW(y,⋅)−1

where (a) is obtained by complete independence of and where to get (b) we use the fact that (see [Veitch:Roy:2015])

 ∑(ν,x)∈Πj+1∖ΠjI(Ux(η)≤W(y,x))∼Poi(W(y,⋅))
• Moreover, we can very similarly see that:

where is a constant that does not depend on or .

Therefore using the exchangeable pair method presented earlier and setting for all such that we get that there is , such that for all

 P(|∑(ν,x)∈ΠnI(U