Supervised Coarse-Graining of Composite Objects

01/01/2019 ∙ by Shaobo Han, et al. ∙ 0

We consider supervised dimension reduction for regression with composite object-valued predictors, which are composed of primitive variables located on an underlying relevant space. For example, this type of predictor can refer to a number of time-stamped events or a collection of interactions between units indexed by spatial coordinates. To facilitate regression analysis on such data, it is desirable to reduce the complex object into a malleable vector form with a tradeoff between homogeneity and heterogeneity. In this article, we introduce a tree-guided supervised dimension reduction method, called multiscale coarse-graining, that uses the similarity among primitive variables for deriving a part-subpart hierarchical representation of the composite object, and chooses non-uniform resolutions according to the relevance to the response. We propose a multiscale extension of the generalized double Pareto prior based on parameter expansion, which induces simultaneous deletion and fusion of the primitive variables through adaptive shrinkage estimation. We demonstrate the practical utility of our approach with an application in soccer analytics, in which we obtain coarse-grained visualizations of spatial passing networks under the supervision of team performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In modern applications, we frequently encounter complex object-type data, such as functions (Ramsay and Silverman, 2006), trees (Wang and Marron, 2007), shapes (Srivastava et al., 2011), and networks (Durante et al., 2017)

. In many instances, such data are collected repeatedly under different conditions, with an additional response variable of interest available for each replicate. This has motivated an increasingly rich literature on generalizing regression on vector predictors to settings involving more elaborate object-type predictors with special characteristics, such as functions

(James, 2002), manifolds (Nilsson et al., 2007)

, tensors

(Zhou et al., 2013), and undirected networks (Guha and Rodriguez, 2018).

Complex objects are often built recursively from simpler parts. In this article, we introduce a new class of object data, denoted composite objects (CO), which are structured data composed of primitive objects (POs). Many common data types can be seen as instances of the CO family, such as a collection of time-stamped events, connections between regions of the brain, or basketball shots on the court. The component POs in CO-type data can be enormous and mostly distinctive from one another across replicates, presenting new challenges for data exploration, analysis, and visualization.

We are interested in identifying the association between the patterns of coordinated interactions among individual units in a group and the performance of the group. In this article, we focus on analyzing the FIFA World Cup 2018 data collected by StatsBomb. In this dataset, we observe SPatial Interaction Networks (SPIN) data from replicate , which contains a collection of completed passes. As illustrated in Figure 1, is in the CO form, with every constituent pass viewed as a PO logged with the spatial locations of the passer and receiver. The associated response may refer to a team performance metric such as goals scored or conceded, the number of shots on target, or other situational game factors.

We consider regression modeling with observations of a scalar response and a CO-valued predictor ,

. A primary challenge is to represent the CO data in a malleable form that facilitates multivariate analysis. In our motivating application, the dataset contains

completed passes with unique locations of origin-destination in the matches. One common practice is to divide , the complete set of POs, into non-overlapping subsets (with for and ) through a predefined partitioning scheme . For example, Miller et al. (2014) discretizes the basketball court uniformly into tiles and counts the number of shots located in each tile. Durante et al. (2017) parcellates the brain into regions and investigates the network connectivity between pairs of regions. Under the partition , the can be represented as a -dimensional count vector , where counts the occurrences of POs appearing in and belonging to subset .

Figure 1: Spatial interaction networks in the 2018 FIFA World Cup Final (France 4-2 Croatia). The arrowed segments denote the pass from the location of passer to the location of receiver. Team’s direction of attack: from left to right.

This partition scheme inherently assumes the equivalence of the POs falling within the same group, and focuses on the variabilities in abundance across groups. At one extreme, the partition of singletons treats all the unique POs as distinct, and thus induces a bag-of-words representation (Blei et al., 2003; Taddy, 2013). At the other extreme, the singleton partition treats all the POs as equivalent, and thus keeps only the information about sample size . As one can see, the choice of partitioning scheme and its scale will have a critical influence on inference.

Ideally, a reductive representation should promote the interpretability of the CO-type predictor and preserve the relevance to the response. However, such approaches are underdeveloped in the current literature. In this article, we propose spinlets—an adaptive multiscale representation for spatial interaction networks. The spinlets method uses the similarity among primitive objects for deriving a tree representation of the data.

Spinlets serves as a tool for visualization and statistical inference of SPIN data. In Figure 2, we plot three different types of reduced representations for the instances shown in Figure 1, based on a predefined uniform parcellation of the pitch, (ii) a graph partitioning algorithm without supervision, and (iii) spinlets under the supervision of the number of goals scored. Each replicate is a unique team-game combination. Spinlets is data-driven and can select multiple informative scales. As illustrated in Figure 2, the details of passing patterns in the backfield are coarsened.

Figure 2: SPIN representation of France (first row) and Croatia (second row) in the 2018 World Cup Final. The colored arrows represent the grouped POs with width proportional to the count of occurrences in the group. For the sake of visualization, the origins (dashed) and destinations (solid) of the grouped POs are indicated by the convex hulls (shaded region) with the grouped POs located at the centroids of the polygons.

There is a rich literature on supervised dimension reduction, covering LASSO (Tibshirani, 1996), supervised PCA (Bair et al., 2006; Barshan et al., 2011), and sufficient dimension reduction (SDR) methods, see Cook (2007), Adragni and Cook (2009) and the references cited therein. The reduction of complexity is typically achieved through variable selection or combination. Such approaches can accommodate vector predictors and perform well in high-dimensional settings; however, our application involves predictors with complex structures. SDR methods have been generalized to handle functional predictors (Ferré and Yao, 2003, 2005), matrix- or array-valued predictors (Li et al., 2010), and irregularly measured longitudinal predictors (Jiang et al., 2014). In this article, we center our focus on spatial networks as an instance of a composite data object.

There is a separate literature on multiscale geometric data representation, including diffusion maps (Lafon and Lee, 2006) and GMRA (Allard et al., 2012; Petralia et al., 2013)

. These approaches seek a reductive representation that reflects the intrinsic geometry of the high-dimensional data by partitioning the similarity graph of

data observations. In contrast, our spinlets method partitions the similarity graph of variables, with a different goal of identifying predictive groups of variables. Spinlets is similar in spirit with the treelets method (Lee et al., 2008)

, which organizes variables on a hierarchical cluster tree with multiscale bases; however, treelets is an unsupervised approach utilizing the sample correlation to construct the tree with a single cutoff height. Our spinlets approach departs from treelets by incorporating external proximity to construct the tree, and determining non-uniform heights

(Meinshausen and Bühlmann, 2008) with reference to the response.

In regression with CO-type predictors, the total number of unique POs is massive, while only a limited number of them are sparsely observed within each replicate. It is advantageous to form groups of POs that are spatially contiguous, such that meaningful analysis can be conducted at a lower level of resolution. In many other applications, predictors are highly correlated, or collectively associated with the response, or domain knowledge exists suggesting the functional similarity among a group of variables. This has motivated a line of research on supervised clustering of predictors in forward regression settings. Examples include Hastie et al. (2001); Jörnsten and Yu (2003) and Dettling and Bühlmann (2004)

. The averaging operator on the predictors often leads to lower variance

(Park et al., 2006).

Regularization methods such as elastic net (Zou and Hastie, 2005) or OSCAR (Bondell and Reich, 2008) can mitigate the multicollinearity issue and encourage grouping effects. Along this thread, Wang and Zhao (2017) recently proposed two tree-guide fused lasso (TFL) penalties, which effectively encode the topology of a phylogenetic tree in determining the taxonomic levels of microbes associated with a given phenotype. However, this approach does not model the variability in the predictors, while we model the conditional distributions of the predictors given the response through inverse regression, with possibilities of alleviating the effects of collinearity (Cook and Li, 2009). Moreover, Lasso-based penalties tend to over-shrink signals not close to zero (Armagan et al., 2011). We introduce a new multiscale prior that induces a locally adaptive shrinkage rule on each scale. This prior is used within our proposed spinlets method for supervised dimension reduction for SPIN predictors.

In Section 2, we introduce the data structure and notations. In Section 3, we describe our model-based SDR framework. Section 4

presents a tree-structured PX scheme and our induced multiscale shrinkage prior. A variational expectation-maximization (EM) algorithm for estimation is outlined in Section

5. In Section 6, we evaluate the performance of our approach with simulated data and demonstrate the practical utility through applications to soccer analytics. The implementation of spinlets will be made available on Github.

2 Data Structure and Notations

2.1 Hierarchical Organization of Primitive Objects

To concisely represent the proximity information of POs (e.g., soccer passes), we define a tree structure in three steps: choose the Euclidean distance metric between pairs of POs and compute a distance matrix, ; construct a sparse similarity graph of POs by considering nearest neighbors; build a partition tree via recursively applying the METIS partitioning algorithm (Karypis and Kumar, 1998) on . We construct a full binary tree via recursive METIS (rMETIS). In each step, a set of POs is split into two disjoint subsets. Figure 3 illustrates a sub-branch of with all assigned POs in the groups plotted.

Figure 3: A sub-branch of the partition tree with leave nodes (indicated by solid dots) produced by the rMETIS algorithm. Every internal node (indicated by hollow dots) on the coarse scale is connected with two child nodes on the finer scale.

2.2 Notations

The internal nodes in are indexed by , in which each parent node has two child nodes , where . We denote by the set of child nodes of and the set all the nodes on the sub-branch rooted from node . The tree is originated from the root node with leaf nodes indexed by , . We define the partition induced by as the primary partition , which partitions into groups on the finest scale and yields a primary vectorial representation of COs, aligned across replicates. Our main goal is to determine a reductive rule by pruning the tree , such that the resulting representation retains the relevance to the response variable .

3 Tree-Guided Supervised Dimension Reduction

The above representation has not taken into account any information from the response. Hence, we further refine it along in a supervised manner.

3.1 Poisson Inverse Regression Model

According to Cook (2007), a sufficient reduction of the random vector , denoted as , satisfies one of the three equivalent statements: ; ; , where indicates equivalence in distribution and denotes independence. For replicate with response variable

, we attach a random variable

to each leaf node , counting the number of POs appearing in CO that fall in the leaf group . In order to explicitly model the variabilities in occurrences, we adopt an inverse regression formulation. This approach is largely motivated by poor performance we observed in implementing usual regression due to extreme multicollinearity issues. Sufficiency is guaranteed within our proposed Poisson inverse regression (PIR) model for conditionally on , , ,

(1)

where is the intercept for predictor , is the baseline effect for replicate , and is the regression coefficient for predictor . The linear sufficient reduction parameterized by is derived as follows (the replicate index is omitted for clarity).

Proposition 1.

Letting , under the inverse Poisson regression model (1), the distribution of is the same as the distribution of for all values of .

Cook (2007) proves the sufficiency of for one-parameter exponential family models. Within this family, the PIR model (1) can be written in the following form,

Accordingly, the joint probability mass function

of can be written as

where and . Thus, the sufficiency of reduction holds according to the Fisher-Neyman factorization theorem for sufficient statistics (Bickel and Doksum, 2015, Theorem 1.5.1, p. 43). The PIR model has a close connection with the multinomial inverse regression (MNIR) model (Taddy, 2013), though, the vector Poisson likelihood departs from multinomial likelihood by accounting for the variability of total number of POs in each replicate.

3.2 Reductive Operators: Deletion and Fusion

One reductive operator enabled by the parameterization from the PIR model (1) is the deletion of irrelevant leaf groups. One can see that implies that , that is, the number of POs in the leaf group is independent of the value of . Another reductive operator on the tree is fusion. We observe that if , , then is a function depending on the predictors only through , which is the total number of POs falling into the set . Therefore, the practitioner can construct a lower resolution vectorial representation by merging the involved leaf sets into one set without loss of relevant information. The relevant signals are then captured on coarser scales. To ensure spatial contiguity, we require all leaves contained in to share at least one common ancestor node.

The grouping of highly correlated predictors in high-dimensional regression can be incorporated via fusion penalties (Tibshirani et al., 2005; Bondell and Reich, 2008), which encourage sparsity in the differences of coefficients. There exist several generalized fusion schemes that can take into account graph structure (She, 2010; Tibshirani and Taylor, 2011). However, these methods do not support multiscale nested grouping of predictors in accordance with the tree structure. For example, applying the pairwise fused lasso (She, 2010) to all pairs of variables tends to incorrectly encourage merging all the variables together with equal strengths. The TFL penalties (Wang and Zhao, 2017) require careful tuning of the regularization parameters across multiple scales.

4 Multiscale Shrinkage with Parameter Expansion

Parameter expansion (PX) (Liu et al., 1998) has been found useful not only for accelerating computations (Liu and Wu, 1999), but also for inducing new families of prior distributions (Gelman, 2004). In this section, we propose a new tree-structured PX scheme, which induces a multiscale shrinkage prior.

4.1 Tree Structured Parameter Expansion

For , we denote by the “vertical" root-to-leaf path set in the tree connected to the leaf group , . The path set includes all the nodes that it visits from the root node to the leaf node . Meanwhile, we denote the “horizontal" descendant set as the set of leaf nodes who share a most recent common ancestor , , and specifically for , we set , for , for ease of notation. Figure 4 illustrates an example of a path set and an example of a descendant set with the common ancestor node .

Figure 4: Illustration of with leaf groups (solid dots). The green shaded region denotes a path set from root node to leaf node , through three intermediate nodes: , , and . The descendant set contains leaf nodes (located in the dashed circle) with a most recent common ancestor (indicated by the red octagon).

There exists a dual relationship between random variables and coefficients based on these two notations. We can attach a random variable to each node , where counts the appearances of POs in descendant set and . Letting be the coefficient for node visited by the path , we have . Therefore, the sufficient reduction score introduced in Section 3 can be re-expressed under the parameterization on the partition tree ,

Clearly, is also a linear sufficient reduction. The reparameterization changes neither the data likelihood of the PIR model (1), nor the sufficient reduction score.

4.2 Fused Generalized Double Pareto Prior

In Section 3.2, we introduced two reductive operations: deletion (if ) and fusion (if , ) on the leaf partitions along the tree. However, exhaustive search for all possible schemes is prohibitive. Even for a binary tree , these two operations in combination result in different schemes. Alternatively, effective execution of the following two operations can be induced by regularization in the parameterization:

  1. Deletion: if , , then , which implies that the contributions of leaf predictor across all the scales are pruned out.

  2. Fusion: If , their child nodes satisfy , then , , the leaf variables within can be condensed into one variable.

Note that the parameterization is redundant; both conditions above are sufficient but not necessary. Based on the above observations, we impose generalized double Pareto (GDP) priors (Armagan et al., 2013) on and the pairwise differences between sibling nodes,

(2)

where , and . The first prior encourages sparsity on the individual coefficients and the second prior promotes sparsity on the differences between pairs of siblings with a common parent node . These priors lead to a generalized fused lasso-type penalty (She, 2010; Tibshirani and Taylor, 2011), but the GDP prior corresponds to a reweighted penalty instead of (as will be seen in (6)), which better approximates the -like criterion (Candes et al., 2008).

For , the number of expanded parameters in is . A natural question is whether there exists a multivariate prior on that could justify the compatibility of these two GDP priors. To see this, we use the latent variable representation of the GDP prior introduced in Armagan et al. (2013). The first level of the hierarchy is written as,

(3)

We postulate a multivariate normal prior for the -dimensional vector with precision matrix , whose log marginal density is different than that of priors in (3) only up to a constant. The entries in as a function of can be found by square completing. It takes a block-diagonal form as follows,

(4)

where

To complete the hierarchy of the multivariate GDP prior with , we put , , , and , , . So now we have a fused generalized double Pareto (fGDP) prior, which promotes the desired form of structured sparsity in , and enjoys a latent variable representation that makes the parameter estimation straightforward. Integrating out the latent variables , we obtain the marginal density of the fGDP prior, denoted by , whose logarithm takes the following form,

(5)
(6)

where , .

4.3 Multiscale Shrinkage in the Original Parameter Space

The tree-structured PX scheme introduced in Section 4.1 offers us a reparameterization: , , which can be represented in matrix form as , where is a design matrix with binary entries, . Each column in can be interpreted as a basis function that encodes the piecewise smoothness at a different location and scale. For example, assuming , the number of leaves , , we have

Importantly, through PX, we have transformed the problem of multiscale shrinkage on the regression coefficients across multiple scales on into a structured shrinkage problem on the expanded parameters , which can be conveniently addressed via the proposed fGDP prior. The hierarchical-Bayes representation of the multiscale shrinkage prior on can be obtained via integrating out . We have the conditional prior and the priors on the latent variables do not change. However, the precision matrix no longer exhibits a sparse block-diagonal structure as in (4), and the resulting EM procedure involves intractable expectations.

4.4 Incorporating Random Effects

We further assume each replicate is collected within a time window of length , . To accommodate potential overdispersion and dependencies, we incorporate random effects in the model. The vector Poisson log-linear mixed regression model is written as follows,

(7)

where is the fixed effect slope parameter for the simple Poisson mixed regression model (Hall et al., 2011b). The fixed effects measure the common association between the predictors and response, while the random effects allow replicates or primary groups to have their own baseline rates. The total, column and row random effects are , , and , respectively. Constraints are needed for the identifiability of row and column scores and , so we use the corner constraint (Yee and Hastie, 2003) in this article. Gaussian priors on , and are specified as follows,

with unknown variance parameters . Since , we have . Note that the sufficiency of established in Section 3.1 still holds conditional on the random effect terms (Taddy, 2013). In the next section, we introduce a penalized likelihood estimator of , under the conditional likelihood with the proposed fGDP prior guided by .

5 Parameter Estimation

The hierarchical-Bayes representation of the fGDP prior facilitates an iterative EM-type algorithm for penalized estimation with (6). We adopt the type-I estimation framework (Figueiredo, 2003), which treats as latent variables and as parameters to optimize. The conditional likelihood of the model in (7) involves a -dimensional integral,

which is nonanalytic. Alternatively, we take a Gaussian variational approximation (GVA) of the posteriors of random effect variables , which provides a lower bound of . Statistical properties of GVA for generalized linear mixed models are studied in Hall et al. (2011a, b) and Ormerod and Wand (2012), from a likelihood-based perspective. The resulting GVA estimator differs from the MLE but is asymptotically valid.

In our setting, the alternating steps are guaranteed to increase the following objective function (Neal and Hinton, 1998),

where naturally decouples into a factorized form, in which is left in free-form and is parameterized as Gaussian with diagonal covariance. With indexing the iterations, the overall algorithm contains the following alternating steps:

  • E-step: Optimize w.r.t. the distribution of latent variables

  • Variational E-step: Update the Gaussian variational parameters such that

  • M-step. Update the model parameters such that

In this algorithm, the variational parameters and the model parameters are updated with gradient updates instead of exact maximization (Lange, 1995a, b). Therefore, it is a generalized EM algorithm (Dempster et al., 1977; Neal and Hinton, 1998), as both the E-step and M-step are taken partially. Note that the latent variables only appear in the prior, and the random effect terms only appear in the likelihood,

so we can discuss them separately.

5.1 Closed-Form Expectations in the Shrinkage Prior

We compute the expected value w.r.t in the complete log-posterior, given the current parameter estimates and the observed data. Note that the entropy term does not depend on , so the relevant term in the E-step is

where

According to the Gaussian scale mixture (GSM) representation of the GDP prior,

and denoting the pairwise differences as , , we have

Given the estimates from the previous iteration , the conditional posterior of latent variables factorizes as

where . Integrating out , we have and , so , where refers to the Laplace distribution with scale parameter and denotes the Generalized Inverse Gaussian (GIG) distribution, , , and is the modified Bessel function of the second kind.

Similarly, given the estimates of the , the conditional posterior of latent variables also factorizes. Thus for every , , we have , and integrating out , we have and , so . Therefore,

(8)

We only need to find . According to the change of variable formula, , we have

(9)
(10)

and similarly denoting , we obtain

(11)

The GSM representation of the GDP priors determines a reweighting rule, as shown in (10) and (11), in which the weights depend only on the current estimate of the parameter

(or its differences), and the hyperparameters

.

The penalty terms in (8) can be organized in a quadratic form, where the block-diagonal matrix can again be found by square completing,

where

Therefore, , which is the structured penalty that favors models with simpler structures conforming to . The quadratic form is concave and differentiable, which makes gradient-based optimization methods suitable.

5.2 Gaussian Variational Approximation of the Likelihood

In our case, the log-likelihood for the Poisson mixed model in (7) is,

The log-priors for the random effect terms are

Let , , 111Note that for and , we have fixed their value to therefore for notational convenience, we assume and in the likelihood term. , where are the mean parameters and are all positive parameters for the variances. Assuming the variational proposals are independent, the lower bound of is