1 Introduction
Many realworld applications are filled with interaction events between multiple entities or objects, e.g., the purchases happened among customers, products and shopping pages at Amazon.com, and tweeting between twitter users and messages. Embedding these events, namely, learning a representation of the participant objects to encode their complex relationships, is of great importance and interest, in discovering hidden patterns from data, e.g.
, clusters and outliers, and performing downstream tasks, such as recommendation and online advertising.
While Poisson tensor factorization is a popular framework for the representation learning of those events, current methods, e.g., (Chi and Kolda, 2012; Hansen et al., 2015; Hu et al., 2015b; Schein et al., 2015, 2016, 2019), are mostly based on a multilinear factorization form, e.g., (Tucker, 1966; Harshman, 1970)
, and therefore might be inadequate to estimate complex, nonlinear temporal relationships in data. More important, existing methods overlook the structural sparsity underlying these events. That is, the observed interactions are far less than all the possible interactions among the participants (
e.g., 0.01%). Many factorization models rely on tensor algebras and demand all the tensor entries (i.e., interactions) should be observed (Kolda and Bader, 2009; Kang et al., 2012; Choi and Vishwanathan, 2014). Even for those entrywise factorization models (Rai et al., 2014; Zhao et al., 2015; Du et al., 2018), from the Bayesian viewpoint, they are equivalent to first generating the entire tensor and then marginalizing out the unobserved entries. In practice, however, the observed interactions are often very sparse, and their proportion can get even smaller with the increase of objects. For example, in online shopping, with the growth of users and items, the number of actual purchases (while growing) takes a smaller percentage of all possible purchases, i.e., all (user, item, shoppingpage) combinations, because the latter grows much faster.In this paper, we propose NESH, a novel nonparametric Poisson factorization approach for highorder interaction events embedding. Not only does NESH flexibly estimate various nonlinear temporal relationships of the participants, it also can capture structural sparsity within the present interactions, absorbing both the structural traits and hidden relationships into the embeddings. Our major contributions are the following:

Model. We hybridize the recent sparse tensor (hypergraph) processes (STP) (Tillinghast and Zhe, 2021) and matrix Gaussian processes (MGP) to develop a sparse event model, where the embeddings are in charge of both generating the interactions and modulating the event rates, hence can jointly encode the temporal relationships and sparse structure knowledge.

Theory. We use Poisson tail estimate, Bernstein’s inequality and L’Hôpital’s rule to prove strong asymptotic bounds of the sparsity ratio, including both a lower and upper bound. The prior work Tillinghast and Zhe (2021) only shows the sparsity ratio of the sampled tensors asymptotically converges to zero, yet never gives an estimate of the convergence rate. Our new result reveals more theoretical insight of STP in producing sparse structures, and can also characterize the classical sparse graph generation models (Caron and Fox, 2014; Williamson, 2016).

Algorithm. We use the stickbreaking construction of the normalized hypergraph process to compute the embedding prior, and then use batchnormalization and variational sparse GP framework to develop an efficient and scalable model estimation algorithm.
For evaluation, we conducted simulations to demonstrate that our theoretical bounds can indeed match the actual sparsity ratio and capture the asymptotic trend. Hence they can provide a reasonable convergence rate estimate and characterize the behavior of the prior. We then tested our approach NESH on three realworld datasets. NESH achieves much better predictive performance than the existing methods that use Poisson tensor factorization, additional time steps, local time dependency windows and triggering kernels. NESH also outperforms the same model with the sparse hypergraph prior removed, which demonstrates the importance of accounting for the structure sparsity. We then looked into the embeddings estimated by NESH, and found interesting patterns, including the clusters of users, sellers, and item categories in online shopping, and groups of states where car crash accidents happened.
2 Background
We assume that we observed way interactions among types of objects or entities (e.g., customers, products and sellers). We denote by the number of objects of type , and index each object by (). We then index a particular interaction by a tuple . We may observe multiple occurrences of a particular interaction. We denote the sequence of these events by where is the timestamp when th event occurred () and is the total number of the occurrences of . Suppose we have observed events of a collection of interactions, , we aim to learn an embedding for each participant object. Note that one object may participate in multiple, distinct interactions. We denote by the embeddings for object of type , which is an
dimensional vector. We stack the embeddings of the objects of type
into a embedding matrix , and denote by all the embedding matrices.To estimate the embeddings from , a popular approach is tensor factorization. We can introduce a mode tensor accordingly, where each mode includes objects, and each entry corresponds to an event sequence
. For event modeling, we can use the popular (homogeneous) Poisson processes, and the probability of
is given by(1) 
where is the total time span of all the event sequences, and is the rate (or intensity) of the interaction . Since the probability is only determined by the event count, we can place the count value in the entry of , and perform count tensor factorization. Classical tensor factorization approaches include Tucker decomposition (Tucker, 1966), CANDECOMP/PARAFAC (CP) decomposition (Harshman, 1970), etc. Tucker decomposition assumes , where is a parametric core tensor, are embedding matrices, and is the tensormatrix product at mode (Kolda, 2006), which is very similar to the matrixmatrix product. If we set all , and constrain to be diagonal, Tucker decomposition is reduced to CANDECOMP/PARAFAC (CP) decomposition (Harshman, 1970). While numerous tensor factorization algorithms have been developed, e.g., (Chu and Ghahramani, 2009; Kang et al., 2012; Choi and Vishwanathan, 2014), most of them inherit the CP or Tucker form. To perform count tensor factorization, we can use Poisson process likelihood (1) for each entry, and apply the Tucker/CP decomposition to the rates or log rates (Chi and Kolda, 2012; Hu et al., 2015b). A more refined strategy is to further partition the events into a series of time steps, e.g., by weeks or months, augment the count tensor with a timestep mode (Xiong et al., 2010; Schein et al., 2015, 2016, 2019), and jointly estimate the embeddings of these steps, . We can also model the dependencies between the time steps with some dynamics, e.g., (Xiong et al., 2010).
3 Model
Despite the success of existing Poisson tensor factorization approaches, they might be restricted in that (1) the commonly used CP/Tucker factorization over the rates are multilinear to the embeddings and therefore cannot capture more complex, nonlinear relationships between the interaction participants; (2) the homogeneous assumption, i.e., constant event rate, might be oversimplified, overlook temporal variations of the rates, and hence miss critical temporal patterns. More important, (3) in many realworld applications, the present interactions are very sparse, when contrasted to all possible interactions. For example, despite the massive online transactions, the ratio between the number of actual transactions and all possible transactions (i.e., all combinations of (customer, product, seller) is tiny^{1}^{1}1see Amazon data samples (http://jmcauley.ucsd.edu/data/amazon/) and dataset information in our experiments in Sec. 6. , slightly above zero (e.g., ). This proportion can get even smaller with the growth of customers, products and sellers, because their combinations can grow much faster. Similar observations can be found in clicks in online advertising, message tweeting, etc. Existing methods, however, are not aware of this data sparsity, and lack an effective modeling framework to embed the underlying sparse structures. To overcome these limitations, we propose NESH, a novel nonparametric embedding model for sparse highorder interaction events, presented as follows.
3.1 Nonparametric Sparse Event Modeling for HighOrder Interactions
First, to highlight the sparse structure within the observed events, we view the participants as nodes, and their interactions as way hyperedges in a hypergraph. Each edge connects participants (nodes), corresponding to a particular interaction . Attached to is a sequence of events — the occurrence history of . Our goal is to learn an embedding for each node, which is able to not only estimate the complex temporal relationships between the nodes, but also capture the traits of the sparse hypergraph structure. To this end, we follow (Tillinghast and Zhe, 2021; Caron and Fox, 2014) to construct a stochastic process to sample the hypergraph, with a guarantee of sparsity in the asymptotic sense. Specifically, for each node type , we sample a set of Gamma processes (Hougaard, 1986; Brix, 1999) to represent an infinite number of nodes and their weights,
(2) 
where is a Lebesgue base measure confined to . Next, we use these Ps to construct a productmeasure sum, with which as the mean measure to sample a Poisson point process (PPP) (Kingman, 1992), which represents the sampled edges of the hypergraph,
(3) 
Accordingly, has the following form,
where each point represents an hyperedge (interaction), is the set of all the sampled points, is the count of the point , and represents the location of that point and comes from the Ps, and is the Dirac measure. In essence, the Ps sample infinite nodes for each type , and then the PPP picks the nodes from each type to sample the hyperedges, i.e., multiway interactions.
To examine the sparsity, we look into the nodes in the sampled hyperedges , which are referred to as “active” nodes. They are participants of the interactions. For example, if an edge is sampled, then node 2, 3, 1 (of type 1, 2, 3 respectively) are active nodes. Denote by the number of distinct active nodes of type . If we connect all the active nodes of the types, we will have hyperedges (interactions) in total, i.e., the volume. Sparsity means the proportion of the sampled edges in all possible edges is very small, and the former grows slower than the latter, with the increase of active nodes. Denote by the number of sampled edges. The sparsity is guaranteed by
Lemma 3.1 (Corollary 3.1.1 (Tillinghast and Zhe, 2021)).
almost surely as , i.e.,
Note that this is an asymptotic notion of structural sparsity — with the hypergraph volume growing (i.e., increasing ), the proportion of the sampled edges is tending to zero. It is different from other notions (Choi and Vishwanathan, 2014; Hu et al., 2015a) where the sparsity means data is dominated by zero values.
Given the sparse hypergraph prior, we then sample the observed edges (interactions) and associated events, . Since they are always finite, we can use the standard PPP construction (Kingman, 1992) to sample these observations, which is computationally much more convenient and efficient. Specifically, we normalize the mean measure in (3) to obtain a probability measure, and use it to sample the points (i.e., edges/interactions) independently. To normalize , we need to first normalize each P , which gives a Dirichlet process (DP) (Ferguson, 1973), with the strength as , and base measure as the normalized base measure of
that is a uniform distribution in
,(4) 
where and . The normalized is . To capture rich structural information, we follow “ModelII” in (Tillinghast and Zhe, 2021) to sample multiple DP weights for each node. Specifically, we drop the locations, and only sample the weights, which follow the GEM distribution (Griffiths, 1980; Engen, 1975; McCloskey, 1965), and obtain
(5) 
Accordingly, we construct a probability measure over all possible edges (interactions),
(6) 
where . We then sample each observed interaction , and the probability is
(7) 
Now, it can be seen that from (4) and (5), for each node of type , we have sampled a set of weights from DPs. From (6), we can see these weights reflect the activity of the node interacting with other nodes (of different types). Each weight naturally represents the sociability in one community/group, and these communities are overlapping. We use these sociabilities to construct the embeddings of the nodes. Therefore, they encode the sparse structural information underlying the observed interactions ^{2}^{2}2ModelII in (Tillinghast and Zhe, 2021) actually has made an additional adjustment on top of (3). According to the superposition theorem, . It means (3) essentially samples hypergraphs independently and places them together. ModelII further performs a probabilistic merge of the hypergraphs. To see this, from (5) the nodes of each type in each hypergraph are indexed by the same set of integers (), and so all the possible edges in each hypergraph are indexed by the same set of index tuples. From (6) and (7), the probability of sampling a particular edge indexed by is . This can be explained as the following merging procedure. We randomly select one hypergraph (with probability ) and check if edge has been sampled in that hypergraph. If it has, we add edge in the new graph; otherwise, we do not add the edge. Since in each hypergraph , the probability of edge being sampled is , the overall probability of sampling the edge in the new graph is the average, . It is trivial to see that the merged hypergraph is still asymptotically sparse: since these hypergraphs can be viewed as independently sampled based on the same set of nodes, each of which is asymptotically sparse, their summation is also asymptotically sparse. The benefit is that since we align these hypergraphs via the integer indices of the nodes, we can assign multiple sociabilities for each node to be better able to capture the abundant structural information.
Given the sampled interactions , we then sample their occurred events . To flexibly capture the temporal patterns, we use nonhomogeneous Poisson processes. For each observed interaction , we consider a raw rate function , and then link it to a positive rate function by taking the square, . Note that we can also use , which, however, performs worse in our experiments. In order to capture the complex relationships of the rate functions and their temporal variations, we jointly sample the collection of rate functions, , from a matrix Gaussian process (MPG) (Rasmussen and Williams, 2006),
(8) 
where is the (row) covariance function across different interactions (hyperedges), the inputs are the embeddings of participant nodes, , and is the (column) covariance function about the time. We can choose nonlinear kernels for and to capture the complex relationships and temporal dependencies within . Given the rate functions, we then sample the observed event sequences from
(9) 
where is the total time span across all the event sequences.
3.2 Theoretical Analysis of Sparsity
Although Tillinghast and Zhe (2021) has proved the asymptotic sparsity guarantee of the hypergraph process in (3) (referred to as sparse tensor process in their paper), i.e., Lemma 3.1, the conclusion is rough in that we have no idea how the sparsity of the sampled hypergraph varies along with more and more active nodes. We only know that at the limit, the sparsity ratio becomes zero. While Caron and Fox (2014) gave some convergence rate estimate in their binary graph generating models under a similar modeling framework, the estimate is only available when using generalized Ps (GPs) (Hougaard, 1986) with a particular parameter range (see Theorem 10 in their paper). The estimate is not available for the popular ordinary Ps as in our model. GPs cannot be normalized as DPs and are much harder/inconvenient for computation and inference. To extract more theoretical insight, we prove asymptotic bounds of the sparsity ratio for our hypergraph process, which not only deepen our understanding of the properties of the sampled structures, but also fill the gap of prior works.
Lemma 3.2.
For a sparse hypergraph process defined as in (3), for all sufficiently large , there exists an absolute constant such that, with probability at least ,
Proof sketch. We first use concentration inequalities, including Poisson tail estimate (Vershynin, 2018) and Bernstein’s inequality, and L’Hopital’s rule to bound the measure of each P on , and then take a union bound over to obtain an upper bound of
. The lower bound is more technical, and requires a careful estimate of the support of the intensity measure that appears in the sampled entries with high probability, for which we apply a novel probabilistic argument. We mainly combine Poisson tail estimates, union bounds, the Bernoulli distribution and L’Hôpital’s rule to bound each
and then derive the lower bound of . We leave the details in Appendix.4 Algorithm
The matrix GP in our model, coupled with DPs, is computational costly. When the number of present interactions and/or the number of their events are large, we will have to compute a huge row and/or column covariance matrix, their inverse and determinants, which is very expensive or even infeasible. To address these issues, we use the stickbreaking construction (Sethuraman, 1994), sparse variational GP approximation (Titsias, 2009; Hensman et al., 2013) and batch normalization (Ioffe and Szegedy, 2015) to develop an efficient, scalable variational inference algorithm.
Specifically, we use the stickbreaking construction to sample the DP weights (or GEM distribution),
(10) 
where . Therefore, we only need to estimate the stickbreaking variables , from which we can outright calculate the weights (or sociabilities). Since these weights can be very small and close to zero, we use their logarithm to construct the embedding of each node of type , .
Next, to conveniently handle the matrix GP prior in (8), we unify all the raw rate functions as one function of the embeddings and time, , over which we assign a GP prior with a product covariance (kernel) function, . This is computationally equivalent to (8) because they share the same covariance function. But we only need to deal with one function. Accordingly, the function values at the event timestamps (across all the interactions),
follow a multivariate Gaussian distribution,
(11) 
where each row of consist of the embedding and a timestamp . Combing with (7) (9) (10), the joint probability of our model is
(12)  
where . The stickbreaking variables associated with inactive nodes, i.e., the nodes that do not participate any interactions, have been marginalized out.
Next, to dispense with the huge covariance matrix in (12) for , we use the sparse variational GP approximation (Hensman et al., 2013) to develop a variational inference algorithm. Specifically, we introduce a small set of pseudo inputs for , where is far less than the dimension of . We then define the pseudo outputs . We augment our model by jointly sampling . Due to the GP prior over , follow a multivariate Gaussian distribution that can be decomposed as , where , is a conditional Gaussian distribution, and . The probability of the augmented model has the following form,
(13) 
where
is the likelihood of the events. Compared to (12), we just replace the Gaussian prior over by the joint Gaussian prior over . If we marginalize out , we will recover the original distribution (12). Now, we construct a variational evidence lower bound (ELBO) to avoid computing the covariance matrix . To this end, we introduce a variational posterior for , , where , and is a lower triangular matrix. Note that is essentially a Cholesky decomposition, and we use it to ensure the positive definiteness of the posterior covariance matrix. We then derive the EBLO
Now we can see that the full conditional Gaussian distributions is canceled. We only need to calculate the covariance matrix for , which is very small. Hence, the cost is largely reduced. The detailed ELBO is given by
where , and
is the KullbackLeibler divergence. We maximize
to estimate the variational posterior and the other parameters, including the stickbreaking variables , kernel parameters, etc. Due to the additive structure over both the interactions and their events, it is straightforward to combine with the reparameterization trick (Kingma and Welling, 2013) to perform efficient stochastic minibatch optimization.However, since our embeddings are constructed from the logarithm of the sociabilities (in ), , and these sociabilities are often small, close to zero, their log scale can be quite big, e.g., hundreds. As a result, when we feed the input to the GP kernel (e.g., we used SE kernel in the experiments), it is easy to incur numerical issues or make the kernel matrix stuck to be diagonal. To address this issue, we use the batch normalization method (Ioffe and Szegedy, 2015)
. That is, we jointly estimate an (empirical) mean and standard deviation for each embedding element during our stochastic minibatch optimization. Denote them by
and . Each time, we first normalize each byand then feed them to the kernel; and are jointly updated with all the other parameters using stochastic gradients. We empirically found the numerical problem disappears, and the learning is effective (see Sec. 6).
Algorithm Complexity. The time complexity of our inference algorithm is where is the total number of events. Since , the computational cost is linear in . The space complexity is , including the storage of the prior and posterior covariance matrices for pseudo outputs and embeddings .
5 Related Work
It is natural to represent highorder interactions by multidimensional arrays or tensors. Tensor factorization is the fundamental framework for tensor analysis. Classical tensor factorization approaches include CP (Harshman, 1970) and Tucker (Tucker, 1966) decomposition, based on which numerous other methods have been proposed: (Chu and Ghahramani, 2009; Kang et al., 2012; Yang and Dunson, 2013; Choi and Vishwanathan, 2014; Du et al., 2018; Fang et al., 2021a)
, to name a few. Recently, nonparametric and/or neural network factorization models
(Zhe et al., 2015, 2016b, 2016a; Liu et al., 2018; Pan et al., 2020b; Tillinghast et al., 2020; Fang et al., 2021b; Tillinghast and Zhe, 2021) were developed to estimate nonlinear relationships in data, and have shown advantages over popular multilinear methods in prediction accuracy. When dealing with temporal information, existing methods mainly use homogeneous Poisson processes and decompose the event counts (Chi and Kolda, 2012; Hansen et al., 2015; Hu et al., 2015b). More advanced approaches further partition the time stamps into different steps, and perform count factorization across the time steps (Xiong et al., 2010; Schein et al., 2015, 2016, 2019). Recently, Zhe and Du (2018) used Hawkes processes to estimate the local triggering effects between the events, and modeled the triggering strength with a kernel of the embeddings of the interactions. Pan et al. (2020a) modeled the time decay as another kernel of the embeddings, and developed scalable inference for long event sequences. Wang et al. (2020) proposed a nonHawkes, nonPoisson process to estimate the triggering and inhibition effects between the events. All these are temporal point processes that focus on rate modeling, and are different from the PPPs (with mean measure) in NESH to sample sparse interaction structures. Lloyd et al. (2015) proposed GP modulated Poisson processes and also used the square link to ensure a positive rate function. However, the work is purely about event modeling and does not learn any embedding. With the SE kernel, it derives an analytical form of ELBO. However, since our model includes the embedding (log sociabilities) in the GP prior, the ELBO is analytically intractable, and we use the reparameterization trick to conduct stochastic optimization. Recently, Pan et al. (2021) proposed a selfadaptable point process for event modeling, which can estimate both the triggering and inhibition effects within the events. More important, they construct a GP based component to enable a nonparametric estimate of the time decays of these effects. Their point process is not a Poisson process any more.Our hypergraph prior is inherited from the sparse tensor process in (Tillinghast and Zhe, 2021), which can be viewed as a multidimensional extension of the pioneer work of Caron and Fox (2014, 2017), who first used completely random measures (CRMs) (Kingman, 1967, 1992; Lijoi et al., 2010), such as Gamma processes (Ps) (Hougaard, 1986), to generate sparse random graphs. However, these prior works only show the asymptotic sparse guarantee (i.e., the sparsity ratio converges to zero at the limit), yet not giving any convergence rate estimate for popular Ps that are convenient for inference and computation. Our work fills this gap by giving strong asymptotic bounds about the sparsity, including both a lower and upper bound, which can reveal more refined insight about these sparse priors. Furthermore, we couple the sparse prior with matrix GPs to jointly sample the interactions (i.e., hyperedges) and their event sequences. In this way, the embeddings can assimilate both the sparse structural information underlying the present interactions and the hidden temporal relationship of the participants.
6 Experiment
6.1 Sparsity Ratio Investigation
We first examined if our theoretical bounds in Lemma 3.2 match the actual sparsity ratio in the sampled hypergraphs. To this end, we followed (Tillinghast and Zhe, 2021) to sample a series of hypergraphs with threeway edges (interactions), namely . We set and varied in . For each particular , we independently sampled hypergraphs and computed average ratio of the sampled edges. We calculated the bounds accordingly. We show the results in a loglog plot as in Fig. 1. As we can see, the bounds clamp the actual sparsity ratio and match the trend well. The upper bound is tighter. Hence, these bounds can provide a reasonable estimate of the convergence rate and characterize the asymptotic behaviors of the structural sparsity.
6.2 Predictive Performance
Datasets. We then examined the predictive performance of NESH on the following realworld datasets. (1) Taobao (https://tianchi.aliyun.com/dataset/dataDetail?dataId=53), the shopping events in the largest online retail platform of China, from 07/01/2015 to 11/30/2015, which are interactions between users, sellers, items, categories and options. There are in total distinct interactions and events. (2) Crash (https://www.kaggle.com/usdot/nhtsatrafficfatalities), fatal traffic crashes in US 2015, within states, counties, cities and landusetypes. There are distinct interactions and events in total. (3) Retail (https://tianchi.aliyun.com/dataset/dataDetail?dataId=37260), online retail records from tmall.com. It includes interaction events among stock items and customers. We have and unique items and customers, among which are distinct interactions and events in total. We can see that all these datasets are very sparse. The existent interactions take , and on Taobao, Crash and Retail, respectively.
Competing Methods. We compared with the following popular and/or stateoftheart tensor decomposition methods that deal with interaction events. (1) CPPTF, similar to (Chi and Kolda, 2012), the homogeneous Poisson process (PP) tensor decomposition, which uses CP to factorize the event rate for each particular interaction and the square link to ensure the positiveness (consistent with NESH). (2) CPTPTF, similar to (Schein et al., 2015), which extends CPPTF by introducing time steps in the tensor. The embeddings of the time steps are assinged a conditional Gaussian prior (Xiong et al., 2010) to model their dynamics. (3) GPPTF, which uses GPs to estimate the square root of of the rate for each particular interaction as a nonlinear function of the associated embeddings. (4) CPNPTF, nonhomogeneous Poisson process tensor factorization where the event rate is modelled as a parametric form, . Here is CP decomposition of the entry (interaction) . (5) HPLocal (Zhe and Du, 2018), Hawkes process based decomposition that uses a local time window to model the rate and to estimate the local excitation effects among the nearby events, (6) HPTF (Pan et al., 2020a), another Hawkes process based on factorization method that models both the triggering strength and decay as kernel functions of the embeddings. Both HPLocal and HPTF use a GP to model the base rate as a function of embeddings. In addition, we compared with (7) MGPEF, matrix GP based events factorization. It is the same as our method in applying a matrix GP prior over the rates of distinct interactions. However, MGPEF places a standard Gaussian prior over the embeddings, and so does not model the structure sparsity.
Settings.
We implemented NESH, HPLocal, HPTF and MGPEF with Pytorch
(Paszke et al., 2019), and the other methods with MATLAB. For all the approaches employing GPs, we used the same variational approximation as in NESH (our method), and set the number of pseudo inputs to . We used the square exponential (SE) kernel and initialized the kernel parameters with . For HPLocal, the local window size was set to . For our method, we chose from . We conducted stochastic minibatch optimization for all the methods, where the batch size was set to . We used ADAM (Kingma and Ba, 2014) algorithm, and the learning rate was tuned from . We ran each method for epochs, which is enough to converge. We randomly split each dataset into sequences for training, and the remaining for test. We varied , the dimension of the embeddings, from . For CPTPTF, we tested the number of time steps from . We ran the experiments for five times, and report the average test loglikelihood and its standard deviation in Fig. 2.Results. As we can see from Fig. 2, NESH consistently outperforms all the competing methods by a large margin. Since the test loglikelihood of NESH is much larger than that of the other methods, we report the result of NESH separately (i.e., in the top figures) so that we can compare the difference between the competing methods. It can be seen that MGPEF is much better than GPPTF, implying that introducing a time kernel to model nonhomogeneous Poisson process rates is more advantageous. In addition, MGPEF is much better than or comparable to CPNPTF (see Fig. 2a). Since both methods uses nonhomogeneous Poisson processes, the result demonstrates the advantage of nonparametric rate modeling (the former) over the parametric one (the latter). In most cases, HPLocal and HPTF shows better or comparable prediction accuracy than MGPEF. This is reasonable, because the two methods use Hawkes processes that can capture more refined temporal dependencies, i.e., excitation effects between the events. However, both HPLocal and HPTF cannot capture the sparse structure within the present interactions, and their performance is still much inferior to NESH. Finally, GPPTF, HPLocal and HPTF broke down at on Retail dataset. We found their learning was unstable. In some splits, their predictive likelihood are very small, leading to much worse average likelihood than the other methods. Note that they did not use batch normalization as in NESH and MGPEF, which might cause the learning instability under some settings.
6.3 Pattern Discovery
Next, we examined if NESH can discover hidden patterns from data. To this end, we set to , and ran NESH on Taobao and Crash. Then we applied kernel PCA (Schölkopf et al., 1998) with the SE kernel to project the embeddings onto a plane. We then ran clustering algorithms to find potential structures. As we can see from Fig. 3 a and b, the embeddings of users and items from Taobao dataset exhibit interesting and clear cluster structures, which might correspond to separate interests/shopping habits. Note that we ran DBSCAN (Ester et al., 1996)
rather than kmeans to obtain the clusters. In addition, the embeddings of item
categories on Taobao also shows a clear structure that was discovered by kmeans (see Fig. 3c). Although the Taobao dataset have been completely anynoymized and we cannot investigate the meaning of these clusters, potentially they can be useful for tasks such as marketing (Zhang et al., 2017), recommendation (Liu et al., 2015; Tran et al., 2018) and clickthroughrate prediction (Pan et al., 2019). In addition, the embeddings of the states from Crash dataset also exhibit clear structures (see Fig. 3d). We have checked the geolocation of these states and found the states grouped together are often neighborhoods. This is reasonable in that neighboring states might bear a resemblance to each other, in traffic regulations (e.g., speed limit), road conditions, driving customs, weather changes. etc. All these might lead to similar or closely related patterns of traffic accident rates.7 Conclusion
We have presented NESH, a novel nonparametric embedding method for sparse highorder interaction events. Not only can our method estimate the complex temporal relationships between the participants, our model is also able to capture the structural information underlying the observed sparse interactions. Our theoretical bounds enable convergence rate estimate and reveal insights about the asymptotic behaviors of the sparse prior over hypergraphs or tensors. In the future, we will extend our model to more expressive point processes, such as Hawkes processes, and discover more refined temporal patterns.
Acknowledgments
This work has been supported by NSF IIS1910983, NSF DMS1848508 and NSF CAREER Award IIS2046295.
References
 Generalized gamma measures and shotnoise cox processes. Advances in Applied Probability, pp. 929–953. Cited by: §3.1.
 Sparse graphs using exchangeable random measures. arXiv preprint arXiv:1401.1137. Cited by: 2nd item, §3.1, §3.2, §5.
 Sparse graphs using exchangeable random measures. Journal of the Royal Statistical Society. Series B, Statistical Methodology 79 (5), pp. 1295. Cited by: §5.
 On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications 33 (4), pp. 1272–1299. Cited by: §1, §2, §5, §6.2.
 DFacTo: distributed factorization of tensors. In Advances in Neural Information Processing Systems, pp. 1296–1304. Cited by: §1, §2, §3.1, §5.
 Probabilistic models for incomplete multidimensional arrays. AISTATS. Cited by: §2, §5.
 Probabilistic streaming tensor decomposition. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 99–108. Cited by: §1, §5.
 A note on the geometric series as a species frequency model. Biometrika 62 (3), pp. 697–699. Cited by: §3.1.
 A densitybased algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: §6.3.

Bayesian streaming sparse Tucker decomposition.
In
Uncertainty in Artificial Intelligence
, pp. 558–567. Cited by: §5.  Streaming Bayesian deep tensor factorization. In International Conference on Machine Learning, pp. 3133–3142. Cited by: §5.
 A bayesian analysis of some nonparametric problems. The annals of statistics, pp. 209–230. Cited by: §3.1.
 Lines of descent in the diffusion approximation of neutral wrightfisher models. Theoretical population biology 17 (1), pp. 37–50. Cited by: §3.1.
 Newtonbased optimization for KullbackLeibler nonnegative tensor factorizations. Optimization Methods and Software 30 (5), pp. 1002–1029. External Links: Document, 1304.4964 Cited by: §1, §5.
 Foundations of the PARAFAC procedure: model and conditions for an”explanatory”multimode factor analysis. UCLA Working Papers in Phonetics 16, pp. 1–84. Cited by: §1, §2, §5.
 Gaussian processes for big data. In Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence, pp. 282–290. Cited by: §4, §4.
 Survival models for heterogeneous populations derived from stable distributions. Biometrika 73 (2), pp. 387–396. Cited by: §3.1, §3.2, §5.
 Zerotruncated poisson tensor factorization for massive binary tensors. In UAI, Cited by: §3.1.
 Scalable bayesian nonnegative tensor factorization for massive count data. In Proceedings, Part II, of the European Conference on Machine Learning and Knowledge Discovery in Databases  Volume 9285, ECML PKDD 2015, New York, NY, USA, pp. 53–70. External Links: ISBN 9783319235240, Link, Document Cited by: §1, §2, §5.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: §4, §4.
 Gigatensor: scaling tensor analysis up by 100 timesalgorithms and discoveries. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 316–324. Cited by: §1, §2, §5.
 Lévy processes and infinitely divisible distributions. Cambridge university press. Cited by: Appendix.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.2.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §4.
 Poisson processes. Vol. 3, Clarendon Press. Cited by: §3.1, §3.1, §5.
 Completely random measures. Pacific Journal of Mathematics 21 (1), pp. 59–78. Cited by: §5.
 Tensor decompositions and applications. SIAM Review 51 (3), pp. 455–500. Cited by: §1.
 Multilinear operators for higherorder decompositions. Vol. 2, United States. Department of Energy. Cited by: §2.
 Models beyond the Dirichlet process. Bayesian nonparametrics 28 (80), pp. 342. Cited by: §5.
 Neuralcp: Bayesian multiway data analysis with neural tensor decomposition. Cognitive Computation 10 (6), pp. 1051–1061. Cited by: §5.
 Nonlinear crossdomain collaborative filtering via hyperstructure transfer. In International Conference on Machine Learning, pp. 1190–1198. Cited by: §6.3.
 Variational inference for gaussian process modulated poisson processes. In International Conference on Machine Learning, pp. 1814–1822. Cited by: §5.
 A model for the distribution of individuals by species in an environment. Michigan State University. Department of Statistics. Cited by: §3.1.
 Warm up coldstart advertisements: improving ctr predictions via learning to learn id embeddings. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 695–704. Cited by: §6.3.
 Selfadaptable point processes with nonparametric time decays. Advances in Neural Information Processing Systems 34. Cited by: §5.
 Scalable nonparametric factorization for highorder interaction events. In International Conference on Artificial Intelligence and Statistics, pp. 4325–4335. Cited by: §5, §6.2.
 Streaming nonlinear Bayesian tensor decomposition. In Conference on Uncertainty in Artificial Intelligence, pp. 490–499. Cited by: §5.

PyTorch: an imperative style, highperformance deep learning library.
. In NeurIPS, Cited by: §6.2.  Scalable Bayesian lowrank decomposition of incomplete multiway tensors. In Proceedings of the 31th International Conference on Machine Learning (ICML), Cited by: §1.
 Gaussian processes for machine learning. MIT Press. Cited by: §3.1.
 Poissonrandomized gamma dynamical systems. In Advances in Neural Information Processing Systems, pp. 781–792. Cited by: §1, §2, §5.
 Bayesian poisson tensor factorization for inferring multilateral relations from sparse dyadic event counts. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1045–1054. Cited by: §1, §2, §5, §6.2.
 Bayesian poisson tucker decomposition for learning the structure of international relations. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pp. 2810–2819. External Links: Link Cited by: §1, §2, §5.

Nonlinear component analysis as a kernel eigenvalue problem
. Neural computation 10 (5), pp. 1299–1319. Cited by: §6.3.  A constructive definition of dirichlet priors. Statistica sinica, pp. 639–650. Cited by: §4.
 Probabilistic neuralkernel tensor decomposition. In 2020 IEEE International Conference on Data Mining (ICDM), pp. 531–540. Cited by: §5.
 Nonparametric decomposition of sparse tensors. In International Conference on Machine Learning, pp. 10301–10311. Cited by: Appendix A, Appendix C, Appendix C, 1st item, 2nd item, §3.1, §3.1, §3.2, Lemma 3.1, §5, §5, §6.1, footnote 2.
 Variational learning of inducing variables in sparse gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 567–574. Cited by: §4.
 Regularizing matrix factorization with user and item embeddings for recommendation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 687–696. Cited by: §6.3.
 Some mathematical notes on threemode factor analysis. Psychometrika 31, pp. 279–311. Cited by: §1, §2, §5.

Highdimensional probability: an introduction with applications in data science
. Vol. 47, Cambridge university press. Cited by: Appendix C, §3.2.  Selfmodulating nonparametric eventtensor factorization. In International Conference on Machine Learning, pp. 9857–9867. Cited by: §5.
 Nonparametric network models for link prediction. The Journal of Machine Learning Research 17 (1), pp. 7102–7121. Cited by: 2nd item.
 Temporal collaborative filtering with bayesian probabilistic tensor factorization. In Proceedings of the 2010 SIAM International Conference on Data Mining, pp. 211–222. Cited by: §2, §5, §6.2.
 Bayesian conditional tensor factorizations for highdimensional classification. Journal of the Royal Statistical Society B, revision submitted. Cited by: §5.
 Nonlinear effects of social connections and interactions on individual goal attainment and spending: evidences from online gaming markets. Journal of Marketing 81 (6), pp. 132–155. Cited by: §6.3.
 Bayesian cp factorization of incomplete tensors with automatic rank determination. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1751–1763. Cited by: §1.
 Stochastic nonparametric eventtensor decomposition. In Advances in Neural Information Processing Systems, pp. 6856–6866. Cited by: §5, §6.2.
 Dintucker: scaling up Gaussian process models on large multidimensional arrays. In Thirtieth AAAI conference on artificial intelligence, Cited by: §5.
 Scalable nonparametric multiway data analysis. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pp. 1125–1134. Cited by: §5.
 Distributed flexible nonlinear tensor factorization. In Advances in Neural Information Processing Systems, pp. 928–936. Cited by: §5.
Appendix
In this section, we provide detailed proof for Lemma 3.2. To make the ideas accessible to a broad audience, we also give a brief introduction to the LévyKhintchine formula as well as the intuition behind it. Our introduction includes the Gamma Processes (Ps), which are used to sample the intensity measures for a sequence of Poisson Point Processes (PPPs) that are used for sparse tensor/hypergraph construction, as a special case. A comprehensive treatment of the related topics can be found in, for instance, (KenIti, 1999).
Appendix A The LévyKhintchine formula
The sparse tensor model introduced in (Tillinghast and Zhe, 2021) first generates a discrete measure using Ps, which are a special type of Lévy process. To better understand the process, we take a brief detour to Lévy processes.
A Lévy process is an valued random process such that

a.s.;

has stationary and independent increments;

For every , is rightcontinuous and has a welldefined leftlimit.
is uniquely determined by
, which is an infinitely divisible random variable. Indeed, if the characteristic function (CF) of
is , then the CF of is for . A complete understanding of is sufficient to characterize , and this can be done via the LévyKhintchine formula:Theorem 1 (LévyKhintchine).
Let be an valued random variable. is infinitely divisible if and only if the CF of , , takes the form
(14) 
where
(15) 
where is a Borel measure on satisfying and . Here is called the Lévy exponent of .
As a consequence of Theorem 1, we conclude that the CF of every Lévy process can be written as
where is the Lévy exponent of .
To comprehend the path structure of using (15), we appeal to the following facts:

Addition of a CF’s exponents corresponds to addition of independent random variables (processes);

The CF of a drifted Brownian motion is ;

The CF of a compound Poisson process with jump parameter (i.e. a distribution) and rate parameter (i.e. a Lévy process) is .

The CF of a compensated compound Poisson process with jump parameter (i.e. a distribution) and rate parameter (i.e. a Lévy process and a martingale) is .
Let and for . Rewrite (15) as
(16) 
where the righthand side corresponds to three independent processes: a drifted Brownian motion, a compound Poisson process, and a series of independent compensated compound Poisson processes. Under the integrability condition on , the last part can be shown to converge using the martingale theory. Moving the drift term (cumulation of the compensated terms; deterministic) in the last part of (16) into the Brownian motion, we decompose into two independent processes, a drifted Brownian motion and a purejump process (with countably many jumps). This construction is the celebrated LévyItô construction.
To end this section, we give a useful interpretation of compound Poisson processes. A compound Poisson process with jump parameter and rate parameter is defined as
where are independent of . Note follows a PPP with intensity measure , for fixed , we may also consider as the integration of against the Poisson random measure on .
Appendix B Gamma processes
A Gamma process (with shape parameter and rate parameter ) is a Lévy Process with
It can be checked that
i.e., the Lévy measure is
For convenience, we set . From one can deduce the following properties of :

By the LévyItô construction, is a purejump process (i.e. no Brownian part), i.e., has countably infinitely many jumps in ;

We can associate a sample path of (up to time ) with the following measure:
(17) where the summation is welldefined since there are at most countably many that . is finite thanks to

The jump sizes of , , follows a PPP with intensity measure ; see the last paragraph in Section A.
Appendix C Sparse tensor/hypergraph processes
The sparse hypergraph model considered in this paper is a superposition of independent sparse tensor process introduced in (Tillinghast and Zhe, 2021). Without loss of generality, we assume ; the general case can be analyzed similarly. In this case, the sampled entries in the sparse tensor model are obtained as follows: Given and time , we

Use i.i.d. Gamma processes, , to generate discrete measures as in (17). Here we change the notations to be consistent with the ones in the manuscript, with the index omitted;

Construct a product measure on ;

Take as the intensity measure to construct a Poisson random measure . In particular, one can take
(18) where means that and are independent. The support of corresponds to sampled entries in the sparse tensor, and the marginals of the support of corresponds to the size of the sparse tensor in the respective dimension.
To get an explicit rate of convergence of sparsity as , we need to estimate the cardinality of the support of , , as well as of the corresponding marginals, which are denoted by .
Lemma 2.
Fix . For a sparse tensor process defined as above, for all sufficiently large , there exists an absolute constant such that, with probability at least ,
Proof.
Analysis of . We begin by deriving an upper bound for . Note , where the is defined in (18). Conditioned on ,
By a Poisson tail estimate (Vershynin, 2018, Exercise 2.3.6), we have
(19) 
where is an absolute constant. For each ,
Without loss of generality, we assume (otherwise consider and ). Then, we can write as a sum of i.i.d. exponentials with unit rate:
where is the exponential random variable with unit rate. An application of Bernstein’s inequality yields
where is an absolute constant (depending only on and both of which are equal to ). Taking a union bound over yields
(20) 
Combining (19) and (20) via a union bound yields that, with probability at least ,
(21) 
A lower bound on requires more refined analysis. Let
(22) 
For all sufficiently large , , and
(23)  