1 Introduction
Systems of interacting particles or agents are ubiquitous in science. They may be employed to model and interpret intriguing natural phenomena, such as superconductivity, body problems, and selforganization in biological systems. Collective dynamical systems demonstrate emergent behaviors such as clustering (Krause (2000); Mostch and Tadmor,E. (2014)), flocking (Vicsek et al. (1995); Cucker and Smale (2007)), milling (Carrillo et al. (2009); Albi et al. (2014)), synchronization (Strogatz (2000); O’Keeffe et al. (2017)), and many more (see Degond et al. (2013)).
We consider interactingagent systems of agents governed by the equations in the form
(1) 
for , ;
is a state vector, for example describing position, speed, emotion, or opinions of the
agent, and is the interaction kernel, governing how the state of agent influences the state of agent .Note that while the state space of the system is dimensional, the interaction kernel is a function of dimensions. In fact, in many important models, depends on an even smaller number of natural variables , with , which are functions of pairs . In other words, the interaction kernel can be factorized (as composition of functions) as a reduced interaction kernel , and reduced variables , i.e.
(2) 
An important example is , i.e. is in fact just a function of pairwise distances, a onedimensional variable. Given a set of discrete observations of trajectory data
with and initial conditions sampled i.i.d. from
(some unknown probability distribution on
) for, we are interested in estimating the interaction kernel
. Note that this is not a regression problem, but an inverse problem, since values of the function are not observed (only averages of such values, as per the righthand side of (1), are). In the works that considered this problem (Bongini et al. (2017); Lu et al. (2019); Miller et al. (2020)), the reduced variables were given, and was estimated as a function of . One of the main takeaways of those works is that the estimation of is not cursed by the dimension of the state space of the system, nor by the dimension of pairs of agent states, and, under suitable stable identifiability conditions, it is not harder than regression in dimensions, with the minmax learning rate of (up to log factors) being achievable, with efficient algorithms.In this work, we consider the case where the reduced variables are not known: both and need to be estimated from observations. We wish to accomplish this while exploiting the factorization (2), in order to obtain rates of estimations that are independent of both the dimension of the state space and of the dimension of of pairs of states, and only dependent on . We relate this problem to the classical multiindex model: one considers functions , where , is a linear map, , and given i.i.d. samples , one constructs estimators of and . We defer to Martin (2021) for discussions, references, and a novel technique for estimating that avoids the curse of dimensionality and is computationally efficient.
In our setting, we cannot directly apply the techniques of multiindex regression, as we are facing an inverse problem. We therefore consider a multitask learning setting, where we are allowed to learn on systems with a varying number of agents, driven by interaction kernels sharing the same variables , and then transfer to other systems with different or , but sharing the same . Crucially, when , it is possible to recast the estimation problem for and as a multiindex regression problem, opening the way to efficient estimation. Transferring to other systems with different is straightforward, and to systems with different ’s but the same variables may be achieved by the techniques of Lu et al. (2019); Miller et al. (2020).
Related Methods. There are many related approaches in learning the governing structures from observation of dynamical trajectory data, i.e. Ballerini et al. (2008); HerbertRead et al. (2011); Brunton et al. (2016); see also Miller et al. (2020) for further references. However, most of them do not take into consideration the special structure of selforganized dynamics. Our learning method follows in the line of research of Bongini et al. (2017); Lu et al. (2019), which exploits both the structure of the equations, and the assumption that depends on a small number of reduced variables (e.g. pairwise distance). The learning framework developed in Bongini et al. (2017) focused on the convergence case when goes to infinity, while Lu et al. (2019) consider the convergence when goes to infinity; extensions include Zhong et al. (2020); Miller et al. (2020); Maggioni et al. (2021). In these works, the reduced variables are known and the focus was on learning the interaction kernels; here we generalize the framework to the case when the variables are unknown.
2 Learning Method
We assume the discrete observations of trajectory data, with , is equispaced in time: for all ’s. We seek to infer the variables and the reduced interaction kernel in the factorization in (2).
We proceed by further assuming that can be factorized as , where is a known
feature map to highdimensional feature vectors, and the feature reduction map
is an unknown linear map to be estimated. What really matters is the range of , and without loss of generality we assume orthogonal. The factorization (2) becomes: ; we estimate and in two steps.Step . Estimating and . The techniques of Martin (2021), for estimation of the multiindex model, are only applicable in the regression setting. However, we note that if we allow the learning procedure to conduct “experiments” and have access to observations of (1) for different values of , and in particular for , then from those observations it can extract equations that reveal the values taken by the interaction kernels, taking us back to a regression setting. We then apply the approach of Martin (2021) to obtain , estimating . This technique is not cursed by the dimension , making it possible to increase the range of the feature map with relatively small additional sampling requirements. While the choice of the feature map is typically applicationdependent, and can incorporate symmetries, or physical constraints on the system, a rather canonical choice is the map to polynomials in the states, and here we restrict ourselves to secondorder polynomials, and therefore assume that:
(3)  
with . Of course higher order terms, beyond the quadratic ones, could be added to the feature map .
Step . Estimating . Once an estimate for the feature reduction map has been constructed, we proceed to estimate . We project the pairs of states, using , to , and use the nonparametric estimation techniques of Lu et al. (2019): we minimize, over functions in a suitable hypothesis space , the error functional
where for , and for , . is a convex and compact (in the norm) set of functions of the estimated variables , e.g. spanned by splines with knots on a grid, or piecewise polynomials. As in classical nonparametric estimation, the dimension of will be chosen as a suitably increasing function of the number of training trajectories .
2.1 Details on estimating the reduced variables and kernel
Step described above requires two key ideas: first of all, mapping the inverse problem of estimating the interaction kernel to a regression problem, by considering the problem for (from which the solution is immediately transferred to any ). Secondly, given inputoutput pairs , with and unknown and known, we estimate while trying to avoid the curse of dimensionality – meaning both (the dimension of , the domain of ), and (the dimension of the domain of , which is the range of ).
Firstorder system with agents. We consider the controlled experiments where we observe a firstorder system of only agents (we drop the dependence on for now to ease the notation), i.e.,
We can transform the equations to
Let as in (3); we obtain a regression problem:
Similar definitions are used for and . We reindex the observations as and the corresponding , with . We are now ready to estimate . Multiplicatively Perturbed Least Squares (MPLS) is a computationally efficient algorithm introduced in Martin (2021) for estimating multiindex models, which we apply here to the estimation of . It decomposes the regression function into linear and nonlinear components: , where and is orthogonal to linear polynomials. The intrinsic domain of , the rowspace of , is spanned by and the rows of , estimated as follows:
Algorithm [MPLS] Inputs: training data , partitioned into subsets and with sizes ; parameters and , with and (choosing and is discussed in Martin (2021)).

Let be the residual data of this approximation on , paired with s projected on .

Pick in (e.g., a random subset of ). For each , center the residuals to their weighted mean: with ,
and compute the slope perturbations via least squares:

Let have rows , , and compute the ranksingular value decomposition of . Return and .
The consistency of the estimators and convergence rates are discussed in Martin (2021); a key outcome of that analysis is that the sampling requirements for accurately estimating and are not exponential in , but a lowdegree polynomial in , therefore avoiding the curse of dimensionality. Those results hold when the training data is sampled i.i.d. from some distribution over , while here the samples are clearly not independent along each trajectory, but they are on the different trajectories, as the initial conditions are sampled independently. It remains to convert the computed and (from linear components) into an estimate of the lowdimensional space . This is done by computing the regression error of a 1dimensional constant spline approximation to for ; comparing this error for and the th singular vector of . If the former’s error is less than the latter’s, which suggests that the linear component is a more significant feature, then we take to be the row vector stacked on the first rows of , otherwise we let . Note that because the s used to compute are projected away from , will have orthonormal rows in either case. This procedure could be done at the beginning to identify useful vectors on the unit ball in dimensions (as in Juditsky et al. (2009)), however this would involve checking a number of vectors exponential in , and is not computationally feasible here. MPLS allows us to the reduce the problem to one comparison, regardless of .
2.2 Performance Measures
To measure the accuracy in approximating the feature reduction map , we use . To measure how well approximates we use
(4) 
where ; its relative variant is . The dynamical system, together with its random initial conditions sampled from on
, induces the probability measure
on , given byWe measure the trajectory accuracy between the observed trajectory, , and the trajectory obtained by the system driven by (over matching initial conditions) as
(5) 
with the expectation taken over initial conditions sampled i.i.d. from . A relative version , so that trajectory errors from different dynamics are more comparable, is obtained by dividing by
For a set of trajectories with different initial conditions, we report the trajectory error as .
2.3 Convergence Characteristics and Limitations
We expect our approach to preserve the strong convergence properties of MPLS and of the estimators of interaction kernels in Miller et al. (2020), in particular avoiding the curse of dimensionality in and . Limitations of our approach include the need of available training data for , and the knowledge of the feature map , such that is a linear function of .
3 Numerical Examples
We consider three kinds of dynamical systems exhibiting different emergent patterns: opinion dynamics, where clustering of agents emerges and the interaction kernels are compactly supported and discontinuous; powerlaw dynamics, where ringlike patterns emerge and the interaction kernels are powers of the pairwise distance variables; a modified powerlaw dynamics where a directional influence is added to steer the agents into a preferred direction. The shared parameters of these experiments are given in Table 1.
Here is used to generate the initial conditions for , and it is significantly smaller than , used for to learn
, showcasing an excellent transfer learning capability. For the
case, we have the mapping given by (3). Fig. 1 compares the reduced features and . Notes that estimates up to orthogonal transformations; this effect can be seen in Fig. 1 and Fig. 5. However, when we visually compare the interaction kernels and the distributions of and , we will coordinate the plots on the same axis, for the purpose for easier visual comparison.After we obtain (or, in the Oracle case, when is given), we estimate , , from the data (or , for the data ) using the following steps: estimate the support from (or ); build a basis on a uniform partition of the support with the optimal number of basis functions using either piecewise polynomials or clamped Bsplines as basis functions; solve for or as a linear combination of the basis functions (see Lu et al. (2019)). We run learning trials over
training sets in order to demonstrate the effectiveness of the learning results as the training data is randomly generated, and report the mean and standard deviation of the performance estimators over these trials.
Opinion Dynamics (OD). We consider the system of Mostch and Tadmor,E. (2014), where , and
where is the indicator function of set , is uniform on , and . Table 2 shows the estimation errors for this system, where (Oracle) indicates that the error is calculated when the true is known. In particular, the proposed method has a relative accuracy in estimating comparable to the case when is known (Oracle case). We can obtain a faithful approximation of (in terms of ) and of the interaction kernel using the pair , with digit relative accuracy.
(Oracle)  

(Transfer) 
We show the comparison of the two estimators (one using the feature reduction map and the other using ) to the true interaction kernel in Fig. 2. A comparison of selected trajectories is shown in Fig. 3.
Power Law Dynamics (PL). We consider the dynamics in Kolokolnikov et al. (2011) where with and . is uniform on and . Table 3 shows the estimation errors, which are about digit relative accuracy in estimating the feature map and , and lower for the trajectory errors. We compare the two estimators (using and respectively) to as well as the corresponding dynamics in Fig. 4.
(Oracle)  

(Transfer) 
For this continuous kernel, we are able to provide a learned pair with higher accuracy (when compared to the case of learning from opinion dynamics).
Power Law Dynamics with Directional Correction (PLwDC). We consider a directioncorrected powerlaw dynamics, generalizing that of Kolokolnikov et al. (2011), where with , and . is as in the previous example. Here is a unit vector which defines a preferred direction for the agents to follow. As shown in Table 4, our learning method can keep about digit relative accuracy in estimating the feature map and the interaction kernel – a loss of accuracy compared to the powerlaw case due to the increase in the dimension of the features. The estimators of the interaction kernel are compared to the true kernel in Fig. 5. We compare the distribution of and in Fig. 5, to the two estimated pairs and (Oracle estimator) as well as the trajectories in Fig. 6.
(Oracle)  

(Transfer) 
: Marginal and joint distributions of
(first row) is wellmatched by that of (second row), here on the same , axis, respectively, for ease of comparison.Although our estimate has large principal angle with , we are able to provide an estimated pair with reasonably small error (even smaller than knowing ), which also yields accurate trajectories. The reason lies in the unindentifiability of , and the estimator produces variables which are in onetoone correspondence with the ones in the model, see Fig. 5.
4 Conclusion
We have shown that we can estimate factorizable high dimensional interaction kernel from high dimensional observation data by first learning the feature reduction map via MPLS, then inferring the reduced interaction kernel from a nonparametric regression with the reduction variable . Our method can accurately estimate the feature reduction map, i.e. the unknown reduced variables in the interaction kernel, as well as the reduced interaction kernel, and provide accurate predictions for the dynamics of the system. Work on systems with different interaction structures, higher order systems, and transfer across multiple systems sharing reduced variables, is ongoing, as is the development of theoretical guarantees on the performance of the estimators, expecting to show that their consistency, under suitable identifiability assumptions, and convergence rate, not cursed by the dimension of the state space of the system nor by the rank of the feature map.
Acknolwedgments. MM thanks Fei Lu for insightful discussions. Prisma Analytics Inc. provided computing and storage facilities at no cost.
References
 Stability analysis of flock and mill rings for second order models in swarming. SIAM Journal on Applied Mathematics 74 (3), pp. 794–818. External Links: arXiv:1304.5459v1, ISSN 00361399 Cited by: §1.
 Interaction ruling animal collective behavior depends on topological rather than metric distance: evidence from a field study. Proc Natl Acad Sci USA 105 (4), pp. 1232–1237. External Links: http://www.pnas.org/content/105/4/1232.full.pdf Cited by: §1.
 Inferring interaction rules from observations of evolutive systems I: the variational approach. Math Mod Methods Appl Sci 27 (05), pp. 909–951. Cited by: §1, §1.
 Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc Natl Acad Sci USA 113 (15), pp. 3932–3937. External Links: http://www.pnas.org/content/113/15/3932.full.pdf Cited by: §1.
 Double milling in selfpropelled swarms from kinetic theory. Kinetic and Related Models 2 (2), pp. 363–378. External Links: ISSN 19375093 Cited by: §1.
 Emergent behavior in flocks. IEEE Trans Automat Contr 52 (5), pp. 852. Cited by: §1.
 Hydrodynamic models of selforganized dynamics: Derivation and existence theory. Methods and Applications of Analysis 20 (2), pp. 89–114. External Links: arXiv:1108.3160v1, ISSN 10732772 Cited by: §1.
 Inferring the rules of interaction of shoaling fish. Proc Natl Acad Sci USA 108 (46), pp. 18726–18731. Cited by: §1.
 Nonparametric estimation of composite functions. Ann. Statist. 37 (3), pp. 1360–1404. External Links: Document Cited by: §2.1.
 A theory of complex patterns arising from d particle interactions. Phys Rev E, Rapid Communications 84, pp. 015203(R). Cited by: §3, §3.
 A discrete nonlinear and nonautonomous model of consensus formation. Commun Part Diff Eq 2000, pp. 227–236. Cited by: §1.
 Nonparametric inference of interaction laws in systems of agents from trajectory data. Proceedings of the National Academy of Sciences 116 (29), pp. 14424–14433. Cited by: §1, §1, §1, §2, §3.
 Learning interaction kernels for agent systems on riemannian manifolds. In Proc. 38th I.C.M.L., M. Meila and T. Zhang (Eds.), Proceedings of Machine Learning Research, Vol. 139, pp. 7290–7300. Cited by: §1.
 Multiplicatively perturbed least squares for dimension reduction. Ph.D. Thesis, Johns Hopkins University. Cited by: §1, §2.1, §2.1, §2.
 Learning theory for inferring interaction kernels in secondorder interacting agent systems. arXiv:2010.03729. Cited by: §1, §1, §1, §2.3.
 Heterophilious Dynamics Enhances Consensus. SIAM Rev 56 (4), pp. 577 – 621. Cited by: §1, §3.
 Oscillators that sync and swarm. Nature Communications 8 (1), pp. 1–12. Cited by: §1.
 From Kuramoto to Crawford: exploring the onset of synchronization in populations of coupled oscillators. Physica D (143), pp. 1 – 20. External Links: ISSN , Link Cited by: §1.

Novel Type of Phase Transition in a System of SelfDriven Particles
. Physical Review Letters 75, pp. 1226–1229. External Links: condmat/0611743 Cited by: §1.  Datadriven discovery of emergent behaviors in collective dynamics. Physica D: Nonlinear Phenomena, pp. 132542. External Links: ISSN 01672789 Cited by: §1.