1 Introduction
1.1 Inferring the sources, sinks and hubs of transmission flows to aid the design of HIV prevention interventions
HIV remains one of the largest public health threats, especially in subSaharan Africa where approximately 61% of all new cases worldwide occur (UNAIDS, 2019). In recent years, rates of incident cases have overall dropped considerably with the widespread adoption of prevention interventions such as voluntary medical male circumcision (VMMC) to reduce the risk of HIV acquisition in men, or immediate provision of antiretroviral therapy (ART) to suppress the virus in infected individuals and thereby stop onward transmission (Cohen et al., 2011; Grabowski et al., 2017; Hayes et al., 2019), although they remain well above UNAIDS thresholds for elimination (UNAIDS, 2018).
Within Africa, there is increasing focus on identifying groups of individuals that are at high risk of acquiring HIV and at high risk of spreading the virus with the goal of targeted control interventions to these groups (AbelerDörner et al., 2019). Conceptually, the first step in this strategy is to break down the epidemic into source, sink and hub populations, according to the transmission flows that occur between them (Figure 1). Sources are population groups that disproportionately pass on infection, sinks are groups that disproportionately acquire infection, and hubs are both sources and sinks. The population groups can be defined in various ways.
For example DwyerLindgren et al. (2019) provided subnational estimates of HIV prevalence across Africa, adding to data showing that the epidemic is highly heterogeneous across Africa, with small areas of very high prevalence (i.e., hotspots) that are surrounded by neighbouring areas with substantially lower prevalence. Although often assumed, it is unclear if hotspots are also sources of epidemic spread to neighbouring lowerprevalence communities (Ratmann et al., 2020). Here, study populations are divided into individuals living in highprevalence areas () and lowprevalence areas (), and then transmission flows are estimated within and between them,
(1) 
where is the proportion of transmission flows from group to group subject to .
Another prominent application concerns the interruption of infection cycles between men and women of different ages. De Oliveira et al. (2017) proposed the scenario that young women aged 25 years are predominantly infected by older men aged 2540 years, and later spread the virus to similarly aged men in their late twenties and early thirties. Here, study populations are divided into sexspecific age groups (we consider 1year age groups between 15 to 49 years), and then transmission flows between and within age groups are estimated,
(2) 
where is the proportion of transmissions from men in age band to women in age band , and similarly for . We consider here only malefemale transmission flows because we found no evidence of malemale transmission in previous analyses (Ratmann et al., 2019) and sexual transmission between women is extremely rare. The flow matrix (2) has nonzero entries to estimate, which for 1year age bands amounts to
variables. Important summary statistics are the vector of sources of infection in group
individuals (), for example in young women aged 20 years; the vector of recipients of infection from group individuals (), for example from men aged 25 years; and flow ratios from to (), for example the ratio of transmissions from highprevalence to lowprevalence areas compared to transmissions from lowprevalence to highprevalence areas. Respectively these quantities are defined by(3a)  
(3b)  
(3c) 
Flow matrices of the form (12) are also of central interest to characterise human migration flows between countries (Raymer et al., 2013), transport flows between locations (Tebaldi and West, 1998), bacterial migration in humans (Ailloud et al., 2019), or human contact intensities (van de Kassteele et al., 2017), and are alternatively referred to as origindestination matrices (Hazelton, 2001; Miller et al., 2019).
1.2 Inference from pathogen sequence data
Transmission flow matrices have been estimated from contact tracing or survey data on partner characteristics, though these data are often subject to reporting and/or social desirability biases, especially for sexual diseases that are associated with stigma or remain criminalised in many countries (BarréSinoussi et al., 2018). Here, we are concerned in estimating the quantities (13) from pathogen sequences, which are considered an objective marker of disease flow. For fastevolving pathogens like HIV, mutations accrue quickly and the phylogenetic relationship of pathogen sequences can be used to evaluate many aspects of transmission dynamics such as the origins of HIV (Faria et al., 2014), the contribution of different disease phases to onward spread (Volz et al., 2013; Ratmann et al., 2016), or outbreak detection (Poon et al., 2016).
Traditionally, HIV sequences are obtained through Sanger sequencing, which returns for each sample one consensus nucleotide sequence that captures the entire viral diversity in the sample from one individual. The genetic distance between two consensus sequences can be used to estimate if the corresponding two individuals are epidemiologically closely related, however the data are insufficient to estimate the direction of transmission between any two sampled individuals (Leitner and RomeroSeverson, 2018). For this reason most methods infer transmission flows indirectly from statistics of the entire phylogeny, usually the coalescent times (i.e. the times when two lineages coalesce into one, backwards in time) and the disease states of infected individuals at time of sampling (such as location or age in the two applications discussed above). In the mugration model (Lemey et al., 2009)
, the states of viral lineages at any time are described with a continuoustime Markov chain (CTMC) that is independent of the evolutionary process. Flow estimates between groups can be obtained from the posterior distribution of the transition rates of the CTMC model via MCMC sampling, as well as posterior estimates of the phylogeny and the states of its lineages, which are latent variables in the model
(Lemey et al., 2009). The MultiTypeTree model (Vaughan et al., 2014) removes the independence assumption that the evolutionary history of the genealogy is independent of population structure, however sampling correlated latent phylogenies and state histories is often computationally infeasible. This limitation is addressed with the structured coalescent of Volz et al. (2009), which integrates over the state histories of the phylogeny and describes the marginal probabilities of each viral lineage to be in a particular state at a particular time. The changes in the state probabilities along lineages and through coalescent events are derived under ordinary differential equations (ODE) models of disease spread. The flow parameters are obtained as byproducts of the estimated latent states and parameters of the compartmental model, and in general vary in time over the phylogenetic history. Adopting the marginalisation approach, a greater range of flow models and data sets can be analysed, though computational runtimes often remain on the order of several weeks for data from hundreds of individuals
(Vaughan et al., 2014; Volz et al., 2013).An emerging strategy for estimating transmission flows involves phylogenetic analysis of multiple distinct pathogen sequences per infected host, because such data make possible to attribute sets of viral lineages to individuals and infer the ancestral relationships between them, which can provide direct evidence into the direction of transmission between two individuals (Leitner and RomeroSeverson, 2018). Such analyses are becoming broadly applicable, because deep sequencing technology now allows generating thousands to millions of distinct pathogen sequence fragments per sample (Gall et al., 2012; Zhang et al., 2020). Prior work focused on software development (Wymant et al., 2017; Skums et al., 2018), validation of the bioinformatics protocol for inferring the direction of transmission (Ratmann et al., 2019; Zhang et al., 2020), and reconstruction of partially observed transmission networks at the population level (Ratmann et al., 2019).
1.3 Semiparametric Poisson flow models
The starting point of this paper is the output of a typical deepsequence phylogenetic analysis (Wymant et al., 2017)
, which includes, for each ordered pair of sampled individuals, a viral phylogenetic measure in
giving a score that transmission occurred from the first to the second individual, possibly via unsampled intermediate individuals (phylogenetic direction scores). This is advantageous, because first, in this canonical form the data enable us to present the estimation problem in terms of a class of Bayesian Poisson models for nonGaussian flow data that can flexibly describe a range of epidemiological questions including transmission in space (1), or by age and sex (2), similar to the Poisson models used for estimating transport or migration flows (Tebaldi and West, 1998; Raymer et al., 2013). Second, the models can account for multilevel sampling heterogeneity, which is typically present in populationbased disease occurrence data but was not emphasised in phylogenetic analysis. We leverage Bayesian data augmentation to adjust for sampling heterogeneity (Givens et al., 1997), and exploit the fact that the additional latent variables can be integrated out in our framework, so that computational inference remains inexpensive. Third, while typical phylodynamic approaches are limited to estimating transmission flows between coarse population strata, for example by age brackets years and years (De Oliveira et al., 2017; Le Vu et al., 2019), we can employ Gaussianprocessbased regularisation techniques to capture fine detail in transmission flows by annual age increments. Specifically, we propose using recently developed Hilbert Space Gaussion Process (HGSP) approximations to ensure the regularisation priors remain computationally tractable (Solin and Särkkä, 2020). This brings our approach into the form of semiparametric Bayesian Poisson models, which enable inference of highresolution transmission flows similar to the Integrated Nested Laplace Approximations used for inferring highresolution human contact matrices (van de Kassteele et al., 2017).In Section 2, we introduce our notation and develop the semiparametric Bayesian Poisson flow model. In Sections 3.13.2, we assess the performance of the Poisson model in estimating transmission flows from samplingbiased data, and identify suitable HGSP approximations. Sections 3.33.6 illustrate our approach on HIV deep sequence data from the Rakai Community Cohort Study (RCCS) of the Rakai Health Sciences program, situated in southeastern Uganda (Grabowski et al., 2017). Between August 10 2011 to January 30 2015, virus from 2652 HIVinfected individuals could be deepsequenced, and 293 pairs of individuals with phylogenetically strong support for the direction of transmission were identified. We demonstrate that the new type of phylogenetic data and our statistical model enable estimation of age and genderspecific transmission flows at finer detail than previously possible while remaining computationally scalable. Particular attention is given to potential sampling biases, and we propose a hierarchical model of the sequence sampling cascade for analysis of transmission flows. Section 4 closes with a discussion.
2 Methodology
2.1 Notation and Definitions
In this section we present the notation that is used to estimate transmission flows in a population of size , during a study period . We define by the identifier of infected individuals in during . The transmission events during the study period can thus be described in a binary matrix , where denotes transmission from person to person , and denotes no transmission. The transmission matrix is not symmetric, and diagonal entries are zero.
We estimate transmission flows between population strata, and denote the strata by and the set of strata by , which is of dimension . The number of transmission events in from group to group are , and the primary object of interest is the flow matrix with entries where . The flow matrix is in general not symmetric, and is subject to . The matrix may contain structural zeros, for example in the case of HIV femaletofemale transmission is extremely unlikely. We denote the number of structurally nonzero entries by , which satisfies .
In general the flow matrix is timedependent due to changes in population composition and varying transmission rates (Anderson and May, 1992). For instance in a compartment model of susceptible (), infected () and treated () men and women of high () and low risk () of onward transmission, the ODE equations pertaining to the male () high risk population are
(4) 
where the force of infection is , the birth/death rate is constant, and the viral suppression rate and transmission rates , are timedependent. The actual, unobserved number of transmissions from high risk women to high risk men in are
(5) 
and the corresponding proportion of transmissions is
(6) 
where is the sum of transmission events in . Here, we focus on estimating transmission flows in a given study period, and for ease of notation drop the dependence of our data and estimates on .
Pathogen deep sequence data are available from sampled, infected individuals. We denote the sampling status vector for all individuals in by , where denotes that person is sampled, and that person is not sampled. The number of sampled individuals is , which corresponds here to the individuals for whom a viral deep sequence is available for analysis. We will characterise population sampling in terms of individuallevel characteristics, such as age or location of residence, that are described with covariates, which we denote with the matrix . The output of the phylogenetic deep sequence analysis can be summarised in an direction score matrix that describes the evidence for transmission from to with the weight . The direction score matrix is not symmetric, diagonal entries are zero, and entries involving unsampled individuals are missing. To estimate the flow matrix , consider the observed flow counts
(7) 
where is a threshold that can be used to select phylogenetically highly supported sourcerecipient pairs, and is taken as a perfect predictor of among sampled individuals. The counts can be arranged into the count matrix , and sum to . In previous studies, was set to or , and was between to (Hall et al., 2019; Ratmann et al., 2020).
The naïve flow estimator is defined by If we suppose that each population group is independently sampled at random with probability , and the actual flows from group to group are , then Consequently, the naïve flow estimator is only unbiased when the population groups were homogeneously sampled, i.e. is the same for all , which is rarely the case (Ratmann et al., 2020).
2.2 Inferring flows from heterogeneously sampled data
The data inputs for estimating transmission flows from pathogen deepsequence data (7) are of the same form as for estimating origindestination matrices with unobserved transport routes (Hazelton, 2001), migration flows (Raymer et al., 2013), or contact intensities (van de Kassteele et al., 2017), prompting us to formulate the statistical model in general terms. Considering the actual, unobserved number of flow events (i.e., transmissions) between all population groups, , the complete data likelihood that arises under mathematical models of the form (4) in a fixed study period is the multinomial
(8) 
where for ease of notation we denote by the vector of nonzero elements of the flow matrices (12), and similarly for and . Model (8
) ignores potential secondorder correlations between flows, for example that a female infected by an older male may be more likely to transmit to men of older age. Since the total number of transmissions is in itself a random variable, we consider the related Poisson model
(9) 
where can be interpreted as the flow intensities from group to group , , and are recovered via .
The actual flows are not observed. We assume that individuals are sampled at random within strata (SARWS). SAWRS implies that sampling is independent of being a source or not, and the likelihood of the observed counts conditional on the complete data is , where is the sampling probability in group . In this class of models the latent flow counts can be conveniently integrated out, yielding for the observed flow counts the Poisson model
(10) 
An important point in this construction is that we are free to choose the stratification in order to accommodate the SARWS assumption. We will show that flow estimates on different, coarser population stratifications that are of primary interest are easily obtained through the aggregation property of the Poisson system (9).
The samplingadjusted maximumlikelihood estimates of and under (10) can be derived under the SARWS assumption. The number of sampled individuals in , , is a Binomial sample of the number of all individuals in , , which leads to
(11) 
where (Supplementary Material, section LABEL:supp:sec:mle).
2.3 Inferring highresolution flows with regularising priors
Bayesian regularisation techniques play a central role in obtaining robust and suitably smoothed flow estimates. Considering population sampling, we exploit additional information on the sampling vector . We assume that flows are independent of sampling, allowing us to decompose the joint posterior distribution into
(12) 
A possible limitation of (11) is that the counts ,
can be small when the population is finely stratified. It is thus often advantageous to model individuallevel sampling probabilities in terms of a linear combination of predictors. Using, for example, a logistic regression approach, we obtain
(13) 
where is the row vector of population characteristics that specify group , are the regression coefficients, and is the posterior density of the regression coefficients, estimated from the sampling status vector of all individuals in the study population.
Considering the prior density on the transmission intensities , in some applications the population strata are unordered such as in application (1). In this case we propose using
(14) 
where is the length of the flow vector , which is equivalent to the number of structurally nonzero entries in the flow matrices (12), and is the number of expected transmission events, . This choice is motivated by the fact that (14) induces on an objective Dirichlet prior density with parameters (Berger et al., 2015), such that the likelihood (10) dominates the prior (14) regardless of the number of flows to estimate. However when the population groups can be ordered, such as the 1year age bands in (2), the structure of the flow model (12) enables using regularising prior densities that penalise against large deviations in flow intensities between similar source and recipient populations. For (2), we opted for (stacked) twodimensional Gaussianprocess priors on the entries of ,
(15) 
where the first entries of correspond to flows in the malefemale direction and the remaining entries correspond to flows in the femalemale direction, is the baseline log transmission intensity, is a scalar on the elements of in the malefemale direction, and ,
are genderspecific squared exponential kernels with variance parameters
and length scales , , , (Rasmussen and Williams, 2006).2.4 Scalable numerical inference
The semiparametric Poisson model can be efficiently fitted with the dynamic Hamiltonian Monte Carlo sampler of the Stan probabilistic programming framework (Carpenter et al., 2017). The implementation uses the Hilbert Space Gaussian process approximation (HSGP) to the GP prior (15) developed by Solin and Särkkä (2020), in which the squared exponential covariance kernel in (15
) is approximated through a series expansion of eigenvalues and eigenfunctions of the Laplacian differential operator on a compact domain
of the input space. In our twodimensional case, we consider , and the boundary points play an important role in the approximation, and need to be specified appropriately.Briefly, the HSGP approximation involves the spectral density associated with the stationary kernel of the GP prior. For the squared exponential kernel with twodimensional inputs, it is given by
(16) 
where denote the frequencies, and are the kernel parameters (Rasmussen and Williams, 2006). The approximation further involves the eigenvalues and eigenfunctions of the Laplacian differential operator. On the compact domain , the th univariate eigenvalues and eigenfunctions in dimension can be computed (Solin and Särkkä, 2020), and are , . The HSGP approximation involves the first and such terms of both dimensions. There are possible combinations of such terms, which we index through . For example, if and , then
For 2D inputs, the eigenvalues and eigenfunctions are the combinations of the univariate eigenvalues and eigenfunctions, , and . This fully specifies the HSGP approximation to Gaussian processes with 2D inputs,
(17) 
which depends on the choice of , and , . In our work, we applied the approximation (17) to each of the stacked Gaussian process prior components in (15). To our knowledge, this is one of the first applications of the HSGP approximation. Note that the structure of the data inputs and the form of the Poisson model is the same in many flow applications (Raymer et al., 2013; van de Kassteele et al., 2017), and so the HSGP approximation could make estimation of highresolution flows also numerically scalable in these settings.
Full details on the algorithm are reported in Supplementary Text SLABEL:sec:algorithms and a tutorial is provided in https://github.com/BDIpathogens/phyloscanner/blob/master/phyloflows/vignettes/08_practical_example.md.
3 Applications
3.1 Accuracy with and without adjustments for sampling heterogeneity
Standard phylodynamic methods ignore sampling differences between population strata (Volz et al., 2009; Le Vu et al., 2019; Scire et al., 2020). We first assessed the impact of sampling heterogeneity on estimating transmission flows in simulation experiments, that are fully reported in Supplementary Material, section LABEL:sec:supp:simulation. The first experiment is a minimal example involving flows between two population groups, which for simplicity we refer to as individuals in rural areas (group ) and individuals in large communities (group ). Transmission chains were simulated under the ODE model (4), i.e. not from our simpler likelihood model (10), and the simulated flow matrix was recorded in replicate simulations (Figure 3A). Observations were drawn under heterogeneous sampling, with fixed sampling probabilities in group , and decreasing sampling probabilities in groub , . First, we estimated flows from (12) assuming no sampling differences between population strata, and which we implemented by setting and to the Beta density with parameters 25.5, 25.5. Second, we estimated flows with information on sampling differences included, by setting to the Beta density with shape parameters , , where are the number of sampled inviduals in group ,
is the population denominator, and the hyperparameters
, were both set to under Jeffrey’s prior on the Binomial sampling probabilities. The density was specified analogously. Figure 3B compares the accuracy in the posterior median flow estimates in terms of the worst case error (WCE) , and illustrates that the Poisson model (12) can minimise the impact of sampling heterogeneity on flow estimates. More complex simulation experiments yielded similar results (Supplementary Material, section LABEL:sec:supp:simulation).3.2 Accuracy with different smoothing priors
Next, we assessed on simulations the impact of different regularising prior densities on estimating flows across orderable population strata, which is typically not considered in standard phylodynamic inference approaches (Volz et al., 2009; De Oliveira et al., 2017; Le Vu et al., 2019; Scire et al., 2020; Bbosa et al., 2020). We focused on age and genderspecific transmission flows by year age inputs between  years, resulting in flow combinations, and simulated transmission pairs from the GP model (15) using groundtruth parameters that were motivated by analyses of the Rakai data set (see Supplementary Material, section LABEL:sec:supp:simulation). Figure 4A illustrates the simulated transmission pairs from men to women, and the corresponding underlying flows in of simulations generated. Transmission flows were then inferred using the HSGP approximation (17) to (15) in our semiparametric Poisson flow model. We varied the number of basis functions from to by setting , and chose as HSGP domain an expanded version of the input domain, , with boundary factor . To have a benchmark for the performance of the HSGP approximations, inferences were also performed using the GP prior (15), from which the data were simulated. Priors for all parameters are described in the Supplementary Material, section LABEL:sec:supp:simulation. Figure 4B shows that relative to using GP priors, average HMC runtimes across the simulated data sets improved by more than tenfold with HSGP priors when the number of basis functions was less than . Figure 4B also summarises the average mean absolute error in posterior median flow estimates. Similar to RiutortMayol et al. (2020), our results indicate that the HSGP approximations were least accurate for small , and for larger boundary factors when the number of basis functions was not increased simultaneously. On our simulations, HSGP approximations performed almost as well as GPs for the tuning parameters , or . As computational cost increases with , we chose .
3.3 Application to populationbased deepsequence data from Rakai, Uganda
We illustrate application of the semiparametric Poisson flow model (12) on a populationbased sample of HIV deep sequences from the RCCS in southeastern Uganda at the shores of Lake Victoria (Ratmann et al., 2019, 2020). Between 2011/08/10 to 2015/01/30, two survey rounds were conducted in 36 inland communities, and three survey rounds in 4 fishing communities (Figure 2A). Preceeding each survey, a household census was conducted to identify individuals aged 1549 years who lived in the communities for at least one month and with intention to stay, who were eligible to participate. In brief, there were 37645 censuseligible individuals, of whom 25882 (68.8%) participated in the RCCS. Participation was higher among women than men, increased with age for both men and women, and was similar in fishing and inland communities (Supplementary Material, section LABEL:sec:supp:regression). 11404 (96.9%) of nonparticipants were absent for school or work. Infected individuals who did not report ART use were selected for sequencing, and deepsequencing rates were higher among men than women, similar by age for men and women, and were higher in fishing communities (Supplementary Material, section LABEL:sec:supp:regression). There were heterosexual pairs with phylogenetic support for linkage and direction of transmission (sourcerecipient pairs) when using the threshold in (7). The estimated infection times of the recipients were between 2009/10/01 and 2015/01/30, which defined the study period during which we estimated transmission flows. Figure 2B illustrates the reconstructed sourcerecipient pairs by age of both individuals at the midpoint of the study period.
To interpret these observations, we defined as denominator transmission events to censuseligible individuals in RCCS communities who were infected during the study period , and formalised the individual steps in the sampling cascade of transmission events (Figure 5A). The sources and recipients had to participate in at least one survey round between 2011/08/10 and 2015/01/30, report no ART use, and have virus sequenced successfully. In Rakai, each survey was preceeded by a household census, and with this denominator we numerically estimated age, gender, and locationspecific conditional sampling probabilities at each step of the sampling cascade using Bayesian logisticBinomial regression models, and then multiplied Monte Carlo draws from these distributions to numerically approximate the posterior distribution of the overall sampling probabilities , , (see Supplementary Material, section LABEL:sec:supp:regression). Figure 5B illustrates the resulting, overall sampling probabilities of femaletomale transmissions in fishing communities. The estimated sampling probabilities indicate that the observed data overrepresent transmissions between older individuals.
3.4 Transmission flows between areas with high and low disease prevalence
We then used the sourcerecipient data of Figure 2B to address problem (1) and estimate transmission HIV flows within and between high and lowprevalence RCCS communities. The highprevalence communities comprised the four fishing communities, and the lowprevalence communities comprised the remaining 36 inland communities. Detailed analyses have been reported in Ratmann et al. (2020); here we focus on illustrating how known sampling heterogeneities can be accounted for, and how they affect inferences of transmission flow.
The participation, ART use, and sequence sampling probabilities differed by gender, age, and location, and we stratified the population accordingly to meet the SARWS assumption that underlies the Poisson flow model (Section 2.2). Specifically, we stratified populations by gender, 1year age bands (between 15 and 49 years), and resident location (low or high prevalence), which resulted in 140 sampling groups. Following our sampling cascade model (Figure 5), we then sought to estimate transmission flows between the age and genderspecific transmission flows for each of the combinations of geographic source and recipient locations, through the joint posterior distribution (12). We further accounted for geographic inmigration, resulting in flow variables. On this highresolution flow space, we were able to directly apply the estimated, structured sampling probabilities that are illustrated in Figure 5B. This shows that accounting for the observed heterogeneities in how the census population was sampled resulted in a more complex inferential problem than (12) suggest.
To regularise inferences, we used the HSGP approximation (17) to the stacked Gaussian process prior (15). We further sought to allow for differences in transmission dynamics across locations, and for this reason specified independent HSGP priors on the parts of the flow space that correspond to transmissions to low prevalence areas, and on the parts of the flow space that correspond to transmissions to high prevalence areas. Numerical inference of the joint posterior density (12) took hours on four
Ghz processors with Stan version 2.19. There were no convergence, mixing, or divergence warnings, as long as informative prior densities on the length scale hyperparameters were chosen, which we set by matching their 99% credible ranges to the empirical 99% quantiles in Figure
2. Thus flow inferences remained computationally manageable even in the highresolution space considered here.Figure 6B shows the marginal posterior estimates of the aggregated flow vector , Equation (1), when sampling heterogeneity was ignored by setting all to the average sampling probability (red), and when gender, age, and locationspecific participation and sequence sampling probabilities of sources and recipients were accounted for as described above (turquoise). The average sampling difference between individuals in fishing and inland communities was 7.16%, suggesting based on our results in Figure 3B that after accounting for sampling differences, the samplingadjusted estimates could differ by up to 5% from the unadjusted estimates. Figure 6B shows that our results are in line with this expectation. The estimated flow ratio (inlandfishing / fishinginland) was 2.18 (1.064.71) when sampling heterogeneity was accounted for, and 2.58 (1.235.86) when sampling heterogeneity was not accounted for. We thus see that sampling heterogeneity can have an impact on flow estimates, and that the finding that highprevalence fishing communities were net sinks, and not sources, of local infection flows is robust to sampling heterogeneity.
3.5 Transmission flows between age groups
We next turned to problem (2) and estimated transmission flows by age and gender from the sourcerecipient data shown in Figure 2. Here, we focus on illustrating how the HSGP prior in the Poisson model allows us to borrow information across data points, and thereby go beyond existing phylodynamic methods (De Oliveira et al., 2017; Le Vu et al., 2019; Bbosa et al., 2020) and make flow inferences by 1year age bands.
To do this, we compared estimates of the source vectors , defined in (3a), when we used the independent Gamma prior density (14) (no regularisation) to those when we used the HSGP prior density (17) (with regularisation). We focused the comparison on the age and genderspecific sources of infection regardless of location, i.e. the source vector that corresponds to Equation (2), which was obtained by aggregating over the high and lowprevalence locations of the source and recipient population groups. Figures 7AB show the posterior median source estimates for each recipient group respectively without regularisation when using year age bands and with regularisation using year age bands, and Figures 7CD show the corresponding posterior coefficients of variation. The estimated coefficients of variation were similar with and without regularisation, and well below with regularisation, except from sources associated with little contribution to onward transmission and for very young or very old recipients.These findings suggest that the year flow estimates are statistically meaningful, and at high resolution provide better insights. More detailed analyses are reported in Supplementary Text S5. First, we document the obvious, that it is not possible to estimate flow variables from data points without regularisation. Second, we show that, as age bands are widened, estimates increasingly depend on the particular start and end points of the chosen age strata, and so we caution against inferences by year age bands or wider.
3.6 Sources of transmission to women aged <25 years
Across subSaharan Africa, HIV prevalence rises rapidly among young women aged <25 years (UNAIDS, 2018), which has prompted efforts to prevent infection among adolescent girls and young women, most notably the DREAMS partnership (Saul et al., 2018). Infections among young women aged <25 years are commonly attributed to older men, with a recent phylogenetic study from South Africa finding that of 60 identified transmission pairs involving women aged <25 years, 42 (70.0%) had a probable male partner aged >25 years (De Oliveira et al., 2017). Our larger data set with directional phylogenetic information allowed us to revisit these estimates.
Our data contained 96 sourcerecipient pairs involving women aged <25 years, of whom 59 (61.5%) originated from men and 37 (38.5%) from women. However, under the regularising HSGP prior density (17), our gender and agespecific flow estimates also borrow information from the other sourcerecipient pairs that involved older women. We report in Figure 8 the estimated sources of HIV infection in women of increasing age. The facets show the contribution of each source, men younger or the same age, men up to years older, and men more than years older, and sum to % for each age of infected women on the xaxis. At age 15, an estimated 91.8% (77.6%  97.7%) of women were infected by men more than 5 years older, while at age 20, this was 71.8% (60.6%  80.7%), and at age 25 this was 47.8% (38.2%  57.4%). These estimates document the overwhelming impact that men more than years older have on driving infection in very young women in our observation period , and they show that the contribution of these men on infection in women declines rapidly with the age of the women. We provide exact estimates in Table S9.
4 Discussion
In this study we introduce a class of semiparametric Bayesian Poisson models for estimating highresolution flows between population strata, and apply the model to estimate the sources of HIV infections from pathogen deepsequence data. The modelling framework is flexible, and enables addressing a range of epidemiological questions on pathogen spread between geographic areas, by year age bands, gender, or indeed other discretelyvalued sociodemographic characteristics. We templated the model with and without Hilbert space approximations for scalable inference of highresolution flows in generic Stan model files, and hope that given the canonical structure of the semiparametric Poisson flow model, these will be helpful in other movement, origindestination or flow applications as well (Tebaldi and West, 1998; Hazelton, 2001; Raymer et al., 2013; Lindström et al., 2013; Faye et al., 2015; van de Kassteele et al., 2017; Sun et al., 2021).
Existing phylodynamic estimation approaches are tailored for pathogen consensus sequences (Lemey et al., 2009; Vaughan et al., 2014; Volz et al., 2009; Scire et al., 2020). The approach described here is tailored for pathogen deepsequence data, in that an observed, timehomogeneous flow matrix is required as input, which can be derived through aggregation from individual sourcerecipient relationships in deepsequence phylogenies. The main advantages are first, that populationlevel spread can be directly estimated from individual sourcerecipient relationships, and modelled in terms of associated individual covariates. Second, relatively little computational effort is needed to fit the Poisson flow model (1012) to deepsequence data, because it falls within the class of Bayesian hierarchical models for binary data, for which efficient fitting and regularisation techniques exist (Carpenter et al., 2017; Rasmussen and Williams, 2006; Solin and Särkkä, 2020). This makes it computationally feasible to investigate complex aspects of disease spread such as populationlevel HIV transmission by 1year age bands. Third, differences in how the phylogenetic data were sampled for each stratum can be explicitly accounted for in the model, which is particularly important for characterising HIV transmission, which tends to concentrate in marginalised, vulnerable, and hard to reach populations (UNAIDS, 2019). We found relatively small differences in the estimated sources of infection in the study communities by location and age when inferences were performed with and without sampling adjustments, in line with the relatively limited differences in sampling inclusion probabilities in the RCCS communities and the expected impact on source attribution on simulated data (Figure 3B). However in other use cases, the impact of sampling heterogeneity on source attribution can be substantially larger. For example, the RCCS included in surveyed communities an estimated 75.7% of the lakeside population within 3km of the shoreline of Lake Victoria along the Rakai region, and an estimated 16.2% of the inland population of the Rakai region (Ratmann et al., 2020). We can use the proposed framework to extrapolate our inferences from the RCCS communities to the underlying population, and given the larger sampling differences we estimated that 88.7% (84.5–91.9%) of transmissions in the Rakai region occurred in the inland population (Ratmann et al., 2020). Specifying and characterising the denominator population is thus crucial for interpreting phylogenetic source attribution estimates, and we believe the proposed Bayesian semiparametric flow model provides a useful tool in this endeavour.
The method we propose has limitations. First, the method requires pathogen deepsequence data instead of consensus sequences, which at present are uncommon, though increasingly generated in routine clinical care (Houlihan et al., 2018). Second, current deepsequencing protocols typically generate short sequence fragments, usually of to base pairs in length after trimming adaptors and low quality ends, and merging paired end fragments. This implies that pathogens need to evolve at a fast rate, because otherwise reconstructed deepsequence phylogenies do not contain the pattern of ancestral subgraphs that is characteristic of pathogen spread in one direction. Such high evolutionary rates are typical for viral pathogens that infect and evolve in humans over long periods of time, such as HIV or hepatitis C. We expect that the methods developed here will become applicable to a broad range of viral and bacterial infectious diseases as existing deepsequencing methods that generate substantially longer pathogen sequence fragments become cheaper (Rhoads and Au, 2015). Third, our inferences are based on sourcerecipient pairs with strong evidence for the direction of transmission, which is a subset of all the data available, and we cannot exclude that this selection step introduces bias into flow estimates.Fourth, the model was not designed to estimate time changes in transmission flows. While in principle it is possible to add time as a covariate to the linear predictor of the log transmission intensities (15), the resulting flow estimates will in general not be consistent with the constraints imposed by standard assumptions on disease spread, as in Equations (46), and other techniques such as the structured coalescent may be better suited (Volz et al., 2009).
Reducing HIV incidence among adolescent and young women is a key priority for public health programs across subSaharan Africa to achieve epidemic control milestones (UNAIDS, 2018). The DREAMS intervention aims to promote determined, resilient, empowered, AIDSfree, mentored, and safe adolescent girls and young women, and includes educational programs that aim to address the sociobehavioral factors that underlie vulnerability and infection risk (Saul et al., 2018). Our analysis of a large crosssectionally sampled HIV deepsequence data set from Rakai, Uganda, supports previous analyses (De Oliveira et al., 2017; Probert et al., 2019; Bbosa et al., 2020) and indicates that 68.9% (60.2%76.9%) of adolescent and young women aged <25 years acquired HIV in agedisparate relationships with men at least 5 years older. The estimated proportion of infections attributable to agedisparate relationships was approximately 90% among adolescent girls, and decreased to approximately 50% among women aged 25 years. Taken together, the data from this study and other phylogenetic studies from Uganda, Zambia, and South Africa suggest that rapid increases in HIV prevalence among adolescent and young women may be driven by the same source populations across subSaharan Africa, and support DREAMS interventions that include clear prevention messages about agedisparate sexual relationships.
Acknowledgements
This study was supported by the Bill & Melinda Gates Foundation (OPP1175094, OPP1084362), the National Institute of Allergy and Infectious Diseases (R01AI110324, U01AI100031, U01AI075115, R01AI102939, K01AI12508601), National Institute of Mental Health (R01MH107275), the National Institute of Child Health and Development (RO1HD070769, R01HD050180), the Division of Intramural Research of the National Institute for Allergy and Infectious Diseases, the World Bank, the Doris Duke Charitable Foundation, the Johns Hopkins University Center for AIDS Research (P30AI094189), and the Presidents Emergency Plan for AIDS Relief through the Centers for Disease Control and Prevention (NU2GGH000817). We acknowledge data management support provided in part by the Office of Cyberinfrastructure and Computational Biology at the National Institute for Allergy and Infectious Diseases, computational support through the Imperial College Research Computing Service, doi: 10.14469/hpc/2232. We thank the participants of the Rakai Community Cohort Study and the many staff and investigators who made this study possible, as well as the PANGEAHIV steering committee, the RCCS leadership, and two anonymous reviewers for their helpful comments on this manuscript.
List of Supplementary Material
S1. Maximum likelihood flow estimates under heterogeneous sampling
S2. Numerical inference algorithms
S3. Simulation experiments
S4. Modelling and estimation of the sampling cascade
S5. Analyses using different age bands
S6. Supplementary Figures and Tables
Stan files were provided as text documents.
prior_gamma.txt
prior_gp.txt
prior_gp_approx.txt
References
 AbelerDörner et al. (2019) AbelerDörner, L., Grabowski, M. K., Rambaut, A., Pillay, D., Fraser, C. et al. (2019) PANGEAHIV 2: Phylogenetics and networks for generalised epidemics in africa. Current Opinion in HIV and AIDS, 14, 173–180.
 Ailloud et al. (2019) Ailloud, F., Didelot, X., Woltemate, S., Pfaffinger, G., Overmann, J., Bader, R. C., Schulz, C., Malfertheiner, P. and Suerbaum, S. (2019) Withinhost evolution of helicobacter pylori shaped by nichespecific adaptation, intragastric migrations and selective sweeps. Nature communications, 10, 1–13.
 Anderson and May (1992) Anderson, R. M. and May, R. M. (1992) Infectious diseases of humans: dynamics and control. Oxford University Press.
 BarréSinoussi et al. (2018) BarréSinoussi, F., Abdool Karim, S. S., Albert, J., Bekker, L.G., Beyrer, C., Cahn, P., Calmy, A., Grinsztejn, B., Grulich, A., Kamarulzaman, A. et al. (2018) Expert consensus statement on the science of hiv in the context of criminal law. Journal of the International AIDS Society, 21, e25161.
 Bbosa et al. (2020) Bbosa, N., Ssemwanga, D., Ssekagiri, A., Xi, X., Mayanja, Y., Bahemuka, U., Seeley, J., Pillay, D., AbelerDörner, L., Golubchik, T. et al. (2020) Phylogenetic and demographic characterization of directed HIV1 transmission using deep sequences from highrisk and general population cohorts/groups in uganda. Viruses, 12, 331.
 Berger et al. (2015) Berger, J. O., Bernardo, J. M. and Sun, D. (2015) Overall objective priors. Bayesian Analysis, 10, 189–221.
 Carpenter et al. (2017) Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P. and Riddell, A. (2017) Stan: A probabilistic programming language. Journal of Statistical Software, 76.
 Cohen et al. (2011) Cohen, M. S., Chen, Y. Q., McCauley, M., Gamble, T., Hosseinipour, M. C., Kumarasamy, N., Hakim, J. G., Kumwenda, J., Grinsztejn, B., Pilotto, J. H. et al. (2011) Prevention of hiv1 infection with early antiretroviral therapy. New England journal of medicine, 365, 493–505.
 De Oliveira et al. (2017) De Oliveira, T., Kharsany, A. B., Gräf, T., Cawood, C., Khanyile, D., Grobler, A., Puren, A., Madurai, S., Baxter, C., Karim, Q. A. et al. (2017) Transmission networks and risk of hiv infection in KwaZuluNatal, South Africa: a communitywide phylogenetic study. The Lancet HIV, 4, e41–e50.
 DwyerLindgren et al. (2019) DwyerLindgren, L., Cork, M. A., Sligar, A., Steuben, K. M., Wilson, K. F., Provost, N. R., Mayala, B. K., VanderHeide, J. D., Collison, M. L., Hall, J. B. et al. (2019) Mapping HIV prevalence in subSaharan Africa between 2000 and 2017. Nature, 570, 189.
 Faria et al. (2014) Faria, N. R., Rambaut, A., Suchard, M. A., Baele, G., Bedford, T., Ward, M. J., Tatem, A. J., Sousa, J. D., Arinaminpathy, N., Pépin, J. et al. (2014) The early spread and epidemic ignition of HIV1 in human populations. Science, 346, 56–61.
 Faye et al. (2015) Faye, O., Boëlle, P.Y., Heleze, E., Faye, O., Loucoubar, C., Magassouba, N., Soropogui, B., Keita, S., Gakou, T., Koivogui, L. et al. (2015) Chains of transmission and control of ebola virus disease in conakry, guinea, in 2014: an observational study. The Lancet Infectious Diseases, 15, 320–326.
 Gall et al. (2012) Gall, A., Ferns, B., Morris, C., Watson, S., Cotten, M., Robinson, M., Berry, N., Pillay, D. and Kellam, P. (2012) Universal amplification, nextgeneration sequencing, and assembly of HIV1 genomes. Journal of Clinical Microbiology, 50, 3838–3844.
 Givens et al. (1997) Givens, G. H., Smith, D. and Tweedie, R. (1997) Publication bias in metaanalysis: a Bayesian dataaugmentation approach to account for issues exemplified in the passive smoking debate. Statistical Science, 221–240.
 Grabowski et al. (2017) Grabowski, M. K., Serwadda, D. M., Gray, R. H., Nakigozi, G., Kigozi, G., Kagaayi, J., Ssekubugu, R., Nalugoda, F., Lessler, J., Lutalo, T. et al. (2017) HIV prevention efforts and incidence of HIV in uganda. New England Journal of Medicine, 377, 2154–2166.
 Hall et al. (2019) Hall, M. D., Holden, M. T., Srisomang, P., Mahavanakul, W., Wuthiekanun, V., Limmathurotsakul, D., Fountain, K., Parkhill, J., Nickerson, E. K., Peacock, S. J. et al. (2019) Improved characterisation of MRSA transmission using withinhost bacterial sequence diversity. eLife, 8, e46402.
 Hayes et al. (2019) Hayes, R. J., Donnell, D., Floyd, S., Mandla, N., Bwalya, J., Sabapathy, K., Yang, B., Phiri, M., Schaap, A., Eshleman, S. H., PiwowarManning, E., Kosloff, B., James, A., Skalland, T., Wilson, E., Emel, L., Macleod, D., Dunbar, R., Simwinga, M., Makola, N., Bond, V., Hoddinott, G., Moore, A., Griffith, S., Deshmane Sista, N., Vermund, S. H., ElSadr, W., Burns, D. N., Hargreaves, J. R., Hauck, K., Fraser, C., Shanaube, K., Bock, P., Beyers, N., Ayles, H. and Fidler, S. (2019) Effect of Universal Testing and Treatment on HIV Incidence — HPTN 071 (PopART). New England Journal of Medicine, 381, 207–218. PMID: 31314965.
 Hazelton (2001) Hazelton, M. L. (2001) Inference for origin–destination matrices: estimation, prediction and reconstruction. Transportation Research Part B: Methodological, 35, 667–676.
 Houlihan et al. (2018) Houlihan, C. F., Frampton, D., Ferns, R. B., Raffle, J., Grant, P., Reidy, M., Hail, L., Thomson, K., Mattes, F., Kozlakidis, Z. et al. (2018) Use of wholegenome sequencing in the investigation of a nosocomial influenza virus outbreak. The Journal of Infectious Diseases, 218, 1485–1489.
 van de Kassteele et al. (2017) van de Kassteele, J., van Eijkeren, J., Wallinga, J. et al. (2017) Efficient estimation of agespecific social contact rates between men and women. The Annals of Applied Statistics, 11, 320–339.
 Le Vu et al. (2019) Le Vu, S., Ratmann, O., Delpech, V., Brown, A. E., Gill, O. N., Tostevin, A., Dunn, D., Fraser, C., Volz, E. M. and Database, U. H. D. R. (2019) HIV1 transmission patterns in Men Who Have Sex with Men: Insights from genetic source attribution analysis. AIDS Research and Human Retroviruses, 35, 805–813.
 Leitner and RomeroSeverson (2018) Leitner, T. and RomeroSeverson, E. (2018) Phylogenetic patterns recover known HIV epidemiological relationships and reveal common transmission of multiple variants. Nature Microbiology, 3, 983–988.
 Lemey et al. (2009) Lemey, P., Rambaut, A., Drummond, A. J. and Suchard, M. A. (2009) Bayesian phylogeography finds its roots. PLoS Computational Biology, 5, e1000520.
 Lindström et al. (2013) Lindström, T., Grear, D. A., Buhnerkempe, M., Webb, C. T., Miller, R. S., Portacci, K. and Wennergren, U. (2013) A bayesian approach for modeling cattle movements in the united states: scaling up a partially observed network. PLoS One, 8, e53432.
 Miller et al. (2019) Miller, H. J., Dodge, S., Miller, J. and Bohrer, G. (2019) Towards an integrated science of movement: converging research on animal movement ecology and human mobility science. International Journal of Geographical Information Science, 33, 855–876.
 Poon et al. (2016) Poon, A. F., Gustafson, R., Daly, P., Zerr, L., Demlow, S. E., Wong, J., Woods, C. K., Hogg, R. S., Krajden, M., Moore, D. et al. (2016) Near realtime monitoring of HIV transmission hotspots from routine HIV genotyping: an implementation case study. The Lancet HIV, 3, e231–e238.
 Probert et al. (2019) Probert, W., Hall, M., Xi, X., Sauter, R., Golubchik, T., Bonsall, D., Abeler‐Dörner, L., Pickles, M., Cori, A., Bwalya, J., Floyd, S., Mandla, N., Shanaube, K., Yang, B., Ayles, H., Bock, P., Donnell, D., Grabowski, K., Pillay, D., Rambaut, A., Ratmann, O., Fidler, S., Hayes, R., Fraser, C., consortium, P. and the HPTN 071 (PopART) study team (2019) Quantifying the contribution of different aged men and women to onwards transmission of HIV‐1 in generalised epidemics in sub‐Saharan Africa: A modelling and phylogenetics approach from the HPTN071 (PopART) trial.

Rasmussen and Williams (2006)
Rasmussen, C. E. and Williams, C. (2006)
Gaussian processes for Machine Learning
. MIT Press.  Ratmann et al. (2019) Ratmann, O., Grabowski, M. K., Hall, M., Golubchik, T., Wymant, C., AbelerDörner, L., Bonsall, D., Hoppe, A., Brown, A. L., de Oliveira, T. et al. (2019) Inferring HIV1 transmission networks and sources of epidemic spread in africa with deepsequence phylogenetic analysis. Nature Communications, 10, 1–13.
 Ratmann et al. (2020) Ratmann, O., Kagaayi, J., Hall, M., Golubchick, T., Kigozi, G., Xi, X., Wymant, C., Nakigozi, G., AbelerDörner, L., Bonsall, D. et al. (2020) Quantifying HIV transmission flow between highprevalence hotspots and surrounding communities: a populationbased study in Rakai, Uganda. The Lancet HIV, 7, PE173–E183.
 Ratmann et al. (2016) Ratmann, O., Van Sighem, A., Bezemer, D., Gavryushkina, A., Jurriaans, S., Wensing, A., De Wolf, F., Reiss, P., Fraser, C. et al. (2016) Sources of HIV infection among men having sex with men and implications for prevention. Science Translational Medicine, 8, 320ra2.
 Raymer et al. (2013) Raymer, J., Wiśniowski, A., Forster, J. J., Smith, P. W. and Bijak, J. (2013) Integrated modeling of european migration. Journal of the American Statistical Association, 108, 801–819.
 Rhoads and Au (2015) Rhoads, A. and Au, K. F. (2015) PacBio sequencing and its applications. Genomics, Proteomics & Bioinformatics, 13, 278–289.
 RiutortMayol et al. (2020) RiutortMayol, G., Bürkner, P.C., Andersen, M. R., Solin, A. and Vehtari, A. (2020) Practical hilbert space approximate bayesian gaussian processes for probabilistic programming. arXiv preprint arXiv:2004.11408.
 Saul et al. (2018) Saul, J., Bachman, G., Allen, S., Toiv, N., Cooney, C. and Beamon, T. (2018) Determined resilient empowered AIDSfree mentored and safe (DREAMS): What is the core package and why now. PLOS One, 13, e0208167.
 Scire et al. (2020) Scire, J., BaridoSottani, J., Kühnert, D., Vaughan, T. G. and Stadler, T. (2020) Improved multitype birthdeath phylodynamic inference in BEAST 2. bioRxiv, 2020.01.06.895532.
 Skums et al. (2018) Skums, P., Zelikovsky, A., Singh, R., Gussler, W., Dimitrova, Z., Knyazev, S., Mandric, I., Ramachandran, S., Campo, D., Jha, D. et al. (2018) QUENTIN: reconstruction of disease transmissions from viral quasispecies genomic data. Bioinformatics, 34, 163–170.
 Solin and Särkkä (2020) Solin, A. and Särkkä, S. (2020) Hilbert space methods for reducedrank Gaussian process regression. Statistics and Computing, 30, 419–446.
 Sun et al. (2021) Sun, K., Wang, W., Gao, L., Wang, Y., Luo, K., Ren, L., Zhan, Z., Chen, X., Zhao, S., Huang, Y. et al. (2021) Transmission heterogeneities, kinetics, and controllability of sarscov2. Science, 371.

Tebaldi and West (1998)
Tebaldi, C. and West, M. (1998) Bayesian inference on network traffic using link count data.
Journal of the American Statistical Association, 93, 557–573.  UNAIDS (2018) UNAIDS (2018) Miles to go: closing gaps, breaking barriers, righting justice, document jc2924. URL: https://www.unaids.org/sites/default/files/media_asset/milestogo_en.pdf.
 UNAIDS (2019) — (2019) UNAIDS Data 2019, document jc2959e. URL: https://www.unaids.org/en/resources/documents/2019/2019UNAIDSdata.
 Vaughan et al. (2014) Vaughan, T. G., Kühnert, D., Popinga, A., Welch, D. and Drummond, A. J. (2014) Efficient Bayesian inference under the structured coalescent. Bioinformatics, 30, 2272–2279.
 Volz et al. (2013) Volz, E. M., Ionides, E., RomeroSeverson, E. O., Brandt, M.G., Mokotoff, E. and Koopman, J. S. (2013) HIV1 transmission during early infection in Men who have Sex with Men: a phylodynamic analysis. PLoS Medicine, 10, e1001568.
 Volz et al. (2009) Volz, E. M., Pond, S. L. K., Ward, M. J., Brown, A. J. L. and Frost, S. D. (2009) Phylodynamics of infectious disease epidemics. Genetics, 183, 1421–1430.
 Wymant et al. (2017) Wymant, C., Hall, M., Ratmann, O., Bonsall, D., Golubchik, T., de Cesare, M., Gall, A., Cornelissen, M., Fraser, C., STOPHCV Consortium, The Maela Pneumococcal Collaboration and The BEEHIVE Collaboration (2017) PHYLOSCANNER: inferring transmission from withinand betweenhost pathogen genetic diversity. Molecular Biology and Evolution, 35, 719–733.
 Zhang et al. (2020) Zhang, Y., Wymant, C., Laeyendecker, O., Grabowski, M. K., Hall, M., Hudelson, S., PiwowarManning, E., McCauley, M., Gamble, T., Hosseinipour, M. C. et al. (2020) Evaluation of phylogenetic methods for inferring the direction of HIV transmission: HPTN 052. Clinical Infectious Diseases, Epub ahead of print, ciz1247.
Comments
There are no comments yet.