Generate Country-Scale Networks of Interaction from Scattered Statistics

It is common to define the structure of interactions among a population of agents by a network. Most of agent-based models were shown highly sensitive to that network, so the relevance of simulation results directely depends on the descriptive power of that network. When studying social dynamics in large populations, that network cannot be collected, and is rather generated by algorithms which aim to fit general properties of social networks. However, more precise data is available at a country scale in the form of socio-demographic studies, census or sociological studies. These "scattered statistics" provide rich information, especially on agents' attributes, similar properties of tied agents and affiliations. In this paper, we propose a generic methodology to bring up together these scattered statistics with bayesian networks. We explain how to generate a population of heterogeneous agents, and how to create links by using both scattered statistics and knowledge on social selection processes. The methodology is illustrated by generating an interaction network for rural Kenya which includes familial structure, colleagues and friendship constrained given field studies and statistics.



There are no comments yet.


page 1

page 2

page 3

page 4


Generate Descriptive Social Networks for Large Populations from Available Observations: A Novel Methodology and a Generator

When modeling a social dynamics with an agent-oriented approach, researc...

Improving the robustness of online social networks: A simulation approach of network interventions

Online social networks (OSN) are prime examples of socio-technical syste...

Agent-Based Modelling of Malaria Transmission Dynamics

Recent statistics of malaria shows that over 200 million cases and estim...

Agent Based Rumor Spreading in a scale-free network

In the last years, the study of rumor spreading on social networks produ...

On social networks that support learning

It is well understood that the structure of a social network is critical...

Humans of Simulated New York (HOSNY): an exploratory comprehensive model of city life

The model presented in this paper experiments with a comprehensive simul...

Inference of a universal social scale and segregation measures using social connectivity kernels

How people connect with one another is a fundamental question in the soc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Context & problematic

1.1 Problematic

The principle of agent-based models is to reproduce collective dynamics from local interactions. So in any agent-based simulation, the modeler requires a descriptive model of interactions in a population. As these relationships are relatively stable [1], it became common to represent them using the social network metaphor: the structure of interactions is represented by a graph , with the population of agents and the links between these agents. That structure was shown to have a dramatic influence on the dynamics of various agent-based models. The direct consequence is that the descriptive power of the structure determines the relevance of simulation results. To ensure that the generated relationships network is descriptive, it should be studied as a modeling problematic: the structure of relationships should be a simplification of social interactions, and comply with knowledge on the modeled population.

While the interaction network can be collected by interview when the population is small, such a data collecting becomes intractable for larger populations. Hence, a lot of models deal with country-scale populations, including models of opinion dynamics, virus propagation or information dynamics. We are ourselves interested in modeling diffusion of innovations [2], and propagation of information about these innovations [3]. In that field, the lack of descriptive model of interactions was pointed out as one fondamental limitation of models [4]. It is common to use network generators to describe such a large population. A network generator is an algorithm which, given several parameters, generates networks compliant with one or more properties observed in real networks.

Ideally, a network generator should satisfy the following requirements (noted R). (R1) generate models of large populations, in order to improve descriptive power of agent-based models for large-scale simulations. (R2) The kinds of relationships linking two agents should be represented, because interactions don’t occur in the same way across different relationships. For instance, finding a work was shown more efficient when activating so-called “weak ties” (for instance far family) [5]. (R3) Attributes of agents should be detailed in the network of interactions. That’s justified by three main reasons. First, attributes of individuals influence individual judgment and decision-making (e.g. in diffusion of innovations [4]), so they should be made available in the model. Secondly, attributes influence the frequency or nature of interaction: spatial distance reduces frequency of exchanges, differences in ethnicity and interests lower the normative influence, etc. Third, it was shown (as explained in 1.2) that individual characteristics determine the choice of acquaintances of an agent, so agents’ attributes should be took into account during network generation.

1.2 Key findings for social networks

Decades of research in social networks highlighted several key findings which are today widely accepted. A stream of research explored social selection processes [6] to understand how agents create ties. It appeared that individuals exhibit a strong tendency to create relationships with people sharing similar characteristics (homophily) [7]. Two individuals sharing a common affiliation (event, project or workplace) also have more chances to tie and interact frequently [1]. Transitivity

states that two individuals sharing a common acquaintance are more probably connected together; actually they have more chances to meet together and to create a tie because of a common friend. These observations are no more questionned, so

(R4) a relevant network of interactions should comply with these processes of social selection.
The stream of social network analysis also described statistical properties shared by social networks, including a surprisingly short average distance between individuals, a high clustering rate (that is, it exists groups or communities in which individuals are strongly interconnected), and a low density [1]. A power-law distribution of degrees was also observed in various datasets (most individuals have few acquaintances, while few have a high degree). It was explained by the so-called preferential attachment principle, which states than new individuals in a network connect more probably with nodes having already a high degree.

Beyond these general properties of social networks, each national institute of statistics publish detailed data for its country. Statistics describe who the individuals are by quantifying characteristics on gender, age, ethnicity, socioeconomic class, incomes, marital status, etc. They also study what people do: working or not, kind of activity, participation to associative life, sport, etc. These activities can often be interpreted as affiliations, with detailed information on agents which are part of the institution (common characteristics of workers like educational level or socioeconomical class, as well as geographical location) and on the affiliations themselves (size, location). More qualitative knowledge also exists on the structure of families, as well as statistics on number of children or household composition. When that kind of data was not collected at a large scale, it is still available from field studies focused on more precise phenomena. As an illustration for this paper, we choose to model social relationships in rural Kenya, for which we have no information from our own. Demographical statistics [8], sociological studies on the structure of families (e.g. [9]) and field studies on diffusion of contraceptive use [10, 11] constitute as many sources of information. Surprisingly, no network generator uses these scattered statistics. However, they constitute an appreciable part of knowledge on the structure of interactions. We claim that these (R5) scattered statistics should be taken into account while generating a network of interactions.

1.3 Existing models

The most used generators for agent-based models are small-world networks and scale-free graphs [12]. The first generates networks highly clustered with a short average path length, while the second implements the preferential attachment principle. These models, proposed by physicists, generate highly stylized networks in which social relationships between heterogeneous agents become links between nodes. They neither comply with knowledge on social selection processes nor rely on statistics available for a given population. To summarize, they don’t satisfy our requirements R2, R3, R4 nor R5.

In the frame of social network analysis [1], a lot of models were proposed (see [6] for a synthetic picture). Existence of a link between two agents

is considered to be a random variable

which takes value

if a link exists. Random graphs with attributes generate links given a vector of agents’ attributes

: . If one uses only that constraint, a tie between two agents is independent of any other tie with other agents, in contradiction with transitivity evidence. That assumption was removed by markov random graphs [13] by allowing two links to be dependent if they share a node in common. In that case links are noted as the conditional probability: , with . Recent extensions of these models [6] take into account both links created given agents’ attributes and transitive links; to date, they remain limited to one or two attributes [6].
That formalism is powerful enough to describe homophily and transitivity. Its relevance was proved by fitting data collected from small groups. It was also shown [6] that affiliations or degree of an agent may be considered to be attributes, so the formalism also enables generation of graphs with power-law degree distribution and affiliations. In short, they fullfit R3 and R4. However, they include special parameters which require Ad Hoc collecting of data, so their application remains limited to small groups (opposite to R1). Moreover, in these small groups, it was never necessary to distinguish different kind of relationships, contrarly to R2.

1.4 Approach

We propose to use scattered statistics available for the modeled population (R5) to generate more representative networks of interaction at a country scale (R1). The generated relationships network will detail agents’ attributes in the population (R3), which are taken into account during network generation. The network will include several kinds of relationships (R2), so the user of that network may infer the interaction network given the kind of relationship (4.1). As the model is intended to bring up together several sources of information on a population to parameter the generator, we describe a methodology (2) to formalize intuitively knowledge on agents’ attributes and links using bayesian networks. The formalism, inspired by markov random graphs, enables representation of the key processes of social selection (R4). Then (3), we explain how a population of heterogeneous agents can be generated and how the relationships network is created. Insights on the minimum size of population, on detection of statistics discrepancies, and on statistical properties of generated networks are provided in section 4.

2 Methodology

2.1 Choice of agents’ attributes and link types

Step 1

The modeler should first define the types of social links he wants to represent in the relationships network. enumerates links leading potentially to different interactions in the model, or which are created by different processes. As proposed previously in markov random graphs [13], some kinds of relationships can be generated given agents’ attributes , while others are created by transitivity . In the example of Kenya, we choose to represent links leading to interaction about contraceptive use. Field studies indicate that spouses discuss that topic, that advices of parents have a normative influence, and that women retrieve information from friends, siblings and colleagues [11]. We define links created given agents’ attributes spouses, motherOf, colleagues, friends. Other links are created by transitivity, because they involve more than two agents and can be created given already created links: fatherOf, siblings.

Step 2

Next the modeler has to select agents’ attributes which are known - or supposed - to influence probability of a link to be created. Of course, that selection is done given available data and the purpose of the agent-based model. Typically agents’ attributes will contain socio-demographic characteristics (age, gender, socioeconomic class, ethnicity, etc.) and places were the agents have frequent interactions given these characteristics (going to school, frequenting a workplace, etc.). We also assume the number of links to create for an agent for each kind of relationship to be an attribute: , . The choice of including the number of links as an attribute could seem counter-intuitive, because it was often considered to be an independent density parameter [6]. That choice is justified by the following reasons: (i) The number of links per agent is available from statistics, and varies across kinds of relationships (ii) the number of links is strongly correlated to other agents’ attributes; for instance the number of children of a wife depends on its age. (iii) the number of links is often considered to be an explanatory variable for the individual decision-making process, so it should be made available as an attribute. As example, contraceptive adoption increases with the number of children of a mother [10].
In the example of Kenya, attributes married, age and gender are required for nearly all kinds of links. As field studies indicate that most discussions take place during quotidian activities [10], we added the variable work. Spatial location is required because spouses always live in the same place, as do young children with their mother.

2.2 Formalization based on bayesian networks

Step 3: represent agents’ attributes using bayesian networks.

Figure 1: Attributes bayesian network used to describe interdependencies between Kenyan socio-demographic attributes. Nodes in bold are the number of links to create for each link type.

Attributes of individuals in a real population are strongly interdependent: marital status depends on age and gender, socioeconomic class is highly correlated with location and education level, etc. Generating a population of agents in which attributes of agents comply with statistical interdependences of individual characteristics requires a relevant modeling of these dependencies, generic enough to be used with any kind of data. Hence, data available for a population is often presented as statistics linking one attribute with another. For instance, the number of children per woman is provided given marital status and age [8, p. 57]. That kind of statistic can be translated, without loss of generality, to conditional probabilities, like . In that viewpoint, attributes of agents are considered to be random variables. We propose to use a bayesian network [14] (BN) , named agent BN in this methodology, to formalize these interdependencies. Each agent attribute in is represented by a variable in the BN. The domain of a variable defines the values the attribute can take. For instance in graph 1, variable gender has domain , , and . Root variables define initial probabilities. In Fig 1, initial probabilities for variable ageDetail define the probability for an individual picked up randomly in the population to have a given age; that probability is available from the age pyramid of the target population. A directed link between two variables means that probabilities can be calculed using its parents, and only its parents. embodies a conditional probability table representing the probability to take each value given all the possible values in the domains of its parents (here, ). No link means that variables are assumed independent. That don’t means that variables are independent in reality, but rather represent our lack of knowledge (or our willingness to simplify that knowledge) of that dependence.

In our application to Kenya, probabilities in the agent BN depicted in Fig. 1 come from the US Census Bureau, from the Kenya demographic and health survey [8], and from field studies (e.g. [10, 11]). Note that we used convenience variables to simplify formalization of data: in agent BN 1, variable ageSlices simplifies the detailed age to 5-year slices, which are often used in published statistics. Another benefit of BN is to highlight evident discrepancies in data. For instance, a social scientist will immediately note in (Fig. 1) the absence of link between gender and age, while the age pyramid in most of countries shows significant differences between genders (indeed, Kenya is a particular case of symmetrical age pyramid).

Figure 2: Matching bayesian network for link type spouses. On the left, the agent BN for agents 1 and 2.

Step 4: represent links probability using bayesian networks.

Links created given attributes are defined by . They can be used to represent a large range of phenomena, including homophily, affiliation, preferential attachment, or spatialization. As that probability is conditional, it can also be represented by a matching BN, for which an example for relationship spouses in Kenya is depicted in Fig. 2. In that matching BN, one can recognize two instances of the agent BN (on the left) representing attributes of two different agents of the population. On the right, a special node with domain defines if a link can be created between these agents. Nodes in bold define constraints on linking. In the BN in Fig. 2, we define arbitrarily that agent 1 is male and agent 2 female (wedding in Kenya is heterosexual). Node ageWife projects the probable age of the first wife of man described on top (on average 10 years younger), and variable rightAge ensures by an identity probability table that agent 2 complies with that age. The node sameLocation takes value yes only if both agents live in the same location. The final variable “linkSpouses”, which determines if two agents can be linked together, takes values “yes” only if all of its parents are themselves to “yes”. Note the nodes a1_created_spouses and a1_remaining_spouses, which ensure that we will only create as many links of type as required by and , but no more, so a wife will exactly be tied with one husband. Other links in are defined in the same way: friends have probably the same age and live probably in the same town, mothers are linked to children whom age is compliant with their age (and live always in the same location if children are young), and colleagues are defined as agent sharing the same activity in the same location.

Links created by transitivity are also random variables, and are noted: , with , , . That formalism is quite intuitive and will not be more detailed here. In our example, we define by transitivity the link “fatherOf” with (only transitivity enable description of father-children links; it could not be described as a matching BN, because children of a man have to be the same than children of its wives). In the same way, siblings are created by transitivity across mother and father links. With a lower probability, friendships links are created by transitivity between friends.

3 Generation of the graph

3.1 Generation of an heterogeneous population

All the variables in the agent BN will become agents’ attributes with the same domain. For each agent to create, we generate a prototype agent. The process to generate a prototype simply consists in using the agent BN in a generative way: for each variable of the agent BN (in the ordinal order, so root variables are processed first), a value is selected randomly in the domain of , given probabilities defined in the BN. When value has been chosen, a corresponding piece of evidence is put in the BN. Evidence, in the theory of BN, represents a known information. Putting evidence in the BN permits to compute probabilities of child variables given the values of already selected attributes, so the integrity of agent attributes is ensured. For instance in Fig. 3, before any piece of evidence (top), the probability for someone randomly picked up in the population to be married is 29.69%. When attributes and have been randomly put to and

, and used as evidence, posterior probability for the current agent to be married falls to 1.90%. When all the agents are generated that way, the statistical distribution of their attributes complies with the distribution described by the BN.

Figure 3: Example of evidence propagation when the bayesian network is used to generate agents’ attributes. Here monitors (boxes in the figure) display the probabilities for each variable to take every value (note that some of these monitors are truncated). (top) probabilities with no evidence (bottom) probabilities when evidence is set.

3.2 Creation of links

Now that all agents were created in the population , each agent having its attributes defined, we have to link them using the matching BN. For each kind of relationship , we constraint the matching BN for by providing evidence for link creation: as our aim is to link together agents with link , we set evidence on variable . Given that evidence, probabilities for attributes of and are updated, and some probabilities in attributes’ domains fall to zero. For instance, in the case of “spouses” link, probability of agents 1 and 2 to be younger than 15 years falls to zero; they also cannot have “married=no”. In other words, probabilities in the matching BN given evidence of link creation designate two sets of candidates for linking and . The matching process will remain limited to these sets. Then, we iterate across candidates and select randomly has many acquaintances among as required by . For each agent , we load its attributes and use them as pieces of evidence in the BN. After a run of the inference engine, the probabilities for agent 2 define a restricted set of candidates for linking given agent 1 attributes. In our application to Kenya, for link type , is the set of husbands and the set of wives. When one chooses an agent , given the constraints on matching, the set limits to wives which live in the same location than . Selecting a candidate is made by generating a prototype agent as explained before; if that prototype cannot be found, a fallback solution consists in picking up randomly one agent in (note that the fallback solution can bias statistical distribution in the population; in our example it is possible to link a husband with an older wife). When no fallback solution can be found, that is when , agent remains orphan, but will never be tied with a incompatible agent (in our example, no man will be said married with a non married or too young wife). These errors will be studied in 4.2.

After having processed all link types defined by matching BN, transitive links are created using the probabilities formalized in step 2.

4 Generated network

4.1 Usage for social simulation

Figure 4: (left) generated relationships network (right) zoom in one agent

The resulting graph includes links of different kinds (), and provides the values of agent attributes for any agent in the population . The structure of relationships described by the generated network depends obviously on agent BN and matching BN provided by the modeler as parameters. In our application to Kenya, the population covers the whole age pyramid, and describes attributes depicted in Fig. 1. Moreover, as shown in Fig. 4, each agent is positionned in its familial environment; agent M54 (for Male, 54 years) is married with two wives F28 and F42, and has 7 children, including one daughter F24 which is herself married and mother. He is also tied with its own mother F71 and brother M42, but not with its father - probably because this one is not in the age pyramid (no more alive). He his also tied with colleagues and friends (not represented in that figure to improve lisibility). That structure is described at the scale of the 50,000 agents depicted in Fig 4 (left).
To use that network for simulation, the modeler may simply define probability to interact given the kind of relationship: , , so . He may also choose a finer granularity by defining the probability of interaction given attributes , for instance to represent the fact that spatial distance decrease probability of interaction. In that illustration, we focus on interactions about contraceptive use [11]. In our case, no interaction occurs across links between young children and their parents. As the topic of contraceptive use is sensitive in Kenya, probabilities of discussion between spouses are low, as between a mother and its own parents. In fact, women which are still fertile and are concerned by the topic discuss mainly with their female friends, and often with their brothers-in-law (link sibling). The resulting network of interactions is a network in which ties are weighted by probabilities; it is by far sparse than the network of relationships.

4.2 Errors and statistical properties

Figure 5: (left) Error rate given population size. (right) Statistical properties of the generated relationship network

While BN describe a theoretical population using continuous probabilities, we generate a discrete population and link agents only when a suitable candidate exists. That limitation necessarily leads to a bias in the statistical properties of the population. Two kinds of errors may appear during generation. Errors on statistical distribution appear because the generated population is not large enough given the combinatories of attributes’ values described by the agent BN. These errors are measured by learning the agent BN on data, and quantified as the average difference between theoretical and measured probability. As shown in Fig. 5 (left), these errors (bottom curve) remain low and are negatively correlated to the population size. Errors on matching appear when no candidate was found to link several agents, and are quantified as the rate of the total number of links required by on the number of created ties. When that error rate remains low, and decreases when the population size increases, errors are only due to the discrete nature of agents: it will always exist some agents which could not be connected because their theoretical peer was not created. As shown in Fig. 5, that error rate drops quickly above a given population size. Given our parameters, a population of 5,000 agents is a minimum to reduce errors. Above 10,000 agents, no more significant improvement appears. When the matching error rate remains high when the population size increases, it means that agent BN and/or matching BN are incompatible. In Fig. 5, curve for link married shows that the number of wives per men is not compatible with the proportion of married wives. In that case statistics (or assumptions) used to build BN should be checked and corrected.

Figure 5 (right) depicts the evolution of statistical properties of the relationships network. Density is low (under 0.01). Transitivity (sometimes called clustering rate) is high, and becomes stable above the 10,000 agents threshold. Links defined across spatial locations (for family, work, and with low probability friendship) play the role of shortcuts, so the average path length in the model grows very slowly (around 4.8), exhibiting the so-called “small-world” property. The average degree is in theory defined by attributes . In fact, it is only reached when all the required links are created (above 10,000 agents), then sticks to its theoretical value.
At evidence, it exists a minimum population size to satisfy constraints defined by matching BN. The more matching BN are constraining, the higher the threshold. Above that threshold, statistical properties remain remarkably stable.

5 Discussion

In this paper, we proposed a methodology to formalize various statistics available for the population (R5) to generate a simplified network of relationships at a country scale (R1). The resulting network of relationships includes agents’ attributes (R3) and different kinds of relationships (R2), so the modeler can define with more precision if interaction takes place. The network exhibits a high clustering rate, low density and a low average path length. Formalism enables modelers to comply with evidence on social selection processes (R4) like affiliation, homophily and transitivity. We illustrated the methodology by generating a network of relationships for rural Kenya in which socio-demographic studies, sociological findings, and qualitative observations on affiliation are put together to reproduce familial, work and friendship relationships.
The choice of bayesian networks to formalize scattered statistics make the fusion of different statistical sources more intuitive, so any social scientist can use the generator. BN also facilitate generation of the heterogeneous population of agents and the creation of links between these agents. We plan to publish soon the software which implements the generator.

The purpose of that methodology is to generate a model of relationships in a population. So, the generated graph is only a simplification of real relationships given available data, and don’t target the same precision than models at a smaller scale. However, the network generated is rooted in reality by using available statistics and observations of that population. In some way, we hope it fills the gap between models from social scientists (highly descriptive, but limited to small populations) and generators from physicists (generate large populations with a low descriptive power).
We decided to illustrate this paper with social interactions in rural Kenya because of the relative simplicity of its social structure. The next step is to model more complex populations like France (more affiliations, socioeconomic classes, attributes). Our agenda of research also includes the investigation of dynamics supported by generated networks, especially in the frame of information diffusion, and the formal analysis of the properties of generated networks.


  • [1] Wasserman, S., Faust, K.: Social network analysis, methods and applications. Cambridge: Cambridge University Press (1994)
  • [2] Thiriot, S., Kant, J.D.: Using associative networks to represent adopters’ beliefs in a multiagent model of innovation diffusion. Advances in Complex Systems 11(2) (2008) 261–272
  • [3] Thiriot, S., Kant, J.D.: Reproducing stylized facts of word-of-mouth with a naturalistic multi-agent model. In: Second World Congress on Social Simulation. (2008)
  • [4] Rogers, E.M.: Diffusion of Innovations. 5th edn. New York: Free Press (2003)
  • [5] Granovetter, M.S.: The Strength of Weak Ties. The American Journal of Sociology 78(6) (1973) 1360–1380
  • [6] Robins, G., Elliott, P., Pattison, P.: Network models for social selection processes. Social networks 23(1) (2001) 1–30
  • [7] McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a Feather: Homophily in Social Networks. Annual Reviews in Sociology 27 (2001) 415–444
  • [8] KDHS: Kenya Demographic and Health Survey 2003. Central Bureau of Statistics (2003)
  • [9] Mburugu, E.K., Adams, B.N.: Handbook of World Families. SAGE (2004) 3–24
  • [10] Watkins, S.C., Rutenberg, N., Green, S.: Diffusion and debate: Controversy about reproductive change in Nyanza Province, Kenya. Annual Meeting of the Population Association of America (1995) 6–8
  • [11] Rutenberg, N., Watkins, S.C.: The buzz outside the clinics: conversations and contraception in Nyanza Province, Kenya. Studies in Family Planning 28(4) (1997) 290–307
  • [12] Phan, D., Amblard, F., eds.: Agent-based Modelling and Simulation in the Social and Human Sciences. Oxford, The Bardwell Press (2007)
  • [13] Frank, O., Strauss, D.: Markov graphs. Journal of the American Statistical Association 81 (1986) 832–842
  • [14] Jensen, F.V.: Introduction to Bayesian Networks. Springer-Verlag New York, Inc. Secaucus, NJ, USA (1996)