## 1 Introduction

When disseminating data relating to individuals, there are always two conflicting targets: maximising utility and minimising disclosure risk. To minimise risk, statistical disclosure control (SDC) methods, which typically involve either suppressing or perturbing certain values, are applied to a data set prior to its release. One such method is the generation of synthetic data sets Rubin1993, Little1993, which involves fitting, and then simulating from, models fit to the original data. These methods, while reducing risk, adversely impact the data’s utility resulting in a clear trade-off between risk and utility.

This paper focuses on the role of multiple data sets when synthesizing categorical data (that is, data consisting of only categorical variables) at the aggregated level using saturated count models

(Jackson2021). Saturated synthesis models invite a bespoke approach to synthesis. They allow the synthesizer to generate synthetic data with certain pre-specified properties, thus allowing them to easily tailor the synthesis to suit the data environment (Elliot2018). For example, if the intention is to release open data, relatively more noise can be applied to the data than if the data are released in a secure environment. While the Poisson log-linear model is often used to model categorical data, for synthesis this is not necessarily an optimal choice, because the synthesizer - that is, the person(s) responsible for synthesizing the data - has no control over the variance and has, therefore, no way to add noise to at-risk records in the data. For this reason, the negative binomial (NBI), a two-parameter count distribution, is much more effective for synthesis. As the NBI distribution’s variance is not completely determined by the mean - though the variance is always greater than the mean - the variance

can be increased accordingly. Nevertheless, there are still restrictions and these are discussed later on.Specifically, this paper explores how flexibility can be incorporated into the mechanism through the use of multiple synthetic data sets. In some cases (as explained in Section 3), synthetic data sets must be generated; while in other cases, though it may be sufficient to generate just synthetic data set, the optimal can still be considered in relation to the risk-utility trade-off: does the improvement in utility sufficiently outweigh the cost in terms of greater risk? This is because, since it reduces simulation error, increasing leads to greater utility but also, inevitably, greater risk (Reiter2005b, Reiter2009). More generally, considering introduces another tuning parameter for the synthesizer to set, thereby providing further flexibility.

This paper is structured as follows: Section 2 summaries the -synthesis mechanism, on which the results in this paper are based; Section 3 extends the mechanism to incorporating ; Section 4 introduces the and metrics, developed to assess risk in multiple categorical synthetic data sets; Section 5 presents an illustrative example; and lastly the Discussion proposes some areas or future research.

## 2 Review of the use of saturated models for synthesis

The discrete nature of categorical data allow it to be expressed as a multi-dimensional contingency table (multi-way table). As a multi-way table, the data consist of a structured set of cell counts , which give the frequencies with which each combination of categories is observed.

Synthetic data sets can then be generated by replacing these observed counts (known henceforth as “original counts”) with synthetic counts. This can be achieved through a count model, such a Poisson log-linear model or a Poisson-Gamma model (Graham2007).

Data sets which include large categorical variables (large number of categories), such as administrative databases, return high-dimensional tables. It can be time-consuming to fit log-linear models, for example, to such large tables using standard statistical software, such as the glm function in R

. The iterative proportional fitting (IPF) algorithm can alleviate this problem to a certain extent, however this only returns fitted values - not standard errors - and there reaches a point (even larger tables) where IPF becomes infeasible.

The use of saturated models, therefore, is beneficial, because obviously no actual fitting is needed to obtain the fitted counts. There are two distinct modelling methods for contingency tables: multinomial models and count models. The multinomial ensures that the total number of individuals in the original data , is equal to the total number of individuals in the synthetic data . The syn.catall function in the R package synthpop (Nowok2016) can be used to generate synthetic data via a saturated multinomial model.

The -synthesis mechanism (Jackson2021), on the other hand, uses saturated count models for synthesis; specifically, either a saturated negative binomial (NBI) model or a saturated Poisson-inverse Gaussian (PIG) (Rigby2019) model. In this paper, only the NBI has been considered, to emphasize the tuning capability of . Besides, the results from using either the NBI or PIG distributions are broadly similar, as they share the same variance function.

When , the -synthesis mechanism has two parameters which are set by the synthesizer. The first, denoted by , originates from using a two-parameter count distribution (the NBI or the PIG), parameterised in such a way that controls the variance. The value of can be adjusted to inflate - or deflate - the variability in the synthetic cell counts, thus increasing - or decreasing - the expected divergence from the original counts. More noise is required to sensitive cells - usually small cell counts, which correspond to individuals who have a unique (or near-unique) set of observations - to mask that these are, indeed, unique.

The mechanism’s second parameter, denoted by

, relates to the size of the pseudocount - in practice, this is not actually a count but a very small positive number such as 0.01 - which is added to zero cell counts (zero cells) in the original data. This assigns a non-zero probability that a zero cell is synthesized to a non-zero. The pseudocount

is only applied to non-structural zero cells (known as random or sampling zeros), which are zero cells for which a non-zero count could have been observed. Throughout this paper it has been assumed, for brevity, that .Given an original count (), the corresponding synthetic count is modelled as:

The shape parameter is set by the synthesizer to adjust the variance of the model. The NBI distribution’s variance function is , which demonstrates how the parameter controls the variance: for example, when the variance is , hence a linear relationship.

Using a saturated count model has certain advantages for the unique situation that is modelling for the purpose of synthesis. Firstly, it guarantees the preservation of relationships between variables, as no assumptions are made as to which interactions exist. Secondly, the method scales equally well to large data sets, as no model fitting is required - the model’s fitted counts are just the observed counts. Finally, as the fitted counts are just equal to the observed counts, it allows expected properties of the synthetic data to be determined a priori (that is, prior to synthesis). The uncertainty from modelling is, in effect, drained away, and instead uncertainty is injected where it is most needed: to add noise to sensitive cells in the original data.

### 2.1 The metrics

The following metrics (Jackson2021), give a basic quantification of risk (and utility):

where and are arbitrary cell counts in the original and synthetic data, respectively.

Firstly, is the empirical proportion of cells in the original data with a count of ; the value of is obtained by dividing the number of cells with a count of over the total number of cells . Similarly, is the proportion of cells in the synthetic data with a count of .

The metric is the probability that a cell of size in the original data is synthesized to ; and is the probability that a cell of size in the synthetic data originated from a cell of size . The metrics and are the most associated with risk; particularly and , since these relate to uniques and there is not the same ‘safety in numbers’ compared with larger cell counts. When, for example, is high (close to 1), then it is possible to identify, with near certainty, some uniques in the original data from the synthetic data.

When saturated models are used, the expected values of these metrics can be found analytically, as functions of the parameters available to the synthesizer (, and, as later described, ). Hence the synthesizer is aware, a priori, of the noise required to achieve a specified or value.

## 3 The role of as a tuning parameter

When generating fully synthetic data in the traditional sense (Raghunathan2003), multiple synthetic data sets must be generated in order to obtain valid inferences, because the synthesis utilises the Bayesian framework developed for the multiple imputation of missing data (Rubin1987). Thus multiple data sets are needed to properly quantify the uncertainty arising from imputation, when subsequently analysing the synthetic data sets. However, when values are just replaced rather than imputed, as is done in the -synthesis mechanism, multiple data sets are unnecessary. So while data sets are not fundamental to obtaining valid inferences, the quality

of inferences, for example, the width of confidence intervals, can still be improved upon by increasing

. Thus, prior to releasing synthetic data, the synthesizer can tweak to find an optimal balance between risk and utility. The motivation for keeping small is two-fold: firstly, larger typically results in heightened disclosure risk (Reiter2005b, Reiter2009); and secondly, it relieves the pressure on storage demands, which is especially relevant when synthesizing large administrative databases.Although there is obviously a greater computational burden when generating multiple data sets, the effect on central processing unit (CPU) time can be substantially reduced by making use of parallel processing. As the data sets are independently generated, this is a classic example of a task that can be undertaken in this way. Parallel processing utilises multi-core CPUs; for example, if the task is carried out on a CPU with four cores rather than one, then the CPU time would be roughly four times faster. Besides, for the NBI, the CPU time is quick anyway; the example synthesis presented in Section 5 only took approximately 0.3 seconds for with on a typical laptop running R.

### 3.1 Obtaining inferences from data sets

#### 3.1.1 Averaging the data sets before averaging the results

When analysing multiple synthetic data sets, traditionally the analyst considers each data set separately before later combining inferences. While point estimates are simply averaged, the way in variance estimates are combined depends on the type of synthesis carried out: specifically, whether fully or partially synthetic data sets are generated and whether the Bayesian framework is used. The combining rules also depend on whether an analyst is using the synthetic data to estimate a population parameter

, or an observed data estimate : the former needs to account for the sampling uncertainty in the original data whereas the latter does not.Suppose, then, that an analyst wishes to estimate a univariate population parameter from synthetic data sets. A point estimate , and its variance estimate , is obtained from each synthetic data set, . Before these estimates are substituted into a set of combining rules, it is common, as an intermediary step, to first calculate the following three quantities (Drechsler2011):

where is the mean estimate, is the ‘between-synthesis variance’, that is, the sample variance of the estimates, and is the mean ‘within-synthesis variance’, the mean of the estimates’ variance estimates.

The quantity

is an unbiased estimator for

, and is so regardless of whether fully or partially synthetic data sets are generated. The variance estimator, however, does depend on the type of synthetic data. When using saturated models as described in Section 2, partially synthetic data sets are generated because the Bayesian posterior predictive distribution is not used to impute missing information. Hence, the following estimator

(Reiter2003), is valid when estimating ,The sampling distribution (if frequentist) or posterior distribution (if Bayesian) of is a -distribution with degrees of freedom. Often, is large enough for the

-distribution to be approximated by a normal distribution. However, when the between-synthesis variability is much larger than the within-synthesis variability, that is, when

is much larger than - as may happen when large amounts of noise are applied to protect sensitive records - then is crucial to obtaining valid inferences.As the data sets are “completely synthesized”, the following estimator (Raab2016) is valid, too, in large samples,

The large sample assumption facilitates the use of a normal distribution for the sampling distribution (or the posterior distribution) of when is used to estimate the variance. The overriding advantage of is that, assuming its conditions hold, it allows valid variance estimates to be obtained from synthetic data set.

When using count models, as opposed to multinomial models, the synthetic data sample size is stochastic with mean equal to and a variance that depends on the synthesis model’s tuning parameters, in addition to the distribution of cell sizes in the original data (). The estimator assumes, implicitly, that in each of the synthetic data sets; and assumes that is constant across the synthetic data sets, though not necessarily equal to . When saturated models are used, therefore, and is not constant, this assumption is violated which, in turn, may invalidate and . However, in a simulation study unreported in this paper, it was found that and do still provide valid variance estimates, that is, confidence intervals achieve the nominal coverage. In some cases, such as when (the number of cells) is small and is large, the difference between and may be too large to be ignorable, and then new estimators would be required. Such estimators may introduce weights that are related to , the sample sizes of the synthetic data sets.

#### 3.1.2 Averaging the data sets before analysing them

When faced with multiple categorical data sets, analysts (and attackers) may either pool or average the data sets before analysing them. This is feasible only with contingency tables, as they have the same structure across the data sets. There are several advantages to doing so. Firstly, it means that analysts only have to undertake their analyses once rather than multiple times, thus leading to reduced computational time. Note, although averaging results in non-integer “counts”, standard software such as the glm function in R can typically cope with this and still allow models to be fit. Secondly, model-fitting in aggregated data is often hampered by the presence of zero counts, but either averaging or pooling reduces the proportion of zero counts, since it only takes one non-zero across the data sets to produce a non-zero when averaged or pooled.

When the NBI is used, given an original count (), the corresponding mean synthetic cell count has mean and variance given as,

(1) |

as the synthetic data sets are independent. Thus the effect of using is that the variance of the synthetic cell counts is multiplied by .

## 4 Introducing the and metrics

When multiple synthetic data sets are generated and the mean synthetic count calculated - which is no longer always an integer - it becomes more suitable to consider the proportion of synthetic counts within a certain distance of original counts of . To allow this, the metrics and can be extended to and , respectively. Mathematically, for an arbitrary cell in the original data and its corresponding cell in the synthetic data :

The metric is the probability that a cell count of size in the original data is synthesized to within of ; and The metric is the probability that a cell count within of in the synthetic data originated from a cell of . Unlike , does not need to be an integer.

By extending the metric in a similar way, such that is the proportion of cell counts in the synthetic data within of , it follows that .

It is straightforward to see how and are, in effect, special cases of and , respectively; the former are returned by setting . For small , these metrics are intended primarily as risk metrics, because they are dealing with uniques or near uniques. However, when is reasonably large, and are, perhaps, better viewed as utility metrics, because they are dealing with the proportion of uniques that are synthesized to much larger counts (which impacts utility).

For some distributions there are formulae which give exact distributions for sums of random variables, for example, the sum of two Poisson distributions with means

and is Poisson distributed with mean . Alas, for most count distributions, including the NBI, such relationships are less clear. When is sufficiently large, though, tractable expressions for the andmetrics can be obtained via the Central Limit Theorem (CLT), as the distributions of the mean synthetic cell counts can be approximated by a normal distribution, with mean and variance as given in (

1). That is, given an original count (), when is large, the distribution of the corresponding mean synthetic cell count is given as:This can be used to approximate and : | ||||

(2) | ||||

(3) |

where

is which is used to denote the cumulative distribution function (CDF) of the standard normal distribution.

## 5 Empirical Study

The data set synthesized here was constructed with the intention of being used as a substitute to the English School Census, an administrative database held by the Department for Education (DfE). It was generated using public 2011 census output tables involving local authority, sex, age, language and ethnicity; a more detailed description of its origin is available (Jackson2021). The data itself, however, has little importance here, as the objective is merely to demonstrate the risk and utility of synthetic data and, specifically, the role that plays in relation to the risk-utility trade-off; hence any categorical data set would suffice.

The data comprises 8.2 individuals observed over categorical variables. The local authority variable has the greatest number of categories with 326; while sex has the fewest with 4. When aggregated, the resulting contingency table has cells, 90 of which are unobserved, that is, have a count of zero. The proportion of cells in the original data of size , denoted by , are given in Table 1.

Cell size () | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|

0.9038 | 0.0346 | 0.0148 | 0.0075 | 0.0056 | 0.0038 | 0.0300 |

The function rNBI from the R package gamlss.dist (Stasinopoulos2007) was used to generate multiple synthetic data sets using the -synthesis mechanism described in Section 2. This was done for a range of , 0, 0.1, 0.5, 2 and 10, and synthetic data sets were generated for each. This allowed comparisons to be drawn for a range of , for example, taking the first five data sets gives , taking the first ten gives , etc.

### 5.1 Measuring risk

Evaluating risk in synthetic data, particularly in synthetic categorical data, is not always straightforward. Attempting to estimate the risk of re-identification (Reiter2009) is not possible, because the ability to link records is lost when a microdata set is aggregated, synthesized and disaggregated back to microdata again. More- over, the correct attribution probability (CAP) risk metric (Taub2018), which seeks to quantify the risk of attribute disclosure, relies on the presence of an obvious sensitive variable, for which there is no obvious candidate in this data set.

Instead, the and metrics (that is, setting ), introduced in Section 4, were used as risk metrics. Figure 1 shows that either increasing or decreasing increases and and hence risk. There is an initial fall in the curves as increases initially, suggesting lower not higher risk. However, this is just owing to the small : for example, when , the only way to obtain a mean synthetic count within 0.1 of when, say , is by obtaining a one in each of the five synthetic data sets, compared to just once when .

When is large, the and metrics can be approximated analytically through (2), which relies on the CLT. There is uncertainty in both the empirical values (owing to simulation error) and the analytical values (owing to the normal approximation), though the divergences between the empirical and analytical values are small.

In general, then, increasing or decreasing increases risk. This is also shown visually in Figure 2, which demonstrates how and can be used in tandem to adjust risk. Here, is used as the -axis (risk) but any or would give similar results.

### 5.2 Measuring utility

As saturated models are used, increasing (for a given ) causes the mean synthetic counts to tend towards the original counts. This can be seen in the boxplots in Figure 3, which show that either increasing or reducing reduces the percentage differences between the mean synthetic counts and the original counts. Similarly, the Hellinger and Euclidean distances shown in Figure 4 show an improvement in general utility when either increasing or reducing .

These measures are equally relevant to risk, too, hence Figures 3 and 4 reiterate that risk increases with . It is fairly trivial, however, that reducing simulation error increases risk and utility. It is more useful to gain an insight into the rate at which risk and utility increase with , that is, the shape of the curves. For example, Figure 4, shows that increasing has greater effect when than when .

The utility of synthetic data can also be assessed for specific analyses by, for example, comparing regression coefficient estimates obtained from a model fit to both the observed and synthetic data. While such measures only assess the synthetic data’s ability to support a particular analysis, they nevertheless can be a useful indicator to, for example, the required needed to attain a satisfactory level of utility.

Here, the estimand of interest is a regression coefficient

from the logistic regression of age

(aged = 0, = 1) on language , that is,A subset of the data were used, as just two of the language variable’s seven categories were considered, while the age variable was dichotomised. When estimated from the original data,

- which is a log marginal odds ratio - was equal to -0.0075 with a 95

confidence interval of (-0.0151, -0.0001). Note that, in order to estimate this, it was assumed that the original data constituted a simple random sample drawn from a much larger population. It is hugely doubtful whether such as assumption would be reasonable in practice, but the purpose here was just to evaluate the ability of the synthetic data to produce similar conclusions to the original data.The analysis was undertaken in the two ways described in Section 3. That is, firstly, the synthetic data sets were analysed separately and variance estimates were obtained through the estimator . Secondly, the synthetic data sets were pooled into one data set prior to the analysis and variance estimates were obtained through the estimator .

As can be seen in Figure 5, the estimates from were noticeably larger than those from , for small . This was worrying for the validity of - and the confidence intervals subsequently computed using - especially since the sampling distribution of was not approximated by a normal distribution, but by a -distribution with degrees of freedom, thus widening confidence intervals further. This suggests that the large sample approximation that relies on was not reasonable in this case.

The confidence interval computed from the original data set was compared with the confidence intervals computed from the synthetic data sets via the confidence interval overlap metric (Karr2006, Snoke2018). This metric is a composite measure that takes into account both the length and the accuracy of the synthetic data confidence interval. Yet whether these factors are weighted appropriately is open to debate. Valid confidence intervals estimated from synthetic data, that is, confidence intervals that achieve the nominal coverage, are longer than the corresponding confidence intervals estimated from the original data, because synthetic data estimates are subject to the uncertainty present in the original data estimates, plus have additional uncertainty from synthesis. However, a synthetic data confidence interval, say, one that is narrower than the original data confidence interval (hence clearly invalid) would yield roughly the same overlap as, say, a confidence interval that is wider. Moreover, either an infinitely wide or infinitely narrow synthetic data confidence interval would achieve an overlap of 0.5.

The confidence interval overlap results are presented in Table 2. The top frame gives the overlap values from when the data sets are analysed separately, and the bottom frame gives the results from when the data sets are pooled. It can be seen that increasing broadly results in an increase in the overlap; and that the overlap tends towards 1 as the original and synthetic data confidence intervals converge. The confidence intervals computed using are less robust as those using , which is evident in the zero coverage when and . This is because, unlike the variance estimator , only considers the within-synthesis variability , not the between-synthesis variability . Therefore, if the point estimate deviates, this deviation is not captured by .

The overlap when the data sets were analysed separately and used | |||||||
---|---|---|---|---|---|---|---|

0.883 | 0.901 | 0.950 | 0.992 | 0.990 | 0.994 | 0.983 | |

0.533 | 0.692 | 0.822 | 0.898 | 0.913 | 0.925 | 0.917 | |

0.536 | 0.635 | 0.778 | 0.843 | 0.878 | 0.909 | 0.923 | |

0.000 | 0.587 | 0.667 | 0.726 | 0.716 | 0.742 | 0.780 | |

0.522 | 0.535 | 0.554 | 0.583 | 0.604 | 0.623 | 0.638 | |

The overlap when the data sets were pooled and used | |||||||

0.881 | 0.905 | 0.951 | 0.988 | 0.990 | 0.994 | 0.983 | |

0.700 | 0.317 | 0.802 | 0.942 | 0.904 | 0.920 | 0.915 | |

0.221 | 0.344 | 0.653 | 0.789 | 0.864 | 0.915 | 0.967 | |

0.020 | 0.436 | 0.856 | 0.775 | 0.825 | 0.809 | 0.906 | |

0.000 | 0.664 | 0.454 | 0.000 | 0.078 | 0.258 | 0.465 |

### 5.3 Tuning and in relation to the risk-utility trade-off

The plots in Figure 6 show how and can be tuned in tandem to produce synthetic data sets that sit favourably within the risk-utility trade-off. These trade-off plots, though, depend on the metrics used to measure risk and utility. Here, risk was measured by either or , and utility by either confidence interval overlap (using ) or Hellinger distance. The Hellinger distances were standardised onto the interval of [0,1] (by dividing by the largest Hellinger distance observed and then were subtracting from 1, so that 1 represents maximum utility and 0 minimum utility).

Looking at the top-left plot, where risk is measured by and utility measured by confidence interval overlap, synthetic data sets generated with and have roughly the same level of risk. Yet the overlap (utility) for the latter is roughly 0.1 more than the former. It is possible to strictly dominate synthetic data sets over others, that is, obtain lower risk and greater utility values. For example, again looking at the top left plot, synthetic data sets generated with have higher risk but lower utility than when . These visual trade-offs are plotted using the empirical results, so are subject to variation from simulation; the confidence interval overlap values, in particular, can be volatile, especially when is large.

The intention is that the synthesizer produces such plots prior to releasing the data. Many metrics, including , can be expressed analytically when using saturated models, which means that the synthesizer does not actually have to generate the synthetic data to produce such plots.

## 6 Discussion

In addition to , the synthesizer could also increase or decrease , the expected sample size of each synthetic data set. A single synthetic data set () with contains roughly the same number of records as two synthetic data sets () each with . To generate a synthetic data set with an expected sample size of , the synthesizer simply takes draws from NBI distributions with means exactly half of what they were previously. Reducing should reduce risk, as fewer records are released, but inevitably reduces utility, too; once again, it calls for an evaluation with respect to the risk-utility trade-off.

Moreover, there are further tuning parameters that could be incorporated into this synthesis mechanism. One way would be to use a three-parameter distribution. When using a two-parameter count distribution, the synthesizer can increase the variance but cannot control how the variability manifests itself. The use of three-parameter count distributions would allow the synthesizer to alter the skewness, that is, they could change the shape of the distribution for a given mean and variance.

The estimator (Raab2016) relies on a large sample assumption, thus allowing the use of a normal distribution for the sampling distribution (or the posterior distribution) of . When the sample is not ‘large’, as with the estimator , it may be possible to approximate the sampling distribution by the -distribution. A greater understanding of the asymptotic properties would ensure the validity of resulting inferences. An alternative in this instance, that is, when the underlying sampling distribution is unknown, is to use bootstrap methods and calculate, for example, confidence intervals using the ‘bootstrap- method’ (Efron1993).

This issue with sample size seems to be linked to a wider issue with tabular data analysis that stretches beyond synthetic data. It is often unclear whether sample size in tabular data should refer to (the number of subjects) or (the number of cells). This has implications when using, for example, the Bayesian Information Criterion (BIC), thus can affect model selection. Using appears to be more intuitive, given that it is these counts that are modelled and subsequently simulated from.

There are, of course, disadvantages to generating synthetic data sets with the most obvious being the increased risk. Nevertheless, the potential benefits warrant further exploration, especially in relation to the risk-utility trade-off: does the gain in utility outweigh the increase in risk?

Organisations are taking a greater interest in making data - such as administrative data - available to researchers, by producing their own synthetic data. For this to be successful, organisations need to guarantee the protection of individuals’ personal data - which, as more data becomes publicly available, becomes ever more challenging - while also producing data that are useful for analysts. Therefore, there needs to be scope to tune the risk and utility of synthetic data effectively, and considering as a tuning parameter helps to achieve this.