Many agencies release data to motivate statistical research and industrial work. But often these data-sets carry some information which may be sensitive to the individual bearing it. Erasing the name or some identity number associated with an individual may not always be sufficient to hide the identity of the individual. For example, imagine a situation where a data-set of variables corresponding to individuals are released and among these variables there is a variable named “pin-code”( sometimes called zip-code). Now “pin-code” is not supposed to be a sensitive variable, but it may happen that the intruder, who is trying to identify some individual in the data-set, has an idea about where the individual lives and thus can guess his “pin-code”. In this case, if in the data-set there is no other individual having the same “pin-code”, he can directly guess from this information which row in the data-set corresponds to the individual and thus the identity is revealed. Hence, suppressing identity numbers or names is not always sufficient to prevent identity disclosure. In case, there are a few variables with low frequency cells, it is usually easy for the intruder to identify the individual.
Various articles including    have discussed this problem and various authors have proposed different risk measures to evaluate the security in the released data. However, here we follow the framework of Nayak et. al.  where the intruder has a knowledge of the variable category corresponding to his target unit . If the variable has categories , then we assume without loss of generality and the frequencies of the categories in the data-set are respectively.
If , i.e. only has category , the intruder can guess the row of his target unit with certainty. If is small, the intruder knows that his target unit is definitely one of the many units and then taking into consideration other information, he may successfully identify the row of his target unit or make a correct guess. Thus, in this case, the variable information must be suppressed before releasing the data.
One way to do that is to completely erase the variable but that is not desirable to the statistician. The usual practice is to perturb the data in such a way so that the new data can be treated like the original data in making statistical inferences.
If is the original data-set and is the perturbed data then the transition matrix is given by, where,
This matrix is not released and is unknown to the statistician. This method of obfuscation is known as the post-randomization method (PRAM). If we assume then after transformation of to , if are the frequencies of each class in the perturbed data, then , where (, ). If we want to treat as the original data, we must have . But
is generally unknown to the one, who is masking the data. However, he can estimatefrom the original data with where is the total sample size. If we want
to be an unbiased estimator of, we must have,
Gouweleeuw et.al. ( 1998)  defined a post randomization method to be an invariant PRAM if satisfies Equation (2). The error due to estimation after post randomization was studied in the literature by various authors including Nayak et. al. .
One of the common techniques to achieve an invariant PRAM is to use an Inverse Frequency Post Randomization (IFPR) block diagonal matrix, in which the entire data-set is partitioned into few groups and within each group, categories are interchanged. If it is not desirable to change the category of some variable, it can be made to form its own block. Thus, if there are groups, given by , , , , where , then if and fall into the same group and if and fall into different groups. Within each group, is given by,
where and is the block size of the group that and fall into. However, the parameter of the model should be carefully chosen to ensure that the perturbed data is secured from the intruder, at least, up to a certain extent. To measure the risk of disclosure, Nayak et.al.  suggested checking whether the probability of correctly identifying an individual given any structure of and any value of is bounded by some specified quantity . Moreover, they showed that there exists a , where which gives the transition matrix, where is chosen according to Equation (3) with for each and is the block size of the group belongs to. Without loss of generality, we assume the block belongs to is the first block. This matrix when used to post randomize ,
for any , where CM denotes “Correct Match”. However, if we can extend the search range of from to and can find all categories in the first block that satisfy for all , then the level of security can be extended to any . Note that, under this definition, there is no harm in the range of the probabilities as they certainly lie between 0 and 1. However, smaller the value of , larger the block size is required. Therefore we can extend the security as far as the frequency distribution permits.
2 Our Approach
As mentioned earlier, our framework is similar to that of Nayak et.al. . From the intruder’s point of view, we assume that as he gets access of the released data , he checks the rows for which for . Let be the total number of units having class . If , intruder stops searching for his target unit in the data-set. If for some , he selects one unit randomly among these individuals and concludes that to be his target unit . Under this assumption, we discuss how to choose the parameter of the IFPR block diagonal matrix ( See Equation (3)), depending on , so that the probability of correctly identifying unit is less than some specified . Our method is described in the following paragraph.
Fix a . Note that, if , then there is no need for obfuscation as the intruder can choose one unit randomly and conclude it as his target unit . Since, in the original data, the probability of correctly identifying is , if , the probability is less than . This is quite intuitive since identification risk is a problem associated with low-frequency classes. If , then we find classes ( where the function is discussed in Sec. 3 ) such that for each of these classes , for each . Such an event is usually feasible for moderate values of as usually has small values. If such classes are available, we can have any desired level of security, i.e., for any fixed , there exists a corresponding such that if the data is perturbed with matrix , Equation (4) holds. If, however, such classes are not available, we can find the integer such that . Since classes are not available such that for each , we now set and try to find classes such that for each . If we fail, we next try for and so on until we get a success for some . Since for , there exists classes such that for each , and a , such that if the data is perturbed with , then Equation (4) is satisfied for any . According to Nayak et. al. , there is always a solution for which implies can take a minimum value. However, can take higher values in many cases.
3 Model,Assumptions and Results
As discussed earlier, the goal of the paper is to find out a method by which a data can be perturbed ensuring as much security as possible. Since security is an abstract term, we limit ourselves to ensure that the measure, given by Equation (4)) holds for low values of . Smaller the value of , better the security of the data. Let us denote, by , the probability of correctly identifying the individual from released data given and the frequency distribution of given by . In other words,
If is bounded by for any , then note that
is bounded by for any , which signifies that the probability of correctly identifying an individual is less than , no matter how small or large the frequency of category is, in the released data. is used instead of because it is hard to calculate the probability if is not known. Note that, CM stands for “Correct Match” in the above equations (5) (6).
Recall that if we use, IFPR block diagonal matrix to perturb , the category may get changed to one of , with positive probability. let us denote for . Observe that, can be re-written as
By our assumption, since the intruder searches his target unit among the ones with category , . Again, since, the intruder is assumed to choose randomly one unit among units to be , for any . Thus,
Again, we have,
where , , and the sum is over all integer-valued such that and . We denote the sum by
from Equation (7) , we finally have,
Nayak et.al.  observed that although it seems intuitive that for any , there are certain cases it does not hold true. However, they proved that if , i.e., for all , then for any . Intuitively, if
is highest, i.e., the odds thatgoes to any category other than , then the risk of disclosure should be maximum if . We checked that this is quite true which leads us to our first result, stated in the following theorem and the proof is given in Appendix Section.
If , i.e., for any , then for any , , where is given by Equation (10).
To proceed further we also need the following lemma, proof of which is deferred in Appendix Section.
For any fixed , there exists a such that .
For Theorem 3.1 to hold, in an IFPR block diagonal matrix, we must have which leads to the condition, , i.e., . Note that, if , and , . Hence, it is enough to find for Theorem 3.1 to hold. Again, is chosen by solving . Thus, for fixed and we have a and a corresponding which is the largest integer contained in . is the minimum number of categories required to form the block containing . For some possible choices of and some possible values of , the value of is calculated and given in Table 1. While choosing the block size, one must note that the block size must be larger than or at least equal to to ensure Equation (4).
4 Simulation Results
To illustrate the process, we simulate a sample of size from
categories such that the probability of falling into a category is given by the vector. The sample has frequency distribution given by Table 2.
Two units in the data-set have Category 1, one of which is unit . Since , the probability of Correct Match from true data is 0.5 which is very high. We want this probability to be lower, say below . So, we transform the data to using the IPRAM method with a transition matrix . To choose an ideal we apply the procedure of this paper. From Table 1, we get the required block size is 6. So, we would apply transition to the first categories with the lowest probability of occurrence and do not alter the categories for the rest 2 categories. To solve for , we have which gives the transition matrix,
Using this transition matrix we ran 1000 simulations to get 1000 different s. The mean squared estimation error for each category is given by which is quite low and the average probability of correct match in 1000 simulations is .
The process thus seems to work well for simulated data.
The method works fine in most practical cases, because, in general, since we want to obfuscate categories with low frequency, there will be sufficient number of categories with higher frequency values than them. Accordingly, the security level can be increased.
However, the greatest drawback of this method of obfuscation is that we have assumed the game of the intruder, i.e., it selects one of the units with the desired categorical value randomly looking at the obfuscated data. But this is not expected to happen since in most cases there will be many regressive variables associated and the selection will not be, in general, random. This problem was also discussed in .
However, if the model assumptions hold true, the discussed method is successful in giving a better security.
-  W.A. Fuller Masking Procedures for Microdata Disclosure Limitation Journal of Official Statistics 1993 pp. 383-406
-  T. K. Nayak C. Zhang and J. You Measuring Identification Risk in Microdata Release and Its Control by Post-Randomization , 2016, Center for Disclosure Avoidance Research U.S. Census Bureau Washington DC 20233
-  T. K. Nayak S. A. Adeshiyan C. Zhang A Concise Theory of Randomized Response Techniques for Privacy and Confidentiality Protection Handbook of Statistics Volume 34, 2016, Pages 273-286 DOI:https://doi.org/10.1016/bs.host.2016.01.015
-  S. Trabelsi V. Salzgeber M Bezzi G. Montagnon Data Disclosure Risk Evaluation, 2009, IEEE Xplore DOI: 10.1109/CRISIS.2009.5411979
-  J. G. Bethlehem W. J. Keller J. Pannekoek Disclosure Control of Microdata, 1990, Journal of American Statistical Association
-  J. M. Gouweleeuw P. Kooiman P.P. de Wolf Post Randomisation for Statistical Disclosure Control: Theory and Implementation, 1998, Journal Of Official Statistics
-  T. K. Nayak S. A. Adeshiyan On invariant Post Randomization for Statistical Dislosure Control, 2015, International Statistical Review
Proof to Theorem 3.1
To prove the result, we need to show , i.e., which leads us to check an equivalent statement,
We will prove this result by a two dimensional induction procedure. First, we show that the statement is true for for all , then we show that if the statement is true for , then it is true for for all .
Case: : Since, and
Writing similarly, we note that there are terms in the expansion of .
In the last expression, let us denote the first term by and the second term by . Note that since .
Thus, it can be clearly seen that,