•  
  •  
 

Journal of Privacy and Confidentiality

Abstract

In this article multiplication of original data values by random noise is suggested as a disclosure control strategy when only the top part of the data is sensitive, as is often the case with income data. The proposed method can serve as an alternative to top coding which is a standard method in this context. Because the log-normal distribution usually fits income data well, the present investigation focuses exclusively on the log-normal. It is assumed that the log-scale mean of the sensitive variable is described by a linear regression on a set of non-sensitive covariates, and we show how a data user can draw valid inference on the parameters of the regression. An appealing feature of noise multiplication is the presence of an explicit tuning mechanism, namely, the noise generating distribution. By appropriately choosing this distribution, one can control the accuracy of inferences and the level of disclosure protection desired in the released data. Usually, more information is retained on the top part of the data under noise multiplication than under top coding. Likelihood based analysis is developed when only the large values in the data set are noise multiplied, under the assumption that the original data form a sample from a log-normal distribution. In this scenario, data analysis methods are developed under two types of data releases: (I) each released value includes an indicator of whether or not it has been noise multiplied, and (II) no such indicator is provided. A simulation study is carried out to assess the accuracy of inference for some parameters of interest. Since top coding and synthetic data methods are already available as disclosure control strategies for extreme values, some comparisons with the proposed method are made through a simulation study. The results are illustrated with a data analysis example based on 2000 U.S. Current Population Survey data. Furthermore, a disclosure risk evaluation of the proposed methodology is presented in the context of the Current Population Survey data example, and the disclosure risk of the proposed noise multiplication method is compared with the disclosure risk of synthetic data.

Share

COinS