Spurred by practical considerations, a range of methods have been developed for this task. These methods go under a variety of names, including indexing and blocking, and have seen significant development. However, methods for inferring linkage structure that account for indexing, blocking, and additional filtering steps have not seen commensurate development. In this paper we review the implications of indexing, blocking and filtering within the popular Fellegi-Sunter framework, and propose a new model to account for particular forms of indexing and filtering.

]]>Privacy definitions are often analyzed using a highly targeted approach: a *specific* attack strategy is evaluated to determine if a *specific* type of information can be inferred. If the attack works, one can conclude that the privacy definition is too weak. If it doesn't work, one often gains little information about its security (perhaps a slightly different attack would have worked?). Furthermore, these strategies will not identify cases where a privacy definition protects unnecessary pieces of information.

On the other hand, technical results concerning generalizable and systematic analyses of privacy are few in number, but such results have significantly advanced our understanding of the design of privacy definitions. We add to this literature with a novel methodology for analyzing the Bayesian properties of a privacy definition. Its goal is to identify precisely the type of information being protected, hence making it easier to identify (and later remove) unnecessary data protections.

Using privacy building blocks (which we refer to as axioms), we turn questions about semantics into mathematical problems -- the construction of a *consistent normal form* and the subsequent construction of the *row cone* (which is a geometric object that encapsulates Bayesian guarantees provided by a privacy definition).

We apply these ideas to study randomized response, FRAPP/PRAM, and several algorithms that add integer-valued noise to their inputs; we show that their privacy properties can be stated in terms of the protection of various notions of parity of a dataset. Randomized response, in particular, provides unnecessarily strong protections for parity, and so we also show how our methodology can be used to relax privacy definitions.

]]>In this paper we investigate the applicability of regression-tree-based methods for constructing synthetic business data. We give a detailed example comparing exploratory data analysis and linear regression results under two variants of a regression-tree-based synthetic data approach. We also include an evaluation of the analysis results with respect to the results of analysis of the original data. We further investigate the impact of different stopping criteria on performance.

While it is certainly true that any method designed to protect confidentiality introduces error, and may indeed give misleading conclusions, our analysis of the results for synthesisers based on CART models has provided some evidence that this error is not random but is due to the particular characteristics of business data. We conclude that more careful analysis needs to be done in applying these methods and end users certainly need aware of possible discrepancies.

]]>