Date of Original Version



Technical Report

Rights Management

All Rights Reserved

Abstract or Description

Abstract: "A long-standing challenge for data management is the ability to correctly relate information corresponding to the same entity distributed across databases. Traditional research into record linkage has concentrated on string comparator metrics for records with common, or relatable, attributes. However, spatially distributed data are often devoid of such crucial information for database schema integration. Rather than directly relate schemas, spatially distributed data can be related through location-based linkage algorithms, which link patterns in location-specific attributes (e.g. visit). In this paper we focus on two fundamental algorithms for location-based linkage and we investigate how different distributions of how entities visit locations influence linkage performance. We begin by studying algorithm accuracy for linking real-world data. We then outline a theoretical framework rooted in information theory that allows us to provide insight into observed phenomena. Our framework also provides a useful basis for studying the performance of location-based linkage algorithms: we analyze two opposing cases where location visit patterns arise from uniform and power distributions of entities to locations. We carry out our investigations under both the assumption of complete and incomplete information. Our findings suggest that low skew distributions are more easily linked when complete information is known. In contrast, when information is incomplete high skew distributions lead to higher linkage rates."