Date of Original Version
© ACM, 2001. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.
Abstract or Description
We focus on the problem of finding patterns across two large, multidimensional datasets. For example, given feature vectors of healthy and of non-healthy patients, we want to answer the following questions: “Are the two clouds of points separable?”, “What is the smallest/largest pair-wise distance across the two datasets?”, “Which of the two clouds does a new point (feature vector) come from?”. We propose a new tool, the ‘Cross-Cloud plot’, which helps us answer the above questions, and many more. We present an algorithm to compute the Cross-Cloud plot, which requires only a single pass over the datasets, thus scaling up to arbitrarily large databases. More importantly, it scales linearly with the dimensionality, while most other spatial data mining algorithms explode exponentially. We show how to use our tool for classification, when traditional methods (nearest neighbor, classification trees) may fail. We also provide a set of rules on how to interpret a Cross-cloud plot, and we apply these rules on multiple, synthetic and real datasets.
Proceedings of the Seventh ACM SIGKDD international Conference on Knowledge Discovery and Data Mining , 184-193.