Title

Learning Semantically Robust Rules from Data

Date of Original Version

2004

Type

Technical Report

Abstract or Table of Contents

We introduce the problem of mining robust rules, which are expressive multi-dimensional generalized association rules. Consider a large relational table, where associated with each attribute is a hierarchy whose base values are those originally represented in the data, and values appearing at higher levels in the hierarchy represent increasingly more general concepts of base values. Attribute hierarchies provide meaningful levels of concept aggregation, such as the encoding of postal codes (ZIP) or dates, or the taxonomy of products. We find the least general rules formed by combining mixed levels of generalizations across attributes to convey the maximum expression of information supported by attribute hierarchies, parameter settings and data tuples. We term these "robust rules" and introduce a GenTree algorithm as a means to learn robust rules from a table. An example of a robust rule from a table having base values {5-digit ZIP, gender, registration date (year/month/day), party} might be "women living in Cambridge (021**) and registered in the 1970’s (197*/**/**) tend to be Democrats." Previous studies on mining generalized association rules have been limited dimensionally (e.g., transactional data), by data type (e.g., quantitative data), and/or to rules expressed from either fixed-level or non-semantic abstractions. Such approaches limit the kinds of rules that can be learned. Experiments using GenTree with two real-world datasets, containing 10,000 six-attributed tuples and over 4,000 eight-attributed tuples each, show that learned rules convey more comprehensive information than possible with traditional association rule mining algorithms, because traditional approaches limit the expressivity of the rules they generate.