Adding Semantics and Rigor to Association Rule Learning: the GenTree Approach
Date of Original Version
Abstract or Table of Contents
Learning semantically useful association rules across all attributes of a relational table requires: (1) more rigorous mining than afforded by traditional approaches; and, (2) the invention of knowledge ratings for learned rules, not just statistical ratings. Traditional algorithms began by learning rules over one attribute expressed in the domain values of that attribute (Srikant and Agrawal, 1995). “People who buy diapers tend to buy beer” is an example from grocery purchases.
In the second generation, Srikant and Agrawal, (1996) introduced a hierarchy whose base values are those originally represented in the data, and values appearing at higher levels in the hierarchy represent increasingly more general concepts of base values. Rules learned over the attribute using the hierarchy are termed generalized association rules (or cross-level rules). “People who buy baby products tend to buy controlled substances” is an example of a generalized association rule. Rules that mix the level of expressions for the same item were still not able to be learned. “People who buy beverages tend to buy beer” is an example of a rule that could not be properly learned if most beverage purchases were of beer.
The work reported herein continues the evolution in the expressiveness of association rules to its broadest application – semantically rated rules learned from a large relational table having many attributes. Associated with each attribute is a hierarchy. We find the rules that are formed by combining mixed levels of generalizations across all attributes and that convey the maximum expression of information supported by attribute hierarchies, parameter settings and data tuples. We term these robust rules. An example of a robust rule is “People who buy baby products tend to buy controlled substances, use credit cards and make purchases on Saturdays.” Multiple attributes from the transaction are included and cross-level and mix-level rules are expressed.
Because of the improved expressiveness, many more statistically relevant rules are learned. To identify semantically useful rules among them, we compare rule features appearing on the left side (before “tend to”) to those appearing on the right side (after “tend to”) to identify learning in breadth (more attributes appearing on the right-side than on the left) and learning in depth (values on the right side appearing at lower levels in the corresponding hierarchies than those on the left). For a set of learned rules surpassing statistical thresholds of relevance, the most semantically relevant rules are those expressing more depth and/or breadth in the terms of their expression. The other rules are merely subset expressions. We also introduce a GenTree algorithm as an efficient means to learn robust rules from a table. Experiments using GenTree with two real-world datasets, containing 10,000 six-attributed tuples and over 4,000 eight-attributed tuples each, show that learned rules convey more comprehensive information than possible with traditional association rule mining algorithms.