Date of Original Version
© 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract or Description
One motivation for property testing of boolean functions is the idea that testing can provide a fast preprocessing step before learning. However, in most machine learning applications, it is not possible to request for labels of arbitrary examples constructed by an algorithm. Instead, the dominant query paradigm in applied machine learning, called active learning, is one where the algorithm may query for labels, but only on points in a given (polynomial-sized) unlabeled sample, drawn from some underlying distribution D. In this work, we bring this well-studied model to the domain of testing. We develop both general results for this active testing model as well as efficient testing algorithms for several importantproperties for learning, demonstrating that testing can still yield substantial benefits in this restricted setting. For example, we show that testing unions of d intervals can be done with O(1) label requests in our setting, whereas it is known to √ require Ω(d) labeled examples for learning (and Ω(√d) for passivetesting  where the algorithm must pay for every example drawn from D). In fact, our results fortesting unions of intervals also yield improvements on prior work in both the classic query model (where any point in the domain can be queried) and the passive testing model as well. For the problem oftesting linear separators in Rn over the Gaussian distribution, we show that both active and passivetesting can be done with O(√n) queries, substantially less than the Ω(n) needed for learning, with near-matching lower bounds. We also present a general combination result in this model for building testableproperties out of others, which we then use to provide testers for a number of assumptions used in semi-supervised learning. In addition to the above results, we also develop a general notion of thetesting dimension of a given property with respect to a given distribution, that we show characteriz- s (up to constant factors) the intrinsic number of label requests needed to test that property. We develop such notions for both the active and passive testing models. We then use these dimensions to prove a number of lower bounds, including for linear separators and the class of dictator functions.
Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), 2012, 21-30.