Date of Original Version
Proceedings of 21st Annual Conference on Learning Theory (COLT 2008)
Abstract or Table of Contents
We continue the investigation of natural conditions for a similarity function to allow learning, without requiring the similarity function to be a valid kernel, or referring to an implicit high-dimensional space. We provide a new notion of a “good similarity function” that builds upon the previous definition of Balcan and Blum (2006) but improves on it in two important ways. First, as with the previous definition, any large-margin kernel is also a good similarity function in our sense, but the translation now results in a much milder increase in the labeled sample complexity. Second, we prove that for distribution-specific PAC learning, our new notion is strictly more powerful than the traditional notion of a large-margin kernel. In particular, we show that for any hypothesis class C there exists a similarity function under our definition allowing learning with O(log |C|) labeled examples. However, in a lower bound which may be of independent interest, we show that for any class C of pairwise uncorrelated functions, there is no kernelwith margin γ ≥ 8√|C| for all f ∈ C, even if one allows average hinge-loss as large as 0.5. Thus, the sample complexity for learning such classes with SVMs is Ω(|C|). This extends work of Ben- David et al. (2003) and Forster and Simon (2006) who give hardness results with comparablemargin bounds, but at much lower error rates.
Our new notion of similarity relies upon L1 regularized learning, and our separation result is related to a separation result between what is learnable with L1 vs. L2 regularization.