Date of Original Version
Abstract or Description
We establish connections from optimizing Bellman Residual and Temporal Difference Loss to worstcase long-term predictive error. In the online learning framework, learning takes place over a sequence of trials with the goal of predicting a future discounted sum of rewards. Our first analysis shows that, together with a stability assumption, any no-regret online learning algorithm that minimizes Bellman error ensures small prediction error. Our second analysis shows that applying the family of online mirror descent algorithms on temporal difference loss also ensures small prediction error. No statistical assumptions are made on the sequence of observations, which could be nonMarkovian or even adversarial. Our approach thus establishes a broad new family of provably sound algorithms and provides a generalization of previous worst-case results for minimizing predictive error. We investigate the potential advantages of some of this family both theoretically and empirically on benchmark problems.
Proceedings of the International Joint Conference on Artificial Intelligence, 2016.