Not Logged In

Off-Policy Temporal-Difference Learning With Function Approximation

Full Text: precup01offpolicy.pdf PDF

We introduce the first algorithm for off-policy temporal-difference learning that is stable with linear function approximation. Off-policy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD(lambda) over state-action pairs with importance sampling ideas from our previous work. We prove that, given training under any epsilon-soft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the action-value function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policy evaluation problem, showing reduced variance compared to the most obvious importance sampling algorithm for this problem. Our current results are limited to episodic tasks with episodes of bounded length.

Citation

D. Precup, R. Sutton, S. Dasgupta. "Off-Policy Temporal-Difference Learning With Function Approximation". International Conference on Machine Learning (ICML), Williams College, pp 417-424, January 2001.

Keywords: sampling, epsilon-soft, converges, bounded, machine learning
Category: In Conference

BibTeX

@incollection{Precup+al:ICML01,
  author = {Doina Precup and Richard S. Sutton and Sanjoy Dasgupta},
  title = {Off-Policy Temporal-Difference Learning With Function Approximation},
  Pages = {417-424},
  booktitle = {International Conference on Machine Learning (ICML)},
  year = 2001,
}

Last Updated: May 31, 2007
Submitted by Staurt H. Johnson

University of Alberta Logo AICML Logo