Not Logged In

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

In this paper we introduce the idea of improving the performance of parametric temporaldifference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD(λ)’s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per-step computation linear in the number of function approximation parameters are the gradientTD family of methods including TDC, GTD(λ), and GQ(λ). Compared to these methods, our emphatic TD(λ) is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state-dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.

Citation

R. Sutton, A. Mahmood, M. White. "An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning". Journal of Machine Learning Research (JMLR), (ed: Shie Mannor), 17(73), pp 1-29, January 2016.

Keywords: Temporal-difference learning, Off-policy learning, Function approximation, Stability, Convergence
Category: In Journal
Web Links: JMLR

BibTeX

@article{Sutton+al:JMLR16,
  author = {Richard S. Sutton and Ashique Rupam Mahmood and Martha White},
  title = {An Emphatic Approach to the Problem of Off-policy
    Temporal-Difference Learning},
  Editor = {Shie Mannor},
  Volume = "17",
  Number = "73",
  Pages = {1-29},
  journal = {Journal of Machine Learning Research (JMLR)},
  year = 2016,
}

Last Updated: February 25, 2020
Submitted by Sabina P

University of Alberta Logo AICML Logo