On Generalized Bellman Equations and Temporal-Difference Learning
- Huizhen Yu
- Ashique Rupam Mahmood
- Richard S. Sutton, Department of Computing Science, University of Alberta
We consider off-policy temporal-difference (TD) learning in discounted Markov decision processes, where the goal is to evaluate a policy in a model-free way by using observations of a state process generated without executing the policy. To curb the high variance issue in off-policy TD learning, we propose a new scheme of setting the λ-parameters of TD, based on generalized Bellman equations. Our scheme is to set λ according to the eligibility trace iterates calculated in TD, thereby easily keeping these traces in a desired bounded range. Compared with prior work, this scheme is more direct and flexible, and allows much larger λ values for off-policy TD learning with bounded traces. As to its soundness, using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both the evolution of λ and the unique invariant probability measure of the state-trace process. These results not only lead immediately to a characterization of the convergence behavior of least-squares based implementation of our scheme, but also prepare the ground for further analysis of gradient-based implementations.
Citation
H. Yu, A. Mahmood, R. Sutton. "On Generalized Bellman Equations and Temporal-Difference Learning". Journal of Machine Learning Research (JMLR), 19(48), pp 1-49, January 2018.| Keywords: | |
| Category: | In Journal |
| Web Links: | JMLR |
BibTeX
@article{Yu+al:JMLR18,
author = {Huizhen Yu and Ashique Rupam Mahmood and Richard S. Sutton},
title = {On Generalized Bellman Equations and Temporal-Difference Learning},
Volume = "19",
Number = "48",
Pages = {1-49},
journal = {Journal of Machine Learning Research (JMLR)},
year = 2018,
}Last Updated: March 19, 2020Submitted by Sabina P