Not Logged In

Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return

Full Text: 35.pdf PDF
Other Attachments: Supplementary-Paper35.pdf [PDF] PDF

Temporal-difference (TD) learning methods are widely used in reinforcement learning to estimate the expected return for each state, without a model, because of their significant advantages in computational and data efficiency. For many applications involving risk mitigation, it would also be useful to estimate the variance of the return by TD methods. In this paper, we describe a way of doing this that is substantially simpler than those proposed by Tamar, Di Castro, and Mannor in 2012, or those proposed by White and White in 2016. We show that two TD learners operating in series can learn expectation and variance estimates. The trick is to use the square of the TD error of the expectation learner as the reward of the variance learner, and the square of the expectation learner’s discount rate as the discount rate of the variance learner. With these two modifications, the variance learning problem becomes a conventional TD learning problem to which standard theoretical results can be applied. Our formal results are limited to the table lookup case, for which our method is still novel, but the extension to function approximation is immediate, and we provide some empirical results for the linear function approximation case. Our experimental results show that our direct method behaves just as well as a comparable indirect method, but is generally more robust.

Citation

C. Sherstan, B. Bennett, K. Young, D. Ashley, A. White, M. White, R. Sutton. "Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return". Conference on Uncertainty in Artificial Intelligence (UAI), (ed: Amir Globerson and Ricardo Silva), pp 63-72, August 2018.

Keywords:  
Category: In Conference

BibTeX

@incollection{Sherstan+al:UAI18,
  author = {Craig Sherstan and Brendan Bennett and Kenny Young and Dylan Ashley
    and Adam White and Martha White and Richard S. Sutton},
  title = {Comparing Direct and Indirect Temporal-Difference Methods for
    Estimating the Variance of the Return},
  Editor = {Amir Globerson and  Ricardo Silva},
  Pages = {63-72},
  booktitle = {Conference on Uncertainty in Artificial Intelligence (UAI)},
  year = 2018,
}

Last Updated: February 24, 2020
Submitted by Sabina P

University of Alberta Logo AICML Logo