View Publication

Linear Off-Policy Actor-Critic

Thomas Degris
Martha White, University of Alberta
Richard S. Sutton, Department of Computing Science, University of Alberta

This paper presents the first actor-critic algorithm for o↵-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in o↵- policy gradient temporal-di↵erence learning. O↵-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of o↵- policy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous o↵-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.

Citation

T. Degris, M. White, R. Sutton. "Linear Off-Policy Actor-Critic". International Conference on Machine Learning (ICML), pp n/a, June 2012.

Keywords:
Category:	In Conference
Web Links:	ICML

BibTeX

@incollection{Degris+al:ICML12,
  author = {Thomas Degris and Martha White and Richard S. Sutton},
  title = {Linear Off-Policy Actor-Critic},
  Pages = {n/a},
  booktitle = {International Conference on Machine Learning (ICML)},
  year = 2012,
}

Last Updated: February 25, 2020
Submitted by Sabina P

Not Logged In

PapersDB

Linear Off-Policy Actor-Critic

Citation

BibTeX