Not Logged In

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Full Text: 7295-an-off-policy-policy-gradient-theorem-using-emphatic-weightings.pdf PDF

Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithm—called Actor Critic with Emphatic weightings (ACE)—that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods—particularly OffPAC and DPG—converge to the wrong solution whereas ACE finds the optimal solution.

Citation

E. Imani, E. Graves, M. White. "An Off-policy Policy Gradient Theorem Using Emphatic Weightings". Neural Information Processing Systems (NIPS), (ed: Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolo Cesa-Bianchi, Roman Garnett), pp 96-106, December 2018.

Keywords:  
Category: In Conference
Web Links: NeurIPS

BibTeX

@incollection{Imani+al:NIPS18,
  author = {Ehsan Imani and Eric Graves and Martha White},
  title = {An Off-policy Policy Gradient Theorem Using Emphatic Weightings},
  Editor = {Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman,
    Nicolo Cesa-Bianchi, Roman Garnett},
  Pages = {96-106},
  booktitle = {Neural Information Processing Systems (NIPS)},
  year = 2018,
}

Last Updated: February 25, 2020
Submitted by Sabina P

University of Alberta Logo AICML Logo