Not Logged In

Off-Policy Learning With Recognizers

Full Text: NIPS2005_0775.pdf PDF

We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has much lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for MDPs and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Our algorithm achieves this without knowledge of the behavior policy or even requiring that there exists a behavior policy.

Citation

D. Precup, R. Sutton, C. Paduraru, A. Koop, S. Singh. "Off-Policy Learning With Recognizers". Neural Information Processing Systems (NIPS), Vancouver, British Columbia, Canada, January 2005.

Keywords: algorithm, temporal-difference, MDPs, machine learning
Category: In Conference

BibTeX

@incollection{Precup+al:NIPS05,
  author = {Doina Precup and Richard S. Sutton and Cosmin Paduraru and Anna
    Koop and Satinder Singh},
  title = {Off-Policy Learning With Recognizers},
  booktitle = {Neural Information Processing Systems (NIPS)},
  year = 2005,
}

Last Updated: April 24, 2007
Submitted by AICML Admin Assistant

University of Alberta Logo AICML Logo