Trajectory Based Prioritized Double Experience Buffer for Sample-Efficient Policy Optimization

Reinforcement learning has recently made great progress in various challenging domains such as board game of Go and MOBA game of StarCraft II. Policy gradient based reinforcement learning method has become the mainstream due to its effectiveness and simplicity both in discrete and continuous scenari...

Full description

Saved in:
Bibliographic Details
Main Authors: Shengxiang Li, Ou Li, Guangyi Liu, Siyuan Ding, Yijie Bai
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9486881/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Reinforcement learning has recently made great progress in various challenging domains such as board game of Go and MOBA game of StarCraft II. Policy gradient based reinforcement learning method has become the mainstream due to its effectiveness and simplicity both in discrete and continuous scenarios. However, policy gradient methods commonly involve function approximation and work in an on-policy fashion, which leads to high variance and low sample efficiency. This paper introduces a novel policy gradient method to improve the sample efficiency via a pair of trajectory based prioritized replay buffers and reduce the variance in training with a target network whose weights are updated in a “soft” manner. We evaluate our method on the reinforcement learning suit of Open AI Gym tasks, and the results show that the proposed method can learn more steadily and achieve higher performance than existing methods.
ISSN:2169-3536