Selective imitation for efficient online reinforcement learning with pre-collected data

Deep reinforcement learning (RL) has emerged as a promising solution for autonomous devices requiring sequential decision-making. In the online RL framework, the agent must interact with the environment to collect data, making sample efficiency the most challenging aspect. While the off-policy metho...

Full description

Saved in:
Bibliographic Details
Main Authors: Chanin Eom, Dongsu Lee, Minhae Kwon
Format: Article
Language:English
Published: Elsevier 2024-12-01
Series:ICT Express
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2405959524001048
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Deep reinforcement learning (RL) has emerged as a promising solution for autonomous devices requiring sequential decision-making. In the online RL framework, the agent must interact with the environment to collect data, making sample efficiency the most challenging aspect. While the off-policy method in online RL partially addresses this issue by employing a replay buffer, learning speed remains slow, particularly at the beginning of training, due to the low quality of data collected with the initial policy. To overcome this challenge, we propose Reward-Adaptive Pre-collected Data RL (RAPD-RL), which leverages pre-collected data in addition to online RL. We employ two buffers: one for pre-collected data and another for online collected data. The policy is trained using both buffers to increase the Q objective and imitate the actions in the dataset. To maintain resistance to poor-quality (i.e., low-reward) data, our method selectively imitates data based on reward information, thereby enhancing sample efficiency and learning speed. Simulation results demonstrate that the proposed solution converges rapidly and achieves high performance across various dataset qualities.
ISSN:2405-9595