Selective imitation for efficient online reinforcement learning with pre-collected data

Deep reinforcement learning (RL) has emerged as a promising solution for autonomous devices requiring sequential decision-making. In the online RL framework, the agent must interact with the environment to collect data, making sample efficiency the most challenging aspect. While the off-policy metho...

Full description

Saved in:
Bibliographic Details
Main Authors: Chanin Eom, Dongsu Lee, Minhae Kwon
Format: Article
Language:English
Published: Elsevier 2024-12-01
Series:ICT Express
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2405959524001048
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846129567138316288
author Chanin Eom
Dongsu Lee
Minhae Kwon
author_facet Chanin Eom
Dongsu Lee
Minhae Kwon
author_sort Chanin Eom
collection DOAJ
description Deep reinforcement learning (RL) has emerged as a promising solution for autonomous devices requiring sequential decision-making. In the online RL framework, the agent must interact with the environment to collect data, making sample efficiency the most challenging aspect. While the off-policy method in online RL partially addresses this issue by employing a replay buffer, learning speed remains slow, particularly at the beginning of training, due to the low quality of data collected with the initial policy. To overcome this challenge, we propose Reward-Adaptive Pre-collected Data RL (RAPD-RL), which leverages pre-collected data in addition to online RL. We employ two buffers: one for pre-collected data and another for online collected data. The policy is trained using both buffers to increase the Q objective and imitate the actions in the dataset. To maintain resistance to poor-quality (i.e., low-reward) data, our method selectively imitates data based on reward information, thereby enhancing sample efficiency and learning speed. Simulation results demonstrate that the proposed solution converges rapidly and achieves high performance across various dataset qualities.
format Article
id doaj-art-c8f6adc0406b4d2b895da544a076c00c
institution Kabale University
issn 2405-9595
language English
publishDate 2024-12-01
publisher Elsevier
record_format Article
series ICT Express
spelling doaj-art-c8f6adc0406b4d2b895da544a076c00c2024-12-10T04:14:23ZengElsevierICT Express2405-95952024-12-0110613081314Selective imitation for efficient online reinforcement learning with pre-collected dataChanin Eom0Dongsu Lee1Minhae Kwon2Department of Intelligent Semiconductors, Soongsil University, Seoul 06978, Republic of KoreaDepartment of Intelligent Semiconductors, Soongsil University, Seoul 06978, Republic of KoreaDepartment of Intelligent Semiconductors, Soongsil University, Seoul 06978, Republic of Korea; School of Electronic Engineering, Soongsil University, Seoul 06978, Republic of Korea; Corresponding author at: Department of Intelligent Semiconductorsand School of Electronic Engineering, Soongsil University, Seoul 06978, Republic of Korea.Deep reinforcement learning (RL) has emerged as a promising solution for autonomous devices requiring sequential decision-making. In the online RL framework, the agent must interact with the environment to collect data, making sample efficiency the most challenging aspect. While the off-policy method in online RL partially addresses this issue by employing a replay buffer, learning speed remains slow, particularly at the beginning of training, due to the low quality of data collected with the initial policy. To overcome this challenge, we propose Reward-Adaptive Pre-collected Data RL (RAPD-RL), which leverages pre-collected data in addition to online RL. We employ two buffers: one for pre-collected data and another for online collected data. The policy is trained using both buffers to increase the Q objective and imitate the actions in the dataset. To maintain resistance to poor-quality (i.e., low-reward) data, our method selectively imitates data based on reward information, thereby enhancing sample efficiency and learning speed. Simulation results demonstrate that the proposed solution converges rapidly and achieves high performance across various dataset qualities.http://www.sciencedirect.com/science/article/pii/S2405959524001048Deep reinforcement learningOff-policy reinforcement learningPre-collected dataBehavioral cloningImitation learning
spellingShingle Chanin Eom
Dongsu Lee
Minhae Kwon
Selective imitation for efficient online reinforcement learning with pre-collected data
ICT Express
Deep reinforcement learning
Off-policy reinforcement learning
Pre-collected data
Behavioral cloning
Imitation learning
title Selective imitation for efficient online reinforcement learning with pre-collected data
title_full Selective imitation for efficient online reinforcement learning with pre-collected data
title_fullStr Selective imitation for efficient online reinforcement learning with pre-collected data
title_full_unstemmed Selective imitation for efficient online reinforcement learning with pre-collected data
title_short Selective imitation for efficient online reinforcement learning with pre-collected data
title_sort selective imitation for efficient online reinforcement learning with pre collected data
topic Deep reinforcement learning
Off-policy reinforcement learning
Pre-collected data
Behavioral cloning
Imitation learning
url http://www.sciencedirect.com/science/article/pii/S2405959524001048
work_keys_str_mv AT chanineom selectiveimitationforefficientonlinereinforcementlearningwithprecollecteddata
AT dongsulee selectiveimitationforefficientonlinereinforcementlearningwithprecollecteddata
AT minhaekwon selectiveimitationforefficientonlinereinforcementlearningwithprecollecteddata