Selective imitation for efficient online reinforcement learning with pre-collected data
Deep reinforcement learning (RL) has emerged as a promising solution for autonomous devices requiring sequential decision-making. In the online RL framework, the agent must interact with the environment to collect data, making sample efficiency the most challenging aspect. While the off-policy metho...
        Saved in:
      
    
          | Main Authors: | , , | 
|---|---|
| Format: | Article | 
| Language: | English | 
| Published: | Elsevier
    
        2024-12-01 | 
| Series: | ICT Express | 
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2405959524001048 | 
| Tags: | Add Tag 
      No Tags, Be the first to tag this record!
   | 
| _version_ | 1846129567138316288 | 
|---|---|
| author | Chanin Eom Dongsu Lee Minhae Kwon | 
| author_facet | Chanin Eom Dongsu Lee Minhae Kwon | 
| author_sort | Chanin Eom | 
| collection | DOAJ | 
| description | Deep reinforcement learning (RL) has emerged as a promising solution for autonomous devices requiring sequential decision-making. In the online RL framework, the agent must interact with the environment to collect data, making sample efficiency the most challenging aspect. While the off-policy method in online RL partially addresses this issue by employing a replay buffer, learning speed remains slow, particularly at the beginning of training, due to the low quality of data collected with the initial policy. To overcome this challenge, we propose Reward-Adaptive Pre-collected Data RL (RAPD-RL), which leverages pre-collected data in addition to online RL. We employ two buffers: one for pre-collected data and another for online collected data. The policy is trained using both buffers to increase the Q objective and imitate the actions in the dataset. To maintain resistance to poor-quality (i.e., low-reward) data, our method selectively imitates data based on reward information, thereby enhancing sample efficiency and learning speed. Simulation results demonstrate that the proposed solution converges rapidly and achieves high performance across various dataset qualities. | 
| format | Article | 
| id | doaj-art-c8f6adc0406b4d2b895da544a076c00c | 
| institution | Kabale University | 
| issn | 2405-9595 | 
| language | English | 
| publishDate | 2024-12-01 | 
| publisher | Elsevier | 
| record_format | Article | 
| series | ICT Express | 
| spelling | doaj-art-c8f6adc0406b4d2b895da544a076c00c2024-12-10T04:14:23ZengElsevierICT Express2405-95952024-12-0110613081314Selective imitation for efficient online reinforcement learning with pre-collected dataChanin Eom0Dongsu Lee1Minhae Kwon2Department of Intelligent Semiconductors, Soongsil University, Seoul 06978, Republic of KoreaDepartment of Intelligent Semiconductors, Soongsil University, Seoul 06978, Republic of KoreaDepartment of Intelligent Semiconductors, Soongsil University, Seoul 06978, Republic of Korea; School of Electronic Engineering, Soongsil University, Seoul 06978, Republic of Korea; Corresponding author at: Department of Intelligent Semiconductorsand School of Electronic Engineering, Soongsil University, Seoul 06978, Republic of Korea.Deep reinforcement learning (RL) has emerged as a promising solution for autonomous devices requiring sequential decision-making. In the online RL framework, the agent must interact with the environment to collect data, making sample efficiency the most challenging aspect. While the off-policy method in online RL partially addresses this issue by employing a replay buffer, learning speed remains slow, particularly at the beginning of training, due to the low quality of data collected with the initial policy. To overcome this challenge, we propose Reward-Adaptive Pre-collected Data RL (RAPD-RL), which leverages pre-collected data in addition to online RL. We employ two buffers: one for pre-collected data and another for online collected data. The policy is trained using both buffers to increase the Q objective and imitate the actions in the dataset. To maintain resistance to poor-quality (i.e., low-reward) data, our method selectively imitates data based on reward information, thereby enhancing sample efficiency and learning speed. Simulation results demonstrate that the proposed solution converges rapidly and achieves high performance across various dataset qualities.http://www.sciencedirect.com/science/article/pii/S2405959524001048Deep reinforcement learningOff-policy reinforcement learningPre-collected dataBehavioral cloningImitation learning | 
| spellingShingle | Chanin Eom Dongsu Lee Minhae Kwon Selective imitation for efficient online reinforcement learning with pre-collected data ICT Express Deep reinforcement learning Off-policy reinforcement learning Pre-collected data Behavioral cloning Imitation learning | 
| title | Selective imitation for efficient online reinforcement learning with pre-collected data | 
| title_full | Selective imitation for efficient online reinforcement learning with pre-collected data | 
| title_fullStr | Selective imitation for efficient online reinforcement learning with pre-collected data | 
| title_full_unstemmed | Selective imitation for efficient online reinforcement learning with pre-collected data | 
| title_short | Selective imitation for efficient online reinforcement learning with pre-collected data | 
| title_sort | selective imitation for efficient online reinforcement learning with pre collected data | 
| topic | Deep reinforcement learning Off-policy reinforcement learning Pre-collected data Behavioral cloning Imitation learning | 
| url | http://www.sciencedirect.com/science/article/pii/S2405959524001048 | 
| work_keys_str_mv | AT chanineom selectiveimitationforefficientonlinereinforcementlearningwithprecollecteddata AT dongsulee selectiveimitationforefficientonlinereinforcementlearningwithprecollecteddata AT minhaekwon selectiveimitationforefficientonlinereinforcementlearningwithprecollecteddata | 
 
       