M-Learning: Heuristic Approach for Delayed Rewards in Reinforcement Learning

The current design of reinforcement learning methods requires extensive computational resources. Algorithms such as Deep Q-Network (DQN) have obtained outstanding results in advancing the field. However, the need to tune thousands of parameters and run millions of training episodes remains a signifi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Cesar Andrey Perdomo Charry, Marlon Sneider Mora Cortes, Oscar J. Perdomo
Format:	Article
Language:	English
Published:	MDPI AG 2025-06-01
Series:	Mathematics
Subjects:	reinforcement learning exploration–exploitation dilemma Q-learning frozen lake heuristic approach
Online Access:	https://www.mdpi.com/2227-7390/13/13/2108
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849319959653515264
author	Cesar Andrey Perdomo Charry Marlon Sneider Mora Cortes Oscar J. Perdomo
author_facet	Cesar Andrey Perdomo Charry Marlon Sneider Mora Cortes Oscar J. Perdomo
author_sort	Cesar Andrey Perdomo Charry
collection	DOAJ
description	The current design of reinforcement learning methods requires extensive computational resources. Algorithms such as Deep Q-Network (DQN) have obtained outstanding results in advancing the field. However, the need to tune thousands of parameters and run millions of training episodes remains a significant challenge. This document proposes a comparative analysis between the Q-Learning algorithm, which laid the foundations for Deep Q-Learning, and our proposed method, termed M-Learning. The comparison is conducted using Markov Decision Processes with the delayed reward as a general test bench framework. Firstly, this document provides a full description of the main challenges related to implementing Q-Learning, particularly concerning its multiple parameters. Then, the foundations of our proposed heuristic are presented, including its formulation, and the algorithm is described in detail. The methodology used to compare both algorithms involved training them in the Frozen Lake environment. The experimental results, along with an analysis of the best solutions, demonstrate that our proposal requires fewer episodes and exhibits reduced variability in the outcomes. Specifically, M-Learning trains agents 30.7% faster in the deterministic environment and 61.66% faster in the stochastic environment. Additionally, it achieves greater consistency, reducing the standard deviation of scores by 58.37% and 49.75% in the deterministic and stochastic settings, respectively. The code will be made available in a GitHub repository upon this paper’s publication.
format	Article
id	doaj-art-c59857e92db645a0b9a8a0ffcc1fc06c
institution	Kabale University
issn	2227-7390
language	English
publishDate	2025-06-01
publisher	MDPI AG
record_format	Article
series	Mathematics
spelling	doaj-art-c59857e92db645a0b9a8a0ffcc1fc06c2025-08-20T03:50:16ZengMDPI AGMathematics2227-73902025-06-011313210810.3390/math13132108M-Learning: Heuristic Approach for Delayed Rewards in Reinforcement LearningCesar Andrey Perdomo Charry0Marlon Sneider Mora Cortes1Oscar J. Perdomo2Faculty of Engineering, Universidad Distrital Francisco José de Caldas, Bogotá 111611, ColombiaFaculty of Engineering, Universidad Distrital Francisco José de Caldas, Bogotá 111611, ColombiaDeptartment of Electrical and Electronic, Universidad Nacional de Colombia, Bogotá 111321, ColombiaThe current design of reinforcement learning methods requires extensive computational resources. Algorithms such as Deep Q-Network (DQN) have obtained outstanding results in advancing the field. However, the need to tune thousands of parameters and run millions of training episodes remains a significant challenge. This document proposes a comparative analysis between the Q-Learning algorithm, which laid the foundations for Deep Q-Learning, and our proposed method, termed M-Learning. The comparison is conducted using Markov Decision Processes with the delayed reward as a general test bench framework. Firstly, this document provides a full description of the main challenges related to implementing Q-Learning, particularly concerning its multiple parameters. Then, the foundations of our proposed heuristic are presented, including its formulation, and the algorithm is described in detail. The methodology used to compare both algorithms involved training them in the Frozen Lake environment. The experimental results, along with an analysis of the best solutions, demonstrate that our proposal requires fewer episodes and exhibits reduced variability in the outcomes. Specifically, M-Learning trains agents 30.7% faster in the deterministic environment and 61.66% faster in the stochastic environment. Additionally, it achieves greater consistency, reducing the standard deviation of scores by 58.37% and 49.75% in the deterministic and stochastic settings, respectively. The code will be made available in a GitHub repository upon this paper’s publication.https://www.mdpi.com/2227-7390/13/13/2108reinforcement learningexploration–exploitation dilemmaQ-learningfrozen lakeheuristic approach
spellingShingle	Cesar Andrey Perdomo Charry Marlon Sneider Mora Cortes Oscar J. Perdomo M-Learning: Heuristic Approach for Delayed Rewards in Reinforcement Learning Mathematics reinforcement learning exploration–exploitation dilemma Q-learning frozen lake heuristic approach
title	M-Learning: Heuristic Approach for Delayed Rewards in Reinforcement Learning
title_full	M-Learning: Heuristic Approach for Delayed Rewards in Reinforcement Learning
title_fullStr	M-Learning: Heuristic Approach for Delayed Rewards in Reinforcement Learning
title_full_unstemmed	M-Learning: Heuristic Approach for Delayed Rewards in Reinforcement Learning
title_short	M-Learning: Heuristic Approach for Delayed Rewards in Reinforcement Learning
title_sort	m learning heuristic approach for delayed rewards in reinforcement learning
topic	reinforcement learning exploration–exploitation dilemma Q-learning frozen lake heuristic approach
url	https://www.mdpi.com/2227-7390/13/13/2108
work_keys_str_mv	AT cesarandreyperdomocharry mlearningheuristicapproachfordelayedrewardsinreinforcementlearning AT marlonsneidermoracortes mlearningheuristicapproachfordelayedrewardsinreinforcementlearning AT oscarjperdomo mlearningheuristicapproachfordelayedrewardsinreinforcementlearning

M-Learning: Heuristic Approach for Delayed Rewards in Reinforcement Learning

Similar Items