Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret

We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in th...

Full description

Saved in:
Bibliographic Details
Main Authors: Lai Wei, Vaibhav Srivastava
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Open Journal of Control Systems
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10460198/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841554049692860416
author Lai Wei
Vaibhav Srivastava
author_facet Lai Wei
Vaibhav Srivastava
author_sort Lai Wei
collection DOAJ
description We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.
format Article
id doaj-art-948d0912f99e48b4bd64204b4de428e2
institution Kabale University
issn 2694-085X
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Open Journal of Control Systems
spelling doaj-art-948d0912f99e48b4bd64204b4de428e22025-01-09T00:02:59ZengIEEEIEEE Open Journal of Control Systems2694-085X2024-01-01312814210.1109/OJCSYS.2024.337292910460198Nonstationary Stochastic Bandits: UCB Policies and Minimax RegretLai Wei0https://orcid.org/0000-0001-9684-2090Vaibhav Srivastava1https://orcid.org/0000-0002-9786-0159Life Sciences Institute, University of Michigan, Ann Arbor, MI, USADepartment of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, USAWe study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.https://ieeexplore.ieee.org/document/10460198/Heavy-tailed distributionsminimax regretnonstationary multiarmed banditupper-confidence boundvariation budget
spellingShingle Lai Wei
Vaibhav Srivastava
Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret
IEEE Open Journal of Control Systems
Heavy-tailed distributions
minimax regret
nonstationary multiarmed bandit
upper-confidence bound
variation budget
title Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret
title_full Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret
title_fullStr Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret
title_full_unstemmed Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret
title_short Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret
title_sort nonstationary stochastic bandits ucb policies and minimax regret
topic Heavy-tailed distributions
minimax regret
nonstationary multiarmed bandit
upper-confidence bound
variation budget
url https://ieeexplore.ieee.org/document/10460198/
work_keys_str_mv AT laiwei nonstationarystochasticbanditsucbpoliciesandminimaxregret
AT vaibhavsrivastava nonstationarystochasticbanditsucbpoliciesandminimaxregret