The biobjective multiarmed bandit: learning approximate lexicographic optimal allocations

We consider a biobjective sequential decision-making problem where an allocation (arm) is called $\epsilon$ lexicographic optimal if its expected reward in the first objective is at most $\epsilon$ smaller than the highest expected reward, and its expected reward in the second objective is at least the expected reward of a lexicographic optimal arm. The goal of the learner is to select arms that are $\epsilon$ lexicographic optimal as much as possible without knowing the arm reward distributions beforehand. For this problem, we first show that the learner's goal is equivalent to minimizing the $\epsilon$ lexicographic regret, and then, propose a learning algorithm whose $\epsilon$ lexicographic gap-dependent regret is bounded and gap-independent regret is sublinear in the number of rounds with high probability. Then, we apply the proposed model and algorithm for dynamic rate and channel selection in a cognitive radio network with imperfect channel sensing. Our results show that the proposed algorithm is able to learn the approximate lexicographic optimal rate--channel pair that simultaneously minimizes the primary user interference and maximizes the secondary user throughput.