The biobjective multiarmed bandit: learning approximate lexicographic optimal allocations

The biobjective multiarmed bandit: learning approximate lexicographic optimal allocations

We consider a biobjective sequential decision-making problem where an allocation (arm) is called ϵ lexicographic optimal if its expected reward in the first objective is at most ϵ smaller than the highest expected reward,and its expected reward in the second objective is at least the expected reward of a lexicographic optimal arm. Thegoal of the learner is to select arms that are ϵ lexicographic optimal as much as possible without knowing the armreward distributions beforehand. For this problem, we first show that the learner’s goal is equivalent to minimizing theϵ lexicographic regret, and then, propose a learning algorithm whose ϵ lexicographic gap-dependent regret is boundedand gap-independent regret is sublinear in the number of rounds with high probability. Then, we apply the proposedmodel and algorithm for dynamic rate and channel selection in a cognitive radio network with imperfect channel sensing.Our results show that the proposed algorithm is able to learn the approximate lexicographic optimal rate–channel pairthat simultaneously minimizes the primary user interference and maximizes the secondary user throughput.

___

  • [1] Lai TL, Robbins H. Asymptotically efficient adaptive allocation rules. Adv Appl Math 1985; 6: 4-22.
  • [2] Thompson WR. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 1933; 25: 285-294.
  • [3] Li L, Chu W, Langford J, Schapire RE. A contextual-bandit approach to personalized news article recommendation. In: 19th International Conference on World Wide Web; 26-30 April 2010; Raleigh, NC, USA. New York, NY, USA: ACM. pp. 661-670.
  • [4] Tekin C, Liu M. Online learning of rested and restless bandits. IEEE Trans Inf Theory 2012; 58: 5588-5611.
  • [5] Kveton B, Wen Z, Ashkan A, Szepesvari C. Combinatorial cascading bandits. In: 28th Annual Conference on Neural Information Processing Systems; 7-12 December 2015; Montreal, Canada. Red Hook, NY, USA: Curran Associates, Inc. pp. 1450-1458.
  • [6] Auer P, Cesa-Bianchi N, Fischer P. Finite-time analysis of the multiarmed bandit problem. Mach Learn 2002; 47: 235-256.
  • [7] Agrawal S, Goyal N. Analysis of Thompson sampling for the multi-armed bandit problem. In: 25th Annual Conference on Learning Theory; 25-27 June 2012; Edinburgh, Scotland. PMLR. pp. 39.1-39.26.
  • [8] Gai Y, Krishnamachari B, Jain R. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Trans Netw 2012; 20: 1466-1478.
  • [9] Combes R, Proutiere A. Unimodal bandits: regret lower bounds and optimal algorithms. In: 31st International Conference on Machine Learning; 21-16 June 2014; Beijing, China. PMLR. pp. 521-529.
  • [10] Combes R, Proutiere A. Dynamic rate and channel selection in cognitive radio systems. IEEE J Sel Areas Commun 2015; 33: 910-921.
  • [11] Gai Y, Krishnamachari B, Jain R. Learning multiuser channel allocations in cognitive radio networks: a combinatorial multi-armed bandit formulation. In: 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum; 6-9 April 2010; Singapore. New York, NY, USA: IEEE. pp. 1-9.
  • [12] Kveton B, Wen Z, Ashkan A, Szepesvari C. Tight regret bounds for stochastic combinatorial semi-bandits. In: 18th International Conference on Artificial Intelligence and Statistics; 9-12 May 2015; San Diego, CA, USA. PMLR. pp. 535-543.
  • [13] Chen W, Wang Y, Yuan Y. Combinatorial multi-armed bandit: General framework and applications. In: 30th International Conference on Machine Learning; 16-21 June 2013; Atlanta, GA, USA. PMLR. pp. 151-159.
  • [14] Chen W, Wang Y, Yuan Y, Wang Q. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. J Mach Learn Res 2016; 17: 1746-1778.
  • [15] Drugan MM, Nowé A. Designing multi-objective multi-armed bandits algorithms: a study. In: 2013 International Joint Conference on Neural Networks; 4-9 August 2013; Dallas, TX, USA. New York, NY, USA: IEEE. pp. 1-8.
  • [16] Drugan MM, Nowé A. Scalarization based Pareto optimal set of arms identification algorithms. In: 2014 International Joint Conference on Neural Networks; 6-11 July 2014; Beijing, China. New York, NY, USA: IEEE. pp. 2690-2697.
  • [17] Yahyaa SQ, Manderick B. Thompson sampling for multi-objective multi-armed bandits problem. In: 2015 European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning; 22-24 April 2015; Brudges, Belgium. Louvain-la-Neuve, Belgium: i6doc.com Publishing. pp. 47-52.
  • [18] Turgay E, Oner D, Tekin C. Multi-objective contextual bandit problem with similarity information. In: 21st International Conference on Artificial Intelligence and Statistics; 9-11 April 2018; Lanzarote, Spain. PMLR. pp. 1673-1681.
  • [19] Slivkins A. Contextual bandits with similarity information. J Mach Learn Res 2014; 15: 2533-2568.
  • [20] Ehrgott M. Multicriteria optimization. 2nd ed. Berlin - Heidelberg, Germany: Springer Science & Business Media, 2005.
  • [21] Tekin C, Turgay E. Multi-objective contextual bandits with a dominant objective. In: 27th IEEE International Workshop on Machine Learning for Signal Processing; 25-28 September 2017; Tokyo, Japan. New York, NY, USA: IEEE. pp. 1-6.
  • [22] Tekin C, Turgay E. Multi-objective contextual multi-armed bandit with a dominant objective. IEEE Trans Signal Process 2018; 66: 3799-3813.
  • [23] Gábor Z, Kalmár Z, Szepesvári C. Multi-criteria reinforcement learning. In: 15th International Conference on Machine Learning; 24-27 July 1998; Madison, WI, USA. San Francisco, CA, USA: Morgan Kaufmann Publishers. pp. 197-205.
  • [24] Mannor S, Shimkin N. A geometric approach to multi-criterion reinforcement learning. J Mach Learn Res 2004; 5: 325-360.
  • [25] Abbasi-Yadkori Y, Pál D, Szepesvári C. Improved algorithms for linear stochastic bandits. In: 25th Annual Conference on Neural Information Processing Systems; 12-17 December 2011; Granada, Spain. Red Hook, NY, USA: Curran Associates, Inc. pp. 2312-2320.
  • [26] Antos A, Grover V, Szepesvári C. Active learning in heteroscedastic noise. Theor Comput Sci 2010; 411: 2712-2728.
  • [27] Stuber GL. Principles of mobile communication. 2nd ed. Norwell, MA, USA: Kluwer, 2001.