CSC 696H: Topics in Bandits and Reinforcement Learning Theory - Fall 2023

Tentative schedule

Date Topics Notes Additional readings Homework
Aug 22 Introduction, course mechanics Slides AK lec. 1/24  
Aug 24 Basic probability tools: concentration of measure; Hoeffding’s inequality   AK lec. 1/24, My 588 lecture note  
Aug 29 Bernstein’s inequality; Generalization in supervised learning; analysis of ERM Note 1 AK lec. 1/24  
Aug 31 Online learning basics; online to batch conversion   FR Sec. 1.6-1.6.1  
Sep 5 Azuma’s inequality; Online learning algorithms: exponential weights   FR Sec. 1.6.2  
Sep 7 Finish exponential weights; Begin multi-armed bandits Note 2 FR Sec. 2  
Sep 12 MAB algorithms: explore-then-commit, eps-greedy, UCB   FR Sec. 2.1-2.2, 2.3 HW1
Sep 14 Analysis of UCB; Stochastic linear bandits   FR Sec. 2.3, 3.2, LinUCB for news recommendation (Li et al, 2010)  
Sep 19 OFUL: confidence set construction and validity      
Sep 21 OFUL: upper confidence bound form and regret analysis Note 3 AJKS Sec 6.2  
Sep 26 Elliptical Potential Lemma; Bandits with general function approximation   RecSys 2020 tutorial on Bandits in Recommender Systems WWW 2023 tutorial - industry perspective on Bandit Feedback  
Sep 28 OFU for Bandits with general function approximation; Eluder dimension Note 4 FR Sec 3.1, The original Eluder dimension paper  
Oct 3 Eluder dimension-based analysis of OFU; Contextual bandits beyond optimism   FR Sec 4.1  
Oct 5 Failure of optimism; the SquareCB reduction   FR Sec. 3.3, 3.5, SquareCB paper Spanner IGW for large action spaces HW2
Oct 10 Finish SquareCB analysis; Decision-Estimation Coefficient; Note 5 FR Sec. 4.2, ICML 22 Tutorial on Bridging Learning and Decision Making, ``The Decision-Estimation Coefficient’’ by Paul Mineiro  
Oct 12 Finish E2D analysis; Begin Posterior Sampling for CB      
Oct 17 Thompson Sampling (TS); Insufficiency of TS in structured bandits   Sec 3 of Tutorial on TS Prop. 1 of FGTS paper, Chapelle and Li, Empirical analysis on TS, Osband and Van Roy, Why is Posterior Sampling Better than Optimism for Reinforcement Learning?  
Oct 19 Feel-good Thompson Sampling (FGTS) and analysis; begin RL   TZ Sec. 17.4, Tong Zhang’s RL theory seminar talk  
Oct 24 RL basics: the MDP model Note 6 FR Sec 5.1, 5.3  
Oct 26 Planning in RL: visitation (occupancy) distributions and optimality of Markovian policies; Value function   FR Sec 5.2, Csaba Szepesvari’s lecture note  
Oct 31 Value function cont’d; begin online learning in MDPs      
Nov 2 Online learning in MDPs: simulation lemma (Bellman residual decomposition) and optimism lemma Note 7 FR 5.4, 5.5  
Nov 7 UCB-VI algorithm and analysis   FR 5.6  
Nov 9 RL with general function approximation: Bellman rank   FR 7.3, ICML Talk on bilinear rank, Bellman rank paper, DQN for Atari Games  
Nov 14 Student presentation by Winston Zeng: A sharp analysis of model-based reinforcement learning with self-play slides    
Nov 16 Student presentation by Rupal Jain: Contextual Combinatorial Cascading Bandits slides    
Nov 21 Student presentation by Aniket Panda: High-dimensional Sparse Linear Bandits slides    
Nov 23 Thanksgiving recess      
Nov 28 Student presentation by Yanan Wang: Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual Bandits slides    
Nov 30 Student presentation: Arush Sharma slides    
Dec 5 Student presentation: Sopan Sarkar slides    
Dec 7 Student presentation: Sadaf Raoufi slides