Aug 22 |
Introduction, course mechanics |
Slides |
AK lec. 1/24 |
|
Aug 24 |
Basic probability tools: concentration of measure; Hoeffding’s inequality |
|
AK lec. 1/24, My 588 lecture note |
|
Aug 29 |
Bernstein’s inequality; Generalization in supervised learning; analysis of ERM |
Note 1 |
AK lec. 1/24 |
|
Aug 31 |
Online learning basics; online to batch conversion |
|
FR Sec. 1.6-1.6.1 |
|
Sep 5 |
Azuma’s inequality; Online learning algorithms: exponential weights |
|
FR Sec. 1.6.2 |
|
Sep 7 |
Finish exponential weights; Begin multi-armed bandits |
Note 2 |
FR Sec. 2 |
|
Sep 12 |
MAB algorithms: explore-then-commit, eps-greedy, UCB |
|
FR Sec. 2.1-2.2, 2.3 |
HW1 |
Sep 14 |
Analysis of UCB; Stochastic linear bandits |
|
FR Sec. 2.3, 3.2, LinUCB for news recommendation (Li et al, 2010) |
|
Sep 19 |
OFUL: confidence set construction and validity |
|
|
|
Sep 21 |
OFUL: upper confidence bound form and regret analysis |
Note 3 |
AJKS Sec 6.2 |
|
Sep 26 |
Elliptical Potential Lemma; Bandits with general function approximation |
|
RecSys 2020 tutorial on Bandits in Recommender Systems WWW 2023 tutorial - industry perspective on Bandit Feedback |
|
Sep 28 |
OFU for Bandits with general function approximation; Eluder dimension |
Note 4 |
FR Sec 3.1, The original Eluder dimension paper |
|
Oct 3 |
Eluder dimension-based analysis of OFU; Contextual bandits beyond optimism |
|
FR Sec 4.1 |
|
Oct 5 |
Failure of optimism; the SquareCB reduction |
|
FR Sec. 3.3, 3.5, SquareCB paper Spanner IGW for large action spaces |
HW2 |
Oct 10 |
Finish SquareCB analysis; Decision-Estimation Coefficient; |
Note 5 |
FR Sec. 4.2, ICML 22 Tutorial on Bridging Learning and Decision Making, ``The Decision-Estimation Coefficient’’ by Paul Mineiro |
|
Oct 12 |
Finish E2D analysis; Begin Posterior Sampling for CB |
|
|
|
Oct 17 |
Thompson Sampling (TS); Insufficiency of TS in structured bandits |
|
Sec 3 of Tutorial on TS Prop. 1 of FGTS paper, Chapelle and Li, Empirical analysis on TS, Osband and Van Roy, Why is Posterior Sampling Better than Optimism for Reinforcement Learning? |
|
Oct 19 |
Feel-good Thompson Sampling (FGTS) and analysis; begin RL |
|
TZ Sec. 17.4, Tong Zhang’s RL theory seminar talk |
|
Oct 24 |
RL basics: the MDP model |
Note 6 |
FR Sec 5.1, 5.3 |
|
Oct 26 |
Planning in RL: visitation (occupancy) distributions and optimality of Markovian policies; Value function |
|
FR Sec 5.2, Csaba Szepesvari’s lecture note |
|
Oct 31 |
Value function cont’d; begin online learning in MDPs |
|
|
|
Nov 2 |
Online learning in MDPs: simulation lemma (Bellman residual decomposition) and optimism lemma |
Note 7 |
FR 5.4, 5.5 |
|
Nov 7 |
UCB-VI algorithm and analysis |
|
FR 5.6 |
|
Nov 9 |
RL with general function approximation: Bellman rank |
|
FR 7.3, ICML Talk on bilinear rank, Bellman rank paper, DQN for Atari Games |
|
Nov 14 |
Student presentation by Winston Zeng: A sharp analysis of model-based reinforcement learning with self-play |
slides |
|
|
Nov 16 |
Student presentation by Rupal Jain: Contextual Combinatorial Cascading Bandits |
slides |
|
|
Nov 21 |
Student presentation by Aniket Panda: High-dimensional Sparse Linear Bandits |
slides |
|
|
Nov 23 |
Thanksgiving recess |
|
|
|
Nov 28 |
Student presentation by Yanan Wang: Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual Bandits |
slides |
|
|
Nov 30 |
Student presentation: Arush Sharma |
slides |
|
|
Dec 5 |
Student presentation: Sopan Sarkar |
slides |
|
|
Dec 7 |
Student presentation: Sadaf Raoufi |
slides |
|
|