8/24 |
Administrivia; Introduction to RL |
slides |
AJKS Section 1.1-1.2 |
|
8/26 |
MDP basics; Markov reward processes (MRPs); Bellman consistency equation |
Scribe note 0826/ by Chicheng Zhang |
AJKS Section 1.1.2 |
|
8/31 |
Bellman consistency equation and its interpretation; Optimal value functions |
Scribe note 0831/ by Yinan Li |
AJKS Section 1.1.3 |
|
9/2 |
Bellman optimality equation; Contraction mapping; begin planning in MDPs |
Scribe note 0902/ by Brady Gales |
AJKS Section 1.1.3; 1.4.1 |
|
9/7 |
Planning: value iteration; begin policy iteration |
Scribe note 0907/ by Yangzi Lu |
AJKS Section 1.4.1; 1.4.2 |
|
9/9 |
Planning: finish policy iteration; linear programming |
Scribe note 0909/ by Yichen Li |
AJKS Section 1.4.2; 1.5 |
|
9/14 |
Finite horizon episodic MDPs and planning; begin RL with a generative model |
Scribe note 0914/ by Yao Zhao |
AJKS Section 1.2 |
|
9/16 |
Sample-based value iteration and analysis; simulation lemma |
Scribe note 0916/ by Zhiwu Guo |
AJKS Section 2.1-2.2, Chi Jin’s lecture notes 5,6 (Optional) AJKS Section 2.3, Proof of Hoeffding’s Inequality |
HW1 |
9/21 |
Q-learning for RL with a generative model |
Scribe note 0921/ by Wenhan Zhang |
Chi Jin’s lecture notes 6,7 |
|
9/23 |
Finish Q-learning; Begin online episodic RL; Multi-armed bandits (MAB); Explore-then-commit |
Scribe note 0923/ by Robert Ferrando |
AJKS Sections 7.1, 6.1 |
|
9/28 |
MAB algorithms: epsilon-greedy, UCB; Failure of epsilon-greedy in online episodic RL |
Scribe note 0928/ by Ruoyao Wang |
AJKS Sections 6.1, Optimistic Q-learning paper Appendix A |
|
9/30 |
The UCB-VI algorithm and its analysis |
Scribe note 0930/ by Zhengguang Zhang |
AJKS Sections 7.2-7.4; Bernstein’s Inequality: My notes, Theorem 2.8.4 of Vershynin’s book; See also the proofs of AJKS Lemmas 7.3 and 7.8 for alternative proofs of model concentration using Azuma-Hoeffding / Azuma-Bernstein |
|
10/5 |
Finish UCB-VI analysis; Begin RL with function approximation |
Scribe note 1005/ by Hao Qin |
AJKS Sections 7.2-7.4; Chapter 3 |
|
10/7 |
The LSVI algorithm; Linear Bellman completeness; Experiment design / active learning |
Scribe note 1007/ by Bohan Li |
AJKS Sections 3.1-3.3.1 |
|
10/12 |
Statistical guarantees of realizable linear regression; LSVI with G-optimal design |
Scribe note 1012/ by Bao Do |
AJKS Lemma A.9, Theorem A.10, Sections 3.3.2-3.3.3 |
|
10/14 |
Finish LSVI with G-optimal design; Performance difference lemma; begin online RL in linear MDPs |
Scribe note 1014/ by Amir Yazdani |
AJKS Section 1.5 (discounted setting), Section 3.3.3, Sections 8.1-8.2 |
|
10/19 |
Online RL in linear MDPs: LSVI-UCB and its analysis |
Scribe note 1019/ by Yinan Li, Hao Qin, and Yichen Li |
AJKS Sections 8.3-8.7, my draft notes, proof of elliptic potential lemma (page 7 onwards), LSVI-UCB paper |
HW2 |
10/21 |
Wenhan Zhang: Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction |
slides |
|
|
10/26 |
Yinan Li: Is Q-learning Provably Efficient? |
slides |
|
|
10/28 |
Bohan Li: Near-optimal reinforcement learning with self-play |
slides |
|
|
11/2 |
Yichen Li: Toward the Fundamental Limits of Imitation Learning |
slides |
|
|
11/4 |
Zhengguang Zhang: On Reinforcement Learning with Adversarial Corruption and Its Application to Block MDP |
slides |
|
|
11/9 |
Ruoyao Wang: Provably efficient RL with Rich Observations via Latent State Decoding |
slides |
|
|
11/11 |
Yao Zhao: Near-Optimal Representation Learning for Linear Bandits and Linear RL (pre-recorded) |
slides |
|
|
11/16 |
Yangzi Lu: Reward-Free Exploration for Reinforcement Learning |
slides |
|
|
11/18 |
Bao Do: Of Moments and Matching: A Game-Theoretic Framework for Closing the Imitation Gap |
slides |
|
|
11/23 |
Robert Ferrando: Adaptive Discretization for Model-Based Reinforcement Learning |
slides |
|
|
11/25 |
No class - thanksgiving |
|
|
|
11/30 |
Zhiwu Guo: Online Learning in Unknown Markov Games |
slides |
|
|
12/2 |
Hao Qin: Nearly Minimax Optimal Reinfocement Learning for Linear Mixture Markov Decision Processes |
slides |
|
|
12/7 |
Amir Yazdani: Reinforcement Learning in Reward-Mixing MDPs |
slides |
|
|