CSC 696H: Topics in Reinforcement Learning Theory - Fall 2021

Tentative schedule

Date Topics Notes / slides Additional readings Homework
8/24 Administrivia; Introduction to RL slides AJKS Section 1.1-1.2
8/26 MDP basics; Markov reward processes (MRPs); Bellman consistency equation Scribe note 0826/ by Chicheng Zhang AJKS Section 1.1.2
8/31 Bellman consistency equation and its interpretation; Optimal value functions Scribe note 0831/ by Yinan Li AJKS Section 1.1.3
9/2 Bellman optimality equation; Contraction mapping; begin planning in MDPs Scribe note 0902/ by Brady Gales AJKS Section 1.1.3; 1.4.1
9/7 Planning: value iteration; begin policy iteration Scribe note 0907/ by Yangzi Lu AJKS Section 1.4.1; 1.4.2
9/9 Planning: finish policy iteration; linear programming Scribe note 0909/ by Yichen Li AJKS Section 1.4.2; 1.5
9/14 Finite horizon episodic MDPs and planning; begin RL with a generative model Scribe note 0914/ by Yao Zhao AJKS Section 1.2
9/16 Sample-based value iteration and analysis; simulation lemma Scribe note 0916/ by Zhiwu Guo AJKS Section 2.1-2.2, Chi Jin’s lecture notes 5,6 (Optional) AJKS Section 2.3, Proof of Hoeffding’s Inequality HW1
9/21 Q-learning for RL with a generative model Scribe note 0921/ by Wenhan Zhang Chi Jin’s lecture notes 6,7
9/23 Finish Q-learning; Begin online episodic RL; Multi-armed bandits (MAB); Explore-then-commit Scribe note 0923/ by Robert Ferrando AJKS Sections 7.1, 6.1
9/28 MAB algorithms: epsilon-greedy, UCB; Failure of epsilon-greedy in online episodic RL Scribe note 0928/ by Ruoyao Wang AJKS Sections 6.1, Optimistic Q-learning paper Appendix A
9/30 The UCB-VI algorithm and its analysis Scribe note 0930/ by Zhengguang Zhang AJKS Sections 7.2-7.4; Bernstein’s Inequality: My notes, Theorem 2.8.4 of Vershynin’s book; See also the proofs of AJKS Lemmas 7.3 and 7.8 for alternative proofs of model concentration using Azuma-Hoeffding / Azuma-Bernstein
10/5 Finish UCB-VI analysis; Begin RL with function approximation Scribe note 1005/ by Hao Qin AJKS Sections 7.2-7.4; Chapter 3
10/7 The LSVI algorithm; Linear Bellman completeness; Experiment design / active learning Scribe note 1007/ by Bohan Li AJKS Sections 3.1-3.3.1
10/12 Statistical guarantees of realizable linear regression; LSVI with G-optimal design Scribe note 1012/ by Bao Do AJKS Lemma A.9, Theorem A.10, Sections 3.3.2-3.3.3
10/14 Finish LSVI with G-optimal design; Performance difference lemma; begin online RL in linear MDPs Scribe note 1014/ by Amir Yazdani AJKS Section 1.5 (discounted setting), Section 3.3.3, Sections 8.1-8.2
10/19 Online RL in linear MDPs: LSVI-UCB and its analysis Scribe note 1019/ by Yinan Li, Hao Qin, and Yichen Li AJKS Sections 8.3-8.7, my draft notes, proof of elliptic potential lemma (page 7 onwards), LSVI-UCB paper HW2
10/21 Wenhan Zhang: Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction slides
10/26 Yinan Li: Is Q-learning Provably Efficient? slides
10/28 Bohan Li: Near-optimal reinforcement learning with self-play slides
11/2 Yichen Li: Toward the Fundamental Limits of Imitation Learning slides
11/4 Zhengguang Zhang: On Reinforcement Learning with Adversarial Corruption and Its Application to Block MDP slides
11/9 Ruoyao Wang: Provably efficient RL with Rich Observations via Latent State Decoding slides
11/11 Yao Zhao: Near-Optimal Representation Learning for Linear Bandits and Linear RL (pre-recorded) slides
11/16 Yangzi Lu: Reward-Free Exploration for Reinforcement Learning slides
11/18 Bao Do: Of Moments and Matching: A Game-Theoretic Framework for Closing the Imitation Gap slides
11/23 Robert Ferrando: Adaptive Discretization for Model-Based Reinforcement Learning slides
11/25 No class - thanksgiving
11/30 Zhiwu Guo: Online Learning in Unknown Markov Games slides
12/2 Hao Qin: Nearly Minimax Optimal Reinfocement Learning for Linear Mixture Markov Decision Processes slides
12/7 Amir Yazdani: Reinforcement Learning in Reward-Mixing MDPs slides