IE 3186: Approximate DP

Course Information

Instructor	Daniel Jiang (drjiang@pitt.edu, 1002 Benedum Hall)
Time/Space	Tuesday 3:15-5:45pm, Benedum 1025E
Office Hour	Tuesday 11:00am-12:00pm, Benedum 1002
Description	Approximate dynamic programming (ADP) refers to a broad set of computational methods used for finding approximately optimal policies of intractable sequential decision problems (Markov decision processes). This course will be run as a mixture of traditional lecture and seminar style meetings. We'll begin with an overview of classical methods and transition to a survey of state-of-the-art developments and applications of ADP. The lectures will focus on mathematical proofs and underlying theory (familiarity with first year PhD-level probability and optimization is assumed) while the course project will allow students practice with numerical implementations.
Textbooks	Sections of the following textbooks (recommended but not required) will be used. W. B. Powell, Approximate Dynamic Programming, 2nd Edition. D. P. Bertsekas, Dynamic Programming & Optimal Control, Vol. 2, 4th Edition. D. P. Bertsekas & John Tsitsklis, Neuro-Dynamic Programming. R. S. Sutton & A. G. Barto, Reinforcement Learning, 2nd Edition. M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. C. Czepesvári, Algorithms for Reinforcement Learning.
Grading	50% paper discussions, participation, etc. & 50% final project.

Lecture Schedule

9/28	Overview of MDP models: Finite horizon problems & the dynamic programming algorithm. W. B. Powell, Approximate Dynamic Programming, 2nd Edition, 2011. D. P. Bertsekas, Dynamic Programming & Optimal Control, Vol. 2, 4th Edition, 2012. M. L. Puterman, Markov Decision Processes, 1994.	Notes (Jing Yang)
9/4	Overview of MDP models: Infinite horizon & value/policy iteration; Intro to ADP. W. B. Powell, Approximate Dynamic Programming, 2nd Edition, 2011. D. P. Bertsekas, Dynamic Programming & Optimal Control, Vol. 2, 4th Edition, 2012. M. L. Puterman, Markov Decision Processes, 1994. W. B. Powell, Approximate Dynamic Programming, 2nd Edition, 2011.	Notes (Ziyue Sun)
9/11	Variants of the Value Iteration Algorithm. D. P. Bertsekas, Dynamic Programming & Optimal Control, Vol. 2, 4th Edition, 2012.	Notes (Ziyue Sun, Tarik Bilgic)
9/18	Asynchronous VI and Bellman Reformulations. D. P. Bertsekas, Dynamic Programming & Optimal Control, Vol. 2, 4th Edition, 2012. W. B. Powell, Approximate Dynamic Programming, 2nd Edition, 2011.	Notes (Mingyuan Xu)
9/25	Q-Learning and Stochastic Approximation. D. P. Bertsekas, J. N. Tsitsiklis, Neuro-Dynamic Programming, 1996. H. Robbins, S. Monro, A stochastic approximation method, The Annals of Mathematical Statistics, 1951.	Notes (Shaoning Han)
10/2	Convergence of Q-Learning and DQN. D. P. Bertsekas, J. N. Tsitsiklis, Neuro-Dynamic Programming, 1996. C. J. Watkins, P. Dayan, Q-learning, Machine Learning, 1992. J. N. Tsitsiklis, Asynchronous stochastic approximation and Q-learning, Machine Learning, 1994 V. Mnih, et. al., Human-level control through deep reinforcement learning, Nature, 2018.	Notes (Kamal Basulaiman)
10/9	Value Function Approximation, Fitted VI, and the LP Approach. D. P. Bertsekas, J. N. Tsitsiklis, Neuro-Dynamic Programming, 1996. D. P. De Farias, B. Van Roy, On the existence of fixed points for approximate value iteration and temporal-difference learning, Journal of Optimization Theory and Applications, 2000. D. P. De Farias, B. Van Roy The LP approach to approximate dynamic programming, Operations Research, 2003. D. P. De Farias, B. Van Roy On constraint sampling in the LP approach to approximate dynamic programming, Mathematics of Operations Research, 2004.	Notes (Shaoning Han, Mingyuan Xu)
10/23	Policy Evaluation and TD Learning. D. P. Bertsekas, Dynamic Programming & Optimal Control, Vol. 2, 4th Edition, 2012. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd Edition, 2018. W. B. Powell and H. Topaloglu, Approximate dynamic programming for large-scale resource allocation problems, Tutorials in Operations Research, 2012.	Notes (Tarik Bilgic, Shaoning Han)
10/30	Q-Learning for Optimal Stopping. J. N. Tsitsiklis, B. Van Roy, Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to high-dimensional financial derivatives, IEEE Transactions on Automatic Control, 1999. D. P. Bertsekas, J. N. Tsitsiklis, Neuro-Dynamic Programming, 1996.	Notes (Kamal Basulaiman, Jing Yang)
11/13	Aggregation and Feature-based Value Iteration for Control, Natural Policy Gradient. J. N. Tsitsiklis, B. Van Roy, Feature-based methods for large scale dynamic programming. Machine Learning, 1996. S. M. Kakade. A natural policy gradient. Advances in Neural Information Processing Systems, 2002.	Notes (Mingyuan Xu, Tarik Bilgic)
11/20	Approximate Policy Iteration and Fitted VI. D. P. Bertsekas, Dynamic Programming & Optimal Control, Vol. 2, 4th Edition, 2012. D. P. Bertsekas, Approximate policy iteration: A survey and some new methods, Journal of Control Theory and Applications, 2011. R. Munos, C. Czepesvari Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 2008.	Notes (Boyuan Lai, Ibrahim El Shar)
11/30	Policy Gradient Methods and Dual PI. R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 2000. S. Kunnumkal, H. Topaloglu. Using stochastic approximation methods to compute optimal base-stock levels in inventory control problems. Operations Research, 2008. W. Sun, G. J. Gordon, B. Boots, J. A. Bagnell, Dual policy iteration. Advances in Neural Information Processing Systems, 2018.	Notes (Jing Yang)
12/4	Benchmarking and Information Relaxation. D. B. Brown, J. E. Smith, P. Sun, Information relaxations and duality in stochastic dynamic programs. Operations Research, 2010.	Notes (Ibrahim El Shar, Boyuan Lai)