I am a Research Scientist at Meta and an Adjunct Professor at the University of Pittsburgh. I work on reinforcement learning, Bayesian optimization, and adaptive experimentation. More broadly, I am interested in sequential decision-making from the perspectives of both operations research and AI. My recent work has been applied to large-scale, adaptive internet experiments at Meta and the advertising systems behind Facebook and Instagram. I received my Ph.D. in Operations Research and Financial Engineering from Princeton University. I am always interested in academic collaborations. Please reach out!
Deployable RL Workshop @ RLC 2024. I am co-organizing the Deployable RL Workshop at the first Reinforcement Learning Conference (RLC) in Amherst, MA on August 9, 2024. We invite papers on the theory and practice of RL aimed at deployment to real-world problems. The submission deadline is May 8, 2024. Please see the call for papers for more details!
Exploiting Structure in Offline Multi-Agent RL: The Benefits of Low Interaction Rank
Submitted, 2024.
Brief Description: We study the problem of learning an approximate equilibrium in offline multi-agent reinforcement learning (MARL). We introduce a structural assumption, the interaction rank, and show that utilizing function classes with low interaction rank leads to decentralized, computationally and statistically efficient learning in offline MARL. Our experiments show the potential of critic architectures with low interaction rank when used in TD3-BC.
Mathematical Programming for Adaptive Experiments
Submitted, 2024.
Brief Description: We observe that real experiments are deliberately implemented with a few, large batches. By invoking a central limit approximation at each batch, we obtain a tractable Bayesian MDP that can flexibly incorporate a wide range of problem specifications, including batched and delayed feedback, personalization, non-stationarity, multiple objectives, and constraints. We call this the "mathematical programming" view of adaptive experimentation.
AExGym: Benchmarks and Environments for Adaptive Experimentation
Submitted, 2024.
Brief Description: We present a benchmark for adaptive experimentation based on real-world datasets, highlighting prominent practical challenges to operationalizing adaptivity: non-stationarity, batched/delayed feedback, multiple outcomes and objectives, and external validity. We release an opensource library, AExGym, which is designed with modularity and extensibility in mind.
On the Linear Speedup of Personalized Federated RL with Shared Representations
Submitted, 2024.
Brief Description: Federated reinforcement learning (FedRL) enables multiple agents to collaboratively learn a policy without sharing their own local trajectories collected during agent-environment interactions. We develop a class of personalized FedRL algorithms that learns (1) a shared feature representation collaboratively among all agents and (2) an agent-specific weight vector personalized to its local environment.
Pearl: A Production-Ready Reinforcement Learning Agent
Journal of Machine Learning Research, 2024.
Brief Description: We introduce Pearl, a new open source library for reinforcement learning that aims to enable users to easily build versatile RL agent for real-world applications. Pearl is designed with modularity in mind, allowing researchers and practitioners to mix & match components for policy learning, exploration, safety, and history summarization for building practical RL agents.
Faster Approximate Dynamic Programming by Freezing Slow States
Major revision at Management Science, 2023.
Brief Description: We consider fast-slow MDPs, where certain states move "fast" while other parts of the state space transition more "slowly." This is common when decisions need to be made at high frequencies, yet information that varies at a slower timescale also influences the optimal policy. We propose several new algorithms, each based on the idea of periodically "freezing" and then "releasing" slow states, leading to dramatic computational benefits.
Weakly Coupled Deep Q-Networks
Advances in Neural Information Processing Systems, NeurIPS 2023.
Brief Description: We introduce weakly coupled deep Q-networks and weakly coupled Q-learning, reinforcement learning methods designed for weakly coupled MDPs. We employ multiple, simultaneous DQN or Q-learning agents such that each runs on a separate easier subproblem and when combined, they form an upper bound on the action value of the original problem. These dynamic bounds then guide the primary agent toward the optimal policy.
Dynamic Subgoal-Based Exploration via Bayesian Optimization
Transactions on Machine Learning Research, 2023.
Brief Description: We consider problems where an agent faces an unknown task (drawn from a distribution of MDPs) in the future and is given prior opportunities to "practice" on related tasks where the interactions are still expensive. We propose a one-step Bayes-optimal algorithm for selecting subgoal designs, along with the number of episodes and the episode length during training, to efficiently maximize the expected performance of the agent at test time.
On Noisy Evaluation in Federated Hyperparameter Tuning
Conference on Machine Learning and Systems, MLSys 2023.
Brief Description: We perform the first systematic study on the effect of noisy evaluation in federated hyperparameter tuning. We identify and rigorously explore key sources of noise, including client subsampling, data and systems heterogeneity, and data privacy. Surprisingly, our results indicate that even small amounts of noise can significantly impact tuning methods—reducing the performance of state-of-the-art approaches to that of naive baselines.
Dynamic Inventory Repositioning in On-Demand Rental Networks
Management Science 68(11), pp. 7793-8514, 2022.
Brief Description: We consider a product rental network with a fixed number of rental units distributed across multiple locations. We show convexity of the value function and that the optimal policy can be described in terms of a well-specified region over the state space. We leverage these results in an infinite-horizon, cutting-plane-based ADP algorithm and prove its asymptotic optimality, improving upon previous convergence results in the literature.
Interpretable Personalized Experimentation
ACM International Conference on Knowledge Discovery and Data Mining, KDD 2022.
Brief Description: We present a scalable, interpretable personalized experimentation system, implemented and deployed in production at Meta. The system works in a multiple treatment, multiple outcome setting typical at Meta to (1) learn explanations for black-box HTE models; (2) generate interpretable personalized policies.
Multi-Step Budgeted Bayesian Optimization with Unknown Evaluation Costs
Advances in Neural Information Processing Systems, NeurIPS 2021.
Brief Description: Most Bayesian optimization algorithms ignore how evaluation costs, which are often unknown, may change over the optimization domain. An unknown cost function with a budget constraint introduces a new dimension to the exploration-exploitation trade-off, where learning about the cost incurs the cost itself. We propose a new dynamic programming-based acquisition function for this problem setting.
Structured Actor-Critic for Managing Public Health Points-of-Dispensing
Under revision, 2022.
Brief Description: We consider the setting of public health medical inventory control/dispensing and propose a new actor-critic algorithm that tracks both policy and value function approximations. The algorithm utilizes structure in both the policy and value to improve the empirical convergence rate. We also provide a case study for the problem of dispensing naloxone (an overdose reversal drug) amidst the ongoing opioid crisis.
Efficient Nonmyopic Bayesian Optimization via One-Shot Multi-Step Trees
Advances in Neural Information Processing Systems, NeurIPS 2020.
Brief Description: Bayesian optimization is a sequential decision making framework for optimizing expensive-to-evaluate black-box functions. Computing a full lookahead policy amounts to solving a stochastic dynamic program, which is highly intractable. Instead, we propose a multi-step scenario tree formulation and a one-shot optimization approach that operates by differentiating through the entire decision tree. (* equal contribution).
Lookahead-Bounded Q-Learning
International Conference on Machine Learning, ICML 2020.
Brief Description: We introduce the lookahead-bounded Q-learning (LBQL) algorithm, a new, provably convergent variant of Q-learning that seeks to make better use of collected experience through the use of noisy "lookahead" upper and lower bounds that constrain the Q-iterates. The algorithm operates via a "feedback loop" by using approximate Q-values to estimate bounds and subsquently using those bounds to improve the Q-values (and repeat).
BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization
Advances in Neural Information Processing Systems, NeurIPS 2020.
Brief Description: Bayesian optimization provides sample-efficient global optimization for a broad range of applications, including automatic machine learning, molecular chemistry, and experimental design. We introduce BoTorch, a modern programming framework for Bayesian optimization, along with a new "one-shot" approach to optimizing the Knowledge Gradient acquisition function.
Optimistic Monte Carlo Tree Search with Sampled Information Relaxation Dual Bounds
Operations Research, 68(6), pp. 1678-1697, 2020.
Brief Description: MCTS is a well-known strategy for solving sequential decision problems, particularly in the area of game-play AI. We propose a new technique called Primal-Dual MCTS that utilizes sampled information relaxation (Brown et. al., 2010) bounds on potential actions in order to make tree expansion decisions. The approach shows promise when used to optimize the behavior of a driver navigating a graph while operating on a ride-sharing platform.
Feedback-Based Tree Search for Reinforcement Learning
International Conference on Machine Learning, ICML 2018.
Brief Description: We describe a technique that iteratively applies MCTS on batches of small, finite-horizon versions of the original infinite-horizon MDP. We show that a deep neural network implementation of the technique can create a competitive AI agent for a popular multi-player online battle arena (MOBA) game.
Shape Constraints in Economics and Operations Research
Statistical Science, 33(4), pp. 527-546, 2018.
Brief Description: This paper reviews an illustrative set of research on shape constrained estimation in the economics and operations research literature. We highlight the methodological innovations and applications, with a particular emphasis on utility functions, production economics, and sequential decision making applications.
Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures
Mathematics of Operations Research, 43(2), pp. 554-579, 2018.
Brief Description: We propose a new Q-learning algorithm and a companion sampling procedure to solve risk-averse Markov decision processes under a class of dynamic quantile-based risk measures. Convergence results are proven and an application to energy storage is shown.
An Approximate Dynamic Programming Algorithm for Monotone Value Functions
Operations Research, 63(6), pp. 1489-1511, 2015.
Brief Description: We describe a provably convergent algorithm to exploit the structural property of monotonicity that arises in many applications in operations research, finance, and economics. We show via simulations that near optimal solutions can be obtained using the proposed method when the exact approach is computationally intractable.
Optimal Hour-Ahead Bidding in the Real-Time Electricity Market with Battery Storage using Approximate Dynamic Programming
INFORMS Journal on Computing, 27(3), pp. 525-543, 2015.
Brief Description: We formulate a mathematical model for bidding in the real-time market with the goal of performing energy arbitrage (i.e., exploiting variations in spot prices to profit) in the presence of storage. We train and test an approximate dynamic programming policy on real spot price data from the NYISO and show its value over heuristic policies used in industry.
Approximate Dynamic Programming, Ph.D. Level
Instructor, Spring 2017, Fall 2018
Course Description: ADP refers to a broad set of computational methods used for finding approximately optimal policies of intractable sequential decision problems (MDPs). We'll begin with an overview of classical methods and transition to a survey of state-of-the-art developments. The lectures will focus on mathematical proofs and underlying theory while the course project will allow students practice with numerical implementations. All lecture notes, based on a number of papers and texts (primarily Bertsekas and Tsitsiklis), are available online.
Decision Models, Undergraduate/Master's Level
Instructor, Fall 2016-Fall 2021 (5 times)
Course Description: Decision making is the key in understanding a variety of problems in industry, including inventory control, revenue management, pricing, energy, healthcare, logistics, and finance. In this course, we focus on stochastic decision models (i.e., "decision making under uncertainty") and discuss the fundamental methodology/models in conjunction with applications to real world problems. Students should have a basic understanding of probability and optimization (linear programming).
Ride-sharing Analytics Game, Undergraduate/Master's Level
Instructor, Fall 2018
Description: This was an event designed for the Decision Models course, where teams of students (1) analyze a data set, (2) design a pricing, manufacturing, advertising, and repositioning strategy to operate a ride-sharing company, and (3) compete in a live competition where decisions are submitted periodically to a simulator designed specifically for the course. If you are an instructor and interested in running this event in your course, please let me know. See the description and data for more information.
Reinforcement Learning, Master's Level
Instructor, Summer 2018
Course Description: This is an introductory course on reinforcement learning (RL). The basics of MDPs necessary for RL will be covered, along with a wide range of methods (e.g., TD learning, Q-learning, policy gradients) that perform evaluation and control will be covered. The focus in this course will be on applications, implementation, intuition, and some theory. All lecture notes, based on the Sutton & Barto RL textbook, are available online.
Uncovering Missed Tackle Opportunities. With Matt Chang, Kat Dai, and Harvey Cheng, we propose the "Missed Tackle Opportunity" metric, which is based on tackle probability prediction. We won the the Kaggle NFL Big Data Bowl competition after presenting at the 2024 NFL Combine in Indianapolis. The new metric will be a part of NFL's NextGenStats.
What Would I Say? Read the New Yorker, CNN, and Telegraph articles about the project we created at Hack Princeton 2013, which uses Markov chains to simulate a user's social media posts. The site has generated over 17 million page views and 9 million unique users. Created with Pawel Przytycki, Ugne Klibaite, Vicky Yao, Edward Young, Harvey Cheng, and Alex Furger.
Simulating Fantasy Football Schedules. Use this app to quantify the role of luck in (Yahoo!) Fantasy Football by generating probability distributions of your record over randomized season schedules. Created with Alex Furger, Daniel Munro, and Pawel Przytycki.