Daniel R. Jiang

I'm an AI Research Scientist at Meta, where I work on reinforcement learning and RLHF. I'm interested in building AI agents that can both plan over long horizons and acquire information to reduce epistemic uncertainty. Recent work:

An RLHF-trained LLM for generative AI ads that was deployed at scale at Meta, along with a paper studying the economic impact of RL post-training.
Conversational agents trained with online multi-turn RLHF from user signals to achieve multi-step outcomes.

Prior to Meta, I was an Assistant Professor at the University of Pittsburgh, and I received my Ph.D. in Operations Research and Financial Engineering from Princeton University.

News

Feb '26 Preprint (led by A. Samanta) on post-training via RL on debate outcomes.

Feb '26 Preprint (led by A. Samanta) on LLM error-localization under Thought-MDPs.

Nov '25 Posted a new technical note on Iterative PPO for multi-turn RLHF.

Sep '25 Our RLHF for generative ads work was in Jack Clark's Import AI newsletter.

Sep '25 Our RLHF for generative ads work was cited by the New York Times.

Sep '25 Our work on "RL by Freezing Slow States" to appear in Management Science.

Sep '25 Carbon-aware transformers work accepted to NeurIPS '25.

May '25 Aligned multi-objective optimization accepted to ICML '25.

Apr '25 Presented two works, offline MARL and personalized FedRL, at ICLR '25.

Aug '24 New preprint on large-batch adaptive experimentation using Bayesian MDPs.

Sep '24 Our NFL Big Data Bowl metric was productionized by NextGenStats and AWS.

Mar '24 Our team's project (led by Matt Chang) won the NFL Big Data Bowl.

Publications and Papers Under Review

RLHF RL

Improving Generative Ad Text on Facebook using Reinforcement Learning

Daniel R. Jiang*, Alex Nikulkov*, Yu-Chia Chen, Yang Bai, Zheqing Zhu

Submitted, 2026

We describe AdLlama, the first RL-trained LLM deployed within Facebook's advertising system. We use RL and historical ad click-through rates as a reward to fine-tune an LLM to generate high-performing ad text variations. A 10-week A/B test involving over 35,000 advertisers and 640,000 ad variations shows that this RL-trained model yields meaningful performance improvements for advertisers.

PDF arXiv Import AI Cited in NYT

RLHF RL

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative RLHF

Daniel R. Jiang*, Ankur Samanta*, Jalaj Bhandari, Alexis Yang, Rémi Munos, Tyler Lu

Lifelong Agents Workshop at ICLR 2026

We study outcome-driven conversational RL, where an agent conducts multi-turn dialogues to achieve specific outcomes. We reduce the multi-turn problem to a single-turn RLHF problem by using a learned Q-function as the reward model. This approach conducts policy iteration in a sequence of offline batches. A primary advantage is that we can leverage widely available off-the-shelf single-turn RLHF tools to perform multi-turn RL.

PDF arXiv

RLHF RL

Self-Improvement of Language Models by Post-Training on Multi-Agent Debate

Ankur Samanta, Akshayaa Magesh, Youliang Yu, Runzhe Wu, Ayush Jain, Daniel R. Jiang, Boris Vidolov, Paul Sajda, Yonathan Efroni, Kaveh Hassani

Multi-Agent Learning for GenAI Workshop at ICLR 2026

We propose an RL approach that uses consensus signals arising from multi-agent debate to post-train LMs toward more reliable reasoning paths without any external supervision or rewards. Multi-agent consensus alignment (especially with DPO and KTO) delivers substantial improvements in self-consistency, reasoning, and generalization across unseen benchmarks.

PDF arXiv

RLHF

Structure Enables Effective Self-Localization of Errors in LLMs

Ankur Samanta, Akshayaa Magesh, Ayush Jain, Kavosh Asadi, Youliang Yu, Daniel R. Jiang, Boris Vidolov, Kaveh Hassani, Paul Sajda, Jalaj Bhandari, Yonathan Efroni

AI with Recursive Self-Improvement Workshop at ICLR 2026

We introduce Thought-ICS, a prompting framework that structures reasoning into discrete, semantically coherent thoughts to enable precise error localization. The model identifies the first erroneous step, backtracks from the last correct point, and achieves 20–40% self-correction lift with oracle verification while outperforming baselines in autonomous settings.

PDF arXiv

RL ADP

Frozen-State Value Iteration: Faster Reinforcement Learning by Freezing Slow States

Yijia Wang and Daniel R. Jiang

Management Science, 2025

We consider fast-slow MDPs, where certain states move "fast" while other parts of the state space transition more "slowly." This is common when decisions need to be made at high frequencies over long horizons, yet information that varies at a slower timescale also influences the optimal policy. We propose a new hierarchical value iteration algorithm based periodically "freezing" and then "releasing" slow states, leading to computational benefits.

PDF arXiv Slides

BO AE

Carbon Aware Transformers through Joint Model-Hardware Optimization

Irene Wang, Newsha Ardalani, Mostafa Elhoushi, Daniel R. Jiang, Samuel Hsia, Ekin Sumbul, Divya Mahajan, Carole-Jean Wu, Bilge Acun

NeurIPS 2025

We introduce CATransformers, a carbon-aware architecture search framework that that enables sustainability-driven co-optimization of Transformer-based models and hardware architecture that considers both operational and embodied carbon. The framework is based on multi-objective Bayesian optimization. Applied to CLIP, it yields CarbonCLIP variants that sustain accuracy and latency while lowering total footprint by up to 17%.

PDF arXiv

OPT

Aligned Multi-Objective Optimization

Yonathan Efroni, Ben Kretzu, Daniel R. Jiang, Jalaj Bhandari, Zheqing Zhu, Karen Ullrich

ICML 2025

Most multi-objective optimization research focuses on conflicting objectives and Pareto tradeoffs. However, recent work in multi-task learning, reinforcement learning, and LLMs suggests that related tasks can enhance performance across all objectives simultaneously. We introduce a scalable gradient-based method (AMOOO) with proven advantages in handling numerous compatible objectives.

PDF arXiv

Exploiting Structure in Offline Multi-Agent RL: The Benefits of Low Interaction Rank

Wenhao Zhan, Scott Fujimoto, Zheqing Zhu, Jason D. Lee, Daniel R. Jiang, Yonathan Efroni

ICLR 2025

We study the problem of learning an approximate equilibrium in offline multi-agent reinforcement learning (MARL). We introduce a structural assumption, the interaction rank, and show that utilizing function classes with low interaction rank leads to decentralized, computationally and statistically efficient learning in offline MARL. Our experiments show the potential of critic architectures with low interaction rank when used in TD3-BC.

PDF arXiv

On the Linear Speedup of Personalized Federated RL with Shared Representations

Guojun Xiong, Shufan Wang, Daniel R. Jiang, Jian Li

ICLR 2025

Federated reinforcement learning (FedRL) enables multiple agents to collaboratively learn a policy without sharing their own local trajectories collected during agent-environment interactions. We develop a class of personalized FedRL algorithms that learns (1) a shared feature representation collaboratively among all agents and (2) an agent-specific weight vector personalized to its local environment.

PDF arXiv

AE RL

Optimization-Driven Adaptive Experimentation

Ethan Che, Daniel R. Jiang, Hongseok Namkoong, Jimmy Wang

Submitted, 2024. Oral talks at ESIF Economics and AI+ML Meeting, CODE @ MIT.

We observe that real experiments are deliberately implemented with a few, large batches. By invoking a central limit approximation at each batch, we obtain a tractable Bayesian MDP that can flexibly incorporate a wide range of problem specifications, including batched and delayed feedback, personalization, non-stationarity, multiple objectives, and constraints. We call this the "mathematical programming" view of adaptive experimentation.

PDF arXiv Code

AExGym: Benchmarks and Environments for Adaptive Experimentation

Jimmy Wang, Ethan Che, Daniel R. Jiang, Hongseok Namkoong

Submitted, 2024

We present a benchmark for adaptive experimentation based on real-world datasets, highlighting prominent practical challenges to operationalizing adaptivity: non-stationarity, batched/delayed feedback, multiple outcomes and objectives, and external validity. We release an opensource library, AExGym, which is designed with modularity and extensibility in mind.

PDF arXiv Code

Pearl: A Production-Ready Reinforcement Learning Agent

Z. Zhu, R. de Salvo Braz, J. Bhandari, D. R. Jiang, Y. Wan, Y. Efroni, R. Xu, L. Wang, H. Guo, A. Nikulkov, D. Korenkevych, U. Dogan, F. Cheng, Z. Wu, W. Xu

Journal of Machine Learning Research, 2024

We introduce Pearl, a new open source library for reinforcement learning that aims to enable users to easily build versatile RL agent for real-world applications. Pearl is designed with modularity in mind, allowing researchers and practitioners to mix & match components for policy learning, exploration, safety, and history summarization for building practical RL agents.

PDF arXiv Code

RL ADP

Weakly Coupled Deep Q-Networks

Ibrahim El-Shar and Daniel R. Jiang

NeurIPS 2023

We introduce weakly coupled deep Q-networks and weakly coupled Q-learning, reinforcement learning methods designed for weakly coupled MDPs. We employ multiple, simultaneous DQN or Q-learning agents such that each runs on a separate easier subproblem and when combined, they form an upper bound on the action value of the original problem. These dynamic bounds then guide the primary agent toward the optimal policy.

PDF arXiv

BO RL

Dynamic Subgoal-Based Exploration via Bayesian Optimization

Yijia Wang, Matthias Poloczek, Daniel R. Jiang

Transactions on Machine Learning Research, 2023

We suppose that an agent faces an unknown RL task (drawn from a distribution of MDPs) in the future and is given prior opportunities to "practice" on related tasks where environment interactions are costly. We propose a one-step Bayes-optimal algorithm for selecting subgoal designs, along with the number of episodes and the episode length during training, to efficiently maximize the expected performance of the agent at test time.

PDF arXiv

On Noisy Evaluation in Federated Hyperparameter Tuning

Kevin Kuo, Pratiksha Thaker, Mikhail Khodak, John Nguyen, Daniel R. Jiang, Ameet Talwalkar, Virginia Smith

MLSys 2023

We perform a systematic study on the effect of noisy evaluation in federated hyperparameter tuning. We identify and rigorously explore key sources of noise, including client subsampling, data and systems heterogeneity, and data privacy. Surprisingly, our results indicate that even small amounts of noise can significantly impact tuning methods—reducing the performance of state-of-the-art approaches to that of naive baselines.

PDF arXiv

ADP RL

Dynamic Inventory Repositioning in On-Demand Rental Networks

Saif Benjafaar, Daniel R. Jiang, Xiang Li, and Xiaobo Li

Management Science, 2022

We consider a product rental network with a fixed number of rental units distributed across multiple locations. We show convexity of the value function and that the optimal policy can be described in terms of a well-specified region over the state space. We leverage these results in an infinite-horizon, cutting-plane-based ADP algorithm and prove its asymptotic optimality, improving upon previous convergence results in the literature.

PDF Supplement SSRN Slides Code

Interpretable Personalized Experimentation

Han Wu, Sarah Tan, Weiwei Li, Mia Garrard, Adam Obeng, Drew Dimmery, Shaun Singh, Hanson Wang, Daniel R. Jiang, Eytan Bakshy

KDD 2022

We present a scalable, interpretable personalized experimentation system, implemented and deployed in production at Meta. The system works in a multiple treatment, multiple outcome setting typical at Meta to (1) learn explanations for black-box HTE models; (2) generate interpretable personalized policies.

PDF arXiv

BO RL

Multi-Step Budgeted Bayesian Optimization with Unknown Evaluation Costs

Raul Astudillo, Daniel R. Jiang, Max Balandat, Eytan Bakshy, Peter Frazier

NeurIPS 2021

Most Bayesian optimization algorithms ignore how evaluation costs, which are often unknown, may change over the optimization domain. An unknown cost function with a budget constraint introduces a new dimension to the exploration-exploitation trade-off, where learning about the cost incurs the cost itself. We propose a new dynamic programming-based acquisition function for this problem setting.

PDF arXiv Slides Code

BO RL

Efficient Nonmyopic Bayesian Optimization via One-Shot Multi-Step Trees

Shali Jiang*, Daniel R. Jiang*, Max Balandat*, Brian Karrer, Jacob R. Gardner, Roman Garnett

NeurIPS 2020

Bayesian optimization is a sequential decision making framework for optimizing expensive-to-evaluate black-box functions. Computing a full lookahead policy amounts to solving a stochastic dynamic program, which is highly intractable. Instead, we propose a multi-step scenario tree formulation and a one-shot optimization approach that operates by differentiating through the entire decision tree.

PDF Supplement arXiv Code

Lookahead-Bounded Q-Learning

Ibrahim El-Shar and Daniel R. Jiang

ICML 2020

We introduce the lookahead-bounded Q-learning (LBQL) algorithm, a new, provably convergent variant of Q-learning that seeks to make better use of collected experience through the use of noisy "lookahead" upper and lower bounds that constrain the Q-iterates. The algorithm operates via a "feedback loop" by using approximate Q-values to estimate bounds and subsquently using those bounds to improve the Q-values (and repeat).

PDF Supplement arXiv Slides Code

BO AE

BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization

Maximilian Balandat, Brian Karrer, Daniel R. Jiang, Samuel Daulton, Benjamin Letham, Andrew Gordon Wilson, Eytan Bakshy

NeurIPS 2020

We introduce BoTorch, a modern programming framework for Bayesian optimization in PyTorch, along with a new "one-shot" approach to optimizing the Knowledge Gradient acquisition function. Bayesian optimization provides sample-efficient global optimization for a broad range of applications, including automatic machine learning, molecular chemistry, and experimental design.

PDF Supplement arXiv Code

ADP RL

Optimistic Monte Carlo Tree Search with Sampled Information Relaxation Dual Bounds

Daniel R. Jiang, Lina Al-Kanj, and Warren B. Powell

Operations Research, 2020

MCTS is a well-known strategy for solving sequential decision problems, particularly in the area of game-play AI. We propose a new technique called Primal-Dual MCTS that utilizes sampled information relaxation (Brown et. al., 2010) bounds on potential actions in order to make tree expansion decisions. The approach shows promise when used to optimize the behavior of a driver navigating a graph while operating on a ride-sharing platform.

PDF arXiv Slides

RL ADP

Feedback-Based Tree Search for Reinforcement Learning

Daniel R. Jiang, Emmanuel Ekwedike, and Han Liu

ICML 2018

We describe a technique that iteratively applies MCTS on batches of small, finite-horizon versions of the original infinite-horizon MDP. We show that a deep neural network implementation of the technique can create a competitive AI agent for a popular multi-player online battle arena (MOBA) game.

PDF Supplement arXiv Slides Selected for Long Talk (~8.6%)

ADP

Shape Constraints in Economics and Operations Research

Andrew L. Johnson and Daniel R. Jiang

Statistical Science, 2018

This paper reviews an illustrative set of research on shape constrained estimation in the economics and operations research literature. We highlight the methodological innovations and applications, with a particular emphasis on utility functions, production economics, and sequential decision making applications.

PDF Special Issue Editorial

ADP RL

Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures

Daniel R. Jiang and Warren B. Powell

Mathematics of Operations Research, 2018

We propose a new Q-learning algorithm and a companion sampling procedure to solve risk-averse Markov decision processes under a class of dynamic quantile-based risk measures. Convergence results are proven and an application to energy storage is shown.

PDF Supplement arXiv Slides Editor's Pick, INFORMS ICYMI

ADP RL

An Approximate Dynamic Programming Algorithm for Monotone Value Functions

Daniel R. Jiang and Warren B. Powell

Operations Research, 2015

We describe a provably convergent algorithm to exploit the structural property of monotonicity that arises in many applications in operations research, finance, and economics. We show via simulations that near optimal solutions can be obtained using the proposed method when the exact approach is computationally intractable.

PDF Supplement arXiv Slides

ADP RL

Optimal Hour-Ahead Bidding in the Real-Time Electricity Market with Battery Storage using Approximate Dynamic Programming

Daniel R. Jiang and Warren B. Powell

INFORMS Journal on Computing, 2015

We formulate a mathematical model for bidding in the real-time market with the goal of performing energy arbitrage (i.e., exploiting variations in spot prices to profit) in the presence of storage. We train and test an approximate dynamic programming policy on real spot price data from the NYISO and show its value over heuristic policies used in industry.

PDF Supplement arXiv Slides

Teaching and Lecture Notes

IE 3186

Approximate Dynamic Programming/Reinforcement Learning, Ph.D. Level

Instructor, Spring 2017, Fall 2018

ADP refers to a broad set of computational methods used for finding approximately optimal policies of intractable sequential decision problems (MDPs). We'll begin with an overview of classical methods and transition to a survey of state-of-the-art developments. The lectures will focus on mathematical proofs and underlying theory while the course project will allow students practice with numerical implementations.

Lecture Notes

IE 2186

Reinforcement Learning, Master's Level

Instructor, Summer 2018

This is an introductory course on reinforcement learning (RL). The basics of MDPs necessary for RL will be covered, along with a wide range of methods (e.g., TD learning, Q-learning, policy gradients) that perform evaluation and control. The focus in this course will be on applications, implementation, intuition, and some theory.

Lecture Notes

IE 1086

Decision Models, Undergraduate/Master's Level

Instructor, Fall 2016-Fall 2021 (5 times)

Decision making is the key in understanding a variety of problems in industry, including inventory control, revenue management, pricing, energy, healthcare, logistics, and finance. In this course, we focus on stochastic decision models (i.e., "decision making under uncertainty") and discuss the fundamental methodology/models in conjunction with applications to real world problems.

IE 1086

Ride-sharing Analytics Game, Undergraduate/Master's Level

Instructor, Fall 2018

This was an event designed for the Decision Models course, where teams of students (1) analyze a data set, (2) design a pricing, manufacturing, advertising, and repositioning strategy to operate a ride-sharing company, and (3) compete in a live competition where decisions are submitted periodically to a simulator.

Project Description Data

Personal Projects

Uncovering Missed Tackle Opportunities

Kaggle NFL Big Data Bowl 2024 Winner

With Matthew Chang, Katherine Dai, and Harvey Cheng, we won the Kaggle NFL Big Data Bowl competition after presenting at the 2024 NFL Combine in Indianapolis. We proposed the "Missed Tackle Opportunity" metric based on a tackle probability prediction model. The metric was productionized and is now part of NFL's NextGenStats.

Paper Talk NFL Press Release NextGenStats AWS Blog

What Would I Say?

Best Facebook Integrated Hack, HackPrinceton 2013

At HackPrinceton 2013, we used Markov chains to simulate a user's social media posts, leading to funny new variations of existing content. The site went viral shortly after launch, generating over 17 million page views and 9 million unique users, and was covered by numerous media outlets.

New Yorker CNN

Simulating Fantasy Football Schedules

HackPrinceton 2014

At Hack Princeton 2014, we made an app to quantify the role of luck in Yahoo! Fantasy Football by generating probability distributions of your record over randomized season schedules.

Website