2024 Theses Doctoral

# Horizon-Free Reinforcement Learning in Dynamic Optimization

This thesis intersects the disciplines of operations research (OR) and reinforcement learning (RL), focusing on sequential decision-making under uncertainty. It explores the complex interplay between estimation, learning, and decision-making within structured operational contexts. This work revisits two pivotal OR challenges: optimal stopping and dynamic pricing. Surprisingly, even canonical models in these areas exhibit rich and unexpected behaviors that suggest a lower complexity of learning than traditionally anticipated.

The principal contributions of this thesis underscore three phenomena: a horizon-free characteristic that influences sample complexity for learning and mitigates the propagation of estimation errors; an intriguing revelation that decision-making with heavier-tailed distributions may not be more challenging, and in some cases, is ``easier'' than their lighter-tailed counterparts; and a notable asymmetry in the impact of estimation errors where overestimations are potentially more damaging than underestimations in some highly-structured problems.

Chapter 1 introduces the thesis by reviewing existing work in sample-efficient reinforcement learning (RL), highlighting gaps particularly relevant to structured problems in operations research. We discuss the inadequacy of traditional sample complexity bounds in addressing the unique features of structured decision problems and the inefficacy of complex algorithms in reflecting the intrinsic hardness of these problems. The chapter sets the stage for addressing these challenges, proposing a focused exploration of how structured problems may allow for more efficient learning strategies that significantly deviate from conventional RL approaches. This provides the motivation for the subsequent detailed investigations in the later chapters.

Chapter 2 considers a discounted infinite horizon optimal stopping problem with i.i.d. samples. If the underlying distribution is known a priori, the solution of this problem is obtained via dynamic programming (DP) and is given by a well known threshold rule. When information on this distribution is lacking, the challenge is to leverage this structural property with a suitable learning algorithm. A natural approach is “explore-then-exploit,” whereby the unknown distribution or its parameters are estimated over an initial exploration phase, and this estimate is then used in the DP to determine actions over the residual exploitation phase. While common wisdom suggests that solutions to the Bellman equation can be quite sensitive even to “small” estimation errors, we show that collecting an amount of data that scales only logarithmically in the “effective horizon” is sufficient to support near optimal solutions in the case of optimal stopping. Moreover, we provide a sharp threshold for this sample complexity, in the sense that any number of samples which is of lower order is catastrophic for the decision-maker, with guaranteed loss of at least half of the optimal value. Surprisingly, the length of the exploration horizon required to support near-optimal performance is smaller in problems where the underlying distribution have tails that decrease more slowly. This is especially pronounced when the tails are heavy where a single sample exploration phase suffices. Additionally, our research has revealed a notable asymmetry between the impacts of underestimation versus overestimation, with the former consistently resulting in less severe consequences than the latter.

Chapter 3 explores the problem of discrete-time, finite-horizon dynamic pricing for individual products, emphasizing the perturbation analysis of optimal pricing policies. This chapter examines how inaccuracies in estimating customer demand—specifically, their willingness to pay—affect revenue optimization, with a particular focus on how these effects scale with the horizon length. Our findings reveal a significant first-order relationship where the ratio of relative regret in revenue to the perturbation on the price is directly related to the hazard rate of the distribution governing willingness-to-pay. Impressively, this relationship demonstrates a logarithmic scale in relation to the length of the planning horizon. This finding becomes even more pertinent when considering heavy-tailed distributions, as a bounded hazard rate implies an extraordinary robustness in pricing strategies, which is remarkably independent of the planning horizon. Furthermore, it's particularly noteworthy that the relative regret associated with dynamic pricing decisions is not impacted by the scale of the total inventory, even in scenarios where inventory size is directly scaling with the horizon length. This can be emphasized as: Statistically, the problem of dynamic pricing over a long horizon with a large initial inventory is almost as straightforward as solving a single-period estimation problem for the same willingness-to-pay distribution.

This thesis elucidates the perturbation effects in optimal stopping and dynamic pricing while aiming to expand our understanding of the ``horizon-free'' phenomenon observed in these dynamic optimization studies. It prompts further exploration into the structural factors responsible for reduced sample complexity relative to the horizon length. Additionally, the work anticipates defining an ``intrinsic horizon length'' as a complexity measure that captures the inherent difficulty of multi-stage, long-horizon dynamic optimization problems.

## Subjects

## Files

*This item is currently under embargo. It will be available starting 2025-07-16.*

## More About This Work

- Academic Units
- Business
- Thesis Advisors
- Zeevi, Assaf J.
- Russo, Daniel J.
- Degree
- Ph.D., Columbia University
- Published Here
- September 25, 2024