2025 Theses Doctoral
Reinforcement Learning for Continuous-Time Linear-Quadratic Control and Mean-Variance Portfolio Selection: Regret Analysis and Empirical Study
This thesis explores continuous-time reinforcement learning (RL) for stochastic control with two intimately related problems: mean-variance (MV) portfolio selection and linear-quadratic (LQ) control. For the former, we investigate markets where stock prices are diffusion processes driven by observable factors that are also diffusion processes yet the coefficients of these processes are unknown. Based on the recently developed RL theory for diffusion processes, we present data-driven algorithms that learns the pre-committed investment strategies directly without attempting to learn or estimate the market coefficients. For multi-stock Black–Scholes markets without factors, we develop a baseline algorithm and prove its performance guarantee by deriving a sublinear regret bound in terms of the Sharpe ratio.
To optimize performance and facilitate real-world application, we further adapt the baseline algorithm into four variants. These enhancements incorporate techniques such as real-time online learning, offline pre-training, and mechanisms for managing leverage constraints and trading frequency. Following this, we perform a comprehensive empirical study to compare our RL algorithms against fifteen established portfolio allocation strategies based on S&P 500 constituents. The study employs multiple performance metrics, including annualized returns, variations of the Sharpe ratio, maximum drawdown, and recovery time. The results demonstrate that our continuous-time RL strategies are consistently among the best especially in a volatile bear market, and decisively outperform the model-based continuous-time counterparts by significant margins.
We next study RL for a class of continuous-time LQ control problems for diffusions, where states are scalar-valued and running control rewards are absent but volatilities of the state processes depend on both state and control variables. We apply a model-free approach that relies neither on knowledge of model parameters nor on their estimations, and devise an actor--critic algorithm to learn the optimal policy parameter directly. Our main contributions include the introduction of an exploration schedule and a regret analysis of the proposed algorithm. We provide the convergence rate of the policy parameter to the optimal one, and prove that the algorithm achieves a regret bound of 𝑂(𝑁³/⁴) up to a logarithmic factor, where N is the number of learning episodes. We conduct a simulation study to validate the theoretical results and demonstrate the effectiveness and reliability of the proposed algorithm. We also perform numerical comparisons between our method and those of the recent model-based stochastic LQ RL studies adapted to the state- and control-dependent volatility setting, demonstrating a better performance of the former in terms of regret bounds.
Along a different direction, we present a policy gradient-based actor–critic algorithm featuring adaptive exploration in both actor and critic. To wit, both the variance of the stochastic policy (actor) and the temperature parameter (critic) are decreasing in time according to certain schedules. In particular, endogenizing the temperature parameter reduces the need for manual tuning. Despite this added flexibility, the algorithm maintains the same sublinear regret bound of 𝑂(𝑁³/⁴) as achieved in the deterministic schedule. In the numerical experiments, we evaluate the convergence rate and regret bound of the proposed algorithm, with results aligning closely with our theoretical findings.
Subjects
Files
This item is currently under embargo. It will be available starting 2029-12-05.
More About This Work
- Academic Units
- Industrial Engineering and Operations Research
- Thesis Advisors
- Zhou, Xunyu
- Degree
- Ph.D., Columbia University
- Published Here
- December 18, 2024