2024 Theses Doctoral

# Data-dependent Regret Bounds for Adversarial Multi-Armed Bandits and Online Portfolio Selection

This dissertation studies π·ππ‘π-π·ππππππππ‘ regret bounds for two online learning problems. As opposed to worst-case regret bounds, data-dependent bounds are able to adapt to the particular sequence of losses seen by the player. Thus, they offer a more fine grained performance guarantee compared to worst-case bounds.

We start off with the Adversarial π-Armed Bandit problem. In prior literature it was a standard practice to assume that the loss vector belonged to a known domain, typically [0,1]βΏ or [-1,1]βΏ. We make no such assumption on the loss vectors, they may be completely arbitrary. We term this problem the Scale-Free Adversarial Multi Armed Bandit. At the beginning of the game, the player only knows the number of arms π. It does not know the scale and magnitude of the losses chosen by the adversary or the number of rounds π. In each round, it sees bandit feedback about the loss vectors πβ, . . . , π_π β² ββΏ. Our goal is to bound its regret as a function of π and norms of πβ, . . . , π_π . We design a bandit Follow The Regularized Leader (FTRL) algorithm, that uses a log-barrier regularizer along with an adaptive learning rate tuned via the AdaFTRL technique. We give two different regret bounds, based on the exploration parameter used. With non-adaptive exploration, our algorithm has a regret of πΜ(βππΏβ + πΏ_ββππ) and with adaptive exploration, it has a regret of π(βππΏβ + πΏββππΏβ). Here πΏβ = sup_π‘ β₯π_π‘β₯_β, πΏβ = πΊα΅_π‘ββ β₯π_π‘β₯Β²β, πΏβ = πΊα΅_π‘ββ β₯π_π‘β₯β and the πΜ notation suppress logarithmic factors. These are the first MAB bounds that adapt to the β₯γ»β₯β, β₯γ»β₯β norms of the losses. The second bound is the first data-dependent scale-free MAB bound as π does not directly appear in the regret. We also develop a new technique for obtaining a rich class of local-norm lower-bounds for Bregman Divergences. This technique plays a crucial role in our analysis for controlling the regret when using importance weighted estimators of unbounded losses.

Next, we consider the Online Portfolio Selection (OPS) problem over π assets and π time periods. This problem was first studied by Cover [1], who proposed the Universal Portfolio (UP) algorithm. UP is a computationally expensive algorithm with minimax optimal regret of π(π log π). There has been renewed interest in OPS due to a recently posed open problem Van Erven ππ‘ ππ. [2] which asks for a computationally efficient algorithm that is also has minimax optimal regret. We study data-dependent regret bounds for OPS problem that adapt to the sequence of returns seen by the investor. Our proposed algorithm called AdaCurv ONS modifies the Online Newton Step(ONS) algorithm of [3] using a new adaptive curvature surrogate function for the log losses β log(π_π‘α΅π€). We show that the AdaCurv ONS algorithm has π(π
πππππ) regret where π
is the data-dependent quantity. For sequences where π
=π(1), the regret of AdaCurv ONS matches the optimal regret. However, for some sequences π
could be unbounded, making the regret bound vacuous. To overcome this issue, we propose the LB-AdaCurv ONS algorithm that adds a log-barrier regularizer along with an adaptive learning rate tuned via the AdaFTRL technique. LB-AdaCurv ONS has an adaptive regret of the form π(min(π
log π, βππ log π)). Thus, LB-AdaCurv ONS has a worst case regret of π(βππ log π) while also having a data-dependent regret of π(ππ
log π) when π
= π(1). Additionally, we show logarithmic First-Order and Second-Order regret bounds for AdaCurv ONS and LB-AdaCurv ONS.

Finally, we consider the problem of Online Portfolio Selection (OPS) with predicted returns. We are the first to extend the paradigm of online learning with predictions to the portfolio selection problem. In this setting, the investor has access to noisy predictions of returns for the π assets that can be incorporated into the portfolio selection process. We propose the Optimistic Expected Utility LB-FTRL (OUE-LB-FTRL) algorithm that incorporates the predictions using a utility function into the LB-FTRL algorithm. We explore the consistency-robustness properties for our algorithm. If the predictions are accurate, OUE-LB-FTRL's regret is π(π log π), providing a consistency guarantee. Even if the predictions are arbitrary, OUE-LB-FTRL's regret is always bounded by π(βππ log π) providing a providing a robustness guarantee. Our algorithm also recovers a Gradual-variation regret bound for OPS. In the presence of predictions, we argue that the benchmark of static-regret becomes less meaningful. So, we consider the regret with respect to an investor who only uses predictions to select their portfolio (i.e., an expected utility investor). We provide a meta-algorithm called Best-of-Both Worlds for OPS (BoB-OPS), that combines the portfolios of an expected utility investor and a purely regret minimizing investor using a higher level portfolio selection algorithm. By instantiating the meta-algorithm and the purely regret minimizing investor with Cover's Universal Portfolio, we show that the regret of BoB-OPS with respect to the expected utility investor is π(log π). Simultaneously, BoB-OPS's static regret is π(π log π). This achieves a stronger form of consistency-robustness guarantee for OPS with predicted returns.

## Subjects

## Files

- Putta_columbia_0054D_18858.pdf application/pdf 614 KB Download File

## More About This Work

- Academic Units
- Industrial Engineering and Operations Research
- Thesis Advisors
- Agrawal, Shipra
- Degree
- Ph.D., Columbia University
- Published Here
- October 16, 2024