2019 Theses Doctoral
Data-Driven Combinatorial Optimization and Efficient Machine Learning Frameworks
Contemporary research in building optimization models and designing algorithms has become more data-centric and application-specific. While addressing three problems in the fields of combinatorial optimization and machine learning (ML), this work highlights the value of making data an important driver for building frameworks and designing solutions.
In Chapter 2, we look into building frameworks for data-intensive applications, such as ML algorithms. We introduce a family of matrices, Structured Spinners, and use these to perform input data projections. This operation is commonly needed for a plethora of ML algorithms such as dimension reduction, cross-polytope LSH techniques or kernel approximation, yet comprises a major bottleneck in terms of time and space complexity. We design a generic framework that speeds up ML algorithms which perform such projections, with no or minor loss in accuracy. Our method applies for projections in both randomized and learned settings. We confirm our results via theoretical guarantees and numerical experiments. Moreover, we are the first to provide theoretical results for that type of work under adaptive settings.
In Chapter 3, we rely on empirical analysis to design algorithms for an online packet scheduling problem with weighted packets and agreeable deadlines. We adopt the model presented in Jez et al. and use their Modified Greedy MG algorithm, which was the best-performing one in the literature, as our benchmark. The technique used for proving the competitive ratio for this benchmark is worst case analysis. Via extensive simulations, we shed light on practical bottlenecks and observe that these are not captured by the competitive analysis. We design three new algorithms, two of which make deterministic online decisions, while the third learns an adaptive one. When contrasted against the MG benchmark, our algorithms have significantly worse competitive ratios, yet noticeably better empirical performance. Our methodology is particularly useful for online algorithms and underlines the significance of leveraging data or simulations to guide algorithm design and testing.
In Chapter 4, we study two different applications in cybersecurity: an adaptive ML problem and a game-theoretic model called PLADD. The common objective between both problems is to protect cybersystems against attacks by intelligent, adaptable and well-resourced adversaries while maintaining a cost budget. We introduce a novel combinatorial scheduling formulation to design useful defense strategies that meet this goal. Our work separates the formulation from the data-driven analysis and solution. As such, the scheduling formulation, which does not resemble any previously studied formulations from the scheduling literature, may be used as a new model by other researchers for different motivations. We keep the model generic enough for others to use, but design the algorithms that work best for our context. The formulation is inspired by stochastic programming and cast as a mixed integer program (MIP). We provide theoretical analysis, e.g. explore integrality gaps, exploit the combinatorial structure, prove NP-hardness, develop dynamic programming solutions for two-machine cases, then work towards data-driven heuristics and approximation algorithms using distribution assumptions and real data.
This item is currently under embargo. It will be available starting 2023-02-02.
More About This Work
- Academic Units
- Industrial Engineering and Operations Research
- Thesis Advisors
- Stein, Clifford S.
- Ph.D., Columbia University
- Published Here
- October 28, 2019