2019 Theses Doctoral
Machine Learning Based User Modeling for Enterprise Security and Privacy Risk Mitigation
Modern organizations are faced with a host of security concerns despite advances in security research. The challenges are diverse, ranging from malicious parties to vulnerable hardware. One particularly strong pain point for enterprises is the insider threat detection problem in which an internal employee, current or former, behaves against the interest of the company. Approaches designed to discourage and to prevent insiders are multifaceted, but efforts to detect malicious users typically involves a combination of an active monitoring infrastructure and a User Behavior Analytics (UBA) system, which applies Machine Learning (ML) algorithms to learn user behavior to identify abnormal behaviors indicative of a security violation. The principal problem with the aforementioned approach is the uncertainty regarding how to measure the functionality of an insider threat detection system. The difficulty of research in UBA technology hinges on sparse knowledge about the models utilized and insufficient data to effectively study the problem. Realistic ground truth data is next to impossible to acquire for open research. This dissertation tackles those challenges and asserts that predictive UBA models can be applied to simulate a wide range of user behaviors in situ and can be broadened to examine test regimes of deployed UBA technology (including evasive low and slow malicious behaviors) without disclosing private and sensitive information. Furthermore, the underlying technology presented in this thesis can increase data availability through a combination of generative adversarial networks, which create realistic yet fake data, and the system log files created by the technology itself.
Given the commercial viability of UBA technology, academic researchers are oft challenged with the inability to test on widely deployed, proprietary software and thus must rely on standard ML based approaches such as Gaussian Mixture Models (GMMs), Support Vector Machines (SVMs) and Bayesian Networks (BNs) to emulate UBA systems. We begin the dissertation with the introduction and implementation of CovTrain, the first neuron coverage guided training algorithm that improves robustness of Deep Learning (DL) systems. CovTrain is tested on a variety of massive, well-tested datasets and has outperformed standard DL models in terms of both loss and accuracy. We then use it to create an enhanced DL based UBA system used in our formal experimental studies.
However, the challenges of measuring and testing a UBA system remain open problems in both academic and commercial communities. With those thoughts in mind, we next present the design, implementation and evaluation of the Bad User Behavior Analytics (BUBA) system, the first framework of its kind to test UBA systems through the iterative introduction of adversarial examples to a UBA system using simulated user bots. The framework's flexibility enables it to tackle an array of problems, including enterprise security at both the system and cloud storage levels. We test BUBA in a synthetic environment with UBA systems that employ state of the art ML models including an enhanced DL model trained using CovTrain and the live Columbia University network. The results show the ability to generate synthetic users that can successfully fool UBA systems at the boundaries. In particular, we find that adjusting the time horizon of a given attack can help it escape UBA detection and in live tests on the Columbia network that SSH attacks could be done without detection if the time parameter is carefully adjusted. We may consider this as an example of Adversarial ML, where temporal test data is modified to evade detection. We then consider a novel extension of BUBA to test cloud storage security in light of the observation that large enterprises are not actively monitoring their cloud storage, for which recent surveys have security personnel fearing that companies are moving to the cloud faster than they can secure it. We believe that there are opportunities to improve cloud storage security, especially given the increasing trend towards cloud utilization. BUBA is intended to reveal the potential security violations and highlight what security mechanisms are needed to prevent significant data loss.
In spite of the advances, the development of BUBA underscores yet another difficulty for a researcher in big data analytics for security - a scarcity of data. Insider threat system development requires granular details about the behaviors of the individuals on its local ecosystem in order to discern anomalous patterns or behaviors. Deep Neural Networks (DNNs) have allowed researchers to discover patterns that were never before seen, but mandate large datasets. Thus, systematic data generation through techniques such as Generative Adversarial Networks (GANs) has become ubiquitous in the face of increased data needs for scientific research as was employed in part for BUBA. Through the first legal analysis of its kind, we test the legality of synthetic data for sharing given privacy requirements. An analysis of statutes through different lens helps us determine that synthetic data may be the next, best step for research advancement. We conclude that realistic yet artificially generated data offers a tangible path forward for academic and broader research endeavors, but policy must meet technological advance before general adoption can take place.
- Dutta_columbia_0054D_15188.pdf application/pdf 11.1 MB Download File
More About This Work
- Academic Units
- Computer Science
- Thesis Advisors
- Stolfo, Salvatore
- Ph.D., Columbia University
- Published Here
- April 25, 2019