Academic Commons Search Results
http://academiccommons.columbia.edu/catalog.rss?f%5Bdepartment_facet%5D%5B%5D=Industrial+Engineering+and+Operations+Research&f%5Bsubject_facet%5D%5B%5D=Computer+science&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usRanking Algorithms on Directed Configuration Networks
http://academiccommons.columbia.edu/catalog/ac:189652
Chen, Ningyuanhttp://dx.doi.org/10.7916/D8J38RX8Mon, 28 Sep 2015 12:08:56 +0000In recent decades, complex real-world networks, such as social networks, the World Wide Web, financial networks, etc., have become a popular subject for both researchers and practitioners. This is largely due to the advances in computing power and big-data analytics. A key issue of analyzing these networks is the centrality of nodes. Ranking algorithms are designed to achieve the goal, e.g., Google's PageRank. We analyze the asymptotic distribution of the rank of a randomly chosen node, computed by a family of ranking algorithms on a random graph, including PageRank, when the size of the network grows to infinity.
We propose a configuration model generating the structure of a directed graph given in- and out-degree distributions of the nodes. The algorithm guarantees the generated graph to be simple (without self-loops and multiple edges in the same direction) for a broad spectrum of degree distributions, including power-law distributions. Power-law degree distribution is referred to as scale-free property and observed in many real-world networks. On the random graph G_n=(V_n,E_n) generated by the configuration model, we study the distribution of the ranks, which solves
R_i=∑ _{j: (j,i) ∈ E_n} (C_jR_j +Q_i)
for all node i, some weight C_i and personalization value Q_i.
We show that as the size of the graph n → ∞, the rank of a randomly chosen node converges weakly to the endogenous solution of the
R =^D ∑ _{i=1}^N (C_iR_i + Q),
where (Q, N, {C_i}) is a random vector and {R_i} are i.i.d. copies of R, independent of (Q, N,{C_i}). This main result is divided into three steps. First, we show that the rank of a randomly chosen node can be approximated by applying the ranking algorithm on the graph for finite iterations. Second, by coupling the graph to a branching tree that is governed by the empirical size-biased distribution, we approximate the finite iteration of the ranking algorithm by the root node of the branching tree. Finally, we prove that the rank of the root of the branching tree converges to that of a limiting weighted branching process, which is independent of n and solves the stochastic fixed-point equation. Our result formalizes the well-known heuristics, that a network often locally possesses a tree-like structure. We conduct a numerical example showing that the approximation is very accurate for English Wikipedia pages (over 5 million).
To draw a sample from the endogenous solution of the stochastic fixed-point equation, one can run linear branching recursions on a weighted branching process. We provide an iterative simulation algorithm based on bootstrap. Compared to the naive Monte Carlo, our algorithm reduces the complexity from exponential to linear in the number of recursions. We show that as the bootstrap sample size tends to infinity, the sample drawn according to our algorithm converges to the target distribution in the Kantorovich-Rubinstein distance and the estimator is consistent.Operations research, Computer sciencenc2462Industrial Engineering, Industrial Engineering and Operations ResearchDissertationsOptimization Algorithms for Structured Machine Learning and Image Processing Problems
http://academiccommons.columbia.edu/catalog/ac:158764
Qin, Zhiweihttp://hdl.handle.net/10022/AC:P:19648Fri, 05 Apr 2013 00:00:00 +0000Optimization algorithms are often the solution engine for machine learning and image processing techniques, but they can also become the bottleneck in applying these techniques if they are unable to cope with the size of the data. With the rapid advancement of modern technology, data of unprecedented size has become more and more available, and there is an increasing demand to process and interpret the data. Traditional optimization methods, such as the interior-point method, can solve a wide array of problems arising from the machine learning domain, but it is also this generality that often prevents them from dealing with large data efficiently. Hence, specialized algorithms that can readily take advantage of the problem structure are highly desirable and of immediate practical interest. This thesis focuses on developing efficient optimization algorithms for machine learning and image processing problems of diverse types, including supervised learning (e.g., the group lasso), unsupervised learning (e.g., robust tensor decompositions), and total-variation image denoising. These algorithms are of wide interest to the optimization, machine learning, and image processing communities. Specifically, (i) we present two algorithms to solve the Group Lasso problem. First, we propose a general version of the Block Coordinate Descent (BCD) algorithm for the Group Lasso that employs an efficient approach for optimizing each subproblem exactly. We show that it exhibits excellent performance when the groups are of moderate size. For groups of large size, we propose an extension of the proximal gradient algorithm based on variable step-lengths that can be viewed as a simplified version of BCD. By combining the two approaches we obtain an implementation that is very competitive and often outperforms other state-of-the-art approaches for this problem. We show how these methods fit into the globally convergent general block coordinate gradient descent framework in (Tseng and Yun, 2009). We also show that the proposed approach is more efficient in practice than the one implemented in (Tseng and Yun, 2009). In addition, we apply our algorithms to the Multiple Measurement Vector (MMV) recovery problem, which can be viewed as a special case of the Group Lasso problem, and compare their performance to other methods in this particular instance; (ii) we further investigate sparse linear models with two commonly adopted general sparsity-inducing regularization terms, the overlapping Group Lasso penalty l1/l2-norm and the l1/l_infty-norm. We propose a unified framework based on the augmented Lagrangian method, under which problems with both types of regularization and their variants can be efficiently solved. As one of the core building-blocks of this framework, we develop new algorithms using a partial-linearization/splitting technique and prove that the accelerated versions of these algorithms require $O(1 sqrt(epsilon) ) iterations to obtain an epsilon-optimal solution. We compare the performance of these algorithms against that of the alternating direction augmented Lagrangian and FISTA methods on a collection of data sets and apply them to two real-world problems to compare the relative merits of the two norms; (iii) we study the problem of robust low-rank tensor recovery in a convex optimization framework, drawing upon recent advances in robust Principal Component Analysis and tensor completion. We propose tailored optimization algorithms with global convergence guarantees for solving both the constrained and the Lagrangian formulations of the problem. These algorithms are based on the highly efficient alternating direction augmented Lagrangian and accelerated proximal gradient methods. We also propose a nonconvex model that can often improve the recovery results from the convex models. We investigate the empirical recoverability properties of the convex and nonconvex formulations and compare the computational performance of the algorithms on simulated data. We demonstrate through a number of real applications the practical effectiveness of this convex optimization framework for robust low-rank tensor recovery; (iv) we consider the image denoising problem using total variation regularization. This problem is computationally challenging to solve due to the non-differentiability and non-linearity of the regularization term. We propose a new alternating direction augmented Lagrangian method, involving subproblems that can be solved efficiently and exactly. The global convergence of the new algorithm is established for the anisotropic total variation model. We compare our method with the split Bregman method and demonstrate the superiority of our method in computational performance on a set of standard test images.Operations research, Computer science, Statisticszq2107Industrial Engineering and Operations ResearchDissertationsAdding Trust to P2P Distribution of Paid Content
http://academiccommons.columbia.edu/catalog/ac:138893
Sherman, Alex; Stavrou, Angelos; Nieh, Jason; Keromytis, Angelos D.; Stein, Clifford S.http://hdl.handle.net/10022/AC:P:11195Mon, 19 Sep 2011 00:00:00 +0000While peer-to-peer (P2P) file-sharing is a powerful and cost-effective content distribution model, most paid-for digital-content providers (CPs) use direct download to deliver their content. CPs are hesitant to rely on a P2P distribution model because it introduces a number of security concerns including content pollution by malicious peers, and lack of enforcement of authorized downloads. Furthermore, because users communicate directly with one another, the users can easily form illegal file-sharing clusters to exchange copyrighted content. Such exchange could hurt the content providers' profits. We present a P2P system TP2P, where we introduce a notion of trusted auditors (TAs). TAs are P2P peers that police the system by covertly monitoring and taking measures against misbehaving peers. This policing allows TP2P to enable a stronger security model making P2P a viable alternative for the distribution of paid digital content. Through analysis and simulation, we show the effectiveness of even a small number of TAs at policing the system. In a system with as many as 60% of misbehaving users, even a small number of TAs can detect 99% of illegal cluster formation. We develop a simple economic model to show that even with such a large presence of malicious nodes, TP2P can improve CP's profits (which could translate to user savings) by 62% to 122%, even while assuming conservative estimates of content and bandwidth costs. We implemented TP2P as a layer on top of BitTorrent and demonstrated experimentally using PlanetLab that our system provides trusted P2P file sharing with negligible performance overhead.Computer sciencejn234, ak2052, cs2035Industrial Engineering and Operations Research, Computer ScienceArticlesMitigating the Effect of Free-Riders in BitTorrent using Trusted Agents
http://academiccommons.columbia.edu/catalog/ac:110826
Sherman, Alex; Stavrou, Angelos; Nieh, Jason; Stein, Clifford S.http://hdl.handle.net/10022/AC:P:29544Wed, 27 Apr 2011 00:00:00 +0000Even though Peer-to-Peer (P2P) systems present a cost-effective and scalable solution to content distribution, most entertainment, media and software, content providers continue to rely on expensive, centralized solutions such as Content Delivery Networks. One of the main reasons is that the current P2P systems cannot guarantee reasonable performance as they depend on the willingness of users to contribute bandwidth. Moreover, even systems like BitTorrent, which employ a tit-for-tat protocol to encourage fair bandwidth exchange between users, are prone to free-riding (i.e. peers that do not upload). Our experiments on PlanetLab extend previous research (e.g. LargeViewExploit, BitTyrant) demonstrating that such selfish behavior can seriously degrade the performance of regular users in many more scenarios beyond simple free-riding: we observed an overhead of up to 430% for 80% of free-riding identities easily generated by a small set of selfish users. To mitigate the effects of selfish users, we propose a new P2P architecture that classifies peers with the help of a small number of {\em trusted nodes} that we call Trusted Auditors (TAs). TAs participate in P2P download like regular clients and detect free-riding identities by observing their neighbors' behavior. Using TAs, we can separate compliant users into a separate service pool resulting in better performance. Furthermore, we show that TAs are more effective ensuring the performance of the system than a mere increase in bandwidth capacity: for 80\% of free-riding identities a single-TA system has a 6\% download time overhead while without the TA and three times the bandwidth capacity we measure a 100\% overhead.Computer sciencejn234, cs2035Industrial Engineering and Operations Research, Computer ScienceTechnical reportsA Case for P2P Delivery of Paid Content
http://academiccommons.columbia.edu/catalog/ac:110614
Sherman, Alex; Stavrou, Angelos; Nieh, Jason; Stein, Clifford S.; Keromytis, Angelos D.http://hdl.handle.net/10022/AC:P:29479Wed, 27 Apr 2011 00:00:00 +0000P2P file sharing provides a powerful content distribution model by leveraging users' computing and bandwidth resources. However, companies have been reluctant to rely on P2P systems for paid content distribution due to their inability to limit the exploitation of these systems for free file sharing. We present TP2, a system that combines the more cost-effective and scalable distribution capabilities of P2P systems with a level of trust and control over content distribution similar to direct download content delivery networks. TP2 uses two key mechanisms that can be layered on top of existing P2P systems. First, it provides strong authentication to prevent free file sharing in the system. Second, it introduces a new notion of trusted auditors to detect and limit malicious attempts to gain information about participants in the system to facilitate additional out-of-band free file sharing. We analyze TP2 by modeling it as a novel game between malicious users who try to form free file sharing clusters and trusted auditors who curb the growth of such clusters. Our analysis shows that a small fraction of trusted auditors is sufficient to protect the P2P system against unauthorized file sharing. Using a simple economic model, we further show that TP2 provides a more cost-effective content distribution solution, resulting in higher profits for a content provider even in the presence of a large percentage of malicious users. Finally, we implemented TP2 on top of BitTorrent and use PlanetLab to show that our system can provide trusted P2P file sharing with negligible performance overhead.Computer sciencejn234, cs2035, ak2052Industrial Engineering and Operations Research, Computer ScienceTechnical reportsFairTorrent: Bringing Fairness to Peer-to-Peer Systems
http://academiccommons.columbia.edu/catalog/ac:110957
Sherman, Alex; Nieh, Jason; Stein, Clifford S.http://hdl.handle.net/10022/AC:P:29585Tue, 26 Apr 2011 00:00:00 +0000The lack of fair bandwidth allocation in Peer-to-Peer systems causes many performance problems, including users being disincentivized from contributing upload bandwidth, free riders taking as much from the system as possible while contributing as little as possible, and a lack of quality-of-service guarantees to support streaming applications. We present FairTorrent, a simple distributed scheduling algorithm for Peer-to-Peer systems that fosters fair bandwidth allocation among peers. For each peer, FairTorrent maintains a deficit counter which represents the number of bytes uploaded to a peer minus the number of bytes downloaded from it. It then uploads to the peer with the lowest deficit counter. FairTorrent automatically adjusts to variations in bandwidth among peers and is resilient to exploitation by free-riding peers. We have implemented FairTorrent inside a BitTorrent client without modifications to the BitTorrent protocol, and compared its performance on PlanetLab against other widely-used BitTorrent clients. Our results show that FairTorrent can provide up to two orders of magnitude better fairness and up to five times better download performance for high contributing peers. It thereby gives users an incentive to contribute more bandwidth, and improve overall system performance.Computer sciencejn234, cs2035Industrial Engineering and Operations Research, Computer ScienceTechnical reportsGroup Ratio Round-Robin: O(1) Proportional Share Scheduling for Uniprocessor and Multiprocessor Systems
http://academiccommons.columbia.edu/catalog/ac:109814
Caprita, Bogdan; Chan, Wong Chun; Nieh, Jason; Stein, Clifford S.; Zheng, Haoqianghttp://hdl.handle.net/10022/AC:P:29230Fri, 22 Apr 2011 00:00:00 +0000Proportional share resource management provides a flexible and useful abstraction for multiplexing time-shared resources. We present Group Ratio Round-Robin (GR3), the first proportional share scheduler that combines accurate proportional fairness scheduling behavior with O(1) scheduling overhead on both uniprocessor and multiprocessor systems. GR3 uses a novel client grouping strategy to organize clients into groups of similar processor allocations which can be more easily scheduled. Using this grouping strategy, GR3 combines the benefits of low overhead round-robin execution with a novel ratio-based scheduling algorithm. GR3 can provide fairness within a constant factor of the ideal generalized processor sharing model for client weights with a fixed upper bound and preserves its fairness properties on multiprocessor systems. We have implemented GR3 in Linux and measured its performance against other schedulers commonly used in research and practice, including the standard Linux scheduler, Weighted Fair Queueing, Virtual-Time Round-Robin, and Smoothed Round-Robin. Our experimental results demonstrate that GR3 can provide much lower scheduling overhead and much better scheduling accuracy in practice than these other approaches.Computer sciencejn234, cs2035Industrial Engineering and Operations Research, Computer ScienceTechnical reportsLearning mixtures of product distributions over discrete domains
http://academiccommons.columbia.edu/catalog/ac:110398
Feldman, Jon; O'Donnell, Ryan; Servedio, Rocco Anthonyhttp://hdl.handle.net/10022/AC:P:29411Thu, 21 Apr 2011 00:00:00 +0000We consider the problem of learning mixtures of product distributions over discrete domains in the distribution learning framework introduced by Kearns et al. We give a $\poly(n/\eps)$ time algorithm for learning a mixture of $k$ arbitrary product distributions over the $n$-dimensional Boolean cube $\{0,1\}^n$ to accuracy $\eps$, for any constant $k$. Previous polynomial time algorithms could only achieve this for $k = 2$ product distributions; our result answers an open question stated independently by Cryan and by Freund and Mansour. We further give evidence that no polynomial time algorithm can succeed when $k$ is superconstant, by reduction from a notorious open problem in PAC learning. Finally, we generalize our $\poly(n/\eps)$ time algorithm to learn any mixture of $k = O(1)$ product distributions over $\{0,1, \dots, b\}^n$, for any $b = O(1)$.Computer scienceras2105Industrial Engineering and Operations Research, Computer ScienceTechnical reportsGrouped Distributed Queues: Distributed Queue, Proportional Share Multiprocessor Scheduling
http://academiccommons.columbia.edu/catalog/ac:110491
Caprita, Bogdan; Nieh, Jason; Stein, Clifford S.http://hdl.handle.net/10022/AC:P:29440Thu, 21 Apr 2011 00:00:00 +0000We present Grouped Distributed Queues (GDQ), the first proportional share scheduler for multiprocessor systems that, by using a distributed queue architecture, scales well with a large number of processors and processes. GDQ achieves accurate proportional fairness scheduling with only O(1) scheduling overhead. GDQ takes a novel approach to distributed queuing: instead of creating per-processor queues that need to be constantly balanced to achieve any measure of proportional sharing fairness, GDQ uses a simple grouping strategy to organize processes into groups based on similar processor time allocation rights, and then assigns processors to groups based on aggregate group shares. Group membership of processes is static, and fairness is achieved by dynamically migrating processors among groups. The set of processors working on a group use simple, low-overhead round-robin queues, while processor reallocation among groups is achieved using a new multiprocessor adaptation of the well-known Weighted Fair Queuing algorithm. By commoditizing processors and decoupling their allocation from process scheduling, GDQ provides, with only constant scheduling cost, fairness within a constant of the ideal generalized processor sharing model for process weights with a fixed upper bound. We have implemented GDQ in Linux and measured its performance. Our experimental results show that GDQ has low overhead and scales well with the number of processors.Computer sciencejn234, cs2035Industrial Engineering and Operations Research, Computer ScienceTechnical reportsBehavior-Based Modeling and Its Application to Email Analysis
http://academiccommons.columbia.edu/catalog/ac:125674
Stolfo, Salvatore; Hershkop, Shlomo; Hu, Chia-wei; Li, Wei-Jen; Nimeskern, Olivier; Wang, Kehttp://hdl.handle.net/10022/AC:P:8686Wed, 28 Apr 2010 00:00:00 +0000The Email Mining Toolkit (EMT) is a data mining system that computes behavior profiles or models of user email accounts. These models may be used for a multitude of tasks including forensic analyses and detection tasks of value to law enforcement and intelligence agencies, as well for as other typical tasks such as virus and spam detection. To demonstrate the power of the methods, we focus on the application of these models to detect the early onset of a viral propagation without "content-base" (or signature-based) analysis in common use in virus scanners. We present several experiments using real email from 15 users with injected simulated viral emails and describe how the combination of different behavior models improves overall detection rates. The performance results vary depending upon parameter settings, approaching 99% true positive (TP) (percentage of viral emails caught) in general cases and with 0.38% false positive (FP) (percentage of emails with attachments that are mislabeled as viral). The models used for this study are based upon volume and velocity statistics of a user's email rate and an analysis of the user's (social) cliques revealed in the person's email behavior. We show by way of simulation that virus propagations are detectable since viruses may emit emails at rates different than human behavior suggests is normal, and email is directed to groups of recipients in ways that violate the users' typical communications with their social groups.Computer sciencesjs11, sh553, ch176Industrial Engineering and Operations Research, Computer ScienceArticles