Home

PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework

Shen Wang; Haimonti Dutta

Title:
PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework
Author(s):
Wang, Shen
Dutta, Haimonti
Date:
Type:
Technical reports
Department:
Center for Computational Learning Systems
Permanent URL:
Series:
CCLS Technical Report
Part Number:
CCLS-11-04
Publisher:
Center for Computational Learning Systems, Columbia University
Publisher Location:
New York
Abstract:
Large datasets, of the order of peta- and tera- bytes, are becoming prevalent in many scientific domains including astronomy, physical sciences, bioinformatics and medicine. To effectively store, query and analyze these gigantic repositories, parallel and distributed architectures have become popular. Apache Hadoop is one such framework for supporting data-intensive applications. It provides an open source implementation of the MapReduce programming paradigm which can be used to build scalable algorithms for pattern analysis and data mining. In this paper, we present a PArallel, RAndom-partition Based hierarchical clustEring algorithm (PARABLE) for the MapReduce framework. It proceeds in two main steps -- local hierarchical clustering on nodes using mappers and reducers and integration of results by a novel dendrogram alignment technique. Empirical results on two large data sets (High Energy Particle Physics and Intrusion Detection) from the KDDCup competition on a large cluster indicates that significant scalability benefits can be obtained by using the parallel hierarchical clustering algorithm in addition to maintaining good cluster quality.
Subject(s):
Computer science
Item views:
648
Metadata:
text | xml

In Partnership with the Center for Digital Research and Scholarship at Columbia University Libraries/Information Services | Terms of Use