High Quality, Efficient Hierarchical Document Clustering Using Closed Interesting Itemsets

Malik, Hassan H.; Kender, John R.

High dimensionality remains a significant challenge for document clustering. Recent approaches used frequent itemsets and closed frequent itemsets to reduce dimensionality, and to improve the efficiency of hierarchical document clustering. In this paper, we introduce the notion of "closed interesting" itemsets (i.e. closed itemsets with high interestingness). We provide heuristics such as "super item" to efficiently mine these itemsets and show that they provide significant dimensionality reduction over closed frequent itemsets. Using "closed interesting" itemsets, we propose a new hierarchical document clustering method that outperforms state of the art agglomerative, partitioning and frequent-itemset based methods both in terms of FScore and Entropy, without requiring dataset specific parameter tuning. We evaluate twenty interestingness measures on nine standard datasets and show that when used to generate "closed interesting" itemsets, and to select parent nodes, Mutual Information, Added Value, Yule's Q and Chi-Square offers best clustering performance, regardless of the characteristics of underlying dataset. We also show that our method is more scalable, and results in better run-time performance as compare to leading approaches. On a dual processor machine, our method scaled sub-linearly and was able to cluster 200K documents in about 40 seconds.



More About This Work

Academic Units
Computer Science
Department of Computer Science, Columbia University
Columbia University Computer Science Technical Reports, CUCS-046-06
Published Here
April 28, 2011