Toward Scalable and Parallel Inductive Learning: A Case Study in Splice Junction Prediction

Chan, Philip K.; Stolfo, Salvatore

Much of the research in inductive learning concentrates on problems with relatively small amounts of training data. With the steady progress of the Human Genome Project, it is likely that orders of magnitude more data in sequence databases will be available in the near future for various learning problems of biological importance. Thus, techniques that provide the means of scaling machine learning algorithms requires considerable attention. Meta-learning is proposed as a general technique to integrate a number of distinct learning processes that aims to provide a means of scaling to large problems. This paper details several meta-learning strategies for integrating independently learned classifiers on subsets of training data by the same learner in a parallel and distributed computing environment. Our strategies are particularly suited for massive amounts of data that main-memory-based learning algorithms cannot handle efficiently. The strategies are also independent of the particular learning algorithm used and the underlying parallel and distributed platform. Preliminary experiments using different learning algorithms in a simulated parallel environment demonstrate encouraging results: parallel learning by meta-learning can achieve comparable prediction accuracy in less space and time than serial learning.



More About This Work

Academic Units
Computer Science
Department of Computer Science, Columbia University
Columbia University Computer Science Technical Reports, CUCS-032-94
Published Here
February 3, 2012