Reports

An Extensible Meta-Learning Approach for Scalable and Accurate Inductive Learning

Chan, Philip K.

Much of the research in inductive learning concentrates on problems with relatively small amounts of data. With the coming age of ubiquitous network computing, it is likely that orders of magnitude more data in databases will be available for various learning problems of real world importance. Some learning algorithms assume that the entire data set fits into main memory, which is not feasible for massive amounts of data, especially for applications in data mining.One approach to handling a large data set is to partition the data set into subsets, run the learning algorithm on each of the subsets, and combine the results. Moreover, data can be inherently distributed across multiple sites on the network and merging all the data in one location can be expensive or prohibitive. In this thesis we propose, investigate, and evaluate a meta-learning approach to integrating the results of multiple learning processes. Our approach utilizes machine learning to guide the integration. We identified two main meta-learning strategies:{\it combiner} and {\it arbiter}. Both strategies are independent to the learning algorithms used in generating the classifiers. The combiner strategy attempts to reveal relationships among the learned classifiers' prediction patterns. The arbiter strategy tries to determine the correct prediction when the classifiers have different opinions. Various schemes under these two strategies have been developed. Empirical results show that our schemes can obtain accurate classifiers from inaccurate classifiers trained from data subsets. We also implemented and analyzed the schemes in a parallel and distributed environment to demonstrate their scalability.

Subjects

Files

  • thumnail for demo title for ac:110303 demo title for ac:110303 application/octet-stream 341 KB Download File

More About This Work

Academic Units
Computer Science
Publisher
Department of Computer Science, Columbia University
Series
Columbia University Computer Science Technical Reports, CUCS-044-96
Published Here
April 22, 2011