|
Organizers |
Improving a Regression Tree Solution Using Error-based Clustering
by
Mahesh Kumar
Massachusetts Institute of Technology
Coauthors: Nitin R. Patel (Sloan School of Management, MIT), Charu C. Aggarwal (IBM T.J. Watson Research Center, NY), Philip S. Yu (IBM T.J. Watson Research Center, NY)
We consider the problem of clustering the points in a regression. The standard least squares regression, which assumes that the data comes from a single linear function, fails to work when different parts of data have different linear relationships. Regression trees have been proposed to partition the data into subsets (leaf nodes) so that the data within each subset would possess a single linear function. The regression function is then estimated using a least squares regression on each subset. The problem with this method is that, often, each subset of data has only few data points; consequently, the regression coefficient estimate for each subset will have large errors. We propose use of error-based clustering on the regression tree leaves that simultaneously determines the clusters of subsets of data and estimates the regression coefficients for each cluster. The new estimates have smaller errors, and therefore, provide a more accurate forecast on unseen data points. We justify our approach theoretically and present empirical results on both simulated and real data.
Date received: October 15, 2002
Copyright © 2002 by the author(s). The author(s) of this document and the organizers of the conference have granted their consent to include this abstract in Atlas Mathematical Conference Abstracts. Document # cais-51.