“Tree Induction vs. Logistic Regression: A Learning-Curve Analysis”, 2003-06-01 (; backlinks):
Tree induction and logistic regression are 2 standard, off-the-shelf methods for building models for classification.
We present a large-scale experimental comparison of logistic regression and tree induction (C4.5), assessing classification accuracy and the quality of rankings based on class-membership probabilities.
We use a learning-curve analysis to examine the relationship of these measures to the size of the training set.
The results of the study show several things:
Contrary to some prior observations, logistic regression does not generally outperform tree induction.
More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (that is, the learning curves cross), so conclusions about induction-algorithm superiority on a given domain must be based on an analysis of the learning curves.
Contrary to conventional wisdom, tree induction is effective at producing probability-based rankings, although apparently comparatively less so for a given training-set size than at making classifications. Finally,
the domains on which tree induction and logistic regression are ultimately preferable can be characterized surprisingly well by a simple measure of the separability of signal from noise. [Keywords: decision trees, learning curves, logistic regression, ROC analysis, tree induction]
…The average data-set size is larger than is usual in machine-learning research, and we see behavioral characteristics that would be overlooked when comparing algorithms only on smaller data sets (such as most in the UCI repository; see 2000).
…Papers such as this seldom consider carefully the size of the data sets to which the algorithms are being applied. Does the relative performance of the different learning methods depend on the size of the data set?
More than a decade ago in machine learning research, the examination of learning curves was commonplace (see, for example, 1988), but usually on single data sets (notable exceptions being the study by et al 1991, and the work of 1991 [“Megainduction: machine learning on very large databases”]). Now learning curves are presented only rarely in comparisons of learning algorithms. Learning curves also are found in the statistical literature (1994) and in the neural network literature ( et al 1994). They have been analyzed theoretically, using statistical mechanics ( et al 1993; et al 1996).
The few cases that exist draw conflicting conclusions, with respect to our goals. 1997 compare classification-accuracy learning curves of naive Bayes and the C4.5RULES rule learner (1993). On synthetic data, they show that naive Bayes performs better for smaller training sets and C4.5RULES performs better for larger training sets (the learning curves cross). They discuss that this can be explained by considering the different bias/variance profile of the algorithms for classification (zero/one loss). Roughly speaking,4 variance plays a more critical role than estimation bias when considering classification accuracy. For smaller data sets, naive Bayes has a substantial advantage over tree or rule induction in terms of variance. They show that this is the case even when (by their construction) the rule learning algorithm has no bias. As expected, as larger training sets reduce variance, C4.5RULES approaches perfect classification. 1999 perform a similar bias/variance analysis of C4.5 and naive Bayes. They do not examine whether the curves cross, but do show on 4 UCI data sets that variance is reduced consistently with more data, but bias is not. These results do not directly examine logistic regression, but the bias/variance arguments do apply: logistic regression, a linear model, should have higher bias but lower variance than tree induction. Therefore, one would expect that their learning curves might cross.
However, the results of 1997 were generated from synthetic data where the rule learner had no bias. Would we see such behavior on real-world domains? 1996 shows classification-accuracy learning curves of tree induction (using C4.5) and of naive Bayes for 9 UCI data sets. With only one exception, either naive Bayes or tree induction dominates (that is, the performance of one or the other is superior consistently for all training-set sizes). Furthermore, by examining the curves, Kohavi concludes that “In most cases, it is clear that even with much more data, the learning curves will not cross” (pp. 203–204).
We are aware of only one learning-curve analysis that compares logistic regression and tree induction. Harris-1997 [“Sample size and misclassification: is more always better?”] compare them on 2 business data sets, one real and one synthetic. For these data the learning curves cross, suggesting (as they observe) that logistic regression is preferable for smaller data sets and tree induction for larger data sets. Our results generally support this conclusion.
…These results concur with recent results (2001) comparing discriminative and generative versions of the same model (viz., logistic regression and naive Bayes), which show that learning curves often cross…A corollary observation is that even for very large data-set sizes, the slope of the learning curves remains distinguishable from zero. 1991 concluded that learning curves continue to grow, on several large-at-the-time data sets (the largest with fewer than 100,000 training examples).14 1999 suggest that this conclusion should be revisited as the size of data sets that can be processed (feasibly) by learning algorithms increases. Our results provide a contemporary reiteration of Catlett’s. On the other hand, our results seemingly contradict conclusions or assumptions made in some prior work. For example, 1997 conclude that classification-tree learning curves level off, and et al 1999 replicate this finding and use it as an assumption of their sampling strategy. Technically, the criterion for a curve to have reached a plateau in these studies is that there be less than a certain threshold (<1%) increase in accuracy from the accuracy with the largest data-set size; however, the conclusion often is taken to mean that increases in accuracy cease. Our results show clearly that this latter interpretation is not appropriate even for our largest data-set sizes.