“The Effects of Training Set Size on Decision Tree Complexity”, Tim Oates, David Jensen1997 (, ; backlinks)⁠:

This paper presents experiments with 19 datasets and 5 decision tree pruning algorithms that show that increasing training set size often results in a linear increase in tree size, even when that additional complexity results in no substantial increase in classification accuracy. Said differently, removing randomly selected training instances often results in trees that are substantially smaller and just as accurate as those built on all available training instances.

This implies that decreases in tree size obtained by more sophisticated data reduction techniques should be decomposed into 2 parts: that which is due to reduction of training set size, and the remainder, which is due to how the method selects instances to discard.

We perform this decomposition for one recent data reduction technique, John’s ROBUSTC4.5 (John1995), and show that a large percentage of its effect on tree size is attributable to the fact that it simply reduces the size of the training set.

We conclude that random data reduction is a baseline against which more sophisticated data reduction techniques should be compared. Finally, we examine one possible cause of the pathological relationship between tree size and training set size.