“This Week’s Citation Classic: Nearest Neighbor Pattern Classification”, Thomas M. Cover1982-03-05 (, ; backlinks)⁠:

[retrospective about Cover & Hart1967] Early in 1966 when I first began teaching at Stanford, a student, Peter E. Hart, walked into my office with an interesting problem. He said that Charles Cole [who?] and he were using a pattern classification scheme which, for lack of a better word, they described as the nearest neighbor procedure. This scheme assigned to an as yet unclassified observation the classification of the nearest neighbor. Were there any good theoretical properties of this procedure? Of course the motivation for such a classification rule must go back to prehistoric times. The idea is that ‘things that look alike must be alike’.

The problem seemed extremely inviting from a theoretical point of view. We began meeting for two or 3 hours every afternoon in an attempt to find some distribution-free properties of this classification rule…Soon thereafter we found what we were looking for. The nearest neighbor risk is less than twice the Bayes risk for all reasonable distributions and for any number of categories.

…The attendant measure-theoretic difficulties in eliminating the so-called singular distributions almost delayed the publication of our paper. It was wise that we did not hold up publication, because the theorem was not proved in total generality until 10 years later

…The simplicity of the bound and the sweeping generality of the statement, combined with the obvious simplicity of the nearest neighbor rule itself, have caused this result to be used by others, thus accounting for the high number of citations, Since the properties of the nearest neighbor rule can be easily remembered, the bound yields a benchmark for other more sophisticated data analysis procedures, which sometimes actually perform worse than the nearest neighbor rule. This is probably due to the fact that more ambitious rules have too many parameters for the data set. [but also because they build in a lot of bias, compared to data-hungry nonparametric statistics like k-NN… cf. Bitter Lesson on how ‘the scissor blades cross’, and Highleyman]