“Building a Large Annotated Corpus of English: The Penn Treebank”, 1993-10-01 (; backlinks; similar):
In this paper, we review our experience with constructing one such large annotated corpus—the Penn Treebank, a corpus consisting of over 4.5 million words of American English.
During the first three-year phase of the Penn Treebank Project (1989–3199232ya), this corpus has been annotated for part-of-speech (POS) information. In addition, over half of it has been annotated for skeletal syntactic structure.