“Building a Large Annotated Corpus of English: The Penn Treebank”, Mitchell Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz1993-10-01 (, , , ; backlinks; similar)⁠:

In this paper, we review our experience with constructing one such large annotated corpus—the Penn Treebank, a corpus consisting of over 4.5 million words of American English.

During the first three-year phase of the Penn Treebank Project (19893199232ya), this corpus has been annotated for part-of-speech (POS) information. In addition, over half of it has been annotated for skeletal syntactic structure.