โ€œBest Practices for the Human Evaluation of Automatically Generated Textโ€, Chris van der Lee, Albert Gatt, Emiel van Miltenburg, Sander Wubben, Emiel Krahmer2019-10 (; backlinks; similar)โ :

Currently, there is little agreement as to how Natural Language Generation (NLG) systems should be evaluated. While there is some agreement regarding automatic metrics, there is a high degree of variation in the way that human evaluation is carried out.

This paper provides an overview of how human evaluation is currently conducted, and presents a set of best practices, grounded in the literature.

With this paper, we hope to contribute to the quality and consistency of human evaluations in NLG.