āHALIE: Evaluating Human-Language Model Interactionā, 2022-12-19 ()ā :
Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement.
To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (1) the interactive process, not only the final output; (2) the first-person subjective experience, not just a third-party assessment; and (3) notions of preference beyond quality (eg. enjoyment and ownership). We then design 5 tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation.
With 4 state-of-the-art LMs (three variants of OpenAIās GPT-3 and AI21 Labsā Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight 3 cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.