âHow Close Is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detectionâ, 2023-01-18 ()â :
The introduction of ChatGPT has garnered widespread attention in both academic and industrial communities. ChatGPT is able to respond effectively to a wide range of human questions, providing fluent and comprehensive answers that surpass previous public chatbots in terms of security and usefulness. On one hand, people are curious about how ChatGPT is able to achieve such strength and how far it is from human experts. On the other hand, people are starting to worry about the potential negative impacts that large language models (LLMs) like ChatGPT could have on society, such as fake news, plagiarism, and social security issues.
In this work, we collected tens of thousands of comparison responses from both human experts and ChatGPT, with questions ranging from open-domain, financial, medical, legal, and psychological areas. We call the collected dataset the Human ChatGPT Comparison Corpus (HC3).
Based on the HC3 dataset, we study the characteristics of ChatGPTâs responses, the differences and gaps from human experts, and future directions for LLMs. We conducted comprehensive human evaluations and linguistic analyses of ChatGPT-generated content compared with that of humans, where many interesting results are revealed.
After that, we conduct extensive experiments on how to effectively detect whether a certain text is generated by ChatGPT or humans.
We build 3 different detection systems, explore several key factors that influence their effectiveness, and evaluate them in different scenarios.
The dataset, code, and models are all publicly available at Github.
âŚResults: Several conclusions can be drawn from the results shown in Table 2: Comparing the results of
pair-expertandsingle-expert, we can find that it is easier to distinguish ChatGPT-generated content when providing a comparison pair than only providing a single answer. Comparing the results of single-expert and single-amateur, we can find that the accuracy of experts is much higher than that of amateurs. Thehelpfulnesstest gives the proportion of questions that volunteers think the ChatGPT answer is more helpful to them. Surprisingly, results show that ChatGPTâs answers are generally considered to be more helpful than humansâ in more than half of questions, especially for finance and psychology areas. By checking the specific answers in these domains, we find that ChatGPT can usually provide more concrete and specific suggestions. However, ChatGPT performs poorly in terms of helpfulness for the medical domain in both English and Chinese. The ChatGPT often gives lengthy answers to medical consulting in our collected dataset, while human experts may directly give straightforward answers or suggestions, which may partly explain why volunteers consider human answers to be more helpful in the medical domain.âŚEventually, we received more than 200 feedbacks, and we summarize these findings as follows:
ChatGPT writes in an organized manner, with clear logic.
tends to offer a long and detailed answer.
shows less bias and harmful information.
refuses to answer the question out of its knowledge.
may fabricate facts.
responses are generally strictly focused on the given question, whereas humansâ are divergent and easily shift to other topics.
provides objective answers, while humans prefer subjective expressions.
answers are typically formal, meanwhile humansâ are more colloquial.
expresses less emotion in its responses, while human chooses many punctuation and grammar feature in context to convey their feelings.
compared to ChatGPT, human answers are relatively shorter, but a larger vocabulary is usedâŚhumans use a more diverse vocabulary in their expressions.
âŚIt is clearly observed that, regardless of whether it is at the text level or the sentence level, the content generated by ChatGPT has relatively lower [perplexity] PPLs compared to the text written by humans. ChatGPT captured common patterns and structures in the text it was trained on, and is very good at reproducing them. As a result, text generated by ChatGPT have relatively concentrated low PPLs.
Humans have the ability to express themselves in a wide variety of ways, depending on the context, audience, and purpose of the text they are writing. This can include using creative or imaginative elements, such as metaphors, similes, and unique word choices, which can make it more difficult for GPT-2 to predict. Therefore, human-written texts have more high-PPL values, and show a long-tailed distribution, as demonstrated in Figure 4.