Spurred by advancements in scale, large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shotâie. without adaptation on downstream data. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community due to the fact that it can generate high-quality responses to human input and self-correct previous mistakes based on subsequent conversations. However, it is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot. In this work, we empirically analyze the zero-shot learning ability of ChatGPT by evaluating it on 20 popular NLP datasets covering 7 representative task categories. With extensive empirical studies, we demonstrate both the effectiveness and limitations of the current version of ChatGPT. We find that ChatGPT performs well on many tasks favoring reasoning capabilities (eg. arithmetic reasoning) while it still faces challenges when solving specific tasks such as sequence tagging. We additionally provide in-depth analysis through qualitative case studies.
Figure 1: Performance of ChatGPT, GPT-3.5, and models fine-tuned with task-specific data for 20 different datasets. For each reasoning dataset, the better result between zero-shot and zero-shot chain-of-thought is shown. Measures of SAMsum, CoNLL03, and the rest are ROUGE-1/2/L average, F1, accuracy, respectively.
âŚAlthough ChatGPT shows some capability as a generalist model that can perform multiple tasks [Zhanget al2021], it often performs worse than models that are fine-tuned on a given task (§4.3 & Figure 1).
The superior reasoning capability of ChatGPT is empirically substantiated in arithmetic reasoning tasks (§4.2.1). However, ChatGPT often underperforms GPT-3.5 in commonsense, symbolic, and logical reasoning tasks, such as by generating uncertain responses (§4.2.2).
ChatGPT outperforms GPT-3.5 for natural language inference tasks (§4.2.3) and question answering (reading comprehension) tasks (§4.2.4) that favor reasoning capabilities, such as in determining logical relationships within text pairs. Specifically, ChatGPT is better at handling factually consistent text (ie. better at classifying entailment rather than non-entailment).
ChatGPT is superior to GPT-3.5 for dialogue tasks (§4.2.5).
ChatGPT generates longer summaries and performs worse than GPT-3.5 for summarization tasks. However, explicitly limiting summary length in the zero-shot instruction harms the summarization quality, leading to even worse performance (§4.2.6).
Despite showing promise as generalist models, both ChatGPT and GPT-3.5 face challenges on certain tasks such as sequence tagging (§4.2.7).
ChatGPTâs sentiment analysis ability comes close to that of GPT-3.5 (§4.2.8).