“GPT-3 As Knowledge Worker: A Zero-Shot Evaluation of AI CPA Capabilities”, Jillian Bommarito, Michael Bommarito, Daniel Martin Katz, Jessica Katz2023-01-11 (, , , , )⁠:

[Github] The global economy is increasingly dependent on knowledge workers to meet the needs of public and private organizations. While there is no single definition of knowledge work, organizations and industry groups still attempt to measure individuals’ capability to engage in it. The most comprehensive assessment of capability readiness for professional knowledge workers is the Uniform CPA Examination developed by the American Institute of Certified Public Accountants (AICPA).

In this paper, we experimentally evaluate OpenAI’s text-davinci-003 and prior versions of GPT on both a sample Regulation (REG) exam and an assessment of over 200 multiple-choice questions based on the AICPA Blueprints for legal, financial, accounting, technology, and ethical tasks.

First, we find that text-davinci-003 achieves a correct rate of 14.4% on a sample REG exam section, underperforming human capabilities on quantitative reasoning in zero-shot prompts. [Inner-monologue would improve this.] Second, text-davinci-003 appears to be approaching human-level performance on the Remembering & Understanding and Application skill levels in the Exam absent calculation. For best prompt and parameters, the model answers 57.6% of questions correctly, better than the 25% guessing rate, and its top two answers are correct 82.1% of the time, indicating strong non-entailment. Finally, we find that recent generations of GPT-3 demonstrate material improvements on this assessment, rising from 30% for text-davinci-001 to 57% for text-davinci-003.

These findings strongly suggest that large language models have the potential to transform the quality and efficiency of future knowledge work.

[Keywords: knowledge work, artificial intelligence, natural language processing, accounting, finance, law]


While text-davinci-003 and ChatGPT have demonstrated state-of-the-art performance on a wide range of tasks in zero-shot and few-shot contexts, there was previously little reason to believe that these models could perform even reasonably well in general assessments across the domains of finance, law, and accounting. However, in recent prior work on the Bar Exam, the authors have shown that text-davinci-003 could achieve near-parity with human test-takers in two of 7 sections of the Multistate Bar Exam (MBE); more strikingly, generation-over-generation model performance suggests that an LLM like GPT-3.5 may be capable of passing the Bar Exam in the near future.


The Examination is divided into 4 sections that test-takers sit for independently: Auditing and Attestation (AUD), Business Environment and Concepts (BEC), Financial Accounting and Reporting (FAR), and Regulation (REG). Each section of the Exam is divided up into at least 4 testlets that feature scenarios, multiple choice questions, calculated amounts, short answer, and related evidence and research material. The human passage rates of Exam sections are presented in Table 1; the AICPA does not publish statistics related to per-question or per-section test-taker accuracy.


Assessment 1: As expected, the quantitative reasoning and arithmetic required in Assessment 1 resulted in substantially lower zero-shot performance than observed in Assessment 2. Out of 24 questions that required the test-taker to provide a numeric answer based on facts and work papers, GPT-3.5 frequently only answered one, two, or 3 questions correctly, resulting in an average range across all parameters and prompts of 5.7–9.4%. While it is arguable whether 0% is the true baseline for this task, it is clear that such zero-shot performance is not on par with human test-takers.

GPT-3.5 also struggled with arithmetic on the 15 MCQs on Assessment 1, scoring above random chance for some, but not all, prompts and parameters. As a number of questions include more than 4 choices, the true baseline rate of guessing is 22.67%, not 25%, but despite this, the best prompts and parameters were only 4–6% above the baseline rate.

Based on a qualitative review of these questions and the model’s responses, we believe that performance could be improved somewhat in few-shot evaluations. Further, we believe that even some zero-shot performance improvements could be achieved by expanding the prompt to include “scratchpads” for common relationships or equations, as might be seen on problems that feature common work papers like a statement of cash flows; however, in this paper, we focus on a zero-shot, “out-of-the-box” evaluation, and so these improvements are left for future research.

Figure 1: Performance of GPT-3.5 by section of AICPA Exam Blueprints for best prompt and parameter, with correct rate including second-best answer in dashed region. Error bars are ±1 standard error of the mean. Note that GPT-3.5 is not assessed on Analysis or Evaluation tasks, unlike human test-takers, and that the percentage of questions correct does not scale linearly with score or passage.
Figure 2: Comparison of model performance across GPT-3 generations. For text-davinci-003, the average is reported across all runs; for other models, a subset of representative prompts and parameters were included. GPT-2 was unable to reliably respond to the prompt as instructed and questions were larger than its maximum input token length. More details are available in source and data in the online SI.


Acknowledgments: Although the original draft of this paper was written by the authors, portions of this paper were fine-tuned by text-davinci-003 for formatting and clarity.