“GPT-3 Takes the Bar Exam”, Michael Bommarito II, Daniel Martin Katz2022-12-29 (, , ; backlinks)⁠:

[Github; other attempt] Nearly all jurisdictions in the United States require a professional license exam, commonly referred to as “the Bar Exam”, as a precondition for law practice. To even sit for the exam, most jurisdictions require that an applicant completes at least 7 years of post-secondary education, including 3 years at an accredited law school. In addition, most test-takers also undergo weeks to months of further, exam-specific preparation. Despite this investment of time and capital, ~1⁄5 test-takers still score under the rate required to pass the exam on their first try. In the face of a complex task that requires such depth of knowledge, what, then, should we expect of the state-of-the-art in “AI?”

In this research, we document our experimental evaluation of the performance of OpenAI’s text-davinci-003 model, often-referred to as GPT-3.5, on the multi-state multiple choice (MBE) section of the exam.

While we find no benefit in fine-tuning over GPT-3.5’s zero-shot performance at the scale of our training data, we do find that hyperparameter optimization and prompt engineering positively impacted GPT-3.5’s zero-shot performance.

For best prompt and parameters, GPT-3.5 achieves a headline correct rate of 50.3% on a complete NCBE MBE practice exam, in excess of the 25% baseline guessing rate, and performs at a passing rate for both Evidence and Torts. GPT-3.5’s ranking of responses is also highly-correlated with correctness; its top two and top 3 choices are correct 71% and 88% of the time, respectively, indicating very strong non-entailment performance.

While our ability to interpret these results is limited by nascent scientific understanding of LLMs and the proprietary nature of GPT, we believe that these results strongly suggest that an LLM will pass the MBE component of the Bar Exam in the near future.

Figure 1: text-davinci-003 Performance by Question Category: Summary of performance by question category for GPT-3.5 and NCBE-Reported Students.
Figure 2: Progression of models over time.

Fine-tuning: LLMs like GPT-3.5 have received so much interest in part because their zero-shot or few-shot performance is so good. Despite this, in some circumstances, subsequent supervised or unsupervised re-training of some or all layers of an LLM may improve performance.26, 27 OpenAI does make some retraining or “fine-tuning” capabilities available through its API, and these API endpoints do allow for some control of the training process like learning rates or batch sizes. We did attempt to fine tune text-davinci-003 by providing it with 200 unseen, simulated MBE bar exam questions with correct and incorrect explanations. We provided the training samples both with and without explanatory text from the answer guide. In total, we trained 6 fine-tuned models, altering training prompts, training responses, batch size, learning rate, and prompt weighting. However, in all cases, the fine-tuned model underperformed text-davinci-003 itself. Due to the scarcity of high-quality data for training and assessment, we did not pursue fine-tuning of GPT models further, and these results possibly confirm LLM fine-tuning risks observed by others.