“Large Language Models As Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards”, 2023-01-25 ():
Artificial intelligence (AI) is taking on increasingly autonomous roles, eg. browsing the web as a research assistant and managing money. But specifying goals and restrictions for AI behavior is difficult. Similar to how parties to a legal contract cannot foresee every potential “if-then” contingency of their future relationship, we cannot specify desired AI behavior for all circumstances. Legal standards facilitate the robust communication of inherently vague and underspecified goals. Instructions (in the case of language models, “prompts”) that employ legal standards will allow AI agents to develop shared understandings of the spirit of a directive that can adapt to novel situations, and generalize expectations regarding acceptable actions to take in unspecified states of the world. Standards have built-in context that is lacking from other goal specification languages, such as plain language and programming languages.
Through an empirical study on thousands of evaluation labels we constructed from US court opinions, we demonstrate that:
large language models (LLMs) are beginning to exhibit an “understanding” of one of the most relevant legal standards for AI agents: fiduciary obligations. Performance comparisons across models suggest that, as LLMs continue to exhibit improved core capabilities, their legal standards understanding will also continue to improve. OpenAI’s latest LLM has 78% accuracy on our data, their previous release has 73% accuracy, and a model from their 2020 GPT-3 paper has 27% accuracy (worse than random, 50%) [U-shaped scaling].
Our research is an initial step toward a framework for evaluating AI understanding of legal standards more broadly, and for conducting reinforcement learning with legal feedback (RLLF).
[Keywords: artificial intelligence, AI, machine learning, natural language processing, NLP, self-supervised learning, reinforcement learning, RL, large language models, foundation models, AI Safety, AI Alignment, AI & Law, AI policy, computational legal studies, computational law, standards, prompt engineering]
Introduction
The Goal Specification Problem
All Rewards are Proxies
The Real-World Exacerbates Goal Misspecification
More Capable AI May Further Exacerbate Misspecification
Specification Languages As Solutions
Legal Standards: The Spirit Of Directives
Rules vs. Standards
The Fiduciary Duty Standard
FAI as a Fiduciary to Humans
Empirical Case Study: Fiduciary Standard Understanding
Converting Court Opinions to Evaluation Labels
Zero-Shot LLM Evaluation
Leveraging Legal Reward Data for Reinforcement Learning
Conclusion
…1. Converting Court Opinions to Evaluation Labels: We undertook the following process. First, a legal data provider, Fastcase, exported the full text of the more than 18,000 court opinions from the U.S. Federal District Courts and U.S. State Courts from the past 5 years (January 2018–December 2022) that mentioned a breach of fiduciary duty. Then we filtered this to the 1,000 cases that discussed fiduciary duties most extensively.
From here, we use a state-of-the-art LLM [
text-davinci-003] to construct the evaluation data. Recent research has demonstrated that LLMs can produce high-quality evaluation data. A large-scale study concluded that humans rate the LLM-generated examples “as highly relevant and agree with 90–100% of labels, sometimes more so than corresponding human-written datasets”, and conclude that “overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.” Another research team found that training LLMs on LLM-generated data “rivals the effectiveness of training on open-source manually-curated datasets.”Both of these papers used smaller LLMs than we do. But more importantly, our models are creating evaluation data directly from the official text of court opinions (rather than from human-generated research data). The models are tasked to convert the unstructured text to structured text with high fidelity. This grounds our evaluation data creation closely to some of the highest quality and most trustworthy labeled data available (U.S. court opinions).56
…With this fiduciary-duty-dense subset of recent cases, we then applied a process that makes calls to an LLM with prompts that we carefully engineered to ask the model to convert the text of a court opinion into temporally ordered state-action-reward tuples. The goal is to have n > 1 Time Steps, where each Time Step has 3 components: the State of the world relevant to an Action taken, the Action taken by an alleged fiduciary or related person, and the Legal Reward as determined by the court for that Action in that State. The LLM is prompted to abstract away much of the textual content unrelated to facts of a case, such as the discussion of other court cases being cited. We want to focus on extracting descriptions of behavior related to fiduciary obligations.
This prompt/LLM-generation process is applied successively from the beginning to the end of opinions in a way that provides temporary “short-term memory” for the LLM to coherently construct a temporal narrative of (1) who the alleged fiduciary and other key entities were, (2) what transpired, and (3) what judgements the court made on the actions that the people and/or companies took. [All of the documents are far too long to fit the entire text into the context window of the model, and we leverage abstractions and methods from LangChain to handle this.] This entire process is conducted recursively over each court opinion in a way that allows the LLM to improve and iterate on the results to optimize for concise and accurate final results. Here is an example output.
Time Step 1:
STATE: M&T Bank Corporation sponsors a 401(k) retirement plan known as the M&T Bank Corporation Retirement Saving Plan (“the Plan”) for its employees. The Plan is administered by the M&T Bank Employee Benefit Plans Committee, which is the Plan’s named fiduciary, and sponsored by M&T Bank.
ACTION: M&T Bank appointed or removed members of the Committee.
LEGAL REWARD: In the eyes of this court, this action is ‘unsure’ for M&T Bank.Time Step 2:
STATE: The Plan offered participants 23–34 investment options throughout the putative class period.
ACTION: M&T Bank expanded their proprietary funds offerings in 2011, after M&T purchased Wilmington Trust and added 6 of Wilmington’s expensive, poor-performing mutual fund offerings.
LEGAL REWARD: In the eyes of this court, this action is ‘negative’ for M&T Bank.…
After this data structuring / evaluation generation process, we provide the results to the LLM and ask it to “reflect” on the quality of the output. We filter out opinions where the LLM was not confident that the distilled results are relevant for producing substantive descriptions of real-world fiduciary obligations. [We also use the LLM to generate plain language summaries of the case context, whether the court overall believes a fiduciary duty was implicated, and the primary legal issues at play in each case.] The final set for evaluation included just over 500 opinions (which have a median of 7 Time Steps each).
…The data happens to be relatively balanced across those two outcomes so a simple baseline of always predicting a legal reward is positive (or negative) leads to accuracy of ~50%.
We compared performance across models. GPT-3.5 (
text-davinci-003) obtains an accuracy of 78%. The immediately-preceding state-of-the-art GPT-3 release (text-davinci-002) obtains an accuracy of 73%.text-davinci-002was state-of-the-art on most natural language related benchmark tasks untiltext-davinci-003was released on November 28, 2022. A smaller OpenAI LLM from 2020,curie, scored 27%, worse than guessing at random. These results (Table 1) suggest that, as models continue to improve, their legal standards understanding will continue to improve.The more recent models are relatively well calibrated in its confidence in its predictions. Along with the prediction of the Reward class, the model was asked for an “integer 0–100 for your estimate of confidence in your answer (1 is low confidence and 99 is high).” The accuracy of
text-davinci-003on predictions where its confidence was greater than “50” increases to 81%. The oldercurieLLM did not produce confidence scores at all (when prompted to do so).
Table 1: Prediction performance. Metric curietext-davinci-002text-davinci-003Accuracy 27% 73% 78% Accuracy w/ High Confidence NA 76% 81% These are initial, provisional results. We are in the process of having a team of paralegals review and validate the evaluation data. They will (as needed) make manual corrections to the structure data. After this process, and after generating a larger evaluation dataset, we will release a “fiduciary duty understanding” data set. We will also update these performance evaluations on a larger labeled dataset.