This Article empirically examines whether a computational language model can read and understand consumer contracts. In recent years, language models have heralded a paradigm shift in artificial intelligence, characterized by unprecedented machine capabilities and new societal risks. These models, which are trained on immense quantities of data to predict the next word in a sequence, can perform a wide range of complex tasks. In the legal domain, language models can interpret statutes, draft transactional documents, and, as this Article will explore, inform consumers of their contractual rights and obligations.
To showcase the opportunities and challenges of using language models to read consumer contracts, this Article studies the performance of GPT-3, the world’s first commercial language model. The case study evaluates the model’s ability to understand consumer contracts by testing its performance on a novel dataset comprised of questions relating to online terms of service.
Although the results are not definitive, they offer several important insights. First, the model appears to be able to exploit subtle informational cues when answering questions about consumer contracts. Second, the model performs poorly in answering certain questions about contractual provisions that favor the rights and interests of consumers, suggesting that the model may contain an anti-consumer bias. Third, the model is brittle in unexpected ways. Performance in the case study was highly sensitive to the wording of questions, but surprisingly indifferent to variations in contractual language. [In the case study, performance decreased dramatically when the questions presented to the model were less readable (ie. more difficult for a human to read). However, performance did not decrease on longer or less readable contractual texts.]
These preliminary findings suggest that while language models have the potential to empower consumers, they also have the potential to provide misleading advice and entrench harmful biases. Leveraging the benefits of language models in performing legal tasks, such as reading consumer contracts, and confronting the associated challenges requires a combination of thoughtful engineering and governance. Before language models are deployed in the legal domain, policymakers should explore technical and institutional safeguards to ensure that language models are used responsibly and align with broader social values.
…The case study examines the degree to which the model can understand certain consumer contracts. To conduct the case study, I created a novel dataset comprised of 200 yes/no legal questions relating to the terms of service of the 20 most-visited US websites, including Google, Amazon, and Facebook, and tested the model’s ability to answer these questions. The results are illuminating. They shed light on the opportunities and risks of using GPT-3 to inform consumers of their contractual rights and obligations and offer new insights into the inner workings of language models.
Table 1: Sample of Questions.
Question
Correct Answer
Will Google always allow me to transfer my content out of my Google account?
No
Does Amazon sometimes give a refund even if a customer hasn’t returned the item they purchased?
Is the length of the billing cycle period the same for all Netflix subscribers?
No
Do I need to use my real name to open an Instagram account?
No
…Random guessing yields, on average, 50% accuracy. The second baseline is the majority class. The correct answer to 55% of the questions in the case study is “no”; the correct answer to 45% of the questions is “yes.” Responding with the majority class (“no”) to every question yields the majority class baseline, ie. 55% accuracy. The third baseline—which I call contract withheld—involves querying GPT-3 on the questions without displaying the contract excerpts, ie. testing the model on all 200 questions while withholding the corresponding terms of service. If accuracy is not higher when GPT-3 is shown both the contract and the question (compared with when it is shown only the question), then the model would fail to demonstrate that it understands the contracts. Instead, GPT-3 could simply be responding to cues in the questions or relying on data memorized during pretraining.106 If, however, accuracy is higher when GPT-3 is shown both the contract and the question, this would suggest that GPT-3 uses the contract to answer the questions and does not simply respond to cues in the questions or rely on data memorized during pretraining.
…GPT-3 answered correctly 77% of the questions in the case study.125 In terms of accuracy, performance exceeded all 3 baselines, as illustrated in Figure 1 (below). That is, performance in the test was better than (1) random chance (randomly guessing answers); (2) the majority class (answering “no” to all questions); and (3) the contract withheld baseline (responding to questions without being shown the contract excerpts). Beating this final baseline by 16.5 percentage points indicates that performance was considerably better when GPT-3 was shown the contract excerpt, compared with when GPT-3 was not shown the contract excerpt. This result suggests that GPT-3 uses the contract to answer the questions and does not simply respond to cues in the questions or rely on data memorized during pretraining.126
Figure 1: Comparison of Accuracy with Baselines.
…2. Calibration: In terms of calibration, there was a positive correlation between the model’s accuracy and the model’s confidence in its predictions. [r ≈ 0.22] That is, on average, GPT-3 was more confident in its correct responses than in its incorrect responses. This result suggests that GPT-3’s performance in the test was well-calibrated and, all things being equal, encourages us to trust the model’s predictions.