“OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax?”, Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme2023-08-14 (, , )⁠:

[tl;dr: yes] In this article, the authors explain where OpenAI got the tax law example in its livestream demonstration of GPT-4, why GPT-4 got the wrong answer, and how it fails to reliably calculate taxes.

When OpenAI debuted its GPT-4 AI language model in a 2023-03-14 livestream, it used a tax law example to demonstrate the model’s power.1 The presenter pasted in what he called “about 16 pages’ worth of tax code”2 and then 7 sentences of facts about married couple Alice and Bob, who have a son Charlie and $36,991 and $41,990 of income, respectively.

These 7 sentences about Alice, Bob, and Charlie come word-for-word from a handcrafted data set we developed at Johns Hopkins University and published in 2020 for training and measuring AI models for reasoning over statutory language.3 …In the livestream introducing GPT-4, OpenAI used one of our SARA tax cases verbatim, describing it as a real tax example, even though SARA is a simplified academic data set. In the demo, OpenAI also used our heavily edited SARA version of the IRC. OpenAI incorrectly thought GPT-4 had correctly calculated the tax liability because its answer matched the SARA answer, although our IRC edits change the result from the actual IRC. [So in other words, GPT-4 gave the right answer. If it had given the ‘real’ answer, despite being told explicitly otherwise with 16 pages of overriding prompt, that would be the error!]

We tested GPT-4 on the entire SARA data set. It gets tax liabilities exactly right around 1⁄3rd of the time and miscalculates tax liabilities by over 10% nearly a quarter of the time. GPT-4 often misreads even our simplified version of the IRC. [So, it does about 1⁄3rd better than all AI systems previously…?]

In the livestream, the presenter warned, “You should always check with your tax adviser.”24 Wise advice.