ā€œEvaluating the Text-To-SQL Capabilities of Large Language Modelsā€, Nitarshan Rajkumar, Raymond Li, Dzmitry Bahdanau2022-03-15 (; similar)⁠:

We perform an empirical evaluation of Text-to-SQL capabilities of the Codex language model. We find that, without any finetuning, Codex is a strong baseline on the Spider benchmark; we also analyze the failure modes of Codex in this setting. Furthermore, we demonstrate on the GeoQuery and Scholar benchmarks that a small number of in-domain examples provided in the prompt enables Codex to perform better than state-of-the-art models finetuned on such few-shot examples.

…Prompt design is critical for performance: As seen in Table 2, providing the question alone results in a low 8.3% execution accuracy. There is a progressive improvement to 56.8% as schema information is introduced in ā€˜API Docs’, to 59.9% when valid SQL and foreign key information is used in ā€˜Create Table’, and to 67.0% when database content is introduced with ā€˜Create Table + Select 3’.