‘inner monologue (AI)’ directory

Gwern

‘inner monologue (AI)’ directory

Inner Monologue (by analogy to human inner-monologue⁠) is a family of prompt engineering⁠ tricks for large language models which make them solve problems in a ‘step by step’ verbalized way; it is particularly effective on multi-step tasks with ‘one right answer’ such as math word & programming problems.

It can be induced by few-shot examples of several solved problems, finetuning on a corpus (eg. ⁠InstructGPT⁠), or with a carefully-chosen prompt inducing a ‘dialogue’ (original discovery) or instructions (eg. “let’s think step by step”⁠). It can be combined with better sampling strategies like best-of ranking⁠ or majority voting⁠ or a critic⁠, self-distillation⁠ on its monologue outputs (possibly repeatedly), additional data like unit tests or retrieval results, & access to oracles like REPLs or humans.

It was discovered in July 2020⁠ by early OA API & AI Dungeon 2⁠ users who found that GPT⁠-3/‘Dragon’ would fail to solve most simple arithmetic problems like multiplication (as found by the GPT-3⁠ paper), but could be coaxed into solving them by setting up a fictional dialogue between the player and a ‘character’ into solving it step by step. This discovery was widely discussed among GPT-3 enthusiasts, and highlighted on my GPT-3 page as a remarkable emergent capability of GPT-3 unlike GPT-2⁠ or earlier models. It has been ‘rediscovered’ repeatedly since (by EleutherAI, and then multiple academic groups eg. as “scratchpad”⁠ or “chain-of-thought”⁠).

Inner-monologue is interesting because it: is a simple prompting technique which dramatically improves benchmark performance (“sampling can show the presence of knowledge but not the absence”), was not predicted but discovered empirically after model release, appears to emerge only in large language models (>80b dense parameters), can have increasing returns to scale, can scale performance even when naive prompting has flat scaling (“hidden scaling”) adds an RNN⁠-esque flavor to feedforward language models, and involves planning (cf. Socratic models⁠/SayCan⁠). As of 2023, training on inner-monologue-generated datasets has become standard, and is responsible for large capability gains; the limits of self-training & exploration are unknown.

A toy-model for how inner-monologue works is that such problems are sequential: when calculating out an arithmetic problem, an error in any step causes all following steps to be wrong. Such a process is a multiplicative pipeline⁠, where failure rates multiply: ie. a P success rate on n steps multiplies to a correctness rate of Pⁿ, which rapidly shrinks in either variable. So inner-monologue makes the task meta-learning⁠ easier by being more specific, and reducing to easier sub-tasks, potentially increasing success rate far more than alternatives like scaling a model a few times (eg. a 5-step problem with P = 90% vs P = 99% is 60% vs 95%, which for that improvement via pure scaling of naive prompts, might require >10× scaling). Small models then aren’t smart enough to ‘get it’ from the instructions, and their baseline error rate too high to execute steps reliably enough to see much gain.

I speculate the reason for inner-monologue not being model defaults, when it predicts the answer so much more accurately, may be the lack of an implicit memory mechanism—where a model could adaptively execute computations for predicting the next token. Because models like GPT-3 or PaLM⁠ have no recurrent state⁠, they must fake it by reusing their predicted output as a working memory⁠. However, such ‘show-your-work’ writing style is highly unusual in the original natural language distribution they are trained to imitate, so they will not do so by default without a prompt steering them towards it; they instead try to emit the answer immediately, which is impossible given their feedforward limitation, and so they guess incorrectly.

Gwern

“Free-Play Periods for RL Agents ”, Gwern 2023

Free-Play Periods for RL Agents

“It Looks Like You’re Trying To Take Over The World ”, Gwern 2022

It Looks Like You’re Trying To Take Over The World

Links

“Coaxing USAMO Proofs From `o3-Mini-High` ”, Burnham 2025

⁠Coaxing USAMO Proofs From o3-mini-high⁠

“Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t ”, Dang & Ngo 2025

⁠Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t⁠

“Thinking Slow, Fast: Scaling Inference Compute With Distilled Reasoners ”, Paliotta et al 2025

⁠Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners⁠

“Rank1: Test-Time Compute for Reranking in Information Retrieval ”, Weller et al 2025

⁠Rank1: Test-Time Compute for Reranking in Information Retrieval⁠

“Scaling up Test-Time Compute With Latent Reasoning: A Recurrent Depth Approach ”, Geiping et al 2025

⁠Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach⁠

“Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning ”, Su et al 2025

⁠Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning⁠

“Competitive Programming With Large Reasoning Models ”, El-Kishky et al 2025

⁠Competitive Programming with Large Reasoning Models⁠

“Introducing Deep Research: An Agent That Uses Reasoning to Synthesize Large Amounts of Online Information and Complete Multi-Step Research Tasks for You. Available to Pro Users Today, Plus and Team Next ”, OpenAI 2025

Introducing Deep Research: An agent that uses reasoning to synthesize large amounts of online information and complete multi-step research tasks for you. Available to Pro users today, Plus and Team next⁠

“S1: Simple Test-Time Scaling ”, Muennighoff et al 2025

⁠s1: Simple test-time scaling⁠

“Large Language Models Think Too Fast To Explore Effectively ”, Pan et al 2025

Large Language Models Think Too Fast To Explore Effectively⁠

“DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning ”, Guo et al 2025

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning⁠

“Are DeepSeek R1 And Other Reasoning Models More Faithful? ”, Chua & Evans 2025

⁠Are DeepSeek R1 And Other Reasoning Models More Faithful?⁠

“Aviary: Training Language Agents on Challenging Scientific Tasks ”, Narayanan et al 2024

Aviary: training language agents on challenging scientific tasks⁠

“O1 Turns Pro ”

⁠o1 Turns Pro :

View HTML:

⁠/doc/www/thezvi.wordpress.com/522104dae07b48bd12f311402ec2bfd08cdcd2bb.html⁠

“Training Large Language Models to Reason in a Continuous Latent Space ”, Hao et al 2024

Training Large Language Models to Reason in a Continuous Latent Space⁠

“Introducing ChatGPT Pro: Broadening Usage of Frontier AI ”, OpenAI 2024

Introducing ChatGPT Pro: Broadening usage of frontier AI⁠

“Free Process Rewards without Process Labels ”, Yuan et al 2024

Free Process Rewards without Process Labels⁠

“Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models ”, Ruis et al 2024

Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models⁠

“Mind Your Step (By Step): Chain-Of-Thought Can Reduce Performance on Tasks Where Thinking Makes Humans Worse ”, Liu et al 2024

Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse⁠

“Thinking LLMs: General Instruction Following With Thought Generation ”, Wu et al 2024

Thinking LLMs: General Instruction Following with Thought Generation⁠

“When a Language Model Is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI O1 ”, McCoy et al 2024

When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1⁠

“Evaluation of OpenAI O1: Opportunities and Challenges of AGI ”, Zhong et al 2024

Evaluation of OpenAI o1: Opportunities and Challenges of AGI⁠

“LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s O1 on PlanBench ”, Valmeekam et al 2024

LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench⁠

“Training Language Models to Self-Correct via Reinforcement Learning ”, Kumar et al 2024

Training Language Models to Self-Correct via Reinforcement Learning⁠

“To CoT or Not to CoT? Chain-Of-Thought Helps Mainly on Math and Symbolic Reasoning ”, Sprague et al 2024

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning⁠

“Critique-Out-Loud Reward Models ”, Ankner et al 2024

Critique-out-Loud Reward Models⁠

“Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process ”, Ye et al 2024

Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process⁠

“Connecting the Dots: LLMs Can Infer and Verbalize Latent Structure from Disparate Training Data ”, Treutlein et al 2024

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data⁠

“Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? ”, Lee et al 2024

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?⁠

“OlympicArena: Benchmarking Multi-Discipline Cognitive Reasoning for Superintelligent AI ”, Huang et al 2024

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI⁠

“How Far Can Transformers Reason? The Locality Barrier and Inductive Scratchpad ”, Abbe et al 2024

How Far Can Transformers Reason? The Locality Barrier and Inductive Scratchpad⁠

“OmegaPRM: Improve Mathematical Reasoning in Language Models by Automated Process Supervision ”, Luo et al 2024

OmegaPRM: Improve Mathematical Reasoning in Language Models by Automated Process Supervision⁠

“MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark ”, Wang et al 2024

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark⁠

“A Theoretical Understanding of Self-Correction through In-Context Alignment ”, Wang et al 2024

A Theoretical Understanding of Self-Correction through In-context Alignment⁠

“Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models ”, Lu et al 2024

Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models⁠

“From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step ”, Deng et al 2024

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step⁠

“Observational Scaling Laws and the Predictability of Language Model Performance ”, Ruan et al 2024

⁠Observational Scaling Laws and the Predictability of Language Model Performance⁠

“Retrieval Head Mechanistically Explains Long-Context Factuality ”, Wu et al 2024

Retrieval Head Mechanistically Explains Long-Context Factuality⁠

“Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models ”, Pfau et al 2024

Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models⁠

“Autonomous LLM-Driven Research from Data to Human-Verifiable Research Papers ”, Ifargan et al 2024

Autonomous LLM-driven research from data to human-verifiable research papers⁠

“Missed Connections: Lateral Thinking Puzzles for Large Language Models ”, Todd et al 2024

Missed Connections: Lateral Thinking Puzzles for Large Language Models⁠

“ChatGPT Can Predict the Future When It Tells Stories Set in the Future About the Past ”, Pham & Cunningham 2024

ChatGPT Can Predict the Future when it Tells Stories Set in the Future About the Past⁠

“Visualization-Of-Thought Elicits Spatial Reasoning in Large Language Models ”, Wu et al 2024

Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models⁠

“Do Language Models Plan Ahead for Future Tokens? ”, Wu et al 2024

Do language models plan ahead for future tokens?⁠

“FABLES: Evaluating Faithfulness and Content Selection in Book-Length Summarization ”, Kim et al 2024

FABLES: Evaluating faithfulness and content selection in book-length summarization⁠

“Re-Evaluating GPT-4’s Bar Exam Performance ”, Martínez 2024

Re-evaluating GPT-4’s bar exam performance⁠

“Long-Form Factuality in Large Language Models ”, Wei et al 2024

Long-form factuality in large language models⁠

“Don’t Trust: Verify—Grounding LLM Quantitative Reasoning With Autoformalization ”, Zhou et al 2024

⁠Don’t Trust: Verify—Grounding LLM Quantitative Reasoning with Autoformalization⁠

“Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking ”, Zelikman et al 2024

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking⁠

“RNNs Are Not Transformers (Yet): The Key Bottleneck on In-Context Retrieval ”, Wen et al 2024

RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval⁠

“Tokenization Counts: the Impact of Tokenization on Arithmetic in Frontier LLMs ”, Singh & Strouse 2024

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs⁠

“Chain-Of-Thought Empowers Transformers to Solve Inherently Serial Problems ”, Li et al 2024

Chain-of-Thought Empowers Transformers to Solve Inherently Serial Problems⁠

“Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models ”, Levy et al 2024

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models⁠

“Why Are Sensitive Functions Hard for Transformers? ”, Hahn & Rofin 2024

Why are Sensitive Functions Hard for Transformers?⁠

“Chain-Of-Thought Reasoning Without Prompting ”, Wang & Zhou 2024

Chain-of-Thought Reasoning Without Prompting⁠

“V-STaR: Training Verifiers for Self-Taught Reasoners ”, Hosseini et al 2024

V-STaR: Training Verifiers for Self-Taught Reasoners⁠

“More Agents Is All You Need ”, Li et al 2024

More Agents Is All You Need⁠

“The Impact of Reasoning Step Length on Large Language Models ”, Jin et al 2024

The Impact of Reasoning Step Length on Large Language Models⁠

“Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach ”, Ma et al 2023

Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach⁠

“Math-Shepherd: Verify and Reinforce LLMs Step-By-Step without Human Annotations ”, Wang et al 2023

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations⁠

“Beyond Human Data: Scaling Self-Training for Problem-Solving With Language Models (ReST^EM) ”, Singh et al 2023

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (ReST^EM)⁠

“Tree of Attacks (TAP): Jailbreaking Black-Box LLMs Automatically ”, Mehrotra et al 2023

Tree of Attacks (TAP): Jailbreaking Black-Box LLMs Automatically⁠

“Universal Self-Consistency for Large Language Model Generation ”, Chen et al 2023

Universal Self-Consistency for Large Language Model Generation⁠

“Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine ”, Nori et al 2023

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine⁠

“Training Chain-Of-Thought via Latent-Variable Inference ”, Phan et al 2023

Training Chain-of-Thought via Latent-Variable Inference⁠

“Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks ”, Ramesh et al 2023

Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks⁠

“On Measuring Faithfulness or Self-Consistency of Natural Language Explanations ”, Parcalabescu & Frank 2023

On Measuring Faithfulness or Self-consistency of Natural Language Explanations⁠

“Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations ”, Hong et al 2023

Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations⁠

“Large Language Models Can Strategically Deceive Their Users When Put Under Pressure ”, Scheurer et al 2023

Large Language Models can Strategically Deceive their Users when Put Under Pressure⁠

“Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves ”, Deng et al 2023

Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves⁠

“Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation ”, Ding et al 2023

Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation⁠

“Implicit Chain-Of-Thought Reasoning via Knowledge Distillation ”, Deng et al 2023

Implicit Chain-of-Thought Reasoning via Knowledge Distillation⁠

“Preventing Language Models From Hiding Their Reasoning ”, Roger & Greenblatt 2023

Preventing Language Models From Hiding Their Reasoning⁠

“Branch-Solve-Merge Improves Large Language Model Evaluation and Generation ”, Saha et al 2023

Branch-Solve-Merge Improves Large Language Model Evaluation and Generation⁠

“Can GPT Models Be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on Mock CFA Exams ”, Callanan et al 2023

Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams⁠

“The Expressive Power of Transformers With Chain-Of-Thought ”, Merrill & Sabharwal 2023

The Expressive Power of Transformers with Chain-of-Thought⁠

“Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models ”, Zhou et al 2023

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models⁠

“Large Language Models Cannot Self-Correct Reasoning Yet ”, Huang et al 2023

Large Language Models Cannot Self-Correct Reasoning Yet⁠

“Think Before You Speak: Training Language Models With Pause Tokens ”, Goyal et al 2023

Think before you speak: Training Language Models With Pause Tokens⁠

“Embers of Autoregression: Understanding Large Language Models Through the Problem They Are Trained to Solve ”, McCoy et al 2023

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve⁠

“Contrastive Decoding Improves Reasoning in Large Language Models ”, O’Brien & Lewis 2023

Contrastive Decoding Improves Reasoning in Large Language Models⁠

“Re-Reading Improves Reasoning in Large Language Models ”, Xu et al 2023

Re-Reading Improves Reasoning in Large Language Models⁠

“From Sparse to Dense: GPT-4 Summarization With Chain of Density (CoD) Prompting ”, Adams et al 2023

From Sparse to Dense: GPT-4 Summarization with Chain of Density (CoD) Prompting⁠

“Graph of Thoughts: Solving Elaborate Problems With Large Language Models ”, Besta et al 2023

Graph of Thoughts: Solving Elaborate Problems with Large Language Models⁠

“Solving Challenging Math Word Problems Using GPT-4 Code Interpreter With Code-Based Self-Verification ”, Zhou et al 2023

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification⁠

“Scaling Relationship on Learning Mathematical Reasoning With Large Language Models ”, Yuan et al 2023

⁠Scaling Relationship on Learning Mathematical Reasoning with Large Language Models⁠

“Android in the Wild: A Large-Scale Dataset for Android Device Control ”, Rawles et al 2023

Android in the Wild: A Large-Scale Dataset for Android Device Control⁠

“LLMs As Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines With LLMs ”, Wu et al 2023

LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs⁠

“TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT ”, Zha et al 2023

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT⁠

“Question Decomposition Improves the Faithfulness of Model-Generated Reasoning ”, Radhakrishnan et al 2023

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning⁠

“Measuring Faithfulness in Chain-Of-Thought Reasoning ”, Lanham et al 2023

Measuring Faithfulness in Chain-of-Thought Reasoning⁠

“Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration ”, Wang et al 2023

Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration⁠

“Explaining Competitive-Level Programming Solutions Using LLMs ”, Li et al 2023

Explaining Competitive-Level Programming Solutions using LLMs⁠

“Teaching Arithmetic to Small Transformers ”, Lee et al 2023

Teaching Arithmetic to Small Transformers⁠

“Language Models Are Weak Learners ”, Manikandan et al 2023

Language models are weak learners⁠

“Let’s Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning ”, Ma et al 2023

Let’s Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning⁠

“GKD: Generalized Knowledge Distillation for Auto-Regressive Sequence Models ”, Agarwal et al 2023

GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models⁠

“From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought ”, Wong et al 2023

From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought⁠

“Large Language Models As Tax Attorneys: A Case Study in Legal Capabilities Emergence ”, Nay et al 2023

Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence⁠

“Iterative Translation Refinement With Large Language Models ”, Chen et al 2023

Iterative Translation Refinement with Large Language Models⁠

“Thought Cloning: Learning to Think While Acting by Imitating Human Thinking ”, Hu & Clune 2023

Thought Cloning: Learning to Think while Acting by Imitating Human Thinking⁠

“Let’s Verify Step by Step ”, Lightman et al 2023

Let’s Verify Step by Step⁠

“Towards Revealing the Mystery behind Chain-Of-Thought: A Theoretical Perspective ”, Feng et al 2023

Towards Revealing the Mystery behind Chain-of-Thought: A Theoretical Perspective⁠

“Improving Factuality and Reasoning in Language Models through Multiagent Debate ”, Du et al 2023

Improving Factuality and Reasoning in Language Models through Multiagent Debate⁠

“How Language Model Hallucinations Can Snowball ”, Zhang et al 2023

How Language Model Hallucinations Can Snowball⁠

“Tree of Thoughts (ToT): Deliberate Problem Solving With Large Language Models ”, Yao et al 2023

Tree of Thoughts (ToT): Deliberate Problem Solving with Large Language Models⁠

“Large Language Model Programs ”, Schlag et al 2023

Large Language Model Programs⁠

“Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-Of-Thought Prompting ”, Turpin et al 2023

Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting⁠

“Distilling Step-By-Step! Outperforming Larger Language Models With Less Training Data and Smaller Model Sizes ”, Hsieh et al 2023

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes⁠

“Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding ”, Xie et al 2023

Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding⁠

“LLM+P: Empowering Large Language Models With Optimal Planning Proficiency ”, Liu et al 2023

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency⁠

“Boosting Theory-Of-Mind Performance in Large Language Models via Prompting ”, Moghaddam & Honey 2023

Boosting Theory-of-Mind Performance in Large Language Models via Prompting⁠

“Think Before You Act: Unified Policy for Interleaving Language Reasoning With Actions ”, Mezghani et al 2023

Think Before You Act: Unified Policy for Interleaving Language Reasoning with Actions⁠

“Language Models Can Solve Computer Tasks ”, Kim et al 2023

Language Models can Solve Computer Tasks⁠

“Reflexion: Language Agents With Verbal Reinforcement Learning ”, Shinn et al 2023

Reflexion: Language Agents with Verbal Reinforcement Learning⁠

“How Well Do Large Language Models Perform in Arithmetic Tasks? ”, Yuan et al 2023

How well do Large Language Models perform in Arithmetic tasks?⁠

“SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models ”, Manakul et al 2023

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models⁠

“Language Is Not All You Need: Aligning Perception With Language Models (Kosmos-1) ”, Huang et al 2023

Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)⁠

“Multimodal Chain-Of-Thought Reasoning in Language Models ”, Zhang et al 2023

Multimodal Chain-of-Thought Reasoning in Language Models⁠

“Faithful Chain-Of-Thought Reasoning ”, Lyu et al 2023

Faithful Chain-of-Thought Reasoning⁠

“Large Language Models Are Versatile Decomposers: Decompose Evidence and Questions for Table-Based Reasoning ”, Ye et al 2023

Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning⁠

“ChatGPT Goes to Law School ”, Choi et al 2023

ChatGPT Goes to Law School⁠

“Large Language Models As Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards ”, Nay 2023

Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards⁠

“Interactive-Chain-Prompting (INTERCPT): Ambiguity Resolution for Crosslingual Conditional Generation With Interaction ”, Pilault et al 2023

Interactive-Chain-Prompting (INTERCPT): Ambiguity Resolution for Crosslingual Conditional Generation with Interaction⁠

“Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes ”, Reppert et al 2023

Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes⁠

“Solving Math Word Problems With Process & Outcome-Based Feedback ”, Uesato et al 2022

Solving math word problems with process & outcome-based feedback⁠

“PAL: Program-Aided Language Models ”, Gao et al 2022

PAL: Program-aided Language Models⁠

“Measuring Progress on Scalable Oversight for Large Language Models ”, Bowman et al 2022

Measuring Progress on Scalable Oversight for Large Language Models⁠

“U-PaLM: Transcending Scaling Laws With 0.1% Extra Compute ”, Tay et al 2022

U-PaLM: Transcending Scaling Laws with 0.1% Extra Compute⁠

“Large Language Models Can Self-Improve ”, Huang et al 2022

Large Language Models Can Self-Improve⁠

“Challenging BIG-Bench Tasks (BBH) and Whether Chain-Of-Thought Can Solve Them ”, Suzgun et al 2022

Challenging BIG-Bench Tasks (BBH) and Whether Chain-of-Thought Can Solve Them⁠

“Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle) ”, Press et al 2022

Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle)⁠

“Language Models Are Multilingual Chain-Of-Thought Reasoners ”, Shi et al 2022

Language Models are Multilingual Chain-of-Thought Reasoners⁠

“ReAct: Synergizing Reasoning and Acting in Language Models ”, Yao et al 2022

ReAct: Synergizing Reasoning and Acting in Language Models⁠

“Dynamic Prompt Learning via Policy Gradient for Semi-Structured Mathematical Reasoning ”, Lu et al 2022

Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning⁠

“FOLIO: Natural Language Reasoning With First-Order Logic ”, Han et al 2022

FOLIO: Natural Language Reasoning with First-Order Logic⁠

“Faithful Reasoning Using Large Language Models ”, Creswell & Shanahan 2022

Faithful Reasoning Using Large Language Models⁠

“Limitations of Language Models in Arithmetic and Symbolic Induction ”, Qian et al 2022

Limitations of Language Models in Arithmetic and Symbolic Induction⁠

“Language Models Can Teach Themselves to Program Better ”, Haluptzok et al 2022

Language Models Can Teach Themselves to Program Better⁠

“Language Model Cascades ”, Dohan et al 2022

Language Model Cascades⁠

“CodeT: Code Generation With Generated Tests ”, Chen et al 2022

CodeT: Code Generation with Generated Tests⁠

“Can Large Language Models Reason about Medical Questions? ”, Liévin et al 2022

Can large language models reason about medical questions?⁠

“Inner Monologue: Embodied Reasoning through Planning With Language Models ”, Huang et al 2022

Inner Monologue: Embodied Reasoning through Planning with Language Models⁠

“Exploring Length Generalization in Large Language Models ”, Anil et al 2022

Exploring Length Generalization in Large Language Models⁠

“Language Models (Mostly) Know What They Know ”, Kadavath et al 2022

Language Models (Mostly) Know What They Know⁠

“Solving Quantitative Reasoning Problems With Language Models ”, Lewkowycz et al 2022

Solving Quantitative Reasoning Problems with Language Models⁠

“Maieutic Prompting: Logically Consistent Reasoning With Recursive Explanations ”, Jung et al 2022

Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations⁠

“Large Language Models Are Zero-Shot Reasoners ”, Kojima et al 2022

Large Language Models are Zero-Shot Reasoners⁠

“Instruction Induction: From Few Examples to Natural Language Task Descriptions ”, Honovich et al 2022

Instruction Induction: From Few Examples to Natural Language Task Descriptions⁠

“Least-To-Most Prompting Enables Complex Reasoning in Large Language Models ”, Zhou et al 2022

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models⁠

“Dialog Inpainting: Turning Documents into Dialogues ”, Dai et al 2022

Dialog Inpainting: Turning Documents into Dialogues⁠

“UL2: Unifying Language Learning Paradigms ”, Tay et al 2022

UL2: Unifying Language Learning Paradigms⁠

“Can Language Models Learn from Explanations in Context? ”, Lampinen et al 2022

Can language models learn from explanations in context?⁠

“Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language ”, Zeng et al 2022

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language⁠

“STaR: Bootstrapping Reasoning With Reasoning ”, Zelikman et al 2022

STaR: Bootstrapping Reasoning With Reasoning⁠

“A Conversational Paradigm for Program Synthesis ”, Nijkamp et al 2022

A Conversational Paradigm for Program Synthesis⁠

“Self-Consistency Improves Chain-Of-Thought Reasoning in Language Models ”, Wang et al 2022

Self-Consistency Improves Chain-of-Thought Reasoning in Language Models⁠

“Learning-By-Narrating: Narrative Pre-Training for Zero-Shot Dialogue Comprehension ”, Zhao et al 2022

Learning-by-Narrating: Narrative Pre-Training for Zero-Shot Dialogue Comprehension⁠

“PromptChainer: Chaining Large Language Model Prompts through Visual Programming ”, Wu et al 2022

PromptChainer: Chaining Large Language Model Prompts through Visual Programming⁠

“Chain-Of-Thought Prompting Elicits Reasoning in Large Language Models ”, Wei et al 2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models⁠

“Reasoning Like Program Executors ”, Pi et al 2022

Reasoning Like Program Executors⁠

“A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More ”, Drori et al 2021

A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More⁠

“DREAM: Uncovering Mental Models behind Language Models ”, Gu et al 2021

DREAM: Uncovering Mental Models behind Language Models⁠

“Reframing Human-AI Collaboration for Generating Free-Text Explanations ”, Wiegreffe et al 2021

Reframing Human-AI Collaboration for Generating Free-Text Explanations⁠

“NeuroLogic A^✱esque Decoding: Constrained Text Generation With Lookahead Heuristics ”, Lu et al 2021

NeuroLogic A^✱esque Decoding: Constrained Text Generation with Lookahead Heuristics⁠

“WebGPT: Improving the Factual Accuracy of Language Models through Web Browsing ”, Hilton et al 2021

WebGPT: Improving the factual accuracy of language models through web browsing⁠

“NN Inner Monologue ”, Gwern 2021

⁠NN Inner Monologue⁠

“Few-Shot Self-Rationalization With Natural Language Prompts ”, Marasović et al 2021

Few-Shot Self-Rationalization with Natural Language Prompts⁠

“Training Verifiers to Solve Math Word Problems ”, Cobbe et al 2021

Training Verifiers to Solve Math Word Problems⁠

“Unsupervised Neural Machine Translation With Generative Language Models Only ”, Han et al 2021

Unsupervised Neural Machine Translation with Generative Language Models Only⁠

“Show Your Work: Scratchpads for Intermediate Computation With Language Models ”, Nye et al 2021

Show Your Work: Scratchpads for Intermediate Computation with Language Models⁠

“AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts ”, Wu et al 2021

AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts⁠

“Teaching Autoregressive Language Models Complex Tasks By Demonstration ”, Recchia 2021

Teaching Autoregressive Language Models Complex Tasks By Demonstration⁠

“Program Synthesis With Large Language Models ”, Austin et al 2021

Program Synthesis with Large Language Models⁠

“Decision Transformer: Reinforcement Learning via Sequence Modeling ”, Chen et al 2021

Decision Transformer: Reinforcement Learning via Sequence Modeling⁠

“Explainable Multi-Hop Verbal Reasoning Through Internal Monologue ”, Liang et al 2021

Explainable Multi-hop Verbal Reasoning Through Internal Monologue⁠

“A Simple Method to Keep GPT-3 Focused in a Conversation ”, Mayne 2021

A simple method to keep GPT-3 focused in a conversation

“Measuring Mathematical Problem Solving With the MATH Dataset ”, Hendrycks et al 2021

Measuring Mathematical Problem Solving With the MATH Dataset⁠

“Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm ”, Reynolds & McDonell 2021

Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm⁠

“How We Accidentally Gave Our Bots Their Personalities ”, Latitude 2021

How We Accidentally Gave our Bots Their Personalities⁠

“Word in Context: Agent and Agent Clarification (69% Dev) ”, Brockman 2020

Word in Context: Agent and Agent Clarification (69% Dev)⁠

“I Found That Getting GPT-3 to Add Its Own "Internal Monologue" in Parentheses to Be a Helpful Strategy… ”, blixt 2020

I found that getting GPT-3 to add its own "internal monologue" in parentheses to be a helpful strategy…⁠

kleptid @ "2020-07-17"

Seems to work⁠

kleptid @ "2020-07-17"

Teaching GPT-3 to do a brute force 'for loop' checking answers also seems to work⁠

“Inducing Self-Explanation: a Meta-Analysis ”, Bisra et al 2018

Inducing Self-Explanation: a Meta-Analysis⁠

“Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems ”, Ling et al 2017

Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems⁠

“Why Do Humans Reason? Arguments for an Argumentative Theory ”, Mercier & Sperber 2011

⁠Why do humans reason? Arguments for an argumentative theory :

View PDF:

⁠/doc/www/hal.science/ba03e8d7db678948a7585a947ea8a4eac13d6abf.pdf⁠

“How to Dramatically Improve the Reasoning Ability of GPT-3 ”

⁠How to dramatically improve the reasoning ability of GPT-3 :

View HTML:

⁠/doc/www/blog.andrewcantino.com/642b641a22ab789da5eba95379dfeb1e7c7596e9.html⁠

“A Preliminary Exploration into Factored Cognition With Language Models ”

⁠A Preliminary Exploration into Factored Cognition with Language Models⁠ :

View External Link:

⁠https://blog.eleuther.ai/factored-cognition/⁠

“ChatGPT-4 O1-Pro: Poetry Reflection and Analysis ”

⁠ChatGPT-4 o1-pro: Poetry Reflection and Analysis⁠

“WiC_SelfContextStuffingImproved_Last10_stuft_examplesNV.ipynb ”

WiC_SelfContextStuffingImproved_Last10_stuft_examplesNV.ipynb⁠

“TinyZero ”, Pan 2025

⁠TinyZero⁠ :

View HTML:

⁠/doc/www/github.com/65c21c2602a16ce0a4613a6c1f14b6f708200614.html⁠

“Vincent-163/transformer-Arithmetic ”

vincent-163/transformer-arithmetic⁠

“Magic ToDo List Creator ”

⁠Magic ToDo List Creator :

View HTML:

⁠/doc/www/goblin.tools/99da0d8d421922b8768e9f6e35207d59db5bb214.html⁠

“Short Story on AI: ‘Forward Pass’ ”, Karpathy 2025

⁠Short Story on AI: ‘Forward Pass’ :

View External Link:

⁠https://karpathy.github.io/2021/03/27/forward-pass/

“AI Dungeon Players Can Now Translate Their Stories into Emojis by Just Clicking a Button. ”

⁠AI Dungeon players can now translate their stories into emojis by just clicking a button.⁠ :

View HTML:

⁠/doc/www/latitude.io/d4a1d75abef33a907533b26f731c3ebb3ac090a1.html⁠

“Sky-T1: Train Your Own `o1-Preview` Model With $450 ”

⁠Sky-T1: Train your own o1-preview model with $450 :

View HTML:

⁠/doc/www/novasky-ai.github.io/65d3c8432019e99079cf93efdc234841bc85ca3c.html⁠

“Solving Math Word Problems: We’ve Trained a System That Solves Grade School Math Problems With Nearly Twice the Accuracy of a Fine-Tuned GPT-3 Model. It Solves about 90% As Many Problems As Real Kids: a Small Sample of 9-12 Year Olds Scored 60% on a Test from Our Dataset, While Our System Scored 55% on Those Same Problems. This Is Important Because Today’s AI Is Still Quite Weak at Commonsense Multistep Reasoning, Which Is Easy Even for Grade School Kids. We Achieved These Results by Training Our Model to Recognize Its Mistakes, so That It Can Try Repeatedly Until It Finds a Solution That Works ”

Solving Math Word Problems: We’ve trained a system that solves grade school math problems with nearly twice the accuracy of a fine-tuned GPT-3 model. It solves about 90% as many problems as real kids: a small sample of 9-12 year olds scored 60% on a test from our dataset, while our system scored 55% on those same problems. This is important because today’s AI is still quite weak at commonsense multistep reasoning, which is easy even for grade school kids. We achieved these results by training our model to recognize its mistakes, so that it can try repeatedly until it finds a solution that works⁠

“Prompting Diverse Ideas: Increasing AI Idea Variance ”

⁠Prompting Diverse Ideas: Increasing AI Idea Variance⁠

“`o3-Mini` ”, OpenAI 2025

o3-mini⁠

“Teaching a Neural Network to Use a Calculator ”

⁠Teaching a neural network to use a calculator :

View HTML:

⁠/doc/www/reiinakano.com/3a2a05d3c82f879dc7abd75a01d242a08925804f.html⁠

“GPT-4 O1 Isn’t a Chat Model (And That’s the Point) ”

⁠GPT-4 o1 isn’t a chat model (and that’s the point)

“Connecting the Dots: LLMs Can Infer & Verbalize Latent Structure from Training Data ”

⁠Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data⁠

“Preventing Language Models from Hiding Their Reasoning ”

⁠Preventing Language Models from hiding their reasoning⁠

“A High Level Closed-Door Session Discussing DeepSeek: Vision Trumps Technology ”

⁠A High Level Closed-Door Session Discussing DeepSeek: Vision Trumps Technology⁠

“Steganography in Chain-Of-Thought Reasoning ”

⁠Steganography in Chain-of-Thought Reasoning⁠

“Visible Thoughts Project and Bounty Announcement ”

⁠Visible Thoughts Project and Bounty Announcement⁠ :

View External Link:

⁠https://www.lesswrong.com/posts/zRn6cLtxyNodudzhw/visible-thoughts-project-and-bounty-announcement⁠

Malcolm_Ocean

Inspired by an AI Dungeon example where math is discussed in simple language, I seem to be having decent results here. I had to… not just say what parity IS but HOW to calculate it (‘count the number of 1s’) and then it sort of walks itself through decently. Tho kinda confused⁠ :

/doc/www/localhost/d503eac82e2ea68eb23f0a1362ee513ce0176ec2.html⁠

bucketofkets

I think ‘GPT-3 can’t do parity checking’ isn’t quite right. It can clearly pattern match the algorithm, almost perfectly. It’s just a little mistake prone. Here, I invented a syntax for having it evaluate parity on each pair of digits. It…almost gets it right.⁠ :

/doc/www/localhost/8ecd37ee160fdb9c3f0aa3260978809d5afc743c.html⁠

sama

⁠[o3-full & o4-mini to launch earlier, GPT-5 delayed for capability improvement, integration polishing, & hardware availability]⁠

teortaxesTex

[DeepSeek-r1 solving Russian pun]⁠

Sort By Magic

Annotations sorted by machine learning into ⁠inferred 'tags'⁠. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

Miscellaneous

Bibliography

https://arxiv.org/abs/2502.20339: “Thinking Slow, Fast: Scaling Inference Compute With Distilled Reasoners ”⁠, Daniele Paliotta, ⁠Junxiong Wang, Matteo Pagliardini …, Kevin Y. Li, Aviv Bick, J. Zico Kolter, Albert Gu⁠, François Fleuret, ⁠Tri Dao
link-bibliography⁠
https://arxiv.org/abs/2502.06807#openai: “Competitive Programming With Large Reasoning Models ”⁠, Ahmed El-Kishky, Alexander Wei, Andre Saraiva …, Borys Minaev, Daniel Selsam⁠, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Łukasz Kaiser⁠, ⁠Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese⁠, o3 contributors, Oleg Mürk, Rhythm Garg, Rui Shu, Szymon Sidor, Vineet Kosaraju, Wenda Zhou
link-bibliography⁠
https://arxiv.org/abs/2501.08156: “Are DeepSeek R1 And Other Reasoning Models More Faithful? ”⁠, James Chua, ⁠Owain Evans
link-bibliography⁠
https://arxiv.org/abs/2412.01981: “Free Process Rewards without Process Labels ”⁠, Lifan Yuan⁠, Wendi Li, Huayu Chen …, Ganqu Cui, Ning Ding⁠, Kaiyan Zhang, Bowen Zhou, ⁠Zhiyuan Liu, Hao Peng
link-bibliography⁠
https://arxiv.org/abs/2410.21333: “Mind Your Step (By Step): Chain-Of-Thought Can Reduce Performance on Tasks Where Thinking Makes Humans Worse ”⁠, Ryan Liu, Jiayi Geng, Addison J. Wu …, Ilia Sucholutsky, Tania Lombrozo, Thomas L. Griffiths⁠
link-bibliography⁠
https://arxiv.org/abs/2406.13121#google: “Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? ”⁠, Jinhyuk Lee, Anthony Chen⁠, Zhuyun Dai …, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, Kelvin Guu
link-bibliography⁠
https://arxiv.org/abs/2405.15143: “Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models ”⁠, Cong Lu, Shengran Hu, ⁠Jeff Clune
link-bibliography⁠
https://arxiv.org/abs/2405.14838: “From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step ”⁠, Yuntian Deng, Yejin Choi⁠, Stuart Shieber
link-bibliography⁠
https://arxiv.org/abs/2405.10938: “Observational Scaling Laws and the Predictability of Language Model Performance ”⁠, Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto
link-bibliography⁠
https://arxiv.org/abs/2404.15574: “Retrieval Head Mechanistically Explains Long-Context Factuality ”⁠, Wenhao Wu, ⁠Yizhong Wang, Guangxuan Xiao …, Hao Peng, Yao Fu
link-bibliography⁠
https://arxiv.org/abs/2404.15758: “Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models ”⁠, Jacob Pfau, William Merrill, ⁠Samuel R. Bowman
link-bibliography⁠
https://link.springer.com/article/10.1007/s10506-024-09396-9: “Re-Evaluating GPT-4’s Bar Exam Performance ”⁠, Eric Martínez
link-bibliography⁠
https://arxiv.org/abs/2403.18802#deepmind: “Long-Form Factuality in Large Language Models ”⁠, Jerry Wei, Chengrun Yang, Xinying Song …, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le⁠
link-bibliography⁠
https://arxiv.org/abs/2403.18120#google: “Don’t Trust: Verify—Grounding LLM Quantitative Reasoning With Autoformalization ”⁠, Jin Peng Zhou, Charles Staats, Wenda Li …, Christian Szegedy, ⁠Kilian Q. Weinberger, ⁠Yuhuai Wu
link-bibliography⁠
https://arxiv.org/abs/2403.09629: “Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking ”⁠, Eric Zelikman, Georges Harik, Yijia Shao …, Varuna Jayasiri, Nick Haber, Noah D. Goodman
link-bibliography⁠
https://arxiv.org/abs/2402.14903: “Tokenization Counts: the Impact of Tokenization on Arithmetic in Frontier LLMs ”⁠, Aaditya K. Singh, D. J. Strouse
link-bibliography⁠
https://arxiv.org/abs/2402.09963: “Why Are Sensitive Functions Hard for Transformers? ”⁠, Michael Hahn⁠, Mark Rofin
link-bibliography⁠
https://arxiv.org/abs/2402.05120#tencent: “More Agents Is All You Need ”⁠, Junyou Li, Qin Zhang⁠, Yangbin Yu …, Qiang Fu, Deheng Ye
link-bibliography⁠
https://arxiv.org/abs/2312.08935: “Math-Shepherd: Verify and Reinforce LLMs Step-By-Step without Human Annotations ”⁠, Peiyi Wang, Lei Li, Zhihong Shao …, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, Zhifang Sui
link-bibliography⁠
https://arxiv.org/abs/2312.06585#deepmind: “Beyond Human Data: Scaling Self-Training for Problem-Solving With Language Models (ReST^EM) ”⁠, Avi Singh, John D. Co-Reyes, Rishabh Agarwal …, Ankesh Anand, Piyush Patil, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar⁠, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Hanie Sedghi, Igor Mordatch⁠, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington⁠, Jiri Hron, Kathleen Kenealy⁠, Kevin Swersky, Kshiteej Mahajan, Laura Culp, Lechao Xiao, Maxwell L. Bileschi, Noah Constant⁠, Roman Novak, Rosanne Liu, Tris Warkentin, Yundi Qian, Ethan Dyer⁠, Behnam Neyshabur, Jascha Sohl-Dickstein⁠, Noah Fiedel
link-bibliography⁠
https://arxiv.org/abs/2311.16452#microsoft: “Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine ”⁠, Harsha Nori, Yin Tat Lee, Sheng Zhang …, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin⁠, Naoto Usuyama, Chris White, Eric Horvitz⁠
link-bibliography⁠
https://arxiv.org/abs/2312.02179: “Training Chain-Of-Thought via Latent-Variable Inference ”⁠, Du Phan, Matthew D. Hoffman, David Dohan …, Sholto Douglas⁠, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous
link-bibliography⁠
https://arxiv.org/abs/2310.08678: “Can GPT Models Be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on Mock CFA Exams ”⁠, Ethan Callanan, Amarachi Mbakwe, Antony Papadimitriou …, Yulong Pei, Mathieu Sibue, Xiaodan Zhu, Zhiqiang Ma, Xiaomo Liu, Sameena Shah
link-bibliography⁠
https://arxiv.org/abs/2310.04406: “Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models ”⁠, Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman …, Haohan Wang, Yu-Xiong Wang
link-bibliography⁠
https://arxiv.org/abs/2310.02226: “Think Before You Speak: Training Language Models With Pause Tokens ”⁠, Sachin Goyal, Ziwei Ji, Ankit Singh Rawat …, Aditya Krishna Menon, Sanjiv Kumar⁠, Vaishnavh Nagarajan
link-bibliography⁠
https://arxiv.org/abs/2309.09117#facebook: “Contrastive Decoding Improves Reasoning in Large Language Models ”⁠, Sean O’Brien, Mike Lewis⁠
link-bibliography⁠
https://arxiv.org/abs/2309.06275: “Re-Reading Improves Reasoning in Large Language Models ”⁠, Xiaohan Xu, Chongyang Tao, Tao Shen …, Can Xu⁠, Hongbo Xu, Guodong Long, Jian-guang Lou, Shuai Ma
link-bibliography⁠
https://arxiv.org/abs/2309.04269: “From Sparse to Dense: GPT-4 Summarization With Chain of Density (CoD) Prompting ”⁠, Griffin Adams, Alexander Fabbri, Faisal Ladhak …, Eric Lehman, Noémie Elhadad⁠
link-bibliography⁠
https://arxiv.org/abs/2308.07921: “Solving Challenging Math Word Problems Using GPT-4 Code Interpreter With Code-Based Self-Verification ”⁠, Aojun Zhou, Ke Wang⁠, Zimu Lu …, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, Hongsheng Li
link-bibliography⁠
https://arxiv.org/abs/2307.05300#microsoft: “Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration ”⁠, Zhenhailong Wang, Shaoguang Mao, Wenshan Wu …, Tao Ge, Furu Wei⁠, Heng Ji⁠
link-bibliography⁠
https://arxiv.org/abs/2307.03381: “Teaching Arithmetic to Small Transformers ”⁠, Nayoung Lee, Kartik Sreenivasan, Jason D. Lee …, Kangwook Lee, Dimitris Papailiopoulos
link-bibliography⁠
https://arxiv.org/abs/2306.14308#google: “Let’s Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning ”⁠, Xiao Ma, ⁠Swaroop Mishra, Ahmad Beirami …, Alex Beutel, Jilin Chen
link-bibliography⁠
https://arxiv.org/abs/2306.00323: “Thought Cloning: Learning to Think While Acting by Imitating Human Thinking ”⁠, Shengran Hu, ⁠Jeff Clune
link-bibliography⁠
https://arxiv.org/abs/2305.20050#openai: “Let’s Verify Step by Step ”⁠, Hunter Lightman, Vineet Kosaraju, Yura Burda …, Harri Edwards, Bowen Baker, Teddy Lee, ⁠Jan Leike, ⁠John Schulman, Ilya Sutskever⁠, Karl Cobbe
link-bibliography⁠
https://arxiv.org/abs/2305.13534: “How Language Model Hallucinations Can Snowball ”⁠, Muru Zhang, Ofir Press, William Merrill …, Alisa Liu, Noah Smith⁠
link-bibliography⁠
https://arxiv.org/abs/2305.10601#deepmind: “Tree of Thoughts (ToT): Deliberate Problem Solving With Large Language Models ”⁠, Shunyu Yao, Dian Yu, Jeffrey Zhao …, Izhak Shafran, Thomas L. Griffiths⁠, Yuan Cao⁠, Karthik Narasimhan
link-bibliography⁠
https://arxiv.org/abs/2305.04388: “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-Of-Thought Prompting ”⁠, Miles Turpin, ⁠Julian Michael, ⁠Ethan Perez, ⁠Samuel R. Bowman
link-bibliography⁠
https://arxiv.org/abs/2305.02301#google: “Distilling Step-By-Step! Outperforming Larger Language Models With Less Training Data and Smaller Model Sizes ”⁠, Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh …, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister
link-bibliography⁠
https://arxiv.org/abs/2304.11490: “Boosting Theory-Of-Mind Performance in Large Language Models via Prompting ”⁠, Shima Rahimi Moghaddam, Christopher J. Honey
link-bibliography⁠
https://arxiv.org/abs/2304.02015#alibaba: “How Well Do Large Language Models Perform in Arithmetic Tasks? ”⁠, Zheng Yuan, Hongyi Yuan, Chuanqi Tan …, Wei Wang, Songfang Huang
link-bibliography⁠
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4335905: “ChatGPT Goes to Law School ”⁠, Jonathan H. Choi, Kristin E. Hickman, Amy Monahan, Daniel Schwarcz
link-bibliography⁠
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4335945: “Large Language Models As Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards ”⁠, John Nay
link-bibliography⁠
https://arxiv.org/abs/2301.01751#elicit: “Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes ”⁠, Justin Reppert, Ben Rachbach, Charlie George …, Luke Stebbing, Jungwon Byun, Maggie Appleton, Andreas Stuhlmüller
link-bibliography⁠
https://arxiv.org/abs/2210.11399#google: “U-PaLM: Transcending Scaling Laws With 0.1% Extra Compute ”⁠, ⁠Yi Tay, Jason Wei, Hyung Won Chung …, Vinh Q. Tran, David R. So, Siamak Shakeri, Xavier Garcia, Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha Chowdhery, ⁠Denny Zhou, Donald Metzler, Slav Petrov, ⁠Neil Houlsby, Quoc V. Le⁠, Mostafa Dehghani
link-bibliography⁠
https://arxiv.org/abs/2210.11610#google: “Large Language Models Can Self-Improve ”⁠, Jiaxin Huang, Shixiang Shane Gu⁠, Le Hou …, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han⁠
link-bibliography⁠
https://arxiv.org/abs/2210.09261#google: “Challenging BIG-Bench Tasks (BBH) and Whether Chain-Of-Thought Can Solve Them ”⁠, Mirac Suzgun, Nathan Scales, Nathanael Schärli …, Sebastian Gehrmann, ⁠Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le⁠, Ed H. Chi⁠, ⁠Denny Zhou, Jason Wei
link-bibliography⁠
https://arxiv.org/abs/2210.03350#allen: “Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle) ”⁠, Ofir Press, Muru Zhang, Sewon Min …, Ludwig Schmidt⁠, ⁠Noah A. Smith, Mike Lewis⁠
link-bibliography⁠
https://arxiv.org/abs/2210.03057#google: “Language Models Are Multilingual Chain-Of-Thought Reasoners ”⁠, Freda Shi, Mirac Suzgun, Markus Freitag …, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, ⁠Yi Tay, Sebastian Ruder, ⁠Denny Zhou, Dipanjan Das, Jason Wei
link-bibliography⁠
https://arxiv.org/abs/2209.00840: “FOLIO: Natural Language Reasoning With First-Order Logic ”⁠, Simeng Han, Hailey Schoelkopf, Yilun Zhao …, Zhenting Qi, Martin Riddell, Luke Benson, Lucy Sun, Ekaterina Zubova, Yujie Qiao, Matthew Burtell, David Peng, Jonathan Fan, Yixin Liu, Brian Wong⁠, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Shafiq Joty, Alexander R. Fabbri, Wojciech Kryscinski, Xi Victoria Lin, ⁠Caiming Xiong, Dragomir Radev⁠
link-bibliography⁠
https://arxiv.org/abs/2207.08143: “Can Large Language Models Reason about Medical Questions? ”⁠, Valentin Liévin, Christoffer Egeberg Hother, Ole Winther
link-bibliography⁠
https://arxiv.org/abs/2207.05608#google: “Inner Monologue: Embodied Reasoning through Planning With Language Models ”⁠, Wenlong Huang, Fei Xia⁠, Ted Xiao …, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch⁠, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine⁠, Karol Hausman, Brian Ichter
link-bibliography⁠
https://arxiv.org/abs/2207.05221#anthropic: “Language Models (Mostly) Know What They Know ”⁠, Saurav Kadavath⁠, Tom Conerly, ⁠Amanda Askell …, Tom Henighan, Dawn Drain, ⁠Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston⁠, Sheer El-Showk, ⁠Andy L. Jones, ⁠Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai⁠, ⁠Samuel R. Bowman, Stanislav Fort, ⁠Deep Ganguli, Danny Hernandez⁠, Josh Jacobson, ⁠Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei⁠, Tom B. Brown⁠, ⁠Jack Clark⁠, Nicholas Joseph, Ben Mann, Sam McCandlish⁠, Chris Olah, Jared Kaplan
link-bibliography⁠
https://arxiv.org/abs/2205.10625#google: “Least-To-Most Prompting Enables Complex Reasoning in Large Language Models ”⁠, ⁠Denny Zhou, Nathanael Schärli, Le Hou …, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc V. Le⁠, Ed Chi⁠
link-bibliography⁠
https://arxiv.org/abs/2205.09073#google: “Dialog Inpainting: Turning Documents into Dialogues ”⁠, Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Zhao …, Aida Amini, Qazi Mamunur Rashid, Mike Green, Kelvin Guu
link-bibliography⁠
https://arxiv.org/abs/2205.05131#google: “UL2: Unifying Language Learning Paradigms ”⁠, ⁠Yi Tay, Mostafa Dehghani, Vinh Q. Tran …, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, ⁠Neil Houlsby, Donald Metzler
link-bibliography⁠
https://arxiv.org/abs/2204.00598#google: “Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language ”⁠, Andy Zeng, Adrian Wong, Stefan Welker …, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, Pete Florence
link-bibliography⁠
https://arxiv.org/abs/2203.11171#google: “Self-Consistency Improves Chain-Of-Thought Reasoning in Language Models ”⁠, Xuezhi Wang, Jason Wei, Dale Schuurmans …, Quoc V. Le⁠, Ed Chi⁠, ⁠Denny Zhou
link-bibliography⁠
https://arxiv.org/abs/2201.11903#google: “Chain-Of-Thought Prompting Elicits Reasoning in Large Language Models ”⁠, Jason Wei, Xuezhi Wang, Dale Schuurmans …, Maarten Bosma, Ed Chi⁠, Quoc V. Le⁠, ⁠Denny Zhou
link-bibliography⁠
https://arxiv.org/abs/2201.11473#microsoft: “Reasoning Like Program Executors ”⁠, Xinyu Pi, Qian Liu⁠, Bei Chen …, Morteza Ziyadi, Zeqi Lin, Yan Gao, Qiang Fu, Jian-Guang Lou, Weizhu Chen
link-bibliography⁠
https://arxiv.org/abs/2112.15594: “A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More ”⁠, Iddo Drori, Sunny Tran, Roman Wang …, Newman Cheng, Kevin Liu, Leonard Tang, Elizabeth Ke, Nikhil Singh, Taylor L. Patti, Jayson Lynch, Avi Shporer, Nakul Verma⁠, Eugene Wu⁠, Gilbert Strang⁠
link-bibliography⁠
https://openai.com/research/webgpt: “WebGPT: Improving the Factual Accuracy of Language Models through Web Browsing ”⁠, ⁠Jacob Hilton, Suchir Balaji, Reiichiro Nakano, ⁠John Schulman
link-bibliography⁠
https://arxiv.org/abs/2110.14168#openai: “Training Verifiers to Solve Math Word Problems ”⁠, Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian …, ⁠Jacob Hilton, Reiichiro Nakano, Christopher Hesse, ⁠John Schulman
link-bibliography⁠
https://sites.google.com/berkeley.edu/decision-transformer: “Decision Transformer: Reinforcement Learning via Sequence Modeling ”⁠, Lili Chen, ⁠Kevin Lu, ⁠Aravind Rajeswaran …, Kimin Lee⁠, Aditya Grover⁠, Michael Laskin⁠, Pieter Abbeel⁠, Aravind Srinivas⁠, Igor Mordatch⁠
link-bibliography⁠
https://gptprompts.wikidot.com/linguistics:word-in-context#toc3: “Word in Context: Agent and Agent Clarification (69% Dev) ”⁠, Matt Brockman
link-bibliography⁠
https://news.ycombinator.com/item?id=23990902: “I Found That Getting GPT-3 to Add Its Own "Internal Monologue" in Parentheses to Be a Helpful Strategy… ”⁠, blixt
link-bibliography⁠
https://x.com/kleptid/status/1284069270603866113: “Seems to Work ”⁠, KaryoKleptid
link-bibliography⁠
https://x.com/kleptid/status/1284098635689611264: “Teaching GPT-3 to Do a Brute Force 'For Loop' Checking Answers Also Seems to Work ”⁠, KaryoKleptid
link-bibliography⁠
2018-bisra.pdf: “Inducing Self-Explanation: a Meta-Analysis ”⁠, Kiran Bisra, Qing Liu, John C. Nesbit …, Farimah Salimi, Philip H. Winne
link-bibliography⁠