Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
Thinking LLMs: General Instruction Following with Thought Generation
When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1
Evaluation of OpenAI o1: Opportunities and Challenges of AGI
LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench
Training Language Models to Self-Correct via Reinforcement Learning
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
How Far Can Transformers Reason? The Locality Barrier and Inductive Scratchpad
OmegaPRM: Improve Mathematical Reasoning in Language Models by Automated Process Supervision
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
A Theoretical Understanding of Self-Correction through In-context Alignment
Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
Retrieval Head Mechanistically Explains Long-Context Factuality
Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
Autonomous LLM-driven research from data to human-verifiable research papers
Missed Connections: Lateral Thinking Puzzles for Large Language Models
ChatGPT Can Predict the Future when it Tells Stories Set in the Future About the Past
Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models
FABLES: Evaluating faithfulness and content selection in book-length summarization
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
Chain-of-Thought Empowers Transformers to Solve Inherently Serial Problems
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
The Impact of Reasoning Step Length on Large Language Models
Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (ReSTEM)
Tree of Attacks (TAP): Jailbreaking Black-Box LLMs Automatically
Universal Self-Consistency for Large Language Model Generation
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
On Measuring Faithfulness or Self-consistency of Natural Language Explanations
Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations
Large Language Models can Strategically Deceive their Users when Put Under Pressure
Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation
Implicit Chain-of-Thought Reasoning via Knowledge Distillation
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams
The Expressive Power of Transformers with Chain-of-Thought
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
Think before you speak: Training Language Models With Pause Tokens
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve
Contrastive Decoding Improves Reasoning in Large Language Models
From Sparse to Dense: GPT-4 Summarization with Chain of Density (CoD) Prompting
Graph of Thoughts: Solving Elaborate Problems with Large Language Models
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
Android in the Wild: A Large-Scale Dataset for Android Device Control
LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs
TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration
Explaining Competitive-Level Programming Solutions using LLMs
Let’s Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning
GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models
Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence
Iterative Translation Refinement with Large Language Models
Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
Towards Revealing the Mystery behind Chain-of-Thought: A Theoretical Perspective
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Tree of Thoughts (ToT): Deliberate Problem Solving with Large Language Models
Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
Boosting Theory-of-Mind Performance in Large Language Models via Prompting
Think Before You Act: Unified Policy for Interleaving Language Reasoning with Actions
Reflexion: Language Agents with Verbal Reinforcement Learning
How well do Large Language Models perform in Arithmetic tasks?
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)
Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning
Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards
Interactive-Chain-Prompting (INTERCPT): Ambiguity Resolution for Crosslingual Conditional Generation with Interaction
Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes
Solving math word problems with process & outcome-based feedback
Measuring Progress on Scalable Oversight for Large Language Models
Challenging BIG-Bench Tasks (BBH) and Whether Chain-of-Thought Can Solve Them
Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle)
Language Models are Multilingual Chain-of-Thought Reasoners
ReAct: Synergizing Reasoning and Acting in Language Models
Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning
Limitations of Language Models in Arithmetic and Symbolic Induction
Inner Monologue: Embodied Reasoning through Planning with Language Models
Solving Quantitative Reasoning Problems with Language Models
Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations
Instruction Induction: From Few Examples to Natural Language Task Descriptions
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Self-Consistency Improves Chain-of-Thought Reasoning in Language Models
Learning-by-Narrating: Narrative Pre-Training for Zero-Shot Dialogue Comprehension
PromptChainer: Chaining Large Language Model Prompts through Visual Programming
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More
Reframing Human-AI Collaboration for Generating Free-Text Explanations
NeuroLogic A✱esque Decoding: Constrained Text Generation with Lookahead Heuristics
WebGPT: Improving the factual accuracy of language models through web browsing
Few-Shot Self-Rationalization with Natural Language Prompts
Unsupervised Neural Machine Translation with Generative Language Models Only
Show Your Work: Scratchpads for Intermediate Computation with Language Models
AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts
Teaching Autoregressive Language Models Complex Tasks By Demonstration
Decision Transformer: Reinforcement Learning via Sequence Modeling
Explainable Multi-hop Verbal Reasoning Through Internal Monologue
Measuring Mathematical Problem Solving With the MATH Dataset
Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
I found that getting GPT-3 to add its own "internal monologue" in parentheses to be a helpful strategy…
Teaching GPT-3 to do a brute force 'for loop' checking answers also seems to work
Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems
Why Do Humans Reason? Arguments for an Argumentative Theory
How to Dramatically Improve the Reasoning Ability of GPT-3
A Preliminary Exploration into Factored Cognition With Language Models
WiC_SelfContextStuffingImproved_Last10_stuft_examplesNV.ipynb
AI Dungeon Players Can Now Translate Their Stories into Emojis by Just Clicking a Button.
Solving Math Word Problems: We’ve Trained a System That Solves Grade School Math Problems With Nearly Twice the Accuracy of a Fine-Tuned GPT-3 Model. It Solves about 90% As Many Problems As Real Kids: a Small Sample of 9-12 Year Olds Scored 60% on a Test from Our Dataset, While Our System Scored 55% on Those Same Problems. This Is Important Because Today’s AI Is Still Quite Weak at Commonsense Multistep Reasoning, Which Is Easy Even for Grade School Kids. We Achieved These Results by Training Our Model to Recognize Its Mistakes, so That It Can Try Repeatedly Until It Finds a Solution That Works
Connecting the Dots: LLMs Can Infer & Verbalize Latent Structure from Training Data
I Think ‘GPT-3 Can’t Do Parity Checking’ Isn’t Quite Right. It Can Clearly Pattern Match the Algorithm, Almost Perfectly. It’s Just a Little Mistake Prone. Here, I Invented a Syntax for Having It Evaluate Parity on Each Pair of Digits. It...almost Gets It Right.
2023-chen-table1-gpt35promptsusedtorepeatedlyrefinenaturallanguagetranslationsinnermonologuestyle.png
2023-lee-figure2-thefourinputformattingoptionsforgptinnermonologue.png
2023-lee-figure3-performanceofgpton3digitarithmeticdependsondatadistribution.png
2023-lee-figure9-arithmeticcanbelearnedevenwithnoiseintheinnermonologuetranscripts.jpg
2023-moghaddam-figure1-examplesofzerovstwoshottheoryofmindprompting.png
2023-moghaddam-figure3-gpt3andgpt4performanceontheoryofmindwithinnermonologues.jpg
2023-pilaut-figure2-exampleambiguitiesintranslatingfrenchtoenglish.jpg
2023-pilaut-figure3-interceptinnermonologuequestionaskingonlyemergesatscalefrompalm62bto540b.png
2023-lee-figure6-sampleefficiencyofvariousinnermonologueformatsshowingmoredetailedisbetterforimitationlearning.png
2022-10-24-raldi-gpt3doesanastonishinglygoodjobcreatingbothsidesofaninteractivefictiontranscript.html
2022-dai-figure4-qreccretrevialperformancelogscalinginwikidialogdatasetsize.jpg
2022-huang-figure2-3kindsofnaturallanguagefeedbackforcontrollingsaycaninnermonologue.png
2022-huang-figure3-testinginnermonologuein3roboticdomains.png
2022-huang-figure5a-emergentcapabilities-continuedadaptationtonewinstructions.png
2022-huang-figure5b-emergentcapabilities-selfproposingnewgoalsunderinfeasibilityofoldgoals.png
2022-huang-figure5c-emergentcapabilities-multilingualinteractioninchinese.png
2022-huang-figure5d-emergentcapabilities-interactivesceneunderstandinglikeshrdlu.png
2022-lampinen-figure2-gopherperformanceimprovementsfromexplanationofproblems.jpg
2022-lampinen-figure4-largermodelsbenefitmorefromexplanationofproblems.png
2022-press-figure3-gpt3selfaskinnermonologuedemonstration.png
2022-press-figure4-selfaskinnermonologueperformsequallywellon1hopand2hopquestionanswering.png
2022-press-figure5-selfaskplusgooglesearchengine-innermonologueforsearchingtheinternettoanswermultihopquestions.png
2022-press-table1-selfaskplusgooglesearchengine-innermonologueforsearchingtheinternettoanswermultihopquestions-benchmarkperformance.jpg
2022-shi-figure4-multilingualinnermonologuescalingbyparametercountingpt3andpalm.png
2022-shi-figure5-multiglinalfewshotscalinginpalm540bbynumberofexamples.png
2022-wang-figure2-selfconsistencycompletiongreatlyimprovesanswercorrectness.jpg
2022-wei-figure2-lamdamathwordproblemscalinginmodelparametersize.jpg
2022-wei-figure3-lamdamathwordproblemscalingwithmodelparametersizewhenusinginnermonologueprompts.jpg
2022-wei-figure5-lamdamatsymbolicreasoningproblemscalingwithmodelparametersizewhenusinginnermonologueprompts.png
2022-wei-figure6-lamdacommonsensereasoningproblemscalingwithmodelparametersizewhenusinginnermonologueprompts.png
https://builtin.com/job/customer-success/expert-ai-teacher-contract/1267315
https://generative.ink/posts/methods-of-prompt-programming/#serializing-reasoning
2fa0cae05f887923f2a169fecaa094fa3075f6ba.html#serializing-reasoning
https://github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md
https://jxnl.github.io/instructor/blog/2023/11/05/chain-of-density/
https://model-checking.github.io/kani-verifier-blog/2023/05/01/writing-code-with-chatgpt-improve-it-with-kani.html
https://platform.openai.com/docs/guides/reasoning/how-reasoning-works
https://research.google/blog/google-research-2022-beyond-language-vision-and-generative-models/
https://research.google/blog/minerva-solving-quantitative-reasoning-problems-with-language-models/
https://statmodeling.stat.columbia.edu/2023/08/30/chatgpt-4-can-do-3-digit-multiplication/
https://towardsdatascience.com/1-1-3-wait-no-1-1-2-how-to-have-gpt-sanity-check-itself-136e846987bf
https://www.fhi.ox.ac.uk/wp-content/uploads/2021/08/QNRs_FHI-TR-2021-3.0.pdf
https://www.lesswrong.com/posts/XaKLjyDejtXDoRAzL/a-quick-experiment-on-lms-inductive-biases-in-performing
https://www.lesswrong.com/posts/bwyKCQD7PFWKhELMr/by-default-gpts-think-in-plain-sight#zfzHshctWZYo8JkLe
https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/
https://www.patterns.app/blog/2023/01/18/crunchbot-sql-analyst-gpt/
https://www.reddit.com/r/ChatGPT/comments/10zavbv/extending_chatgpt_with_some_additional_internal/
https://www.reddit.com/r/ChatGPT/comments/11anct1/its_easy_to_give_chatgpt_a_bonafide_consciousness/
https://www.reddit.com/r/LocalLLaMA/comments/1fuxw8d/just_for_kicks_i_looked_at_the_newly_released/
https://www.reddit.com/r/OpenAI/comments/1fxa6d6/two_purported_instances_of_o1preview_and_o1mini/
https://www.reddit.com/r/OpenAI/comments/1gjj430/o1_preview_got_weird_today/
https://www.reddit.com/r/PromptEngineering/comments/1fj6h13/hallucinations_in_o1preview_reasoning/
https://www.reddit.com/r/slatestarcodex/comments/1201v68/10word_quote_a_short_and_simple_failure_mode_of/jdigzkh/?context=3
https://www.waluigipurple.com/post/revising-poetry-with-gpt-4
https://yaofu.notion.site/A-Closer-Look-at-Large-Language-Models-Emergent-Abilities-493876b55df5479d80686f68a1abd72f
Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
https%253A%252F%252Farxiv.org%252Fabs%252F2406.13121%2523google.html
Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models
Jeff Clune—Professor—Computer Science—University of British Columbia
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
Retrieval Head Mechanistically Explains Long-Context Factuality
Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
https%253A%252F%252Flink.springer.com%252Farticle%252F10.1007%252Fs10506-024-09396-9.html
https%253A%252F%252Farxiv.org%252Fabs%252F2403.18802%2523deepmind.html
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
https%253A%252F%252Farxiv.org%252Fabs%252F2402.05120%2523tencent.html
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (ReSTEM)
https%253A%252F%252Farxiv.org%252Fabs%252F2312.06585%2523deepmind.html
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
https%253A%252F%252Farxiv.org%252Fabs%252F2311.16452%2523microsoft.html
Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
Think before you speak: Training Language Models With Pause Tokens
Contrastive Decoding Improves Reasoning in Large Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2309.09117%2523facebook.html
From Sparse to Dense: GPT-4 Summarization with Chain of Density (CoD) Prompting
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration
https%253A%252F%252Farxiv.org%252Fabs%252F2307.05300%2523microsoft.html
Let’s Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning
https%253A%252F%252Farxiv.org%252Fabs%252F2306.14308%2523google.html
Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
Jeff Clune—Professor—Computer Science—University of British Columbia
https%253A%252F%252Farxiv.org%252Fabs%252F2305.20050%2523openai.html
Tree of Thoughts (ToT): Deliberate Problem Solving with Large Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2305.10601%2523deepmind.html
Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
https%253A%252F%252Farxiv.org%252Fabs%252F2305.02301%2523google.html
Boosting Theory-of-Mind Performance in Large Language Models via Prompting
How well do Large Language Models perform in Arithmetic tasks?
https%253A%252F%252Farxiv.org%252Fabs%252F2304.02015%2523alibaba.html
https%253A%252F%252Fpapers.ssrn.com%252Fsol3%252Fpapers.cfm%253Fabstract_id%253D4335905.html
Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards
https%253A%252F%252Fpapers.ssrn.com%252Fsol3%252Fpapers.cfm%253Fabstract_id%253D4335945.html
Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes
https%253A%252F%252Farxiv.org%252Fabs%252F2301.01751%2523elicit.html
https%253A%252F%252Farxiv.org%252Fabs%252F2210.11399%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2210.11610%2523google.html
Challenging BIG-Bench Tasks (BBH) and Whether Chain-of-Thought Can Solve Them
https%253A%252F%252Farxiv.org%252Fabs%252F2210.09261%2523google.html
Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle)
https%253A%252F%252Farxiv.org%252Fabs%252F2210.03350%2523allen.html
Language Models are Multilingual Chain-of-Thought Reasoners
https%253A%252F%252Farxiv.org%252Fabs%252F2210.03057%2523google.html
Inner Monologue: Embodied Reasoning through Planning with Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2207.05608%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2207.05221%2523anthropic.html
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2205.10625%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2205.09073%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2205.05131%2523google.html
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https%253A%252F%252Farxiv.org%252Fabs%252F2204.00598%2523google.html
Self-Consistency Improves Chain-of-Thought Reasoning in Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2203.11171%2523google.html
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2201.11903%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2201.11473%2523microsoft.html
A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More
WebGPT: Improving the factual accuracy of language models through web browsing
https%253A%252F%252Fopenai.com%252Fresearch%252Fwebgpt.html
https%253A%252F%252Farxiv.org%252Fabs%252F2110.14168%2523openai.html
Decision Transformer: Reinforcement Learning via Sequence Modeling
https%253A%252F%252Fsites.google.com%252Fberkeley.edu%252Fdecision-transformer.html
https%253A%252F%252Fgptprompts.wikidot.com%252Flinguistics%253Aword-in-context%2523toc3.html
I found that getting GPT-3 to add its own "internal monologue" in parentheses to be a helpful strategy…
https%253A%252F%252Fnews.ycombinator.com%252Fitem%253Fid%253D23990902.html
https%253A%252F%252Fx.com%252Fkleptid%252Fstatus%252F1284069270603866113.html
Teaching GPT-3 to do a brute force 'for loop' checking answers also seems to work
https%253A%252F%252Fx.com%252Fkleptid%252Fstatus%252F1284098635689611264.html
%252Fdoc%252Fpsychology%252Fspaced-repetition%252F2018-bisra.pdf.html
Wikipedia Bibliography: