Bibliography:

  1. NN Inner Monologue

  2. ‘GPT’ tag

  3. ‘NN sampling’ tag

  4. ‘GPT-4’ tag

  5. ‘instruct-tuning LLMs’ tag

  6. ‘PaLM’ tag

  7. ‘inner-monologue (psych)’ tag

  8. ‘meta-learning’ tag

  9. Free-Play Periods for RL Agents

  10. It Looks Like You’re Trying To Take Over The World

  11. O1 Turns Pro

  12. Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models

  13. Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse

  14. Thinking LLMs: General Instruction Following with Thought Generation

  15. When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

  16. Evaluation of OpenAI o1: Opportunities and Challenges of AGI

  17. LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench

  18. Training Language Models to Self-Correct via Reinforcement Learning

  19. To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

  20. Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

  21. Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

  22. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

  23. OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

  24. How Far Can Transformers Reason? The Locality Barrier and Inductive Scratchpad

  25. OmegaPRM: Improve Mathematical Reasoning in Language Models by Automated Process Supervision

  26. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

  27. A Theoretical Understanding of Self-Correction through In-context Alignment

  28. Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models

  29. From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

  30. Retrieval Head Mechanistically Explains Long-Context Factuality

  31. Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models

  32. Autonomous LLM-driven research from data to human-verifiable research papers

  33. Missed Connections: Lateral Thinking Puzzles for Large Language Models

  34. ChatGPT Can Predict the Future when it Tells Stories Set in the Future About the Past

  35. Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

  36. Do language models plan ahead for future tokens?

  37. FABLES: Evaluating faithfulness and content selection in book-length summarization

  38. Re-evaluating GPT-4’s bar exam performance

  39. Long-form factuality in large language models

  40. Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

  41. RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval

  42. Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

  43. Chain-of-Thought Empowers Transformers to Solve Inherently Serial Problems

  44. Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

  45. Why are Sensitive Functions Hard for Transformers?

  46. Chain-of-Thought Reasoning Without Prompting

  47. V-STaR: Training Verifiers for Self-Taught Reasoners

  48. More Agents Is All You Need

  49. The Impact of Reasoning Step Length on Large Language Models

  50. Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach

  51. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (ReSTEM)

  52. Tree of Attacks (TAP): Jailbreaking Black-Box LLMs Automatically

  53. Universal Self-Consistency for Large Language Model Generation

  54. Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

  55. Training Chain-of-Thought via Latent-Variable Inference

  56. Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks

  57. On Measuring Faithfulness or Self-consistency of Natural Language Explanations

  58. Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations

  59. Large Language Models can Strategically Deceive their Users when Put Under Pressure

  60. Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves

  61. Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation

  62. Implicit Chain-of-Thought Reasoning via Knowledge Distillation

  63. Preventing Language Models From Hiding Their Reasoning

  64. Branch-Solve-Merge Improves Large Language Model Evaluation and Generation

  65. Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams

  66. The Expressive Power of Transformers with Chain-of-Thought

  67. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

  68. Large Language Models Cannot Self-Correct Reasoning Yet

  69. Think before you speak: Training Language Models With Pause Tokens

  70. Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

  71. Contrastive Decoding Improves Reasoning in Large Language Models

  72. Re-Reading Improves Reasoning in Large Language Models

  73. From Sparse to Dense: GPT-4 Summarization with Chain of Density (CoD) Prompting

  74. Graph of Thoughts: Solving Elaborate Problems with Large Language Models

  75. Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

  76. Android in the Wild: A Large-Scale Dataset for Android Device Control

  77. LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs

  78. TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

  79. Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

  80. Measuring Faithfulness in Chain-of-Thought Reasoning

  81. Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration

  82. Explaining Competitive-Level Programming Solutions using LLMs

  83. Teaching Arithmetic to Small Transformers

  84. Language models are weak learners

  85. Let’s Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning

  86. GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models

  87. Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence

  88. Iterative Translation Refinement with Large Language Models

  89. Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

  90. Let’s Verify Step by Step

  91. Towards Revealing the Mystery behind Chain-of-Thought: A Theoretical Perspective

  92. Improving Factuality and Reasoning in Language Models through Multiagent Debate

  93. How Language Model Hallucinations Can Snowball

  94. Tree of Thoughts (ToT): Deliberate Problem Solving with Large Language Models

  95. Large Language Model Programs

  96. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

  97. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

  98. Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding

  99. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

  100. Boosting Theory-of-Mind Performance in Large Language Models via Prompting

  101. Think Before You Act: Unified Policy for Interleaving Language Reasoning with Actions

  102. Language Models can Solve Computer Tasks

  103. Reflexion: Language Agents with Verbal Reinforcement Learning

  104. How well do Large Language Models perform in Arithmetic tasks?

  105. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

  106. Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)

  107. Multimodal Chain-of-Thought Reasoning in Language Models

  108. Faithful Chain-of-Thought Reasoning

  109. Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning

  110. ChatGPT Goes to Law School

  111. Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards

  112. Interactive-Chain-Prompting (INTERCPT): Ambiguity Resolution for Crosslingual Conditional Generation with Interaction

  113. Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes

  114. Solving math word problems with process & outcome-based feedback

  115. PAL: Program-aided Language Models

  116. Measuring Progress on Scalable Oversight for Large Language Models

  117. U-PaLM: Transcending Scaling Laws with 0.1% Extra Compute

  118. Large Language Models Can Self-Improve

  119. Challenging BIG-Bench Tasks (BBH) and Whether Chain-of-Thought Can Solve Them

  120. Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle)

  121. Language Models are Multilingual Chain-of-Thought Reasoners

  122. ReAct: Synergizing Reasoning and Acting in Language Models

  123. Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning

  124. FOLIO: Natural Language Reasoning with First-Order Logic

  125. Faithful Reasoning Using Large Language Models

  126. Limitations of Language Models in Arithmetic and Symbolic Induction

  127. Language Models Can Teach Themselves to Program Better

  128. Language Model Cascades

  129. CodeT: Code Generation with Generated Tests

  130. Can large language models reason about medical questions?

  131. Inner Monologue: Embodied Reasoning through Planning with Language Models

  132. Exploring Length Generalization in Large Language Models

  133. Language Models (Mostly) Know What They Know

  134. Solving Quantitative Reasoning Problems with Language Models

  135. Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations

  136. Large Language Models are Zero-Shot Reasoners

  137. Instruction Induction: From Few Examples to Natural Language Task Descriptions

  138. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

  139. Dialog Inpainting: Turning Documents into Dialogues

  140. Unifying Language Learning Paradigms

  141. Can language models learn from explanations in context?

  142. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

  143. STaR: Bootstrapping Reasoning With Reasoning

  144. A Conversational Paradigm for Program Synthesis

  145. Self-Consistency Improves Chain-of-Thought Reasoning in Language Models

  146. Learning-by-Narrating: Narrative Pre-Training for Zero-Shot Dialogue Comprehension

  147. PromptChainer: Chaining Large Language Model Prompts through Visual Programming

  148. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

  149. Reasoning Like Program Executors

  150. A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More

  151. DREAM: Uncovering Mental Models behind Language Models

  152. Reframing Human-AI Collaboration for Generating Free-Text Explanations

  153. NeuroLogic Aesque Decoding: Constrained Text Generation with Lookahead Heuristics

  154. WebGPT: Improving the factual accuracy of language models through web browsing

  155. Few-Shot Self-Rationalization with Natural Language Prompts

  156. Training Verifiers to Solve Math Word Problems

  157. Unsupervised Neural Machine Translation with Generative Language Models Only

  158. Show Your Work: Scratchpads for Intermediate Computation with Language Models

  159. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts

  160. Teaching Autoregressive Language Models Complex Tasks By Demonstration

  161. Program Synthesis with Large Language Models

  162. Decision Transformer: Reinforcement Learning via Sequence Modeling

  163. Explainable Multi-hop Verbal Reasoning Through Internal Monologue

  164. A simple method to keep GPT-3 focused in a conversation

  165. Measuring Mathematical Problem Solving With the MATH Dataset

  166. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm

  167. How We Accidentally Gave our Bots Their Personalities

  168. Word in Context: Agent and Agent Clarification (69% Dev)

  169. I found that getting GPT-3 to add its own "internal monologue" in parentheses to be a helpful strategy…

  170. Seems to work

  171. Teaching GPT-3 to do a brute force 'for loop' checking answers also seems to work

  172. Inducing Self-Explanation: a Meta-Analysis

  173. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

  174. Why Do Humans Reason? Arguments for an Argumentative Theory

  175. ba03e8d7db678948a7585a947ea8a4eac13d6abf.pdf

  176. How to Dramatically Improve the Reasoning Ability of GPT-3

  177. 642b641a22ab789da5eba95379dfeb1e7c7596e9.html

  178. A Preliminary Exploration into Factored Cognition With Language Models

  179. WiC_SelfContextStuffingImproved_Last10_stuft_examplesNV.ipynb

  180. Vincent-163/transformer-Arithmetic

  181. Magic ToDo List Creator

  182. 99da0d8d421922b8768e9f6e35207d59db5bb214.html

  183. Short Story on AI: ‘Forward Pass’

  184. AI Dungeon Players Can Now Translate Their Stories into Emojis by Just Clicking a Button.

  185. d4a1d75abef33a907533b26f731c3ebb3ac090a1.html

  186. Solving Math Word Problems: We’ve Trained a System That Solves Grade School Math Problems With Nearly Twice the Accuracy of a Fine-Tuned GPT-3 Model. It Solves about 90% As Many Problems As Real Kids: a Small Sample of 9-12 Year Olds Scored 60% on a Test from Our Dataset, While Our System Scored 55% on Those Same Problems. This Is Important Because Today’s AI Is Still Quite Weak at Commonsense Multistep Reasoning, Which Is Easy Even for Grade School Kids. We Achieved These Results by Training Our Model to Recognize Its Mistakes, so That It Can Try Repeatedly Until It Finds a Solution That Works

  187. Prompting Diverse Ideas: Increasing AI Idea Variance

  188. Teaching a Neural Network to Use a Calculator

  189. 3a2a05d3c82f879dc7abd75a01d242a08925804f.html

  190. Connecting the Dots: LLMs Can Infer & Verbalize Latent Structure from Training Data

  191. Preventing Language Models from Hiding Their Reasoning

  192. Steganography in Chain-Of-Thought Reasoning

  193. Visible Thoughts Project and Bounty Announcement

  194. I Think ‘GPT-3 Can’t Do Parity Checking’ Isn’t Quite Right. It Can Clearly Pattern Match the Algorithm, Almost Perfectly. It’s Just a Little Mistake Prone. Here, I Invented a Syntax for Having It Evaluate Parity on Each Pair of Digits. It...almost Gets It Right.

  195. design#future-tag-features

    [Transclude the forward-link's context]

  196. 2023-chen-table1-gpt35promptsusedtorepeatedlyrefinenaturallanguagetranslationsinnermonologuestyle.png

  197. 2023-lee-figure1-numberformattingforgpt2arithmetic.jpg

  198. 2023-lee-figure2-thefourinputformattingoptionsforgptinnermonologue.png

  199. 2023-lee-figure3-performanceofgpton3digitarithmeticdependsondatadistribution.png

  200. 2023-lee-figure9-arithmeticcanbelearnedevenwithnoiseintheinnermonologuetranscripts.jpg

  201. 2023-moghaddam-figure1-examplesofzerovstwoshottheoryofmindprompting.png

  202. 2023-moghaddam-figure3-gpt3andgpt4performanceontheoryofmindwithinnermonologues.jpg

  203. 2023-pilaut-figure2-exampleambiguitiesintranslatingfrenchtoenglish.jpg

  204. 2023-pilaut-figure3-interceptinnermonologuequestionaskingonlyemergesatscalefrompalm62bto540b.png

  205. 2023-lee-figure6-sampleefficiencyofvariousinnermonologueformatsshowingmoredetailedisbetterforimitationlearning.png

  206. 2022-10-24-raldi-gpt3doesanastonishinglygoodjobcreatingbothsidesofaninteractivefictiontranscript.html

  207. 2022-05-28-gpt3user-thinkingisallyouneed.html

  208. 2022-dai-figure4-qreccretrevialperformancelogscalinginwikidialogdatasetsize.jpg

  209. 2022-huang-figure2-3kindsofnaturallanguagefeedbackforcontrollingsaycaninnermonologue.png

  210. 2022-huang-figure3-testinginnermonologuein3roboticdomains.png

  211. 2022-huang-figure5a-emergentcapabilities-continuedadaptationtonewinstructions.png

  212. 2022-huang-figure5b-emergentcapabilities-selfproposingnewgoalsunderinfeasibilityofoldgoals.png

  213. 2022-huang-figure5c-emergentcapabilities-multilingualinteractioninchinese.png

  214. 2022-huang-figure5d-emergentcapabilities-interactivesceneunderstandinglikeshrdlu.png

  215. 2022-lampinen-figure2-gopherperformanceimprovementsfromexplanationofproblems.jpg

  216. 2022-lampinen-figure4-largermodelsbenefitmorefromexplanationofproblems.png

  217. 2022-press-figure3-gpt3selfaskinnermonologuedemonstration.png

  218. 2022-press-figure4-selfaskinnermonologueperformsequallywellon1hopand2hopquestionanswering.png

  219. 2022-press-figure5-selfaskplusgooglesearchengine-innermonologueforsearchingtheinternettoanswermultihopquestions.png

  220. 2022-press-table1-selfaskplusgooglesearchengine-innermonologueforsearchingtheinternettoanswermultihopquestions-benchmarkperformance.jpg

  221. 2022-shi-figure4-multilingualinnermonologuescalingbyparametercountingpt3andpalm.png

  222. 2022-shi-figure5-multiglinalfewshotscalinginpalm540bbynumberofexamples.png

  223. 2022-tay-ul2-innermonologueresults.png

  224. 2022-wang-figure2-selfconsistencycompletiongreatlyimprovesanswercorrectness.jpg

  225. 2022-wei-figure2-lamdamathwordproblemscalinginmodelparametersize.jpg

  226. 2022-wei-figure3-lamdamathwordproblemscalingwithmodelparametersizewhenusinginnermonologueprompts.jpg

  227. 2022-wei-figure5-lamdamatsymbolicreasoningproblemscalingwithmodelparametersizewhenusinginnermonologueprompts.png

  228. 2022-wei-figure6-lamdacommonsensereasoningproblemscalingwithmodelparametersizewhenusinginnermonologueprompts.png

  229. 2022-wei-figure8-lamdavsgpt3.png

  230. 2022-zeng-figure2-socraticmodelsworkflowoverview.png

  231. https://applied-llms.org/

  232. https://blog.valentin.sh/chatgpt5/

  233. a969ecc7563621c6816a28e2029162023050ce24.html

  234. https://builtin.com/job/customer-success/expert-ai-teacher-contract/1267315

  235. https://generative.ink/posts/methods-of-prompt-programming/#serializing-reasoning

  236. 2fa0cae05f887923f2a169fecaa094fa3075f6ba.html#serializing-reasoning

  237. https://github.com/OpenBioLink/ThoughtSource

  238. https://github.com/desik1998/MathWithLLMs

  239. https://github.com/ggerganov/llama.cpp/pull/1773

  240. https://github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md

  241. https://jxnl.github.io/instructor/blog/2023/11/05/chain-of-density/

  242. b272f657beb41f780daeefd4da1c58566a855360.html

  243. https://lingo.csail.mit.edu/blog/arithmetic_gpt3/

  244. 116d0377b5d9ec8e00865c1507909dd879d619c2.html

  245. https://model-checking.github.io/kani-verifier-blog/2023/05/01/writing-code-with-chatgpt-improve-it-with-kani.html

  246. https://niplav.site/decompose.html#Small_Experiment

  247. https://openai.com/index/introducing-openai-o1-preview/

  248. https://platform.openai.com/docs/guides/reasoning/how-reasoning-works

  249. https://reasoning-tokens.ghost.io/reasoning-tokens/

  250. https://research.google/blog/google-research-2022-beyond-language-vision-and-generative-models/

  251. https://research.google/blog/minerva-solving-quantitative-reasoning-problems-with-language-models/

  252. https://statmodeling.stat.columbia.edu/2023/08/30/chatgpt-4-can-do-3-digit-multiplication/

  253. https://towardsdatascience.com/1-1-3-wait-no-1-1-2-how-to-have-gpt-sanity-check-itself-136e846987bf

  254. 505a0878a1c3e6053fee5a93169d7ef3334e7cde.html

  255. https://www.fhi.ox.ac.uk/wp-content/uploads/2021/08/QNRs_FHI-TR-2021-3.0.pdf

  256. https://www.lesswrong.com/posts/XaKLjyDejtXDoRAzL/a-quick-experiment-on-lms-inductive-biases-in-performing

  257. 734e0042214ff6c758d48c13b5bbeee104a17936.html

  258. https://www.lesswrong.com/posts/bwyKCQD7PFWKhELMr/by-default-gpts-think-in-plain-sight#zfzHshctWZYo8JkLe

  259. https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/

  260. https://www.patterns.app/blog/2023/01/18/crunchbot-sql-analyst-gpt/

  261. d7aaf7b7491492af22c98dae1079fbfa93961b5b.html

  262. https://www.pnas.org/doi/full/10.1073/pnas.2317967121

  263. https://www.reddit.com/r/ChatGPT/comments/10zavbv/extending_chatgpt_with_some_additional_internal/

  264. 3e3c992ee499fa4294ffcb3c05421fdc5117185f.html

  265. https://www.reddit.com/r/ChatGPT/comments/11anct1/its_easy_to_give_chatgpt_a_bonafide_consciousness/

  266. 17d0d9d4c3e13ca5537f0452222b8889206c71be.html

  267. https://www.reddit.com/r/LocalLLaMA/comments/1fuxw8d/just_for_kicks_i_looked_at_the_newly_released/

  268. https://www.reddit.com/r/OpenAI/comments/1fxa6d6/two_purported_instances_of_o1preview_and_o1mini/

  269. https://www.reddit.com/r/OpenAI/comments/1gjj430/o1_preview_got_weird_today/

  270. https://www.reddit.com/r/PromptEngineering/comments/1fj6h13/hallucinations_in_o1preview_reasoning/

  271. https://www.reddit.com/r/slatestarcodex/comments/1201v68/10word_quote_a_short_and_simple_failure_mode_of/jdigzkh/?context=3

  272. https://www.waluigipurple.com/post/revising-poetry-with-gpt-4

  273. 5c3ed833b2e637e96256b129d4c61794163980eb.html

  274. https://www.youtube.com/watch?v=g7YJIpkk7KM?t=38

  275. https://x.com/AISafetyMemes/status/1841891795782775221

  276. https://x.com/BlinkDL_AI/status/1677593798531223552

  277. https://x.com/D_Rod_Tweets/status/1628449917898264576

  278. https://x.com/DaveMonlander/status/1612802240582135809

  279. https://x.com/KevinAFischer/status/1646018246225846272

  280. https://x.com/KevinAFischer/status/1646677902833102849

  281. https://x.com/KevinAFischer/status/1646690838981005312

  282. https://x.com/Kyrannio/status/1793874431179460911

  283. https://x.com/MParakhin/status/1632087709060825088

  284. https://x.com/MikePFrank/status/1622202768743096320

  285. https://x.com/MikePFrank/status/1622495004810784768

  286. https://x.com/StudentInfosec/status/1640360234882310145

  287. https://x.com/adiwyner/status/1629980541716922369

  288. https://x.com/amasad/status/1628546489843863555

  289. https://x.com/andrewwhite01/status/1616933106786738176

  290. https://x.com/deepfates/status/1682110624271319040

  291. https://x.com/denny_zhou/status/1547662872511070212

  292. https://x.com/denny_zhou/status/1587115933293678592

  293. https://x.com/emollick/status/1705422957856604503

  294. https://x.com/finereli/status/1782611247709786145

  295. https://x.com/gfodor/status/1626270272314839041

  296. https://x.com/goodside/status/1563191853587271681

  297. https://x.com/goodside/status/1568375796904886274

  298. https://x.com/goodside/status/1568375802903015425

  299. https://x.com/goodside/status/1568416130133368835

  300. https://x.com/goodside/status/1568448128495534081

  301. https://x.com/goodside/status/1581868987952300032

  302. https://x.com/goodside/status/1612017392518840320

  303. https://x.com/goodside/status/1635711013566795776

  304. https://x.com/jd_pressman/status/1646766004637401088

  305. https://x.com/jeremyphoward/status/1801037736968913128

  306. https://x.com/jmilldotdev/status/1592288240861839360

  307. https://x.com/lemonodor/status/1628270074074398720

  308. https://x.com/littmath/status/1598128056874721283

  309. https://x.com/mbusigin/status/1789334007047455178

  310. https://x.com/md_rumpf/status/1647911393796956162

  311. https://x.com/peterwildeford/status/1522633978305560576

  312. https://x.com/shinboson/status/1805459742518595585

  313. https://x.com/wgussml/status/1834712489822765295

  314. https://x.com/yoheinakajima/status/1670557048743010305

  315. https://yaofu.notion.site/A-Closer-Look-at-Large-Language-Models-Emergent-Abilities-493876b55df5479d80686f68a1abd72f

  316. 3c24293d28deb791cf7f4769df319264cbc93fb1.html

  317. Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse

  318. https%253A%252F%252Farxiv.org%252Fabs%252F2410.21333.html

  319. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

  320. https%253A%252F%252Farxiv.org%252Fabs%252F2406.13121%2523google.html

  321. Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models

  322. Jeff Clune—Professor—Computer Science—University of British Columbia

  323. https%253A%252F%252Farxiv.org%252Fabs%252F2405.15143.html

  324. From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

  325. https%253A%252F%252Farxiv.org%252Fabs%252F2405.14838.html

  326. Retrieval Head Mechanistically Explains Long-Context Factuality

  327. Yizhong Wang—University of Washington

  328. https%253A%252F%252Farxiv.org%252Fabs%252F2404.15574.html

  329. Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models

  330. Sam Bowman

  331. https%253A%252F%252Farxiv.org%252Fabs%252F2404.15758.html

  332. Re-evaluating GPT-4’s bar exam performance

  333. https%253A%252F%252Flink.springer.com%252Farticle%252F10.1007%252Fs10506-024-09396-9.html

  334. Long-form factuality in large language models

  335. https%253A%252F%252Farxiv.org%252Fabs%252F2403.18802%2523deepmind.html

  336. Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

  337. https%253A%252F%252Farxiv.org%252Fabs%252F2403.09629.html

  338. Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

  339. https%253A%252F%252Farxiv.org%252Fabs%252F2402.14903.html

  340. Why are Sensitive Functions Hard for Transformers?

  341. https%253A%252F%252Farxiv.org%252Fabs%252F2402.09963.html

  342. More Agents Is All You Need

  343. https%253A%252F%252Farxiv.org%252Fabs%252F2402.05120%2523tencent.html

  344. Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (ReSTEM)

  345. Abhishek Kumar

  346. Igor Mordatch

  347. Behnam Neyshabur

  348. Jascha Sohl-Dickstein

  349. https%253A%252F%252Farxiv.org%252Fabs%252F2312.06585%2523deepmind.html

  350. Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

  351. https%253A%252F%252Farxiv.org%252Fabs%252F2311.16452%2523microsoft.html

  352. Training Chain-of-Thought via Latent-Variable Inference

  353. https%253A%252F%252Farxiv.org%252Fabs%252F2312.02179.html

  354. Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams

  355. https%253A%252F%252Farxiv.org%252Fabs%252F2310.08678.html

  356. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

  357. https%253A%252F%252Farxiv.org%252Fabs%252F2310.04406.html

  358. Think before you speak: Training Language Models With Pause Tokens

  359. Sanjiv Kumar

  360. https%253A%252F%252Farxiv.org%252Fabs%252F2310.02226.html

  361. Contrastive Decoding Improves Reasoning in Large Language Models

  362. Mike Lewis

  363. https%253A%252F%252Farxiv.org%252Fabs%252F2309.09117%2523facebook.html

  364. Re-Reading Improves Reasoning in Large Language Models

  365. https%253A%252F%252Farxiv.org%252Fabs%252F2309.06275.html

  366. From Sparse to Dense: GPT-4 Summarization with Chain of Density (CoD) Prompting

  367. https%253A%252F%252Farxiv.org%252Fabs%252F2309.04269.html

  368. Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

  369. https%253A%252F%252Farxiv.org%252Fabs%252F2308.07921.html

  370. Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration

  371. Furu Wei

  372. https%253A%252F%252Farxiv.org%252Fabs%252F2307.05300%2523microsoft.html

  373. Teaching Arithmetic to Small Transformers

  374. https%253A%252F%252Farxiv.org%252Fabs%252F2307.03381.html

  375. Let’s Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning

  376. https%253A%252F%252Farxiv.org%252Fabs%252F2306.14308%2523google.html

  377. Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

  378. Jeff Clune—Professor—Computer Science—University of British Columbia

  379. https%253A%252F%252Farxiv.org%252Fabs%252F2306.00323.html

  380. Let’s Verify Step by Step

  381. Jan Leike

  382. John Schulman’s Homepage

  383. https%253A%252F%252Farxiv.org%252Fabs%252F2305.20050%2523openai.html

  384. How Language Model Hallucinations Can Snowball

  385. https%253A%252F%252Farxiv.org%252Fabs%252F2305.13534.html

  386. Tree of Thoughts (ToT): Deliberate Problem Solving with Large Language Models

  387. https%253A%252F%252Farxiv.org%252Fabs%252F2305.10601%2523deepmind.html

  388. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

  389. Julian Michael

  390. Sam Bowman

  391. https%253A%252F%252Farxiv.org%252Fabs%252F2305.04388.html

  392. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

  393. https%253A%252F%252Farxiv.org%252Fabs%252F2305.02301%2523google.html

  394. Boosting Theory-of-Mind Performance in Large Language Models via Prompting

  395. https%253A%252F%252Farxiv.org%252Fabs%252F2304.11490.html

  396. How well do Large Language Models perform in Arithmetic tasks?

  397. https%253A%252F%252Farxiv.org%252Fabs%252F2304.02015%2523alibaba.html

  398. ChatGPT Goes to Law School

  399. https%253A%252F%252Fpapers.ssrn.com%252Fsol3%252Fpapers.cfm%253Fabstract_id%253D4335905.html

  400. Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards

  401. https%253A%252F%252Fpapers.ssrn.com%252Fsol3%252Fpapers.cfm%253Fabstract_id%253D4335945.html

  402. Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes

  403. https%253A%252F%252Farxiv.org%252Fabs%252F2301.01751%2523elicit.html

  404. U-PaLM: Transcending Scaling Laws with 0.1% Extra Compute

  405. Yi Tay

  406. Jason Wei

  407. Neil Houlsby

  408. https%253A%252F%252Farxiv.org%252Fabs%252F2210.11399%2523google.html

  409. Large Language Models Can Self-Improve

  410. https%253A%252F%252Farxiv.org%252Fabs%252F2210.11610%2523google.html

  411. Challenging BIG-Bench Tasks (BBH) and Whether Chain-of-Thought Can Solve Them

  412. Yi Tay

  413. Jason Wei

  414. https%253A%252F%252Farxiv.org%252Fabs%252F2210.09261%2523google.html

  415. Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle)

  416. Noah A. Smith

  417. Mike Lewis

  418. https%253A%252F%252Farxiv.org%252Fabs%252F2210.03350%2523allen.html

  419. Language Models are Multilingual Chain-of-Thought Reasoners

  420. Yi Tay

  421. Jason Wei

  422. https%253A%252F%252Farxiv.org%252Fabs%252F2210.03057%2523google.html

  423. FOLIO: Natural Language Reasoning with First-Order Logic

  424. Caiming Xiong—Home Page

  425. https%253A%252F%252Farxiv.org%252Fabs%252F2209.00840.html

  426. Can large language models reason about medical questions?

  427. https%253A%252F%252Farxiv.org%252Fabs%252F2207.08143.html

  428. Inner Monologue: Embodied Reasoning through Planning with Language Models

  429. Igor Mordatch

  430. Sergey Levine

  431. https%253A%252F%252Farxiv.org%252Fabs%252F2207.05608%2523google.html

  432. Language Models (Mostly) Know What They Know

  433. Saurav Kadavath

  434. About Me

  435. Andy Jones

  436. Sam Bowman

  437. https://jack-clark.net/about/

  438. Sam McCandlish

  439. Jared Kaplan

  440. https%253A%252F%252Farxiv.org%252Fabs%252F2207.05221%2523anthropic.html

  441. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

  442. Jason Wei

  443. https%253A%252F%252Farxiv.org%252Fabs%252F2205.10625%2523google.html

  444. Dialog Inpainting: Turning Documents into Dialogues

  445. https%253A%252F%252Farxiv.org%252Fabs%252F2205.09073%2523google.html

  446. Unifying Language Learning Paradigms

  447. Yi Tay

  448. Neil Houlsby

  449. https%253A%252F%252Farxiv.org%252Fabs%252F2205.05131%2523google.html

  450. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

  451. https%253A%252F%252Farxiv.org%252Fabs%252F2204.00598%2523google.html

  452. Self-Consistency Improves Chain-of-Thought Reasoning in Language Models

  453. Jason Wei

  454. https%253A%252F%252Farxiv.org%252Fabs%252F2203.11171%2523google.html

  455. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

  456. Jason Wei

  457. https%253A%252F%252Farxiv.org%252Fabs%252F2201.11903%2523google.html

  458. Reasoning Like Program Executors

  459. https%253A%252F%252Farxiv.org%252Fabs%252F2201.11473%2523microsoft.html

  460. A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More

  461. https%253A%252F%252Farxiv.org%252Fabs%252F2112.15594.html

  462. WebGPT: Improving the factual accuracy of language models through web browsing

  463. Jacob Hilton's Homepage

  464. John Schulman’s Homepage

  465. https%253A%252F%252Fopenai.com%252Fresearch%252Fwebgpt.html

  466. Training Verifiers to Solve Math Word Problems

  467. Jacob Hilton's Homepage

  468. John Schulman’s Homepage

  469. https%253A%252F%252Farxiv.org%252Fabs%252F2110.14168%2523openai.html

  470. Decision Transformer: Reinforcement Learning via Sequence Modeling

  471. Aravind Rajeswaran

  472. Kimin Lee

  473. Aditya Grover

  474. Michael (misha) Laskin

  475. Aravind Srinivas

  476. Igor Mordatch

  477. https%253A%252F%252Fsites.google.com%252Fberkeley.edu%252Fdecision-transformer.html

  478. Word in Context: Agent and Agent Clarification (69% Dev)

  479. https%253A%252F%252Fgptprompts.wikidot.com%252Flinguistics%253Aword-in-context%2523toc3.html

  480. I found that getting GPT-3 to add its own "internal monologue" in parentheses to be a helpful strategy…

  481. https%253A%252F%252Fnews.ycombinator.com%252Fitem%253Fid%253D23990902.html

  482. Seems to work

  483. https%253A%252F%252Fx.com%252Fkleptid%252Fstatus%252F1284069270603866113.html

  484. Teaching GPT-3 to do a brute force 'for loop' checking answers also seems to work

  485. https%253A%252F%252Fx.com%252Fkleptid%252Fstatus%252F1284098635689611264.html

  486. Inducing Self-Explanation: a Meta-Analysis

  487. %252Fdoc%252Fpsychology%252Fspaced-repetition%252F2018-bisra.pdf.html