Bibliography:

  1. ‘RL’ tag

  2. ‘AlphaGo’ tag

  3. ‘Decision Transformer’ tag

  4. ‘MuZero’ tag

  5. ‘autoencoder NN’ tag

  6. ‘video generation’ tag

  7. ‘AI chess’ tag

  8. ‘preference learning’ tag

  9. Resorting Media Ratings

  10. Centaur: a foundation model of human cognition

  11. Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

  12. Interpretable Contrastive Monte Carlo Tree Search Reasoning

  13. OpenAI co-founder Sutskever’s new safety-focused AI startup SSI raises $1 billion

  14. The brain simulates actions and their consequences during REM sleep

  15. Solving Path of Exile Item Crafting With Value Iteration

  16. 095ddfd50cd01483937c2a7e5b4b87fdee2bc269.html

  17. Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

  18. DT-VIN: Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning

  19. MCTSr: Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMA-3-8B

  20. Safety Alignment Should Be Made More Than Just a Few Tokens Deep

  21. Can Language Models Serve as Text-Based World Simulators?

  22. Evaluating the World Model Implicit in a Generative Model

  23. OmegaPRM: Improve Mathematical Reasoning in Language Models by Automated Process Supervision

  24. Diffusion On Syntax Trees For Program Synthesis

  25. DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ

  26. DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

  27. Amit’s A Pages

  28. From r to Q: Your Language Model is Secretly a Q-Function

  29. Algorithmic Collusion by Large Language Models

  30. Identifying general reaction conditions by bandit optimization

  31. Gradient-based Planning with World Models

  32. ReCoRe: Regularized Contrastive Representation Learning of World Model

  33. Can a Transformer Represent a Kalman Filter?

  34. Self-Supervised Behavior Cloned Transformers are Path Crawlers for Text Games

  35. Why Won’t OpenAI Say What the Q Algorithm Is? Supposed AI breakthroughs are frequently veiled in secrecy, hindering scientific consensus

  36. Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations

  37. The neural basis of mental navigation in rats: A brain–machine interface demonstrates volitional control of hippocampal activity

  38. Volitional activation of remote place representations with a hippocampal brain–machine interface

  39. Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion

  40. Self-AIXI: Self-Predictive Universal AI

  41. Othello is Solved

  42. Course Correcting Koopman Representations

  43. Predictive auxiliary objectives in deep RL mimic learning in the brain

  44. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

  45. Comparative study of model-based and model-free reinforcement learning control performance in HVAC systems

  46. Learning to Model the World with Language

  47. Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

  48. Fighting Uncertainty with Gradients: Offline Reinforcement Learning via Diffusion Score Matching

  49. Improving Long-Horizon Imitation Through Instruction Prediction

  50. When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming (CDHF)

  51. Long-Term Value of Exploration: Measurements, Findings and Algorithms

  52. Emergence of belief-like representations through reinforcement learning

  53. Six Experiments in Action Minimization

  54. Finding Paths of Least Action With Gradient Descent

  55. MimicPlay: Long-Horizon Imitation Learning by Watching Human Play

  56. Graph schemas as abstractions for transfer learning, inference, and planning

  57. John Carmack’s ‘Different Path’ to Artificial General Intelligence

  58. DreamerV3: Mastering Diverse Domains through World Models

  59. Merging enzymatic and synthetic chemistry with computational synthesis planning

  60. PALMER: Perception-Action Loop with Memory for Long-Horizon Planning

  61. Space is a latent [CSCG] sequence: Structured sequence learning as a unified theory of representation in the hippocampus

  62. CICERO: Human-level play in the game of Diplomacy by combining language models with strategic reasoning

  63. Online Learning and Bandits with Queried Hints

  64. E3B: Exploration via Elliptical Episodic Bonuses

  65. Creating a Dynamic Quadrupedal Robotic Goalkeeper with Reinforcement Learning

  66. Top-down design of protein nanomaterials with reinforcement learning

  67. Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective (ALM)

  68. IRIS: Transformers are Sample-Efficient World Models

  69. LGE: Cell-Free Latent Go-Explore

  70. LaTTe: Language Trajectory TransformEr

  71. PI-ARS: Accelerating Evolution-Learned Visual-Locomotion with Predictive Information Representations

  72. Learning with Combinatorial Optimization Layers: a Probabilistic Approach

  73. Spatial representation by ramping activity of neurons in the retrohippocampal cortex

  74. Inner Monologue: Embodied Reasoning through Planning with Language Models

  75. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

  76. DayDreamer: World Models for Physical Robot Learning

  77. Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

  78. GODEL: Large-Scale Pre-Training for Goal-Directed Dialog

  79. BYOL-Explore: Exploration by Bootstrapped Prediction

  80. Director: Deep Hierarchical Planning from Pixels

  81. Flexible Diffusion Modeling of Long Videos

  82. Housekeep: Tidying Virtual Households using Commonsense Reasoning

  83. Semantic Exploration from Language Abstractions and Pretrained Representations

  84. Demonstrate Once, Imitate Immediately (DOME): Learning Visual Servoing for One-Shot Imitation Learning

  85. Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances

  86. Reinforcement Learning with Action-Free Pre-Training from Videos

  87. On-the-fly Strategy Adaptation for ad-hoc Agent Coordination

  88. VAPO: Affordance Learning from Play for Sample-Efficient Policy Learning

  89. Learning Synthetic Environments and Reward Networks for Reinforcement Learning

  90. How to build a cognitive map: insights from models of the hippocampal formation

  91. LID: Pre-Trained Language Models for Interactive Decision-Making

  92. Rotting Infinitely Many-armed Bandits

  93. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

  94. What is the point of computers? A question for pure mathematicians

  95. An Experimental Design Perspective on Model-Based Reinforcement Learning

  96. Reinforcement Learning on Human Decision Models for Uniquely Collaborative AI Teammates

  97. Learning Representations for Pixel-based Control: What Matters and Why?

  98. Learning Behaviors through Physics-driven Latent Imagination

  99. Is Bang-Bang Control All You Need? Solving Continuous Control with Bernoulli Policies

  100. Skill Induction and Planning with Latent Language

  101. Example-Driven Model-Based Reinforcement Learning for Solving Long-Horizon Visuomotor Tasks

  102. TrufLL: Learning Natural Language Generation from Scratch

  103. Dropout’s Dream Land: Generalization from Learned Simulators to Reality

  104. FitVid: Overfitting in Pixel-Level Video Prediction

  105. Brax—A Differentiable Physics Engine for Large Scale Rigid Body Simulation

  106. A Graph Placement Methodology for Fast Chip Design

  107. Planning for Novelty: Width-Based Algorithms for Common Problems in Control, Planning and Reinforcement Learning

  108. The whole prefrontal cortex is premotor cortex

  109. PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World

  110. Constructions in combinatorics via neural networks

  111. Machine Translation Decoding beyond Beam Search

  112. Learning What To Do by Simulating the Past

  113. Waymo Simulated Driving Behavior in Reconstructed Fatal Crashes within an Autonomous Vehicle Operating Domain

  114. Latent Imagination Facilitates Zero-Shot Transfer in Autonomous Racing

  115. Replaying real life: how the Waymo Driver avoids fatal human crashes

  116. Learning Chess Blindfolded: Evaluating Language Models on State Tracking

  117. COMBO: Conservative Offline Model-Based Policy Optimization

  118. A Search Without Expansions: Learning Heuristic Functions with Deep Q-Networks

  119. ViNG: Learning Open-World Navigation with Visual Goals

  120. Inductive Biases for Deep Learning of Higher-Level Cognition

  121. Multimodal dynamics modeling for off-road autonomous vehicles

  122. Targeting for long-term outcomes

  123. What are the Statistical Limits of Offline RL with Linear Function Approximation?

  124. A Time Leap Challenge for SAT Solving

  125. The Overfitted Brain: Dreams evolved to assist generalization

  126. RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning

  127. Mathematical Reasoning via Self-supervised Skip-tree Training

  128. MOPO: Model-based Offline Policy Optimization

  129. Learning to Simulate Dynamic Environments with GameGAN

  130. Planning to Explore via Self-Supervised World Models

  131. Learning to Simulate Dynamic Environments with GameGAN [homepage]

  132. Reinforcement Learning with Augmented Data

  133. Learning to Fly via Deep Model-Based Reinforcement Learning

  134. Introducing Dreamer: Scalable Reinforcement Learning Using World Models

  135. Reinforcement Learning for Combinatorial Optimization: A Survey

  136. Learning to Prove Theorems by Learning to Generate Theorems

  137. The Gambler’s Problem and Beyond

  138. Combining Q-Learning and Search with Amortized Value Estimates

  139. Dream to Control: Learning Behaviors by Latent Imagination

  140. Approximate Inference in Discrete Distributions with Monte Carlo Tree Search and Value Functions

  141. Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?

  142. Designing agent incentives to avoid reward tampering

  143. An Application of Reinforcement Learning to Aerobatic Helicopter Flight

  144. When to Trust Your Model: Model-Based Policy Optimization (MOPO)

  145. VISR: Fast Task Inference with Variational Intrinsic Successor Features

  146. Learning to Reason in Large Theories without Imitation

  147. Biasing MCTS with Features for General Games

  148. Bayesian Layers: A Module for Neural Network Uncertainty

  149. PlaNet: Learning Latent Dynamics for Planning from Pixels

  150. Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning

  151. Human-Like Playtesting with Deep Learning

  152. General Value Function Networks

  153. Towards Automated Deep Learning: Efficient Joint Neural Architecture and Hyperparameter Search

  154. The Alignment Problem for Bayesian History-Based Reinforcement Learners

  155. Neural scene representation and rendering

  156. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

  157. Mining gold from implicit models to improve likelihood-free inference

  158. Learning to Optimize Tensor Programs

  159. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

  160. Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications

  161. World Models

  162. Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

  163. Differentiable Dynamic Programming for Structured Prediction and Attention

  164. How to Explore Chemical Space Using Algorithms and Automation

  165. Planning Chemical Syntheses With Deep Neural Networks and Symbolic AI

  166. Generalization Guides Human Exploration in Vast Decision Spaces

  167. Safe Policy Search with Gaussian Process Models

  168. Using Parameterized Black-Box Priors to Scale Up Model-Based Policy Search for Robotics

  169. Analogical-based Bayesian Optimization

  170. A Game-Theoretic Analysis of the Off-Switch Game

  171. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning

  172. Learning Transferable Architectures for Scalable Image Recognition

  173. Learning model-based planning from scratch

  174. Value Prediction Network

  175. Path Integral Networks: End-to-End Differentiable Optimal Control

  176. Visual Semantic Planning using Deep Successor Representations

  177. AIXIjs: A Software Demo for General Reinforcement Learning

  178. Metacontrol for Adaptive Imagination-Based Optimization

  179. DeepArchitect: Automatically Designing and Training Deep Architectures

  180. Stochastic Constraint Programming as Reinforcement Learning

  181. Recurrent Environment Simulators

  182. Prediction and Control with Temporal Segment Models

  183. Rotting Bandits

  184. The Kelly Coin-Flipping Game: Exact Solutions

  185. The Hippocampus As a Predictive Map

  186. The Predictron: End-To-End Learning and Planning

  187. Model-based Adversarial Imitation Learning

  188. DeepMath: Deep Sequence Models for Premise Selection

  189. Value Iteration Networks

  190. On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models

  191. Classical Planning Algorithms on the Atari Video Games

  192. Optimal Regret Analysis of Thompson Sampling in Stochastic Multi-armed Bandit Problem with Multiple Plays

  193. Compress and Control

  194. Learning to Win by Reading Manuals in a Monte-Carlo Framework

  195. Whatever next? Predictive brains, situated agents, and the future of cognitive science

  196. Model-Based Bayesian Exploration

  197. PUCT: Continuous Upper Confidence Trees with Polynomial Exploration-Consistency

  198. Planning as satisfiability: Heuristics

  199. Width and Serialization of Classical Planning Problems

  200. An Empirical Evaluation of Thompson Sampling

  201. Monte-Carlo Planning in Large POMDPs

  202. A Monte Carlo AIXI Approximation

  203. Evolution And Episodic Memory: An Analysis And Demonstration Of A Social Function Of Episodic Recollection

  204. Resilient Machines Through Continuous Self-Modeling

  205. Policy Mining: Learning Decision Policies from Fixed Sets of Data

  206. The Speed Prior: A New Simplicity Measure Yielding Near-Optimal Computable Predictions

  207. Iterative widening

  208. Abstract Proof Search

  209. A critique of pure reason

  210. Human Window on the World

  211. Why the Law of Effect Will Not Go Away

  212. Getting the World Record in HATETRIS

  213. e0c7891d20b449a0d1eb36332f05142b82857e0f.html

  214. Solving Probabilistic Tic-Tac-Toe

  215. Approximate Bayes Optimal Policy Search Using Neural Networks

  216. 8573d23bfa334ac53e2a92e37b50937837ab02dd.pdf

  217. Embodying Addiction: A Predictive Processing Account

  218. Introducing ‘Computer Use’, a New Claude 3.5 Sonnet, and Claude 3.5 Haiku

  219. Developing a Computer Use Model

  220. Best-Of-n With Misaligned Reward Models for Math Reasoning

  221. 2021-scanlon-waymoaccidentavoidance-worldsimreconstruction-case_AZ1796255_2_surfels_noagent_cropped-2021-03-05_12_56_39.mp4

  222. 2020-hafner-dreamer-learninganimation.mp4

  223. 2020-hafner-dreamer-modelpredictions.png

  224. 2020-hafner-dreamer-threephasearchitecture.png

  225. 2010-silver-figure1-illustrationofpomcpmctssearchoverapomdp.png

  226. http://www.alpha60.de/research/programming_enter/DavidLink_ProgrammingEnter_ComputerResurrection60_2012.pdf

  227. 6a2f6adbbc6147dea8b598929bf775d71fb548e6.pdf

  228. https://blog.evjang.com/2018/08/dijkstras.html

  229. https://github.com/KeeyanGhoreshi/PokemonFireredSingleSequence

  230. https://github.com/Significant-Gravitas/AutoGPT

  231. https://github.com/sanjeevanahilan/nanoChatGPT

  232. https://iagoleal.com/posts/value-iteration-haskell/

  233. https://if50.substack.com/p/christopher-strachey-and-the-dawn

  234. https://journals.sagepub.com/doi/10.1177/17456916231204811

  235. https://madebyoll.in/posts/game_emulation_via_dnn/

  236. https://netflixtechblog.com/artwork-personalization-c589f074ad76

  237. https://openai.com/research/vpt

  238. https://www.aboutwayfair.com/careers/tech-blog/contextual-bandit-for-marketing-treatment-optimization

  239. https://www.bkgm.com/articles/Berliner/ComputerBackgammon/index.html

  240. 2c0bb18121266e13f3a5247946c75131c0841f1f.html

  241. https://www.dwarkeshpatel.com/p/demis-hassabis#%C2%A7timestamps

  242. https://www.everything2.net/index.pl?node_id=1190642

  243. https://www.freepatentsonline.com/y2024/0104353.html#deepmind

  244. 60589a2ab3b4503180e5d189b4a77b0c00730e73.html#deepmind

  245. https://www.instacart.com/company/how-its-made/using-contextual-bandit-models-in-large-action-spaces-at-instacart/

  246. https://www.lesswrong.com/posts/S54HKhxQyttNLATKu/deconfusing-direct-vs-amortised-optimization

  247. https://www.lesswrong.com/posts/ZwshvqiqCvXPsZEct/the-learning-theoretic-agenda-status-2023

  248. https://www.lesswrong.com/posts/nmxzr2zsjNtjaHh7x/actually-othello-gpt-has-a-linear-emergent-world

  249. https://www.quantamagazine.org/electric-ripples-in-the-resting-brain-tag-memories-for-storage-20240521/

  250. a120d3cf031b4a9fc9e95d11e236c344241bdcc7.html

  251. https://www.reddit.com/gallery/1d6w6b4

  252. https://www.youtube.com/watch?v=g3lc8BxTPiU

  253. https://x.com/jeremyphoward/status/1801037736968913128

  254. https://x.com/moyix/status/1795284112791703735

  255. Interpretable Contrastive Monte Carlo Tree Search Reasoning

  256. https%253A%252F%252Farxiv.org%252Fabs%252F2410.01707.html

  257. OpenAI co-founder Sutskever’s new safety-focused AI startup SSI raises $1 billion

  258. https%253A%252F%252Fwww.reuters.com%252Ftechnology%252Fartificial-intelligence%252Fopenai-co-founder-sutskevers-new-safety-focused-ai-startup-ssi-raises-1-billion-2024-09-04%252F.html

  259. DT-VIN: Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning

  260. https%253A%252F%252Farxiv.org%252Fabs%252F2406.08404%2523schmidhuber.html

  261. MCTSr: Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMA-3-8B

  262. https%253A%252F%252Farxiv.org%252Fabs%252F2406.07394.html

  263. Evaluating the World Model Implicit in a Generative Model

  264. https%253A%252F%252Farxiv.org%252Fabs%252F2406.03689.html

  265. DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ

  266. https%253A%252F%252Farxiv.org%252Fabs%252F2405.15306.html

  267. From r to Q: Your Language Model is Secretly a Q-Function

  268. https%253A%252F%252Farxiv.org%252Fabs%252F2404.12358.html

  269. Identifying general reaction conditions by bandit optimization

  270. %252Fdoc%252Freinforcement-learning%252Fmodel%252F2024-wang-2.pdf.html

  271. Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion

  272. https%253A%252F%252Farxiv.org%252Fabs%252F2311.01017.html

  273. Self-AIXI: Self-Predictive Universal AI

  274. https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DpsXVkKO9No%2523deepmind.html

  275. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

  276. https%253A%252F%252Farxiv.org%252Fabs%252F2310.04406.html

  277. Comparative study of model-based and model-free reinforcement learning control performance in HVAC systems

  278. %252Fdoc%252Freinforcement-learning%252Fmodel%252F2023-gao.pdf.html

  279. When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming (CDHF)

  280. https%253A%252F%252Farxiv.org%252Fabs%252F2306.04930%2523microsoft.html

  281. DreamerV3: Mastering Diverse Domains through World Models

  282. https%253A%252F%252Farxiv.org%252Fabs%252F2301.04104%2523deepmind.html

  283. Merging enzymatic and synthetic chemistry with computational synthesis planning

  284. https%253A%252F%252Fwww.nature.com%252Farticles%252Fs41467-022-35422-y.html

  285. CICERO: Human-level play in the game of Diplomacy by combining language models with strategic reasoning

  286. Mike Lewis

  287. %252Fdoc%252Freinforcement-learning%252Fimperfect-information%252Fdiplomacy%252F2022-bakhtin.pdf.html

  288. Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective (ALM)

  289. Sergey Levine

  290. https%253A%252F%252Farxiv.org%252Fabs%252F2209.08466.html

  291. IRIS: Transformers are Sample-Efficient World Models

  292. https%253A%252F%252Farxiv.org%252Fabs%252F2209.00588.html

  293. Inner Monologue: Embodied Reasoning through Planning with Language Models

  294. Igor Mordatch

  295. Sergey Levine

  296. https%253A%252F%252Farxiv.org%252Fabs%252F2207.05608%2523google.html

  297. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

  298. Sergey Levine

  299. https%253A%252F%252Farxiv.org%252Fabs%252F2207.04429.html

  300. DayDreamer: World Models for Physical Robot Learning

  301. https%253A%252F%252Farxiv.org%252Fabs%252F2206.14176.html

  302. Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

  303. Jeff Clune—Professor—Computer Science—University of British Columbia

  304. https%253A%252F%252Farxiv.org%252Fabs%252F2206.11795%2523openai.html

  305. Director: Deep Hierarchical Planning from Pixels

  306. https%253A%252F%252Farxiv.org%252Fabs%252F2206.04114%2523google.html

  307. Semantic Exploration from Language Abstractions and Pretrained Representations

  308. Language Understanding Grounded in Perception and Action

  309. https%253A%252F%252Farxiv.org%252Fabs%252F2204.05080%2523deepmind.html

  310. Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances

  311. https://evjang.com/about/

  312. Sergey Levine

  313. https%253A%252F%252Farxiv.org%252Fabs%252F2204.01691%2523google.html

  314. Brax—A Differentiable Physics Engine for Large Scale Rigid Body Simulation

  315. Igor Mordatch

  316. https%253A%252F%252Farxiv.org%252Fabs%252F2106.13281%2523google.html

  317. Replaying real life: how the Waymo Driver avoids fatal human crashes

  318. https%253A%252F%252Fwaymo.com%252Fblog%252F2021%252F03%252Freplaying-real-life%252F.html

  319. Introducing Dreamer: Scalable Reinforcement Learning Using World Models

  320. https%253A%252F%252Fresearch.google%252Fblog%252Fintroducing-dreamer-scalable-reinforcement-learning-using-world-models%252F.html

  321. Human-Like Playtesting with Deep Learning

  322. %252Fdoc%252Freinforcement-learning%252Fimitation-learning%252F2018-gudmundsson.pdf.html

  323. The Alignment Problem for Bayesian History-Based Reinforcement Learners

  324. %252Fdoc%252Freinforcement-learning%252Fmodel%252F2018-everitt.pdf.html

  325. The Kelly Coin-Flipping Game: Exact Solutions

  326. Gwern.net Homepage

    [Transclude the forward-link's context]

  327. %252Fcoin-flip.html

  328. Monte-Carlo Planning in Large POMDPs

  329. %252Fdoc%252Freinforcement-learning%252Fmodel%252F2010-silver.pdf.html

  330. Iterative widening

  331. %252Fdoc%252Freinforcement-learning%252Fmodel%252F2001-cazenave.pdf.html