Bibliography:

  1. Machine Learning Scaling

  2. ‘RL’ tag

  3. ‘inner monologue (AI)’ tag

  4. ‘instruct-tuning LLMs’ tag

  5. ‘continual learning’ tag

  6. ‘meta-learning’ tag

  7. ‘AlphaStar’ tag

  8. ‘OA5’ tag

  9. ‘AlphaGo’ tag

  10. ‘Decision Transformer’ tag

  11. ‘MuZero’ tag

  12. ‘MARL’ tag

  13. ‘preference learning’ tag

  14. ‘robotics’ tag

  15. Research Ideas

  16. It Looks Like You’re Trying To Take Over The World

  17. The Scaling Hypothesis

  18. Why Tool AIs Want to Be Agent AIs

  19. Data Scaling Laws in Imitation Learning for Robotic Manipulation

  20. AI Alignment via Slow Substrates: Early Empirical Results With StarCraft II

  21. 44c0ed1a28a38215d6f2d43ed0b4d4ad3b0f38e4.html

  22. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

  23. NAVIX: Scaling MiniGrid Environments with JAX

  24. JEST: Data curation via joint example selection further accelerates multimodal learning

  25. AI Search: The Bitter-er Lesson

  26. Foundational Challenges in Assuring Alignment and Safety of Large Language Models

  27. Simple and Scalable Strategies to Continually Pre-train Large Language Models

  28. Robust agents learn causal world models

  29. Grandmaster-Level Chess Without Search

  30. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

  31. Vision-Language Models as a Source of Rewards

  32. JaxMARL: Multi-Agent RL Environments in JAX

  33. Diversifying AI: Towards Creative Chess with AlphaZero (AZdb)

  34. Deep RL at Scale: Sorting Waste in Office Buildings with a Fleet of Mobile Manipulators

  35. Emergence of belief-like representations through reinforcement learning

  36. Scaling laws for single-agent reinforcement learning

  37. DreamerV3: Mastering Diverse Domains through World Models

  38. Offline Q-Learning on Diverse Multi-Task Data Both Scales And Generalizes

  39. VeLO: Training Versatile Learned Optimizers by Scaling Up

  40. Broken Neural Scaling Laws

  41. Scaling Laws for Reward Model Overoptimization

  42. SAP: Bidirectional Language Models Are Also Few-shot Learners

  43. g.pt: Learning to Learn with Generative Models of Neural Network Checkpoints

  44. Human-level Atari 200× faster

  45. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

  46. AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model

  47. TextWorldExpress: Simulating Text Games at One Million Steps Per Second

  48. Demis Hassabis: DeepMind—AI, Superintelligence & the Future of Humanity § Turing Test

  49. Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

  50. Multi-Game Decision Transformers

  51. Task-Agnostic Continual Reinforcement Learning: In Praise of a Simple Baseline (3RL)

  52. CT0: Fine-tuned Language Models are Continual Learners

  53. Flexible Diffusion Modeling of Long Videos

  54. Instruction Induction: From Few Examples to Natural Language Task Descriptions

  55. Gato: A Generalist Agent

  56. Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers

  57. Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale

  58. Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances

  59. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

  60. InstructGPT: Training language models to follow instructions with human feedback

  61. A data-driven approach for learning to control computers

  62. EvoJAX: Hardware-Accelerated Neuroevolution

  63. Accelerated Quality-Diversity for Robotics through Massive Parallelism

  64. Don’t Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning (ExORL)

  65. Can Wikipedia Help Offline Reinforcement Learning?

  66. In Defense of the Unitary Scalarization for Deep Multi-Task Learning

  67. The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

  68. WebGPT: Browser-assisted question-answering with human feedback

  69. WebGPT: Improving the factual accuracy of language models through web browsing

  70. Acquisition of Chess Knowledge in AlphaZero

  71. AW-Opt: Learning Robotic Skills with Imitation and Reinforcement at Scale

  72. An Explanation of In-context Learning as Implicit Bayesian Inference

  73. Procedural Generalization by Planning with Self-Supervised World Models

  74. MetaICL: Learning to Learn In Context

  75. Collaborating with Humans without Human Data

  76. T0: Multitask Prompted Training Enables Zero-Shot Task Generalization

  77. Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

  78. Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning

  79. Recursively Summarizing Books with Human Feedback

  80. FLAN: Finetuned Language Models Are Zero-Shot Learners

  81. Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation

  82. WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU

  83. Multi-Task Self-Training for Learning General Representations

  84. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

  85. Open-Ended Learning Leads to Generally Capable Agents

  86. Megaverse: Simulating Embodied Agents at One Million Experiences per Second

  87. Evaluating Large Language Models Trained on Code

  88. PES: Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies

  89. Multimodal Few-Shot Learning with Frozen Language Models

  90. Brax—A Differentiable Physics Engine for Large Scale Rigid Body Simulation

  91. PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World

  92. From Motor Control to Team Play in Simulated Humanoid Football

  93. Reward is enough

  94. Podracer architectures for scalable Reinforcement Learning

  95. MuZero Unplugged: Online and Offline Reinforcement Learning by Planning with a Learned Model

  96. Scaling Scaling Laws with Board Games

  97. Large Batch Simulation for Deep Reinforcement Learning

  98. Stockfish and Lc0, test at different number of nodes

  99. Training Larger Networks for Deep Reinforcement Learning

  100. Investment vs. reward in a competitive knapsack problem

  101. NNUE: The neural network of the Stockfish chess engine

  102. Imitating Interactive Intelligence

  103. Scaling down Deep Learning

  104. Understanding RL Vision: With diverse environments, we can analyze, diagnose and edit deep reinforcement learning models using attribution

  105. Meta-trained agents implement Bayes-optimal agents

  106. Measuring Progress in Deep Reinforcement Learning Sample Efficiency

  107. Learning to summarize from human feedback

  108. Measuring hardware overhang

  109. Sample Factory: Egocentric 3D Control from Pixels at 100,000 FPS with Asynchronous Reinforcement Learning

  110. Real World Games Look Like Spinning Tops

  111. Agent57: Outperforming the human Atari benchmark

  112. Deep neuroethology of a virtual rodent

  113. Near-perfect point-goal navigation from 2.5 billion frames of experience

  114. Procgen Benchmark: We’re releasing Procgen Benchmark, 16 simple-to-use procedurally-generated environments which provide a direct measure of how quickly a reinforcement learning agent learns generalizable skills

  115. DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames

  116. Grandmaster level in StarCraft II using multi-agent reinforcement learning

  117. Solving Rubik’s Cube with a Robot Hand

  118. Fine-Tuning Language Models from Human Preferences

  119. Emergent Tool Use from Multi-Agent Interaction § Surprising behavior

  120. Meta Reinforcement Learning

  121. Human-level performance in 3D multiplayer games with population-based reinforcement learning

  122. AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence

  123. Meta-learning of Sequential Strategies

  124. Habitat: A Platform for Embodied AI Research

  125. The Bitter Lesson

  126. Benchmarking Classic and Learned Navigation in Complex 3D Environments

  127. Dota 2 With Large Scale Deep Reinforcement Learning: §4.3: Batch Size

  128. ce54be912a7192483c027c22c5fa8ff05e01ec77.pdf#page=13

  129. An Empirical Model of Large-Batch Training

  130. How AI Training Scales

  131. Bayesian Layers: A Module for Neural Network Uncertainty

  132. Quantifying Generalization in Reinforcement Learning

  133. One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL

  134. Robot Learning in Homes: Improving Generalization and Reducing Dataset Bias

  135. Human-level performance in first-person multiplayer games with population-based deep reinforcement learning

  136. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

  137. Playing Atari with Six Neurons

  138. AI and Compute

  139. Accelerated Methods for Deep Reinforcement Learning

  140. One Big Net For Everything

  141. Interactive Grounded Language Acquisition and Generalization in a 2D World

  142. Emergence of Locomotion behaviors in Rich Environments

  143. Deep reinforcement learning from human preferences

  144. Evolution Strategies as a Scalable Alternative to Reinforcement Learning

  145. On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models

  146. Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning

  147. Gorila: Massively Parallel Methods for Deep Reinforcement Learning

  148. Algorithmic Progress in Six Domains

  149. Robot Predictions Evolution

  150. When will computer hardware match the human brain?

  151. Human Window on the World

  152. Time for AI to Cross the Human Performance Range in Chess

  153. Eric Jang

  154. af69140653b2ecee5f7a7cb4ccfc578dd3c79163.html

  155. Trading Off Compute in Training and Inference

  156. Trading Off Compute in Training and Inference § MCTS Scaling

  157. Submission #6347: Chef Stef’s NES Arkanoid warpless in 11:11.18

  158. [The Addictiveness & Adversarialness of Playing against LeelaQueenOdds]

  159. a702591f3bdf9664f2ea597b2f974e91263364db.html#rJGjyq4j9xNoGyucw

  160. Training a CUDA TDS Ant Using C++ ARS Linear Policy: The Video Is Real-Time, After a Few Minutes (in the 30 Million Steps) the Training Curve Is Flat (I Trained Until a Billion Steps). Note That This Ant Is PD Control, and Not Identical to Either MuJoCo or PyBullet Ant, so the Training Curves Are Not Comparable Yet. Will Fix That.

  161. Ilya Sutskever: Deep Learning | AI Podcast #94 With Lex Fridman

  162. Target-Driven Visual Navigation in Indoor Scenes Using Deep Reinforcement Learning [Video]

  163. If You Want to Solve a Hard Problem in Reinforcement Learning, You Just Scale. It's Just Gonna Work Just like Supervised Learning. It's the Same, the Same Story Exactly. It Was Kind of Hard to Believe That Supervised Learning Can Do All Those Things, but It's Not Just Vision, It's Everything and the Same Thing Seems to Hold for Reinforcement Learning Provided You Have a Lot of Experience.

  164. 2023-baumli-figure4-rewardscalinginclipmodelsize.png

  165. 2021-jones-figure9-trainvstreesearchamortization.jpg

  166. 2021-liu-figure5-soccerperformancescaling.png

  167. 2019-jaderberg-figure1-ctftaskandtraining.jpg

  168. 2018-mccandlish-openai-howaitrainingscales-gradientnoisescale-summary3-scalevsbatchsize.jpg

  169. http://www.incompleteideas.net/IncIdeas/WrongWithAI.html

  170. https://andyljones.com/megastep/

  171. https://clemenswinter.com/2021/03/24/mastering-real-time-strategy-games-with-deep-reinforcement-learning-mere-mortal-edition/

  172. 1df5f9a4616649a2f8d7a313d30e45a6500ea385.html

  173. https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog/posts/deep-learning-episode-4-supercomputer-vs-pong-ii

  174. https://jdlm.info/articles/2018/03/18/markov-decision-process-2048.html

  175. https://openai.com/index/introducing-openai-o1-preview/

  176. https://openai.com/index/mle-bench/

  177. https://openai.com/research/summarizing-books

  178. https://outlast.me/robot-hiveminds-with-network-effects/

  179. https://research.google/blog/google-research-2022-beyond-language-vision-and-generative-models/

  180. https://spectrum.ieee.org/global-robotic-brain

  181. f0684686be802689cc1dc1f48de249efa47b4434.html

  182. https://www.anthropic.com/index/anthropics-responsible-scaling-policy

  183. https://www.lesswrong.com/posts/65qmEJHDw3vw69tKm/proposal-scaling-laws-for-rl-generalization

  184. https://www.lesswrong.com/posts/65qmEJHDw3vw69tKm/proposal-scaling-laws-for-rl-generalization?commentId=wMerfGZfPHerdzDAi

  185. https://www.lesswrong.com/posts/bwyKCQD7PFWKhELMr/by-default-gpts-think-in-plain-sight#zfzHshctWZYo8JkLe

  186. https://x.com/hausman_k/status/1612509549889744899

  187. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

  188. Lil'Log

  189. Homepage: Aleksander Mądry

  190. https%253A%252F%252Farxiv.org%252Fabs%252F2410.07095%2523openai.html

  191. AI Search: The Bitter-er Lesson

  192. https%253A%252F%252Fyellow-apartment-148.notion.site%252FAI-Search-The-Bitter-er-Lesson-44c11acd27294f4495c3de778cd09c8d.html

  193. Grandmaster-Level Chess Without Search

  194. https%253A%252F%252Farxiv.org%252Fabs%252F2402.04494%2523deepmind.html

  195. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

  196. About Me

  197. https://jack-clark.net/about/

  198. Sam Bowman

  199. Jared Kaplan

  200. https%253A%252F%252Farxiv.org%252Fabs%252F2401.05566%2523anthropic.html

  201. JaxMARL: Multi-Agent RL Environments in JAX

  202. https%253A%252F%252Farxiv.org%252Fabs%252F2311.10090.html

  203. Diversifying AI: Towards Creative Chess with AlphaZero (AZdb)

  204. https%253A%252F%252Farxiv.org%252Fabs%252F2308.09175%2523deepmind.html

  205. DreamerV3: Mastering Diverse Domains through World Models

  206. https%253A%252F%252Farxiv.org%252Fabs%252F2301.04104%2523deepmind.html

  207. Scaling Laws for Reward Model Overoptimization

  208. Leo Gao

  209. John Schulman’s Homepage

  210. Jacob Hilton's Homepage

  211. https%253A%252F%252Farxiv.org%252Fabs%252F2210.10760%2523openai.html

  212. SAP: Bidirectional Language Models Are Also Few-shot Learners

  213. Colin Raffel

  214. https%253A%252F%252Farxiv.org%252Fabs%252F2209.14500.html

  215. g.pt: Learning to Learn with Generative Models of Neural Network Checkpoints

  216. https%253A%252F%252Farxiv.org%252Fabs%252F2209.12892.html

  217. Human-level Atari 200× faster

  218. https%253A%252F%252Farxiv.org%252Fabs%252F2209.07550%2523deepmind.html

  219. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

  220. About Me

  221. Saurav Kadavath

  222. Andy Jones

  223. Sam Bowman

  224. Sam McCandlish

  225. Jared Kaplan

  226. https://jack-clark.net/about/

  227. https%253A%252F%252Fwww.anthropic.com%252Fred_teaming.pdf.html

  228. AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model

  229. https%253A%252F%252Farxiv.org%252Fabs%252F2208.01448%2523amazon.html

  230. Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

  231. Jeff Clune—Professor—Computer Science—University of British Columbia

  232. https%253A%252F%252Farxiv.org%252Fabs%252F2206.11795%2523openai.html

  233. Multi-Game Decision Transformers

  234. https://evjang.com/about/

  235. Igor Mordatch

  236. https%253A%252F%252Farxiv.org%252Fabs%252F2205.15241%2523google.html

  237. CT0: Fine-tuned Language Models are Continual Learners

  238. https%253A%252F%252Farxiv.org%252Fabs%252F2205.12393.html

  239. Gato: A Generalist Agent

  240. Nicolas Heess

  241. https%253A%252F%252Farxiv.org%252Fabs%252F2205.06175%2523deepmind.html

  242. Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale

  243. https%253A%252F%252Farxiv.org%252Fabs%252F2204.03514%2523facebook.html

  244. Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances

  245. https://evjang.com/about/

  246. Sergey Levine

  247. https%253A%252F%252Farxiv.org%252Fabs%252F2204.01691%2523google.html

  248. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

  249. https%253A%252F%252Farxiv.org%252Fabs%252F2204.00598%2523google.html

  250. EvoJAX: Hardware-Accelerated Neuroevolution

  251. https%253A%252F%252Farxiv.org%252Fabs%252F2202.05008%2523google.html

  252. The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

  253. Jacob Steinhardt

  254. https%253A%252F%252Farxiv.org%252Fabs%252F2201.03544.html

  255. WebGPT: Browser-assisted question-answering with human feedback

  256. Jacob Hilton's Homepage

  257. Gretchen Krueger

  258. John Schulman’s Homepage

  259. https%253A%252F%252Farxiv.org%252Fabs%252F2112.09332%2523openai.html

  260. WebGPT: Improving the factual accuracy of language models through web browsing

  261. Jacob Hilton's Homepage

  262. John Schulman’s Homepage

  263. https%253A%252F%252Fopenai.com%252Fresearch%252Fwebgpt.html

  264. Acquisition of Chess Knowledge in AlphaZero

  265. https%253A%252F%252Farxiv.org%252Fabs%252F2111.09259%2523deepmind.html

  266. Procedural Generalization by Planning with Self-Supervised World Models

  267. Julian Schrittwieser

  268. Sherjil Ozair

  269. https%253A%252F%252Farxiv.org%252Fabs%252F2111.01587%2523deepmind.html

  270. Recursively Summarizing Books with Human Feedback

  271. Jan Leike

  272. https%253A%252F%252Farxiv.org%252Fabs%252F2109.10862%2523openai.html

  273. PES: Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies

  274. Luke Metz

  275. Jascha Sohl-Dickstein

  276. https%253A%252F%252Fproceedings.mlr.press%252Fv139%252Fvicol21a.html.html

  277. Brax—A Differentiable Physics Engine for Large Scale Rigid Body Simulation

  278. Igor Mordatch

  279. https%253A%252F%252Farxiv.org%252Fabs%252F2106.13281%2523google.html

  280. From Motor Control to Team Play in Simulated Humanoid Football

  281. Guy Lever

  282. Nicolas Heess

  283. https%253A%252F%252Farxiv.org%252Fabs%252F2105.12196%2523deepmind.html

  284. Reward is enough

  285. https%253A%252F%252Fwww.sciencedirect.com%252Fscience%252Farticle%252Fpii%252FS0004370221000862%2523deepmind.html

  286. Podracer architectures for scalable Reinforcement Learning

  287. https%253A%252F%252Farxiv.org%252Fabs%252F2104.06272%2523deepmind.html

  288. MuZero Unplugged: Online and Offline Reinforcement Learning by Planning with a Learned Model

  289. Julian Schrittwieser

  290. https%253A%252F%252Farxiv.org%252Fabs%252F2104.06294%2523deepmind.html

  291. Imitating Interactive Intelligence

  292. Language Understanding Grounded in Perception and Action

  293. https%253A%252F%252Farxiv.org%252Fabs%252F2012.05672%2523deepmind.html

  294. Scaling down Deep Learning

  295. About Sam Greydanus

  296. https%253A%252F%252Fgreydanus.github.io%252F2020%252F12%252F01%252Fscaling-down%252F.html

  297. Measuring Progress in Deep Reinforcement Learning Sample Efficiency

  298. https%253A%252F%252Farxiv.org%252Fabs%252F2102.04881.html

  299. Agent57: Outperforming the human Atari benchmark

  300. https%253A%252F%252Fdeepmind.google%252Fdiscover%252Fblog%252Fagent57-outperforming-the-human-atari-benchmark%252F.html

  301. Deep neuroethology of a virtual rodent

  302. https%253A%252F%252Fopenreview.net%252Fforum%253Fid%253DSyxrxR4KPS%2523deepmind.html

  303. Procgen Benchmark: We’re releasing Procgen Benchmark, 16 simple-to-use procedurally-generated environments which provide a direct measure of how quickly a reinforcement learning agent learns generalizable skills

  304. Jacob Hilton's Homepage

  305. John Schulman’s Homepage

  306. https%253A%252F%252Fopenai.com%252Fresearch%252Fprocgen-benchmark.html

  307. DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames

  308. https%253A%252F%252Farxiv.org%252Fabs%252F1911.00357%2523facebook.html

  309. Grandmaster level in StarCraft II using multi-agent reinforcement learning

  310. Yuhuai (Tony) Wu’s Home Page

  311. Koray Kavukcuoglu

  312. %252Fdoc%252Freinforcement-learning%252Fmodel-free%252Falphastar%252F2019-vinyals.pdf%2523deepmind.html

  313. Emergent Tool Use from Multi-Agent Interaction § Surprising behavior

  314. https://x.com/bobmcgrewai

  315. Igor Mordatch

  316. https%253A%252F%252Fopenai.com%252Fresearch%252Femergent-tool-use%2523surprisingbehaviors.html

  317. Human-level performance in 3D multiplayer games with population-based reinforcement learning

  318. Guy Lever

  319. Koray Kavukcuoglu

  320. %252Fdoc%252Freinforcement-learning%252Fexploration%252F2019-jaderberg.pdf%2523deepmind.html

  321. Habitat: A Platform for Embodied AI Research

  322. https%253A%252F%252Farxiv.org%252Fabs%252F1904.01201%2523facebook.html

  323. How AI Training Scales

  324. Sam McCandlish

  325. Jared Kaplan

  326. https%253A%252F%252Fopenai.com%252Fresearch%252Fhow-ai-training-scales.html

  327. AI and Compute

  328. https://jack-clark.net/about/

  329. https%253A%252F%252Fopenai.com%252Fresearch%252Fai-and-compute.html

  330. Robot Predictions Evolution

  331. https%253A%252F%252Fweb.archive.org%252Fweb%252F20230718144747%252Fhttps%253A%252F%252Ffrc.ri.cmu.edu%252F~hpm%252Fproject.archive%252Frobot.papers%252F2004%252FPredictions.html.html

  332. When will computer hardware match the human brain?

  333. https%253A%252F%252Fjetpress.org%252Fvolume1%252Fmoravec.htm.html