Bibliography:

  1. ‘RL’ tag

  2. ‘AI mode collapse’ tag

  3. ‘Midjourney’ tag

  4. ‘NN sampling’ tag

  5. ‘Sydney (AI)’ tag

  6. ‘instruct-tuning LLMs’ tag

  7. ‘Decision Transformer’ tag

  8. ‘offline RL’ tag

  9. ‘statistical comparison’ tag

  10. GPT-3 Semantic Derealization

  11. Midjourneyv6 Personalized vs Default Samples

  12. Revisiting Your Memory: Reconstruction of Affect-Contextualized Memory via EEG-guided Audiovisual Generation

  13. AI-generated poetry is indistinguishable from human-written poetry and is rated more favorably

  14. Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL

  15. Thinking LLMs: General Instruction Following with Thought Generation

  16. Language Models Learn to Mislead Humans via RLHF

  17. Does Style Matter? Disentangling Style and Substance in Chatbot Arena

  18. f378decdc51f1ed985c69386f92511c2898363c7.html

  19. LLM Applications I Want To See

  20. 994c2f94d62a984842ed3fa41412926dccca6241.html

  21. SEAL: Systematic Error Analysis for Value ALignment

  22. Hermes 3 Technical Report

  23. Does Refusal Training in LLMs Generalize to the Past Tense?

  24. Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

  25. Nemotron-4 340B Technical Report

  26. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

  27. Discovering Preference Optimization Algorithms with and for Large Language Models

  28. Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement

  29. Safety Alignment Should Be Made More Than Just a Few Tokens Deep

  30. AlignEZ: Is Free Self-Alignment Possible?

  31. Aligning LLM Agents by Learning Latent Preference from User Edits

  32. Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

  33. From r to Q: Your Language Model is Secretly a Q-Function

  34. Dataset Reset Policy Optimization for RLHF

  35. ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

  36. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

  37. TextCraftor: Your Text Encoder Can be Image Quality Controller

  38. RewardBench: Evaluating Reward Models for Language Modeling

  39. Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics

  40. When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

  41. V-STaR: Training Verifiers for Self-Taught Reasoners

  42. I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench

  43. Can AI Assistants Know What They Don’t Know?

  44. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

  45. Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM

  46. A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

  47. Reasons to Reject? Aligning Language Models with Judgments

  48. Rich Human Feedback for Text-to-Image Generation

  49. Language Model Alignment with Elastic Reset

  50. The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

  51. Universal Jailbreak Backdoors from Poisoned Human Feedback

  52. Diffusion Model Alignment Using Direct Preference Optimization

  53. Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild

  54. Specific versus General Principles for Constitutional AI

  55. Eureka: Human-Level Reward Design via Coding Large Language Models

  56. A General Theoretical Paradigm to Understand Learning from Human Preferences

  57. Interpreting Learned Feedback Patterns in Large Language Models

  58. UltraFeedback: Boosting Language Models with High-quality Feedback

  59. Motif: Intrinsic Motivation from Artificial Intelligence Feedback

  60. Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

  61. STARC: A General Framework For Quantifying Differences Between Reward Functions

  62. AceGPT, Localizing Large Language Models in Arabic

  63. RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

  64. Activation Addition: Steering Language Models Without Optimization

  65. ReST: Reinforced Self-Training (ReST) for Language Modeling

  66. FABRIC: Personalizing Diffusion Models with Iterative Feedback

  67. LLaMA-2: Open Foundation and Fine-Tuned Chat Models

  68. Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

  69. Introducing Superalignment

  70. Are aligned neural networks adversarially aligned?

  71. AI Is a Lot of Work: As the technology becomes ubiquitous, a vast tasker underclass is emerging—and not going anywhere

  72. Large Language Models Sometimes Generate Purely Negatively-Reinforced Text

  73. Microsoft and OpenAI Forge Awkward Partnership as Tech’s New Power Couple: As the companies lead the AI boom, their unconventional arrangement sometimes causes conflict

  74. Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model

  75. Improving Language Models with Advantage-based Offline Policy Gradients

  76. LIMA: Less Is More for Alignment

  77. A Radical Plan to Make AI Good, Not Evil

  78. SELF-ALIGN: Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

  79. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

  80. Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-oriented Dialogue Systems

  81. Use GPT-3 incorrectly: reduce costs 40× and increase speed by 5×

  82. OpenAI’s Sam Altman Talks ChatGPT And How Artificial General Intelligence Can ‘Break Capitalism’

  83. Big Tech was moving cautiously on AI. Then came ChatGPT. Google, Facebook and Microsoft helped build the scaffolding of AI. Smaller companies are taking it to the masses, forcing Big Tech to react

  84. The inside story of ChatGPT: How OpenAI founder Sam Altman built the world’s hottest technology with billions from Microsoft

  85. Self-Instruct: Aligning Language Models with Self-Generated Instructions

  86. HALIE: Evaluating Human-Language Model Interaction

  87. Constitutional AI: Harmlessness from AI Feedback

  88. Solving math word problems with process & outcome-based feedback

  89. Mysteries of mode collapse § Inescapable wedding parties

  90. When Life Gives You Lemons, Make Cherryade: Converting Feedback from Bad Responses into Good Labels

  91. Scaling Laws for Reward Model Overoptimization

  92. Teacher Forcing Recovers Reward Functions for Text Generation

  93. CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning

  94. Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

  95. Sparrow: Improving alignment of dialogue agents via targeted human judgements

  96. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

  97. Basis for Intentions (BASIS): Efficient Inverse Reinforcement Learning using Past Experience

  98. Improved Policy Optimization for Online Imitation Learning

  99. Quark: Controllable Text Generation with Reinforced Unlearning

  100. Housekeep: Tidying Virtual Households using Commonsense Reasoning

  101. Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning

  102. Inferring Rewards from Language in Context

  103. SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning

  104. InstructGPT: Training language models to follow instructions with human feedback

  105. Safe Deep RL in 3D Environments using Human Feedback

  106. A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models

  107. WebGPT: Browser-assisted question-answering with human feedback

  108. WebGPT: Improving the factual accuracy of language models through web browsing

  109. Modeling Strong and Human-Like Gameplay with KL-Regularized Search

  110. A General Language Assistant as a Laboratory for Alignment

  111. Cut the CARP: Fishing for zero-shot story evaluation

  112. Recursively Summarizing Books with Human Feedback

  113. B-Pref: Benchmarking Preference-Based Reinforcement Learning

  114. Trajectory Transformer: Reinforcement Learning as One Big Sequence Modeling Problem

  115. Embracing New Techniques in Deep Learning for Estimating Image Memorability

  116. A Survey of Preference-Based Reinforcement Learning Methods

  117. Learning What To Do by Simulating the Past

  118. Language Models have a Moral Dimension

  119. Brain-computer interface for generating personally attractive images

  120. Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets

  121. Human-centric Dialog Training via Offline Reinforcement Learning

  122. Learning to summarize from human feedback

  123. Learning Personalized Models of Human Behavior in Chess

  124. Aligning Superhuman AI with Human Behavior: Chess as a Model System

  125. Active Preference-Based Gaussian Process Regression for Reward Learning

  126. Bayesian REX: Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences

  127. RL agents Implicitly Learning Human Preferences

  128. Reward-rational (implicit) choice: A unifying formalism for reward learning

  129. What does BERT dream of? A visual investigation of nightmares in Sesame Street

  130. Deep Bayesian Reward Learning from Preferences

  131. Learning Norms from Stories: A Prior for Value Aligned Agents

  132. Reinforcement Learning Upside Down: Don’t Predict Rewards—Just Map Them to Actions

  133. Learning Human Objectives by Evaluating Hypothetical Behavior

  134. Preference-Based Learning for Exoskeleton Gait Optimization

  135. Do Massively Pretrained Language Models Make Better Storytellers?

  136. Fine-Tuning GPT-2 from Human Preferences § Bugs can optimize for bad behavior

  137. Fine-Tuning GPT-2 from Human Preferences

  138. Fine-Tuning Language Models from Human Preferences

  139. lm-human-preferences

  140. Better Rewards Yield Better Summaries: Learning to Summarise Without References

  141. Dueling Posterior Sampling for Preference-Based Reinforcement Learning

  142. Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

  143. Reward learning from human preferences and demonstrations in Atari

  144. StreetNet: Preference Learning with Convolutional Neural Network on Urban Crime Perception

  145. Toward Diverse Text Generation with Inverse Reinforcement Learning

  146. Ordered Preference Elicitation Strategies for Supporting Multi-Objective Decision Making

  147. Convergence of Value Aggregation for Imitation Learning

  148. A Low-Cost Ethics Shaping Approach for Designing Reinforcement Learning Agents

  149. Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces

  150. DropoutDAgger: A Bayesian Approach to Safe Imitation Learning

  151. NIMA: Neural Image Assessment

  152. Towards personalized human AI interaction—adapting the behavior of AI agents using neural signatures of subjective interest

  153. A deep architecture for unified esthetic prediction

  154. Learning human behaviors from motion capture by adversarial imitation

  155. Learning from Human Preferences

  156. Deep reinforcement learning from human preferences

  157. Learning through human feedback [blog]

  158. Adversarial Ranking for Language Generation

  159. An Invitation to Imitation

  160. Just Sort It! A Simple and Effective Approach to Active Preference Learning

  161. Algorithmic and Human Teaching of Sequential Decision Tasks

  162. Bayesian Active Learning for Classification and Preference Learning

  163. DAgger: A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

  164. John Schulman’s Homepage

  165. An Analysis of AI Political Preferences from a European Perspective

  166. Something Weird Is Happening With LLMs and Chess

  167. Transformers As Variational Autoencoders

  168. The Taming of the AI

  169. Copilot Stops Working on `gender` Related Subjects · Community · Discussion #72603

  170. 240b757ca122975adc355feffb57df79223bfa90.html

  171. Transformer-VAE for Program Synthesis

  172. Claude’s Character

  173. a9f33831747615fc9d619b346ca263844b243b61.html

  174. How Did You Do On The AI Art Turing Test?

  175. Tülu 3: The next Era in Open Post-Training

  176. Interpreting Preference Models W/Sparse Autoencoders

  177. 704ba4488bcfca509f4f8c8bb3627ef5fb21f53b.html

  178. When Your AIs Deceive You: Challenges With Partial Observability in RLHF

  179. Learning and Manipulating Learning

  180. Model Mis-Specification and Inverse Reinforcement Learning

  181. Full Toy Model for Preference Learning

  182. 2023-kirstain-figure6-inversecorrelationbetweenmscocofidqualityandhumanexpertrankingofimagequality.jpg

  183. 2023-kirstain-figure7-comparisonofhighervslowerclassifierfreeguidanceillustratesworsefidbutbetterhumanpreferenceofimagesamples.png

  184. 2023-pullen-buildt-knowledgedistillationofkshotdavinci003tofinetunedbabbagegpt3modeltosavemoneyandlatency.png

  185. 2017-amodei-openai-learningfromhumanpreferences-architecture2x-2x.png

  186. 2012-cakmak-figure5-algorithmicteachingvsrandomsampleselectionsampleefficiencygains.jpg

  187. https://ai.facebook.com/blog/harmful-content-can-evolve-quickly-our-new-ai-system-adapts-to-tackle-it

  188. https://blog.eleuther.ai/trlx-exploratory-analysis/

  189. https://carper.ai/instruct-gpt-announcement/

  190. 766cb8b990cab4b23efdc653265df90bc0acb688.html

  191. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=12d941c445ec477501f78b15dcf84f98173121cf

  192. https://github.com/curiousjp/toy_sd_genetics?tab=readme-ov-file#toy_sd_genetics

  193. https://github.com/sanjeevanahilan/nanoChatGPT

  194. https://hal.science/hal-01972948/document#pdf

  195. 2504fb4f5dc4fdbf91b513e3cf62623e34f71cc0.pdf

  196. https://huggingface.co/blog/rlhf

  197. https://koenvangilst.nl/blog/keeping-code-complexity-in-check

  198. https://openai.com/research/summarizing-books

  199. https://samiramly.com/chess

  200. https://searchengineland.com/how-google-search-ranking-works-pandu-nayak-435395#h-navboost-system-a-k-a-glue

  201. https://www.frontiersin.org/articles/10.3389/frobt.2017.00071/full

  202. https://www.lesswrong.com/posts/3eqHYxfWb5x4Qfz8C/unrlhf-efficiently-undoing-llm-safeguards

  203. https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post

  204. https://www.lesswrong.com/posts/cqGEQeLNbcptYsifz/this-week-in-fashion

  205. https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned#AAC8jKeDp6xqsZK2K

  206. https://www.lesswrong.com/posts/qmQFHCgCyEEjuy5a7/lora-fine-tuning-efficiently-undoes-safety-training-from

  207. https://www.lesswrong.com/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research

  208. https://www.reddit.com/r/ChatGPTNSFW/comments/17wk2g3/a_failed_ai_girlfriend_product_and_my_lessons/k9hs22a/

  209. https://www.reddit.com/r/StableDiffusion/comments/1gdkpqp/the_gory_details_of_finetuning_sdxl_for_40m/

  210. https://www.youtube.com/watch?v=hhiLw5Q_UFg&t=1098s

  211. https://x.com/agishibaa/status/1770206746960601583

  212. https://x.com/corbtt/status/1814056457626862035

  213. https://x.com/davis_yoshida/status/1780733741457088759

  214. https://x.com/edleonklinger/status/1665802712875769860

  215. https://x.com/emmons_scott/status/1762886003046629586

  216. https://x.com/fluffykittnmeow/status/1729072654420680908

  217. https://x.com/garrynewman/status/1755851884047303012

  218. https://x.com/hwchase17/status/1600162023589163008

  219. https://x.com/labenz/status/1611750398712332292

  220. https://x.com/lefthanddraft/status/1851154437752188932

  221. https://x.com/lefthanddraft/status/1853482491124109725

  222. https://x.com/liminal_warmth/status/1852354598817693937#m

  223. Does Refusal Training in LLMs Generalize to the Past Tense?

  224. https%253A%252F%252Farxiv.org%252Fabs%252F2407.11969.html

  225. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

  226. Yizhong Wang—University of Washington

  227. Hannaneh Hajishirzi—University of Washington

  228. https%253A%252F%252Farxiv.org%252Fabs%252F2406.09279.html

  229. From r to Q: Your Language Model is Secretly a Q-Function

  230. https%253A%252F%252Farxiv.org%252Fabs%252F2404.12358.html

  231. Dataset Reset Policy Optimization for RLHF

  232. https%253A%252F%252Farxiv.org%252Fabs%252F2404.08495.html

  233. When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

  234. https%253A%252F%252Farxiv.org%252Fabs%252F2402.17747.html

  235. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

  236. About Me

  237. https://jack-clark.net/about/

  238. Sam Bowman

  239. Jared Kaplan

  240. https%253A%252F%252Farxiv.org%252Fabs%252F2401.05566%2523anthropic.html

  241. Language Model Alignment with Elastic Reset

  242. Aaron Courville

  243. https%253A%252F%252Farxiv.org%252Fabs%252F2312.07551.html

  244. UltraFeedback: Boosting Language Models with High-quality Feedback

  245. Ning Ding

  246. https%253A%252F%252Farxiv.org%252Fabs%252F2310.01377.html

  247. Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

  248. https%253A%252F%252Farxiv.org%252Fabs%252F2309.15807%2523facebook.html

  249. AceGPT, Localizing Large Language Models in Arabic

  250. https%253A%252F%252Farxiv.org%252Fabs%252F2309.12053.html

  251. Activation Addition: Steering Language Models Without Optimization

  252. https%253A%252F%252Farxiv.org%252Fabs%252F2308.10248.html

  253. Introducing Superalignment

  254. Jan Leike

  255. https%253A%252F%252Fopenai.com%252Findex%252Fintroducing-superalignment%252F.html

  256. AI Is a Lot of Work: As the technology becomes ubiquitous, a vast tasker underclass is emerging—and not going anywhere

  257. https%253A%252F%252Fwww.theverge.com%252Ffeatures%252F23764584%252Fai-artificial-intelligence-data-notation-labor-scale-surge-remotasks-openai-chatbots.html

  258. Large Language Models Sometimes Generate Purely Negatively-Reinforced Text

  259. https%253A%252F%252Farxiv.org%252Fabs%252F2306.07567.html

  260. Microsoft and OpenAI Forge Awkward Partnership as Tech’s New Power Couple: As the companies lead the AI boom, their unconventional arrangement sometimes causes conflict

  261. https://x.com/dseetharaman

  262. https%253A%252F%252Fwww.wsj.com%252Farticles%252Fmicrosoft-and-openai-forge-awkward-partnership-as-techs-new-power-couple-3092de51.html

  263. A Radical Plan to Make AI Good, Not Evil

  264. https%253A%252F%252Fwww.wired.com%252Fstory%252Fanthropic-ai-chatbots-ethics%252F.html

  265. SELF-ALIGN: Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

  266. https://www.cs.cmu.edu/~./yiming/

  267. https%253A%252F%252Farxiv.org%252Fabs%252F2305.03047%2523ibm.html

  268. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

  269. Omer Levy

  270. https%253A%252F%252Farxiv.org%252Fabs%252F2305.01569.html

  271. OpenAI’s Sam Altman Talks ChatGPT And How Artificial General Intelligence Can ‘Break Capitalism’

  272. https%253A%252F%252Fwww.forbes.com%252Fsites%252Falexkonrad%252F2023%252F02%252F03%252Fexclusive-openai-sam-altman-chatgpt-agi-google-search%252F.html

  273. Self-Instruct: Aligning Language Models with Self-Generated Instructions

  274. Yizhong Wang—University of Washington

  275. Hannaneh Hajishirzi—University of Washington

  276. https%253A%252F%252Farxiv.org%252Fabs%252F2212.10560.html

  277. Scaling Laws for Reward Model Overoptimization

  278. Leo Gao

  279. John Schulman’s Homepage

  280. Jacob Hilton's Homepage

  281. https%253A%252F%252Farxiv.org%252Fabs%252F2210.10760%2523openai.html

  282. CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning

  283. https%253A%252F%252Farxiv.org%252Fabs%252F2210.07792%2523eleutherai.html

  284. Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

  285. Hannaneh Hajishirzi—University of Washington

  286. https%253A%252F%252Farxiv.org%252Fabs%252F2210.01241.html

  287. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

  288. About Me

  289. Saurav Kadavath

  290. Andy Jones

  291. Sam Bowman

  292. Sam McCandlish

  293. Jared Kaplan

  294. https://jack-clark.net/about/

  295. https%253A%252F%252Fwww.anthropic.com%252Fred_teaming.pdf.html

  296. WebGPT: Browser-assisted question-answering with human feedback

  297. Jacob Hilton's Homepage

  298. Gretchen Krueger

  299. John Schulman’s Homepage

  300. https%253A%252F%252Farxiv.org%252Fabs%252F2112.09332%2523openai.html

  301. WebGPT: Improving the factual accuracy of language models through web browsing

  302. Jacob Hilton's Homepage

  303. John Schulman’s Homepage

  304. https%253A%252F%252Fopenai.com%252Fresearch%252Fwebgpt.html

  305. A General Language Assistant as a Laboratory for Alignment

  306. About Me

  307. Andy Jones

  308. https://jack-clark.net/about/

  309. Sam McCandlish

  310. Jared Kaplan

  311. https%253A%252F%252Farxiv.org%252Fabs%252F2112.00861%2523anthropic.html

  312. Recursively Summarizing Books with Human Feedback

  313. Jan Leike

  314. https%253A%252F%252Farxiv.org%252Fabs%252F2109.10862%2523openai.html

  315. Trajectory Transformer: Reinforcement Learning as One Big Sequence Modeling Problem

  316. Sergey Levine

  317. https%253A%252F%252Ftrajectory-transformer.github.io%252F.html

  318. Fine-Tuning GPT-2 from Human Preferences

  319. Alec Radford

  320. https%253A%252F%252Fopenai.com%252Fresearch%252Ffine-tuning-gpt-2.html

  321. StreetNet: Preference Learning with Convolutional Neural Network on Urban Crime Perception

  322. %252Fdoc%252Fai%252Fnn%252Fcnn%252F2018-fu.pdf.html