- See Also
- Gwern
-
Links
- “Language Models Learn to Mislead Humans via RLHF”, Wen et al 2024
- “Does Style Matter? Disentangling Style and Substance in Chatbot Arena”
- “LLM Applications I Want To See”, Constantin 2024
- “SEAL: Systematic Error Analysis for Value ALignment”, Revel et al 2024
- “Does Refusal Training in LLMs Generalize to the Past Tense?”, Andriushchenko & Flammarion 2024
- “Discovering Preference Optimization Algorithms With and for Large Language Models”, Lu et al 2024
- “Beyond Model Collapse: Scaling Up With Synthesized Data Requires Reinforcement”, Feng et al 2024
- “Safety Alignment Should Be Made More Than Just a Few Tokens Deep”, Qi et al 2024
- “AlignEZ: Is Free Self-Alignment Possible?”, Adila et al 2024
- “Aligning LLM Agents by Learning Latent Preference from User Edits”, Gao et al 2024
- “Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data”, Tajwar et al 2024
- “From r to Q✱: Your Language Model Is Secretly a Q-Function”, Rafailov et al 2024
- “Dataset Reset Policy Optimization for RLHF”, Chang et al 2024
- “ControlNet++: Improving Conditional Controls With Efficient Consistency Feedback”, Li et al 2024
- “Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators”, Dubois et al 2024
- “TextCraftor: Your Text Encoder Can Be Image Quality Controller”, Li et al 2024
- “RewardBench: Evaluating Reward Models for Language Modeling”, Lambert et al 2024
- “Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics”, Hartwig et al 2024
- “V-STaR: Training Verifiers for Self-Taught Reasoners”, Hosseini et al 2024
- “I Think, Therefore I Am: Benchmarking Awareness of Large Language Models Using AwareBench”, Li et al 2024
- “Can AI Assistants Know What They Don’t Know?”, Cheng et al 2024
- “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”, Hubinger et al 2024
- “Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM”, Lu et al 2024
- “A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity”, Lee et al 2024
- “Reasons to Reject? Aligning Language Models With Judgments”, Xu et al 2023
- “Rich Human Feedback for Text-To-Image Generation”, Liang et al 2023
- “The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning”, Lin et al 2023
- “Universal Jailbreak Backdoors from Poisoned Human Feedback”, Rando & Tramèr 2023
- “Diffusion Model Alignment Using Direct Preference Optimization”, Wallace et al 2023
- “Summon a Demon and Bind It: A Grounded Theory of LLM Red Teaming in the Wild”, Inie et al 2023
- “Specific versus General Principles for Constitutional AI”, Kundu et al 2023
- “Eureka: Human-Level Reward Design via Coding Large Language Models”, Ma et al 2023
- “A General Theoretical Paradigm to Understand Learning from Human Preferences”, Azar et al 2023
- “Interpreting Learned Feedback Patterns in Large Language Models”, Marks et al 2023
- “UltraFeedback: Boosting Language Models With High-Quality Feedback”, Cui et al 2023
- “Motif: Intrinsic Motivation from Artificial Intelligence Feedback”, Klissarov et al 2023
- “Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack”, Dai et al 2023
- “STARC: A General Framework For Quantifying Differences Between Reward Functions”, Skalse et al 2023
- “AceGPT, Localizing Large Language Models in Arabic”, Huang et al 2023
- “RLAIF: Scaling Reinforcement Learning from Human Feedback With AI Feedback”, Lee et al 2023
- “Activation Addition: Steering Language Models Without Optimization”, Turner et al 2023
- “ReST: Reinforced Self-Training (ReST) for Language Modeling”, Gulcehre et al 2023
- “FABRIC: Personalizing Diffusion Models With Iterative Feedback”, Rütte et al 2023
- “LLaMA-2: Open Foundation and Fine-Tuned Chat Models”, Touvron et al 2023
- “Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations”, Chen et al 2023
- “Introducing Superalignment”, Leike & Sutskever 2023
- “Are Aligned Neural Networks Adversarially Aligned?”, Carlini et al 2023
- “AI Is a Lot of Work: As the Technology Becomes Ubiquitous, a Vast Tasker Underclass Is Emerging—And Not Going Anywhere”, Dzieza 2023
- “Large Language Models Sometimes Generate Purely Negatively-Reinforced Text”, Roger 2023
- “Microsoft and OpenAI Forge Awkward Partnership As Tech’s New Power Couple: As the Companies Lead the AI Boom, Their Unconventional Arrangement Sometimes Causes Conflict”, Dotan & Seetharaman 2023
- “Direct Preference Optimization (DPO): Your Language Model Is Secretly a Reward Model”, Rafailov et al 2023
- “Improving Language Models With Advantage-Based Offline Policy Gradients”, Baheti et al 2023
- “LIMA: Less Is More for Alignment”, Zhou et al 2023
- “A Radical Plan to Make AI Good, Not Evil”, Knight 2023
- “SELF-ALIGN: Principle-Driven Self-Alignment of Language Models from Scratch With Minimal Human Supervision”, Sun et al 2023
- “Pick-A-Pic: An Open Dataset of User Preferences for Text-To-Image Generation”, Kirstain et al 2023
- “Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-Oriented Dialogue Systems”, Feng et al 2023
- “Use GPT-3 Incorrectly: Reduce Costs 40× and Increase Speed by 5×”, Pullen 2023
- “OpenAI’s Sam Altman Talks ChatGPT And How Artificial General Intelligence Can ‘Break Capitalism’”, Konrad & Cai 2023
- “Big Tech Was Moving Cautiously on AI. Then Came ChatGPT. Google, Facebook and Microsoft Helped Build the Scaffolding of AI. Smaller Companies Are Taking It to the Masses, Forcing Big Tech to React”, Tiku et al 2023
- “The inside Story of ChatGPT: How OpenAI Founder Sam Altman Built the World’s Hottest Technology With Billions from Microsoft”, Kahn 2023
- “Self-Instruct: Aligning Language Models With Self-Generated Instructions”, Wang et al 2022
- “HALIE: Evaluating Human-Language Model Interaction”, Lee et al 2022
- “Constitutional AI: Harmlessness from AI Feedback”, Bai et al 2022
- “Solving Math Word Problems With Process & Outcome-Based Feedback”, Uesato et al 2022
- “Mysteries of Mode Collapse § Inescapable Wedding Parties”, Janus 2022
- “When Life Gives You Lemons, Make Cherryade: Converting Feedback from Bad Responses into Good Labels”, Shi et al 2022
- “Scaling Laws for Reward Model Overoptimization”, Gao et al 2022
- “Teacher Forcing Recovers Reward Functions for Text Generation”, Hao et al 2022
- “CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning”, Castricato et al 2022
- “Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization”, Ramamurthy et al 2022
- “Sparrow: Improving Alignment of Dialogue Agents via Targeted Human Judgements”, Glaese et al 2022
- “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, Ganguli et al 2022
- “Basis for Intentions (BASIS): Efficient Inverse Reinforcement Learning Using Past Experience”, Abdulhai et al 2022
- “Improved Policy Optimization for Online Imitation Learning”, Lavington et al 2022
- “Quark: Controllable Text Generation With Reinforced Unlearning”, Lu et al 2022
- “Housekeep: Tidying Virtual Households Using Commonsense Reasoning”, Kant et al 2022
- “Imitating, Fast and Slow: Robust Learning from Demonstrations via Decision-Time Planning”, Qi et al 2022
- “Inferring Rewards from Language in Context”, Lin et al 2022
- “SURF: Semi-Supervised Reward Learning With Data Augmentation for Feedback-Efficient Preference-Based Reinforcement Learning”, Park et al 2022
- “InstructGPT: Training Language Models to Follow Instructions With Human Feedback”, Ouyang et al 2022
- “Safe Deep RL in 3D Environments Using Human Feedback”, Rahtz et al 2022
- “A Survey of Controllable Text Generation Using Transformer-Based Pre-Trained Language Models”, Zhang et al 2022
- “WebGPT: Browser-Assisted Question-Answering With Human Feedback”, Nakano et al 2021
- “WebGPT: Improving the Factual Accuracy of Language Models through Web Browsing”, Hilton et al 2021
- “Modeling Strong and Human-Like Gameplay With KL-Regularized Search”, Jacob et al 2021
- “A General Language Assistant As a Laboratory for Alignment”, Askell et al 2021
- “Cut the CARP: Fishing for Zero-Shot Story Evaluation”, Shahbul et al 2021
- “Recursively Summarizing Books With Human Feedback”, Wu et al 2021
- “B-Pref: Benchmarking Preference-Based Reinforcement Learning”, Lee et al 2021
- “Trajectory Transformer: Reinforcement Learning As One Big Sequence Modeling Problem”, Janner et al 2021
- “Embracing New Techniques in Deep Learning for Estimating Image Memorability”, Needell & Bainbridge 2021
- “A Survey of Preference-Based Reinforcement Learning Methods”, Wirth et al 2021
- “Learning What To Do by Simulating the Past”, Lindner et al 2021
- “Language Models Have a Moral Dimension”, Schramowski et al 2021
- “Brain-Computer Interface for Generating Personally Attractive Images”, Spape et al 2021
- “Process for Adapting Language Models to Society (PALMS) With Values-Targeted Datasets”, Solaiman & Dennison 2021
- “Human-Centric Dialog Training via Offline Reinforcement Learning”, Jaques et al 2020
- “Learning to Summarize from Human Feedback”, Stiennon et al 2020
- “Learning Personalized Models of Human Behavior in Chess”, McIlroy-Young et al 2020
- “Aligning Superhuman AI With Human Behavior: Chess As a Model System”, McIlroy-Young et al 2020
- “Active Preference-Based Gaussian Process Regression for Reward Learning”, Bıyık et al 2020
- “Bayesian REX: Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences”, Brown et al 2020
- “RL Agents Implicitly Learning Human Preferences”, Wichers 2020
- “Reward-Rational (implicit) Choice: A Unifying Formalism for Reward Learning”, Jeon et al 2020
- “What Does BERT Dream Of? A Visual Investigation of Nightmares in Sesame Street”, Bäuerle & Wexler 2020
- “Deep Bayesian Reward Learning from Preferences”, Brown & Niekum 2019
- “Learning Norms from Stories: A Prior for Value Aligned Agents”, Frazier et al 2019
- “Reinforcement Learning Upside Down: Don’t Predict Rewards—Just Map Them to Actions”, Schmidhuber 2019
- “Learning Human Objectives by Evaluating Hypothetical Behavior”, Reddy et al 2019
- “Preference-Based Learning for Exoskeleton Gait Optimization”, Tucker et al 2019
- “Do Massively Pretrained Language Models Make Better Storytellers?”, See et al 2019
- “Fine-Tuning GPT-2 from Human Preferences § Bugs Can Optimize for Bad Behavior”, Ziegler et al 2019
- “Fine-Tuning GPT-2 from Human Preferences”, Ziegler et al 2019
- “Fine-Tuning Language Models from Human Preferences”, Ziegler et al 2019
- “Lm-Human-Preferences”, Ziegler et al 2019
- “Better Rewards Yield Better Summaries: Learning to Summarise Without References”, Böhm et al 2019
- “Dueling Posterior Sampling for Preference-Based Reinforcement Learning”, Novoseller et al 2019
- “Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog”, Jaques et al 2019
- “Reward Learning from Human Preferences and Demonstrations in Atari”, Ibarz et al 2018
- “StreetNet: Preference Learning With Convolutional Neural Network on Urban Crime Perception”, Fu et al 2018
- “Toward Diverse Text Generation With Inverse Reinforcement Learning”, Shi et al 2018
- “Ordered Preference Elicitation Strategies for Supporting Multi-Objective Decision Making”, Zintgraf et al 2018
- “Convergence of Value Aggregation for Imitation Learning”, Cheng & Boots 2018
- “A Low-Cost Ethics Shaping Approach for Designing Reinforcement Learning Agents”, Wu & Lin 2017
- “Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces”, Warnell et al 2017
- “DropoutDAgger: A Bayesian Approach to Safe Imitation Learning”, Menda et al 2017
- “NIMA: Neural Image Assessment”, Talebi & Milanfar 2017
- “Towards Personalized Human AI Interaction—Adapting the Behavior of AI Agents Using Neural Signatures of Subjective Interest”, Shih et al 2017
- “A Deep Architecture for Unified Esthetic Prediction”, Murray & Gordo 2017
- “Learning Human Behaviors from Motion Capture by Adversarial Imitation”, Merel et al 2017
- “Learning from Human Preferences”, Amodei et al 2017
- “Deep Reinforcement Learning from Human Preferences”, Christiano et al 2017
- “Learning through Human Feedback [Blog]”, Leike et al 2017
- “Adversarial Ranking for Language Generation”, Lin et al 2017
- “An Invitation to Imitation”, Bagnell 2015
- “Just Sort It! A Simple and Effective Approach to Active Preference Learning”, Maystre & Grossglauser 2015
- “Algorithmic and Human Teaching of Sequential Decision Tasks”, Cakmak & Lopes 2012
- “Bayesian Active Learning for Classification and Preference Learning”, Houlsby et al 2011
- “DAgger: A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”, Ross et al 2010
- “John Schulman’s Homepage”, Schulman 2024
- “Transformers As Variational Autoencoders”
- “The Taming of the AI”
- “Copilot Stops Working on `gender` Related Subjects · Community · Discussion #72603”
- “Transformer-VAE for Program Synthesis”
- “Claude’s Character”, Anthropic 2024
- “Interpreting Preference Models W/Sparse Autoencoders”
- “Learning and Manipulating Learning”
- “Model Mis-Specification and Inverse Reinforcement Learning”
- “Full Toy Model for Preference Learning”
- Wikipedia
- Miscellaneous
- Bibliography
See Also
Gwern
“Midjourneyv6 Personalized vs Default Samples”, Gwern 2024
Links
“Language Models Learn to Mislead Humans via RLHF”, Wen et al 2024
“Does Style Matter? Disentangling Style and Substance in Chatbot Arena”
Does style matter? Disentangling style and substance in Chatbot Arena
“LLM Applications I Want To See”, Constantin 2024
“SEAL: Systematic Error Analysis for Value ALignment”, Revel et al 2024
“Does Refusal Training in LLMs Generalize to the Past Tense?”, Andriushchenko & Flammarion 2024
“Discovering Preference Optimization Algorithms With and for Large Language Models”, Lu et al 2024
Discovering Preference Optimization Algorithms with and for Large Language Models
“Beyond Model Collapse: Scaling Up With Synthesized Data Requires Reinforcement”, Feng et al 2024
Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement
“Safety Alignment Should Be Made More Than Just a Few Tokens Deep”, Qi et al 2024
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
“AlignEZ: Is Free Self-Alignment Possible?”, Adila et al 2024
“Aligning LLM Agents by Learning Latent Preference from User Edits”, Gao et al 2024
Aligning LLM Agents by Learning Latent Preference from User Edits
“Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data”, Tajwar et al 2024
Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data
“From r to Q✱: Your Language Model Is Secretly a Q-Function”, Rafailov et al 2024
“Dataset Reset Policy Optimization for RLHF”, Chang et al 2024
“ControlNet++: Improving Conditional Controls With Efficient Consistency Feedback”, Li et al 2024
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
“Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators”, Dubois et al 2024
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
“TextCraftor: Your Text Encoder Can Be Image Quality Controller”, Li et al 2024
TextCraftor: Your Text Encoder Can be Image Quality Controller
“RewardBench: Evaluating Reward Models for Language Modeling”, Lambert et al 2024
“Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics”, Hartwig et al 2024
Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics
“V-STaR: Training Verifiers for Self-Taught Reasoners”, Hosseini et al 2024
“I Think, Therefore I Am: Benchmarking Awareness of Large Language Models Using AwareBench”, Li et al 2024
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench
“Can AI Assistants Know What They Don’t Know?”, Cheng et al 2024
“Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”, Hubinger et al 2024
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
“Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM”, Lu et al 2024
Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM
“A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity”, Lee et al 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
“Reasons to Reject? Aligning Language Models With Judgments”, Xu et al 2023
“Rich Human Feedback for Text-To-Image Generation”, Liang et al 2023
“The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning”, Lin et al 2023
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
“Universal Jailbreak Backdoors from Poisoned Human Feedback”, Rando & Tramèr 2023
“Diffusion Model Alignment Using Direct Preference Optimization”, Wallace et al 2023
Diffusion Model Alignment Using Direct Preference Optimization
“Summon a Demon and Bind It: A Grounded Theory of LLM Red Teaming in the Wild”, Inie et al 2023
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild
“Specific versus General Principles for Constitutional AI”, Kundu et al 2023
“Eureka: Human-Level Reward Design via Coding Large Language Models”, Ma et al 2023
Eureka: Human-Level Reward Design via Coding Large Language Models
“A General Theoretical Paradigm to Understand Learning from Human Preferences”, Azar et al 2023
A General Theoretical Paradigm to Understand Learning from Human Preferences
“Interpreting Learned Feedback Patterns in Large Language Models”, Marks et al 2023
Interpreting Learned Feedback Patterns in Large Language Models
“UltraFeedback: Boosting Language Models With High-Quality Feedback”, Cui et al 2023
UltraFeedback: Boosting Language Models with High-quality Feedback
“Motif: Intrinsic Motivation from Artificial Intelligence Feedback”, Klissarov et al 2023
Motif: Intrinsic Motivation from Artificial Intelligence Feedback
“Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack”, Dai et al 2023
Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
“STARC: A General Framework For Quantifying Differences Between Reward Functions”, Skalse et al 2023
STARC: A General Framework For Quantifying Differences Between Reward Functions
“AceGPT, Localizing Large Language Models in Arabic”, Huang et al 2023
“RLAIF: Scaling Reinforcement Learning from Human Feedback With AI Feedback”, Lee et al 2023
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
“Activation Addition: Steering Language Models Without Optimization”, Turner et al 2023
Activation Addition: Steering Language Models Without Optimization
“ReST: Reinforced Self-Training (ReST) for Language Modeling”, Gulcehre et al 2023
“FABRIC: Personalizing Diffusion Models With Iterative Feedback”, Rütte et al 2023
FABRIC: Personalizing Diffusion Models with Iterative Feedback
“LLaMA-2: Open Foundation and Fine-Tuned Chat Models”, Touvron et al 2023
“Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations”, Chen et al 2023
Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
“Introducing Superalignment”, Leike & Sutskever 2023
“Are Aligned Neural Networks Adversarially Aligned?”, Carlini et al 2023
“AI Is a Lot of Work: As the Technology Becomes Ubiquitous, a Vast Tasker Underclass Is Emerging—And Not Going Anywhere”, Dzieza 2023
“Large Language Models Sometimes Generate Purely Negatively-Reinforced Text”, Roger 2023
Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
“Microsoft and OpenAI Forge Awkward Partnership As Tech’s New Power Couple: As the Companies Lead the AI Boom, Their Unconventional Arrangement Sometimes Causes Conflict”, Dotan & Seetharaman 2023
“Direct Preference Optimization (DPO): Your Language Model Is Secretly a Reward Model”, Rafailov et al 2023
Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model
“Improving Language Models With Advantage-Based Offline Policy Gradients”, Baheti et al 2023
Improving Language Models with Advantage-based Offline Policy Gradients
“LIMA: Less Is More for Alignment”, Zhou et al 2023
“A Radical Plan to Make AI Good, Not Evil”, Knight 2023
“SELF-ALIGN: Principle-Driven Self-Alignment of Language Models from Scratch With Minimal Human Supervision”, Sun et al 2023
“Pick-A-Pic: An Open Dataset of User Preferences for Text-To-Image Generation”, Kirstain et al 2023
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
“Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-Oriented Dialogue Systems”, Feng et al 2023
“Use GPT-3 Incorrectly: Reduce Costs 40× and Increase Speed by 5×”, Pullen 2023
Use GPT-3 incorrectly: reduce costs 40× and increase speed by 5×
“OpenAI’s Sam Altman Talks ChatGPT And How Artificial General Intelligence Can ‘Break Capitalism’”, Konrad & Cai 2023
OpenAI’s Sam Altman Talks ChatGPT And How Artificial General Intelligence Can ‘Break Capitalism’
“Big Tech Was Moving Cautiously on AI. Then Came ChatGPT. Google, Facebook and Microsoft Helped Build the Scaffolding of AI. Smaller Companies Are Taking It to the Masses, Forcing Big Tech to React”, Tiku et al 2023
“The inside Story of ChatGPT: How OpenAI Founder Sam Altman Built the World’s Hottest Technology With Billions from Microsoft”, Kahn 2023
“Self-Instruct: Aligning Language Models With Self-Generated Instructions”, Wang et al 2022
Self-Instruct: Aligning Language Models with Self-Generated Instructions
“HALIE: Evaluating Human-Language Model Interaction”, Lee et al 2022
“Constitutional AI: Harmlessness from AI Feedback”, Bai et al 2022
“Solving Math Word Problems With Process & Outcome-Based Feedback”, Uesato et al 2022
Solving math word problems with process & outcome-based feedback
“Mysteries of Mode Collapse § Inescapable Wedding Parties”, Janus 2022
“When Life Gives You Lemons, Make Cherryade: Converting Feedback from Bad Responses into Good Labels”, Shi et al 2022
When Life Gives You Lemons, Make Cherryade: Converting Feedback from Bad Responses into Good Labels
“Scaling Laws for Reward Model Overoptimization”, Gao et al 2022
“Teacher Forcing Recovers Reward Functions for Text Generation”, Hao et al 2022
Teacher Forcing Recovers Reward Functions for Text Generation
“CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning”, Castricato et al 2022
CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning
“Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization”, Ramamurthy et al 2022
“Sparrow: Improving Alignment of Dialogue Agents via Targeted Human Judgements”, Glaese et al 2022
Sparrow: Improving alignment of dialogue agents via targeted human judgements
“Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, Ganguli et al 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
“Basis for Intentions (BASIS): Efficient Inverse Reinforcement Learning Using Past Experience”, Abdulhai et al 2022
Basis for Intentions (BASIS): Efficient Inverse Reinforcement Learning using Past Experience
“Improved Policy Optimization for Online Imitation Learning”, Lavington et al 2022
“Quark: Controllable Text Generation With Reinforced Unlearning”, Lu et al 2022
Quark: Controllable Text Generation with Reinforced Unlearning
“Housekeep: Tidying Virtual Households Using Commonsense Reasoning”, Kant et al 2022
Housekeep: Tidying Virtual Households using Commonsense Reasoning
“Imitating, Fast and Slow: Robust Learning from Demonstrations via Decision-Time Planning”, Qi et al 2022
Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning
“Inferring Rewards from Language in Context”, Lin et al 2022
“SURF: Semi-Supervised Reward Learning With Data Augmentation for Feedback-Efficient Preference-Based Reinforcement Learning”, Park et al 2022
“InstructGPT: Training Language Models to Follow Instructions With Human Feedback”, Ouyang et al 2022
InstructGPT: Training language models to follow instructions with human feedback
“Safe Deep RL in 3D Environments Using Human Feedback”, Rahtz et al 2022
“A Survey of Controllable Text Generation Using Transformer-Based Pre-Trained Language Models”, Zhang et al 2022
A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models
“WebGPT: Browser-Assisted Question-Answering With Human Feedback”, Nakano et al 2021
WebGPT: Browser-assisted question-answering with human feedback
“WebGPT: Improving the Factual Accuracy of Language Models through Web Browsing”, Hilton et al 2021
WebGPT: Improving the factual accuracy of language models through web browsing
“Modeling Strong and Human-Like Gameplay With KL-Regularized Search”, Jacob et al 2021
Modeling Strong and Human-Like Gameplay with KL-Regularized Search
“A General Language Assistant As a Laboratory for Alignment”, Askell et al 2021
“Cut the CARP: Fishing for Zero-Shot Story Evaluation”, Shahbul et al 2021
“Recursively Summarizing Books With Human Feedback”, Wu et al 2021
“B-Pref: Benchmarking Preference-Based Reinforcement Learning”, Lee et al 2021
B-Pref: Benchmarking Preference-Based Reinforcement Learning
“Trajectory Transformer: Reinforcement Learning As One Big Sequence Modeling Problem”, Janner et al 2021
Trajectory Transformer: Reinforcement Learning as One Big Sequence Modeling Problem
“Embracing New Techniques in Deep Learning for Estimating Image Memorability”, Needell & Bainbridge 2021
Embracing New Techniques in Deep Learning for Estimating Image Memorability
“A Survey of Preference-Based Reinforcement Learning Methods”, Wirth et al 2021
“Learning What To Do by Simulating the Past”, Lindner et al 2021
“Language Models Have a Moral Dimension”, Schramowski et al 2021
“Brain-Computer Interface for Generating Personally Attractive Images”, Spape et al 2021
Brain-computer interface for generating personally attractive images
“Process for Adapting Language Models to Society (PALMS) With Values-Targeted Datasets”, Solaiman & Dennison 2021
Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets
“Human-Centric Dialog Training via Offline Reinforcement Learning”, Jaques et al 2020
Human-centric Dialog Training via Offline Reinforcement Learning
“Learning to Summarize from Human Feedback”, Stiennon et al 2020
“Learning Personalized Models of Human Behavior in Chess”, McIlroy-Young et al 2020
“Aligning Superhuman AI With Human Behavior: Chess As a Model System”, McIlroy-Young et al 2020
Aligning Superhuman AI with Human Behavior: Chess as a Model System
“Active Preference-Based Gaussian Process Regression for Reward Learning”, Bıyık et al 2020
Active Preference-Based Gaussian Process Regression for Reward Learning
“Bayesian REX: Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences”, Brown et al 2020
Bayesian REX: Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences
“RL Agents Implicitly Learning Human Preferences”, Wichers 2020
“Reward-Rational (implicit) Choice: A Unifying Formalism for Reward Learning”, Jeon et al 2020
Reward-rational (implicit) choice: A unifying formalism for reward learning
“What Does BERT Dream Of? A Visual Investigation of Nightmares in Sesame Street”, Bäuerle & Wexler 2020
What does BERT dream of? A visual investigation of nightmares in Sesame Street
“Deep Bayesian Reward Learning from Preferences”, Brown & Niekum 2019
“Learning Norms from Stories: A Prior for Value Aligned Agents”, Frazier et al 2019
Learning Norms from Stories: A Prior for Value Aligned Agents
“Reinforcement Learning Upside Down: Don’t Predict Rewards—Just Map Them to Actions”, Schmidhuber 2019
Reinforcement Learning Upside Down: Don’t Predict Rewards—Just Map Them to Actions
“Learning Human Objectives by Evaluating Hypothetical Behavior”, Reddy et al 2019
Learning Human Objectives by Evaluating Hypothetical Behavior
“Preference-Based Learning for Exoskeleton Gait Optimization”, Tucker et al 2019
“Do Massively Pretrained Language Models Make Better Storytellers?”, See et al 2019
Do Massively Pretrained Language Models Make Better Storytellers?
“Fine-Tuning GPT-2 from Human Preferences § Bugs Can Optimize for Bad Behavior”, Ziegler et al 2019
Fine-Tuning GPT-2 from Human Preferences § Bugs can optimize for bad behavior
“Fine-Tuning GPT-2 from Human Preferences”, Ziegler et al 2019
“Fine-Tuning Language Models from Human Preferences”, Ziegler et al 2019
“Lm-Human-Preferences”, Ziegler et al 2019
“Better Rewards Yield Better Summaries: Learning to Summarise Without References”, Böhm et al 2019
Better Rewards Yield Better Summaries: Learning to Summarise Without References
“Dueling Posterior Sampling for Preference-Based Reinforcement Learning”, Novoseller et al 2019
Dueling Posterior Sampling for Preference-Based Reinforcement Learning
“Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog”, Jaques et al 2019
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
“Reward Learning from Human Preferences and Demonstrations in Atari”, Ibarz et al 2018
Reward learning from human preferences and demonstrations in Atari
“StreetNet: Preference Learning With Convolutional Neural Network on Urban Crime Perception”, Fu et al 2018
StreetNet: Preference Learning with Convolutional Neural Network on Urban Crime Perception
“Toward Diverse Text Generation With Inverse Reinforcement Learning”, Shi et al 2018
Toward Diverse Text Generation with Inverse Reinforcement Learning
“Ordered Preference Elicitation Strategies for Supporting Multi-Objective Decision Making”, Zintgraf et al 2018
Ordered Preference Elicitation Strategies for Supporting Multi-Objective Decision Making
“Convergence of Value Aggregation for Imitation Learning”, Cheng & Boots 2018
“A Low-Cost Ethics Shaping Approach for Designing Reinforcement Learning Agents”, Wu & Lin 2017
A Low-Cost Ethics Shaping Approach for Designing Reinforcement Learning Agents
“Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces”, Warnell et al 2017
Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces
“DropoutDAgger: A Bayesian Approach to Safe Imitation Learning”, Menda et al 2017
DropoutDAgger: A Bayesian Approach to Safe Imitation Learning
“NIMA: Neural Image Assessment”, Talebi & Milanfar 2017
“Towards Personalized Human AI Interaction—Adapting the Behavior of AI Agents Using Neural Signatures of Subjective Interest”, Shih et al 2017
“A Deep Architecture for Unified Esthetic Prediction”, Murray & Gordo 2017
“Learning Human Behaviors from Motion Capture by Adversarial Imitation”, Merel et al 2017
Learning human behaviors from motion capture by adversarial imitation
“Learning from Human Preferences”, Amodei et al 2017
“Deep Reinforcement Learning from Human Preferences”, Christiano et al 2017
“Learning through Human Feedback [Blog]”, Leike et al 2017
“Adversarial Ranking for Language Generation”, Lin et al 2017
“An Invitation to Imitation”, Bagnell 2015
“Just Sort It! A Simple and Effective Approach to Active Preference Learning”, Maystre & Grossglauser 2015
Just Sort It! A Simple and Effective Approach to Active Preference Learning
“Algorithmic and Human Teaching of Sequential Decision Tasks”, Cakmak & Lopes 2012
“Bayesian Active Learning for Classification and Preference Learning”, Houlsby et al 2011
Bayesian Active Learning for Classification and Preference Learning
“DAgger: A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”, Ross et al 2010
DAgger: A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
“John Schulman’s Homepage”, Schulman 2024
“Transformers As Variational Autoencoders”
“The Taming of the AI”
“Copilot Stops Working on `gender` Related Subjects · Community · Discussion #72603”
Copilot stops working on `gender` related subjects · community · Discussion #72603:
“Transformer-VAE for Program Synthesis”
“Claude’s Character”, Anthropic 2024
“Interpreting Preference Models W/Sparse Autoencoders”
“Learning and Manipulating Learning”
“Model Mis-Specification and Inverse Reinforcement Learning”
Model Mis-specification and Inverse Reinforcement Learning:
“Full Toy Model for Preference Learning”
Wikipedia
Miscellaneous
-
https://blog.eleuther.ai/trlx-exploratory-analysis/
:View External Link:
-
https://github.com/curiousjp/toy_sd_genetics?tab=readme-ov-file#toy_sd_genetics
-
https://koenvangilst.nl/blog/keeping-code-complexity-in-check
-
https://www.frontiersin.org/articles/10.3389/frobt.2017.00071/full
-
https://www.lesswrong.com/posts/3eqHYxfWb5x4Qfz8C/unrlhf-efficiently-undoing-llm-safeguards
: -
https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post
-
https://www.lesswrong.com/posts/cqGEQeLNbcptYsifz/this-week-in-fashion
:View External Link:
https://www.lesswrong.com/posts/cqGEQeLNbcptYsifz/this-week-in-fashion
-
https://www.lesswrong.com/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research
:
Bibliography
-
https://arxiv.org/abs/2407.11969
: “Does Refusal Training in LLMs Generalize to the Past Tense?”, -
https://arxiv.org/abs/2404.12358
: “From r to Q✱: Your Language Model Is Secretly a Q-Function”, -
https://arxiv.org/abs/2404.08495
: “Dataset Reset Policy Optimization for RLHF”, -
https://arxiv.org/abs/2401.05566#anthropic
: “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”, -
https://arxiv.org/abs/2310.01377
: “UltraFeedback: Boosting Language Models With High-Quality Feedback”, -
https://arxiv.org/abs/2309.15807#facebook
: “Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack”, -
https://arxiv.org/abs/2309.12053
: “AceGPT, Localizing Large Language Models in Arabic”, -
https://arxiv.org/abs/2308.10248
: “Activation Addition: Steering Language Models Without Optimization”, -
https://openai.com/index/introducing-superalignment/
: “Introducing Superalignment”, -
https://www.theverge.com/features/23764584/ai-artificial-intelligence-data-notation-labor-scale-surge-remotasks-openai-chatbots
: “AI Is a Lot of Work: As the Technology Becomes Ubiquitous, a Vast Tasker Underclass Is Emerging—And Not Going Anywhere”, -
https://arxiv.org/abs/2306.07567
: “Large Language Models Sometimes Generate Purely Negatively-Reinforced Text”, -
https://www.wsj.com/articles/microsoft-and-openai-forge-awkward-partnership-as-techs-new-power-couple-3092de51
: “Microsoft and OpenAI Forge Awkward Partnership As Tech’s New Power Couple: As the Companies Lead the AI Boom, Their Unconventional Arrangement Sometimes Causes Conflict”, -
https://www.wired.com/story/anthropic-ai-chatbots-ethics/
: “A Radical Plan to Make AI Good, Not Evil”, -
https://arxiv.org/abs/2305.03047#ibm
: “SELF-ALIGN: Principle-Driven Self-Alignment of Language Models from Scratch With Minimal Human Supervision”, -
https://arxiv.org/abs/2305.01569
: “Pick-A-Pic: An Open Dataset of User Preferences for Text-To-Image Generation”, -
https://www.forbes.com/sites/alexkonrad/2023/02/03/exclusive-openai-sam-altman-chatgpt-agi-google-search/
: “OpenAI’s Sam Altman Talks ChatGPT And How Artificial General Intelligence Can ‘Break Capitalism’”, -
https://arxiv.org/abs/2212.10560
: “Self-Instruct: Aligning Language Models With Self-Generated Instructions”, -
https://arxiv.org/abs/2210.10760#openai
: “Scaling Laws for Reward Model Overoptimization”, -
https://arxiv.org/abs/2210.07792#eleutherai
: “CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning”, -
https://arxiv.org/abs/2210.01241
: “Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization”, -
https://www.anthropic.com/red_teaming.pdf
: “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, -
https://arxiv.org/abs/2112.09332#openai
: “WebGPT: Browser-Assisted Question-Answering With Human Feedback”, -
https://openai.com/research/webgpt
: “WebGPT: Improving the Factual Accuracy of Language Models through Web Browsing”, -
https://arxiv.org/abs/2112.00861#anthropic
: “A General Language Assistant As a Laboratory for Alignment”, -
https://arxiv.org/abs/2109.10862#openai
: “Recursively Summarizing Books With Human Feedback”, -
https://trajectory-transformer.github.io/
: “Trajectory Transformer: Reinforcement Learning As One Big Sequence Modeling Problem”, -
https://openai.com/research/fine-tuning-gpt-2
: “Fine-Tuning GPT-2 from Human Preferences”, -
2018-fu.pdf
: “StreetNet: Preference Learning With Convolutional Neural Network on Urban Crime Perception”,