Revisiting Your Memory: Reconstruction of Affect-Contextualized Memory via EEG-guided Audiovisual Generation
AI-generated poetry is indistinguishable from human-written poetry and is rated more favorably
Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL
Thinking LLMs: General Instruction Following with Thought Generation
Does Style Matter? Disentangling Style and Substance in Chatbot Arena
Does Refusal Training in LLMs Generalize to the Past Tense?
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
Discovering Preference Optimization Algorithms with and for Large Language Models
Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Aligning LLM Agents by Learning Latent Preference from User Edits
Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data
From r to Q✱: Your Language Model is Secretly a Q-Function
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
TextCraftor: Your Text Encoder Can be Image Quality Controller
RewardBench: Evaluating Reward Models for Language Modeling
Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics
When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Reasons to Reject? Aligning Language Models with Judgments
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
Universal Jailbreak Backdoors from Poisoned Human Feedback
Diffusion Model Alignment Using Direct Preference Optimization
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild
Eureka: Human-Level Reward Design via Coding Large Language Models
A General Theoretical Paradigm to Understand Learning from Human Preferences
Interpreting Learned Feedback Patterns in Large Language Models
UltraFeedback: Boosting Language Models with High-quality Feedback
Motif: Intrinsic Motivation from Artificial Intelligence Feedback
Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
STARC: A General Framework For Quantifying Differences Between Reward Functions
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Activation Addition: Steering Language Models Without Optimization
ReST: Reinforced Self-Training (ReST) for Language Modeling
FABRIC: Personalizing Diffusion Models with Iterative Feedback
Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
AI Is a Lot of Work: As the technology becomes ubiquitous, a vast tasker underclass is emerging—and not going anywhere
Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
Microsoft and OpenAI Forge Awkward Partnership as Tech’s New Power Couple: As the companies lead the AI boom, their unconventional arrangement sometimes causes conflict
Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model
Improving Language Models with Advantage-based Offline Policy Gradients
SELF-ALIGN: Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-oriented Dialogue Systems
Use GPT-3 incorrectly: reduce costs 40× and increase speed by 5×
OpenAI’s Sam Altman Talks ChatGPT And How Artificial General Intelligence Can ‘Break Capitalism’
Big Tech was moving cautiously on AI. Then came ChatGPT. Google, Facebook and Microsoft helped build the scaffolding of AI. Smaller companies are taking it to the masses, forcing Big Tech to react
The inside story of ChatGPT: How OpenAI founder Sam Altman built the world’s hottest technology with billions from Microsoft
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Solving math word problems with process & outcome-based feedback
When Life Gives You Lemons, Make Cherryade: Converting Feedback from Bad Responses into Good Labels
Teacher Forcing Recovers Reward Functions for Text Generation
CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning
Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization
Sparrow: Improving alignment of dialogue agents via targeted human judgements
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Basis for Intentions (BASIS): Efficient Inverse Reinforcement Learning using Past Experience
Improved Policy Optimization for Online Imitation Learning
Quark: Controllable Text Generation with Reinforced Unlearning
Housekeep: Tidying Virtual Households using Commonsense Reasoning
Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning
SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning
InstructGPT: Training language models to follow instructions with human feedback
A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models
WebGPT: Browser-assisted question-answering with human feedback
WebGPT: Improving the factual accuracy of language models through web browsing
Modeling Strong and Human-Like Gameplay with KL-Regularized Search
A General Language Assistant as a Laboratory for Alignment
B-Pref: Benchmarking Preference-Based Reinforcement Learning
Trajectory Transformer: Reinforcement Learning as One Big Sequence Modeling Problem
Embracing New Techniques in Deep Learning for Estimating Image Memorability
A Survey of Preference-Based Reinforcement Learning Methods
Brain-computer interface for generating personally attractive images
Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets
Human-centric Dialog Training via Offline Reinforcement Learning
Aligning Superhuman AI with Human Behavior: Chess as a Model System
Active Preference-Based Gaussian Process Regression for Reward Learning
Bayesian REX: Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences
Reward-rational (implicit) choice: A unifying formalism for reward learning
What does BERT dream of? A visual investigation of nightmares in Sesame Street
Learning Norms from Stories: A Prior for Value Aligned Agents
Reinforcement Learning Upside Down: Don’t Predict Rewards—Just Map Them to Actions
Learning Human Objectives by Evaluating Hypothetical Behavior
Preference-Based Learning for Exoskeleton Gait Optimization
Do Massively Pretrained Language Models Make Better Storytellers?
Fine-Tuning GPT-2 from Human Preferences § Bugs can optimize for bad behavior
Better Rewards Yield Better Summaries: Learning to Summarise Without References
Dueling Posterior Sampling for Preference-Based Reinforcement Learning
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
Reward learning from human preferences and demonstrations in Atari
StreetNet: Preference Learning with Convolutional Neural Network on Urban Crime Perception
Toward Diverse Text Generation with Inverse Reinforcement Learning
Ordered Preference Elicitation Strategies for Supporting Multi-Objective Decision Making
A Low-Cost Ethics Shaping Approach for Designing Reinforcement Learning Agents
Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces
DropoutDAgger: A Bayesian Approach to Safe Imitation Learning
Towards personalized human AI interaction—adapting the behavior of AI agents using neural signatures of subjective interest
Learning human behaviors from motion capture by adversarial imitation
Just Sort It! A Simple and Effective Approach to Active Preference Learning
Algorithmic and Human Teaching of Sequential Decision Tasks
Bayesian Active Learning for Classification and Preference Learning
DAgger: A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
An Analysis of AI Political Preferences from a European Perspective
Copilot Stops Working on `gender` Related Subjects · Community · Discussion #72603
When Your AIs Deceive You: Challenges With Partial Observability in RLHF
Model Mis-Specification and Inverse Reinforcement Learning
2023-kirstain-figure6-inversecorrelationbetweenmscocofidqualityandhumanexpertrankingofimagequality.jpg
2023-kirstain-figure7-comparisonofhighervslowerclassifierfreeguidanceillustratesworsefidbutbetterhumanpreferenceofimagesamples.png
2023-pullen-buildt-knowledgedistillationofkshotdavinci003tofinetunedbabbagegpt3modeltosavemoneyandlatency.png
2017-amodei-openai-learningfromhumanpreferences-architecture2x-2x.png
2012-cakmak-figure5-algorithmicteachingvsrandomsampleselectionsampleefficiencygains.jpg
https://ai.facebook.com/blog/harmful-content-can-evolve-quickly-our-new-ai-system-adapts-to-tackle-it
https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=12d941c445ec477501f78b15dcf84f98173121cf
https://github.com/curiousjp/toy_sd_genetics?tab=readme-ov-file#toy_sd_genetics
https://koenvangilst.nl/blog/keeping-code-complexity-in-check
https://searchengineland.com/how-google-search-ranking-works-pandu-nayak-435395#h-navboost-system-a-k-a-glue
https://www.frontiersin.org/articles/10.3389/frobt.2017.00071/full
https://www.lesswrong.com/posts/3eqHYxfWb5x4Qfz8C/unrlhf-efficiently-undoing-llm-safeguards
https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post
https://www.lesswrong.com/posts/cqGEQeLNbcptYsifz/this-week-in-fashion
https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned#AAC8jKeDp6xqsZK2K
https://www.lesswrong.com/posts/qmQFHCgCyEEjuy5a7/lora-fine-tuning-efficiently-undoes-safety-training-from
https://www.lesswrong.com/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research
https://www.reddit.com/r/ChatGPTNSFW/comments/17wk2g3/a_failed_ai_girlfriend_product_and_my_lessons/k9hs22a/
https://www.reddit.com/r/StableDiffusion/comments/1gdkpqp/the_gory_details_of_finetuning_sdxl_for_40m/
Does Refusal Training in LLMs Generalize to the Past Tense?
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
From r to Q✱: Your Language Model is Secretly a Q-Function
When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
https%253A%252F%252Farxiv.org%252Fabs%252F2401.05566%2523anthropic.html
UltraFeedback: Boosting Language Models with High-quality Feedback
Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
https%253A%252F%252Farxiv.org%252Fabs%252F2309.15807%2523facebook.html
Activation Addition: Steering Language Models Without Optimization
https%253A%252F%252Fopenai.com%252Findex%252Fintroducing-superalignment%252F.html
AI Is a Lot of Work: As the technology becomes ubiquitous, a vast tasker underclass is emerging—and not going anywhere
https%253A%252F%252Fwww.theverge.com%252Ffeatures%252F23764584%252Fai-artificial-intelligence-data-notation-labor-scale-surge-remotasks-openai-chatbots.html
Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
Microsoft and OpenAI Forge Awkward Partnership as Tech’s New Power Couple: As the companies lead the AI boom, their unconventional arrangement sometimes causes conflict
https%253A%252F%252Fwww.wsj.com%252Farticles%252Fmicrosoft-and-openai-forge-awkward-partnership-as-techs-new-power-couple-3092de51.html
https%253A%252F%252Fwww.wired.com%252Fstory%252Fanthropic-ai-chatbots-ethics%252F.html
SELF-ALIGN: Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
https%253A%252F%252Farxiv.org%252Fabs%252F2305.03047%2523ibm.html
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
OpenAI’s Sam Altman Talks ChatGPT And How Artificial General Intelligence Can ‘Break Capitalism’
https%253A%252F%252Fwww.forbes.com%252Fsites%252Falexkonrad%252F2023%252F02%252F03%252Fexclusive-openai-sam-altman-chatgpt-agi-google-search%252F.html
Self-Instruct: Aligning Language Models with Self-Generated Instructions
https%253A%252F%252Farxiv.org%252Fabs%252F2210.10760%2523openai.html
CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning
https%253A%252F%252Farxiv.org%252Fabs%252F2210.07792%2523eleutherai.html
Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
https%253A%252F%252Fwww.anthropic.com%252Fred_teaming.pdf.html
WebGPT: Browser-assisted question-answering with human feedback
https%253A%252F%252Farxiv.org%252Fabs%252F2112.09332%2523openai.html
WebGPT: Improving the factual accuracy of language models through web browsing
https%253A%252F%252Fopenai.com%252Fresearch%252Fwebgpt.html
A General Language Assistant as a Laboratory for Alignment
https%253A%252F%252Farxiv.org%252Fabs%252F2112.00861%2523anthropic.html
https%253A%252F%252Farxiv.org%252Fabs%252F2109.10862%2523openai.html
Trajectory Transformer: Reinforcement Learning as One Big Sequence Modeling Problem
https%253A%252F%252Ftrajectory-transformer.github.io%252F.html
https%253A%252F%252Fopenai.com%252Fresearch%252Ffine-tuning-gpt-2.html
StreetNet: Preference Learning with Convolutional Neural Network on Urban Crime Perception
Wikipedia Bibliography: