- See Also
-
Gwern
- “The Neural Net Tank Urban Legend”, Gwern 2011
- “It Looks Like You’re Trying To Take Over The World”, Gwern 2022
- “Surprisingly Turing-Complete”, Gwern 2012
- “The Scaling Hypothesis”, Gwern 2020
- “Evolution As Backstop for Reinforcement Learning”, Gwern 2018
- “Complexity No Bar to AI”, Gwern 2014
- “Why Tool AIs Want to Be Agent AIs”, Gwern 2016
- “AI Risk Demos”, Gwern 2016
-
Links
- “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”, Hubinger et al 2024
- “Thousands of AI Authors on the Future of AI”, Grace et al 2024
- “Exploiting Novel GPT-4 APIs”, Pelrine et al 2023
- “Comparison of Waymo Rider-Only Crash Data to Human Benchmarks at 7.1 Million Miles”, Kusano et al 2023
- “Challenges With Unsupervised LLM Knowledge Discovery”, Farquhar et al 2023
- “Politics and the Future”, Horowitz 2023
- “Helping or Herding? Reward Model Ensembles Mitigate but Do Not Eliminate Reward Hacking”, Eisenstein et al 2023
- “The Inside Story of Microsoft’s Partnership With OpenAI: The Companies Had Honed a Protocol for Releasing Artificial Intelligence Ambitiously but Safely. Then OpenAI’s Board Exploded All Their Carefully Laid Plans”, Duhigg 2023
- “How Jensen Huang’s Nvidia Is Powering the AI Revolution: The Company’s CEO Bet It All on a New Kind of Chip. Now That Nvidia Is One of the Biggest Companies in the World, What Will He Do Next?”, Witt 2023
- “Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching”, Campbell et al 2023
- “Did I Get Sam Altman Fired from OpenAI?: Nathan’s Redteaming Experience, Noticing How the Board Was Not Aware of GPT-4 Jailbreaks & Had Not Even Tried GPT-4 prior to Its Early Release”, Labenz 2023
- “Did I Get Sam Altman Fired from OpenAI? § GPT-4-base”, Labenz 2023
- “Inside the Chaos at OpenAI: Sam Altman’s Weekend of Shock and Drama Began a Year Ago, With the Release of ChatGPT”, Hao & Warzel 2023
- “OpenAI Announces Leadership Transition”, Sutskever et al 2023
- “In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering”, Liu et al 2023
- “Removing RLHF Protections in GPT-4 via Fine-Tuning”, Zhan et al 2023
- “Large Language Models Can Strategically Deceive Their Users When Put Under Pressure”, Scheurer et al 2023
- “Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation”, Shah et al 2023
- “Augmenting Large Language Models With Chemistry Tools”, Bran et al 2023
- “Will Releasing the Weights of Large Language Models Grant Widespread Access to Pandemic Agents?”, Gopal et al 2023
- “Specific versus General Principles for Constitutional AI”, Kundu et al 2023
- “Goodhart’s Law in Reinforcement Learning”, Karwowski et al 2023
- “Let Models Speak Ciphers: Multiagent Debate through Embeddings”, Pham et al 2023
- “Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!”, Qi et al 2023
- “Representation Engineering: A Top-Down Approach to AI Transparency”, Zou et al 2023
- “Responsibility & Safety: Our Approach”, DeepMind 2023
- “STARC: A General Framework For Quantifying Differences Between Reward Functions”, Skalse et al 2023
- “How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions”, Pacchiardi et al 2023
- “What If the Robots Were Very Nice While They Took Over the World?”, Heffernan 2023
- “Does Sam Altman Know What He’s Creating? The OpenAI CEO’s Ambitious, Ingenious, Terrifying Quest to Create a New Form of Intelligence”, Andersen 2023
- “Introducing Superalignment”, Leike & Sutskever 2023
- “Gödel, Escher, Bach Author Douglas Hofstadter on the State of AI Today § What about AI Terrifies You?”, Hofstadter & Kim 2023
- “Microsoft and OpenAI Forge Awkward Partnership As Tech’s New Power Couple: As the Companies Lead the AI Boom, Their Unconventional Arrangement Sometimes Causes Conflict”, Dotan & Seetharaman 2023
- “Can Large Language Models Democratize Access to Dual-use Biotechnology?”, Soice et al 2023
- “Thought Cloning: Learning to Think While Acting by Imitating Human Thinking”, Hu & Clune 2023
- “The Challenge of Advanced Cyberwar and the Place of Cyberpeace”, Carayannis & Draper 2023
- “Incentivizing Honest Performative Predictions With Proper Scoring Rules”, Oesterheld et al 2023
- “Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns”, Hazell 2023
- “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting”, Turpin et al 2023
- “Mitigating Lies in Vision-Language Models”, Li et al 2023
- “A Radical Plan to Make AI Good, Not Evil”, Knight 2023
- “Even The Politicians Thought the Open Letter Made No Sense In The Senate Hearing on AI Today’s Hearing on Ai Covered Ai Regulation and Challenges, and the Infamous Open Letter, Which Nearly Everyone in the Room Thought Was Unwise”, Gorrell 2023
- “In AI Race, Microsoft and Google Choose Speed Over Caution: Technology Companies Were Once Leery of What Some Artificial Intelligence Could Do. Now the Priority Is Winning Control of the Industry’s next Big Thing”, Grant & Weise 2023
- “8 Things to Know about Large Language Models”, Bowman 2023
- “Sam Altman on What Makes Him ‘Super Nervous’ About AI: The OpenAI Co-founder Thinks Tools like GPT-4 Will Be Revolutionary. But He’s Wary of Downsides”, Swisher 2023
- “The OpenAI CEO Disagrees With the Forecast That AI Will Kill Us All: An Artificial Intelligence Twitter Beef, Explained”, Huet 2023
- “As AI Booms, Lawmakers Struggle to Understand the Technology: Tech Innovations Are Again Racing ahead of Washington’s Ability to Regulate Them, Lawmakers and AI Experts Said”, Kang & Satariano 2023
- “Pretraining Language Models With Human Preferences”, Korbak et al 2023
- “Conditioning Predictive Models: Risks and Strategies”, Hubinger et al 2023
- “Tracr: Compiled Transformers As a Laboratory for Interpretability”, Lindner et al 2023
- “Discovering Language Model Behaviors With Model-Written Evaluations”, Perez et al 2022
- “Discovering Latent Knowledge in Language Models Without Supervision”, Burns et al 2022
- “Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula”, Bronstein et al 2022
- “Interpreting Neural Networks through the Polytope Lens”, Black et al 2022
- “Mysteries of Mode Collapse § Inescapable Wedding Parties”, Janus 2022
- “Measuring Progress on Scalable Oversight for Large Language Models”, Bowman et al 2022
- “Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)”, Mitchell & Chugg 2022
- “Broken Neural Scaling Laws”, Caballero et al 2022
- “Scaling Laws for Reward Model Overoptimization”, Gao et al 2022
- “Defining and Characterizing Reward Hacking”, Skalse et al 2022
- “The Alignment Problem from a Deep Learning Perspective”, Ngo 2022
- “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, Ganguli et al 2022
- “Modeling Transformative AI Risks (MTAIR) Project—Summary Report”, Clarke et al 2022
- “Researching Alignment Research: Unsupervised Analysis”, Kirchner et al 2022
- “Ethan Caballero on Private Scaling Progress”, Caballero & Trazzi 2022
- “DeepMind: The Podcast—Excerpts on AGI”, Kiely 2022
- “Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances”, Ahn et al 2022
- “Predictability and Surprise in Large Generative Models”, Ganguli et al 2022
- “Uncalibrated Models Can Improve Human-AI Collaboration”, Vodrahalli et al 2022
- “DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers”, Cho et al 2022
- “Safe Deep RL in 3D Environments Using Human Feedback”, Rahtz et al 2022
- “LaMDA: Language Models for Dialog Applications”, Thoppilan et al 2022
- “The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models”, Pan et al 2022
- “Scaling Language Models: Methods, Analysis & Insights from Training Gopher”, Rae et al 2021
- “A General Language Assistant As a Laboratory for Alignment”, Askell et al 2021
- “What Would Jiminy Cricket Do? Towards Agents That Behave Morally”, Hendrycks et al 2021
- “Can Machines Learn Morality? The Delphi Experiment”, Jiang et al 2021
- “SafetyNet: Safe Planning for Real-world Self-driving Vehicles Using Machine-learned Policies”, Vitelli et al 2021
- “Unsolved Problems in ML Safety”, Hendrycks et al 2021
- “An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions”, Pearce et al 2021
- “On the Opportunities and Risks of Foundation Models”, Bommasani et al 2021
- “Evaluating Large Language Models Trained on Code”, Chen et al 2021
- “Randomness In Neural Network Training: Characterizing The Impact of Tooling”, Zhuang et al 2021
- “Goal Misgeneralization in Deep Reinforcement Learning”, Koch et al 2021
- “Anthropic Raises $124 Million to Build More Reliable, General AI Systems”, Anthropic 2021
- “Artificial Intelligence in China’s Revolution in Military Affairs”, Kania 2021
- “Reward Is Enough”, Silver et al 2021
- “Intelligence and Unambitiousness Using Algorithmic Information Theory”, Cohen et al 2021
- “AI Dungeon Public Disclosure Vulnerability Report—GraphQL Unpublished Adventure Data Leak”, AetherDevSecOps 2021
- “Universal Off-Policy Evaluation”, Chandak et al 2021
- “Multitasking Inhibits Semantic Drift”, Jacob et al 2021
- “Waymo Simulated Driving Behavior in Reconstructed Fatal Crashes within an Autonomous Vehicle Operating Domain”, Scanlon et al 2021
- “Language Models Have a Moral Dimension”, Schramowski et al 2021
- “Replaying Real Life: How the Waymo Driver Avoids Fatal Human Crashes”, Waymo 2021
- “Agent Incentives: A Causal Perspective”, Everitt et al 2021
- “Organizational Update from OpenAI”, OpenAI 2020
- “Emergent Road Rules In Multi-Agent Driving Environments”, Pal et al 2020
- “Recipes for Safety in Open-domain Chatbots”, Xu et al 2020
- “Hidden Incentives for Auto-Induced Distributional Shift”, Krueger et al 2020
- “The Radicalization Risks of GPT-3 and Advanced Neural Language Models”, McGuffie & Newhouse 2020
- “Matt Botvinick on the Spontaneous Emergence of Learning Algorithms”, Scholl 2020
- “Aligning AI With Shared Human Values”, Hendrycks et al 2020
- “Pitfalls of Learning a Reward Function Online”, Armstrong et al 2020
- “Reward-rational (implicit) Choice: A Unifying Formalism for Reward Learning”, Jeon et al 2020
- “The Incentives That Shape Behavior”, Carey et al 2020
- “2019 AI Alignment Literature Review and Charity Comparison”, Larks 2019
- “Learning Norms from Stories: A Prior for Value Aligned Agents”, Frazier et al 2019
- “Optimal Policies Tend to Seek Power”, Turner et al 2019
- “Taxonomy of Real Faults in Deep Learning Systems”, Humbatova et al 2019
- “Release Strategies and the Social Impacts of Language Models”, Solaiman et al 2019
- “The Bouncer Problem: Challenges to Remote Explainability”, Merrer & Tredan 2019
- “Scaling Data-driven Robotics With Reward Sketching and Batch Reinforcement Learning”, Cabi et al 2019
- “Fine-Tuning GPT-2 from Human Preferences § Bugs Can Optimize for Bad Behavior”, Ziegler et al 2019
- “Designing Agent Incentives to Avoid Reward Tampering”, Everitt et al 2019
- “Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective”, Everitt et al 2019
- “Characterizing Attacks on Deep Reinforcement Learning”, Pan et al 2019
- “Categorizing Wireheading in Partially Embedded Agents”, Majha et al 2019
- “Risks from Learned Optimization in Advanced Machine Learning Systems”, Hubinger et al 2019
- “GROVER: Defending Against Neural Fake News”, Zellers et al 2019
- “AI-GAs: AI-generating Algorithms, an Alternate Paradigm for Producing General Artificial Intelligence”, Clune 2019
- “Challenges of Real-World Reinforcement Learning”, Dulac-Arnold et al 2019
- “DeepMind and Google: the Battle to Control Artificial Intelligence. Demis Hassabis Founded a Company to Build the World’s Most Powerful AI. Then Google Bought Him Out. Hal Hodson Asks Who Is in Charge”, Hodson 2019
- “Forecasting Transformative AI: An Expert Survey”, Gruetzemacher et al 2019
- “Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified”, Mitchell 2019
- “Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures”, Uesato et al 2018
- “There Is Plenty of Time at the Bottom: the Economics, Risk and Ethics of Time Compression”, Sandberg 2018
- “Better Safe Than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning”, Agarwal et al 2018
- “The Alignment Problem for Bayesian History-Based Reinforcement Learners”, Everitt & Hutter 2018
- “Adaptive Mechanism Design: Learning to Promote Cooperation”, Baumann et al 2018
- “Visceral Machines: Risk-Aversion in Reinforcement Learning With Intrinsic Physiological Rewards”, McDuff & Kapoor 2018
- “Incomplete Contracting and AI Alignment”, Hadfield-Menell & Hadfield 2018
- “Programmatically Interpretable Reinforcement Learning”, Verma et al 2018
- “Categorizing Variants of Goodhart’s Law”, Manheim & Garrabrant 2018
- “The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities”, Lehman et al 2018
- “Machine Theory of Mind”, Rabinowitz et al 2018
- “Safe Exploration in Continuous Action Spaces”, Dalal et al 2018
- “CycleGAN, a Master of Steganography”, Chu et al 2017
- “AI Safety Gridworlds”, Leike et al 2017
- “There’s No Fire Alarm for Artificial General Intelligence”, Yudkowsky 2017
- “Safe Reinforcement Learning via Shielding”, Alshiekh et al 2017
- “CAN: Creative Adversarial Networks, Generating "Art" by Learning About Styles and Deviating from Style Norms”, Elgammal et al 2017
- “DeepXplore: Automated Whitebox Testing of Deep Learning Systems”, Pei et al 2017
- “On the Impossibility of Supersized Machines”, Garfinkel et al 2017
- “Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks”, Katz et al 2017
- “The Off-Switch Game”, Hadfield-Menell et al 2016
- “Combating Reinforcement Learning’s Sisyphean Curse With Intrinsic Fear”, Lipton et al 2016
- “Concrete Problems in AI Safety”, Amodei et al 2016
- “Intelligence Explosion Microeconomics”, Yudkowsky 2013
- “Advantages of Artificial Intelligences, Uploads, and Digital Minds”, Sotala 2012
- “Ontological Crises in Artificial Agents’ Value Systems”, Blanc 2011
- “The Normalization of Deviance in Healthcare Delivery”, Banja 2010
- “Halloween Nightmare Scenario, Early 2020’s”, Wood 2009
- “The Basic AI Drives”, Omohundro 2008
- “Starfish § Bulrushes”, Watts 1999
- “Superhumanism: According to Hans Moravec § On the Inevitability & Desirability of Human Extinction”, Platt 1995
- “Some Moral and Technical Consequences of Automation: As Machines Learn They May Develop Unforeseen Strategies at Rates That Baffle Their Programmers”, Wiener 1960
- “Intelligent Machinery, A Heretical Theory”, Turing 1951
- “Homepage of Paul F. Christiano”, Christiano 2024
- Sort By Magic
- Wikipedia
- Miscellaneous
- Link Bibliography
See Also
Gwern
“The Neural Net Tank Urban Legend”, Gwern 2011
“It Looks Like You’re Trying To Take Over The World”, Gwern 2022
“Surprisingly Turing-Complete”, Gwern 2012
“The Scaling Hypothesis”, Gwern 2020
“Evolution As Backstop for Reinforcement Learning”, Gwern 2018
“Complexity No Bar to AI”, Gwern 2014
“Why Tool AIs Want to Be Agent AIs”, Gwern 2016
“AI Risk Demos”, Gwern 2016
Links
“Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”, Hubinger et al 2024
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
“Thousands of AI Authors on the Future of AI”, Grace et al 2024
“Exploiting Novel GPT-4 APIs”, Pelrine et al 2023
“Comparison of Waymo Rider-Only Crash Data to Human Benchmarks at 7.1 Million Miles”, Kusano et al 2023
Comparison of Waymo Rider-Only Crash Data to Human Benchmarks at 7.1 Million Miles
“Challenges With Unsupervised LLM Knowledge Discovery”, Farquhar et al 2023
“Politics and the Future”, Horowitz 2023
“Helping or Herding? Reward Model Ensembles Mitigate but Do Not Eliminate Reward Hacking”, Eisenstein et al 2023
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
“The Inside Story of Microsoft’s Partnership With OpenAI: The Companies Had Honed a Protocol for Releasing Artificial Intelligence Ambitiously but Safely. Then OpenAI’s Board Exploded All Their Carefully Laid Plans”, Duhigg 2023
“How Jensen Huang’s Nvidia Is Powering the AI Revolution: The Company’s CEO Bet It All on a New Kind of Chip. Now That Nvidia Is One of the Biggest Companies in the World, What Will He Do Next?”, Witt 2023
“Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching”, Campbell et al 2023
“Did I Get Sam Altman Fired from OpenAI?: Nathan’s Redteaming Experience, Noticing How the Board Was Not Aware of GPT-4 Jailbreaks & Had Not Even Tried GPT-4 prior to Its Early Release”, Labenz 2023
“Did I Get Sam Altman Fired from OpenAI? § GPT-4-base”, Labenz 2023
“Inside the Chaos at OpenAI: Sam Altman’s Weekend of Shock and Drama Began a Year Ago, With the Release of ChatGPT”, Hao & Warzel 2023
“OpenAI Announces Leadership Transition”, Sutskever et al 2023
“In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering”, Liu et al 2023
“Removing RLHF Protections in GPT-4 via Fine-Tuning”, Zhan et al 2023
“Large Language Models Can Strategically Deceive Their Users When Put Under Pressure”, Scheurer et al 2023
Large Language Models can Strategically Deceive their Users when Put Under Pressure
“Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation”, Shah et al 2023
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
“Augmenting Large Language Models With Chemistry Tools”, Bran et al 2023
“Will Releasing the Weights of Large Language Models Grant Widespread Access to Pandemic Agents?”, Gopal et al 2023
Will releasing the weights of large language models grant widespread access to pandemic agents?
“Specific versus General Principles for Constitutional AI”, Kundu et al 2023
“Goodhart’s Law in Reinforcement Learning”, Karwowski et al 2023
“Let Models Speak Ciphers: Multiagent Debate through Embeddings”, Pham et al 2023
Let Models Speak Ciphers: Multiagent Debate through Embeddings
“Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!”, Qi et al 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
“Representation Engineering: A Top-Down Approach to AI Transparency”, Zou et al 2023
Representation Engineering: A Top-Down Approach to AI Transparency
“Responsibility & Safety: Our Approach”, DeepMind 2023
“STARC: A General Framework For Quantifying Differences Between Reward Functions”, Skalse et al 2023
STARC: A General Framework For Quantifying Differences Between Reward Functions
“How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions”, Pacchiardi et al 2023
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
“What If the Robots Were Very Nice While They Took Over the World?”, Heffernan 2023
What If the Robots Were Very Nice While They Took Over the World?
“Does Sam Altman Know What He’s Creating? The OpenAI CEO’s Ambitious, Ingenious, Terrifying Quest to Create a New Form of Intelligence”, Andersen 2023
“Introducing Superalignment”, Leike & Sutskever 2023
“Gödel, Escher, Bach Author Douglas Hofstadter on the State of AI Today § What about AI Terrifies You?”, Hofstadter & Kim 2023
“Microsoft and OpenAI Forge Awkward Partnership As Tech’s New Power Couple: As the Companies Lead the AI Boom, Their Unconventional Arrangement Sometimes Causes Conflict”, Dotan & Seetharaman 2023
“Can Large Language Models Democratize Access to Dual-use Biotechnology?”, Soice et al 2023
Can large language models democratize access to dual-use biotechnology?
“Thought Cloning: Learning to Think While Acting by Imitating Human Thinking”, Hu & Clune 2023
Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
“The Challenge of Advanced Cyberwar and the Place of Cyberpeace”, Carayannis & Draper 2023
The challenge of advanced cyberwar and the place of cyberpeace
“Incentivizing Honest Performative Predictions With Proper Scoring Rules”, Oesterheld et al 2023
Incentivizing honest performative predictions with proper scoring rules
“Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns”, Hazell 2023
Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns
“Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting”, Turpin et al 2023
“Mitigating Lies in Vision-Language Models”, Li et al 2023
“A Radical Plan to Make AI Good, Not Evil”, Knight 2023
“Even The Politicians Thought the Open Letter Made No Sense In The Senate Hearing on AI Today’s Hearing on Ai Covered Ai Regulation and Challenges, and the Infamous Open Letter, Which Nearly Everyone in the Room Thought Was Unwise”, Gorrell 2023
“In AI Race, Microsoft and Google Choose Speed Over Caution: Technology Companies Were Once Leery of What Some Artificial Intelligence Could Do. Now the Priority Is Winning Control of the Industry’s next Big Thing”, Grant & Weise 2023
“8 Things to Know about Large Language Models”, Bowman 2023
“Sam Altman on What Makes Him ‘Super Nervous’ About AI: The OpenAI Co-founder Thinks Tools like GPT-4 Will Be Revolutionary. But He’s Wary of Downsides”, Swisher 2023
“The OpenAI CEO Disagrees With the Forecast That AI Will Kill Us All: An Artificial Intelligence Twitter Beef, Explained”, Huet 2023
“As AI Booms, Lawmakers Struggle to Understand the Technology: Tech Innovations Are Again Racing ahead of Washington’s Ability to Regulate Them, Lawmakers and AI Experts Said”, Kang & Satariano 2023
“Pretraining Language Models With Human Preferences”, Korbak et al 2023
“Conditioning Predictive Models: Risks and Strategies”, Hubinger et al 2023
“Tracr: Compiled Transformers As a Laboratory for Interpretability”, Lindner et al 2023
Tracr: Compiled Transformers as a Laboratory for Interpretability
“Discovering Language Model Behaviors With Model-Written Evaluations”, Perez et al 2022
Discovering Language Model Behaviors with Model-Written Evaluations
“Discovering Latent Knowledge in Language Models Without Supervision”, Burns et al 2022
Discovering Latent Knowledge in Language Models Without Supervision
“Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula”, Bronstein et al 2022
Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula
“Interpreting Neural Networks through the Polytope Lens”, Black et al 2022
“Mysteries of Mode Collapse § Inescapable Wedding Parties”, Janus 2022
“Measuring Progress on Scalable Oversight for Large Language Models”, Bowman et al 2022
Measuring Progress on Scalable Oversight for Large Language Models
“Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)”, Mitchell & Chugg 2022
Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)
“Broken Neural Scaling Laws”, Caballero et al 2022
“Scaling Laws for Reward Model Overoptimization”, Gao et al 2022
“Defining and Characterizing Reward Hacking”, Skalse et al 2022
“The Alignment Problem from a Deep Learning Perspective”, Ngo 2022
“Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, Ganguli et al 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
“Modeling Transformative AI Risks (MTAIR) Project—Summary Report”, Clarke et al 2022
Modeling Transformative AI Risks (MTAIR) Project—Summary Report
“Researching Alignment Research: Unsupervised Analysis”, Kirchner et al 2022
“Ethan Caballero on Private Scaling Progress”, Caballero & Trazzi 2022
“DeepMind: The Podcast—Excerpts on AGI”, Kiely 2022
“Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances”, Ahn et al 2022
Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances
“Predictability and Surprise in Large Generative Models”, Ganguli et al 2022
“Uncalibrated Models Can Improve Human-AI Collaboration”, Vodrahalli et al 2022
“DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers”, Cho et al 2022
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers
“Safe Deep RL in 3D Environments Using Human Feedback”, Rahtz et al 2022
“LaMDA: Language Models for Dialog Applications”, Thoppilan et al 2022
“The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models”, Pan et al 2022
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
“Scaling Language Models: Methods, Analysis & Insights from Training Gopher”, Rae et al 2021
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
“A General Language Assistant As a Laboratory for Alignment”, Askell et al 2021
“What Would Jiminy Cricket Do? Towards Agents That Behave Morally”, Hendrycks et al 2021
What Would Jiminy Cricket Do? Towards Agents That Behave Morally
“Can Machines Learn Morality? The Delphi Experiment”, Jiang et al 2021
“SafetyNet: Safe Planning for Real-world Self-driving Vehicles Using Machine-learned Policies”, Vitelli et al 2021
SafetyNet: Safe planning for real-world self-driving vehicles using machine-learned policies
“Unsolved Problems in ML Safety”, Hendrycks et al 2021
“An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions”, Pearce et al 2021
An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions
“On the Opportunities and Risks of Foundation Models”, Bommasani et al 2021
“Evaluating Large Language Models Trained on Code”, Chen et al 2021
“Randomness In Neural Network Training: Characterizing The Impact of Tooling”, Zhuang et al 2021
Randomness In Neural Network Training: Characterizing The Impact of Tooling
“Goal Misgeneralization in Deep Reinforcement Learning”, Koch et al 2021
“Anthropic Raises $124 Million to Build More Reliable, General AI Systems”, Anthropic 2021
Anthropic raises $124 million to build more reliable, general AI systems
“Artificial Intelligence in China’s Revolution in Military Affairs”, Kania 2021
Artificial intelligence in China’s revolution in military affairs
“Reward Is Enough”, Silver et al 2021
“Intelligence and Unambitiousness Using Algorithmic Information Theory”, Cohen et al 2021
Intelligence and Unambitiousness Using Algorithmic Information Theory
“AI Dungeon Public Disclosure Vulnerability Report—GraphQL Unpublished Adventure Data Leak”, AetherDevSecOps 2021
AI Dungeon Public Disclosure Vulnerability Report—GraphQL Unpublished Adventure Data Leak
“Universal Off-Policy Evaluation”, Chandak et al 2021
“Multitasking Inhibits Semantic Drift”, Jacob et al 2021
“Waymo Simulated Driving Behavior in Reconstructed Fatal Crashes within an Autonomous Vehicle Operating Domain”, Scanlon et al 2021
“Language Models Have a Moral Dimension”, Schramowski et al 2021
“Replaying Real Life: How the Waymo Driver Avoids Fatal Human Crashes”, Waymo 2021
Replaying real life: how the Waymo Driver avoids fatal human crashes
“Agent Incentives: A Causal Perspective”, Everitt et al 2021
“Organizational Update from OpenAI”, OpenAI 2020
“Emergent Road Rules In Multi-Agent Driving Environments”, Pal et al 2020
“Recipes for Safety in Open-domain Chatbots”, Xu et al 2020
“Hidden Incentives for Auto-Induced Distributional Shift”, Krueger et al 2020
“The Radicalization Risks of GPT-3 and Advanced Neural Language Models”, McGuffie & Newhouse 2020
The Radicalization Risks of GPT-3 and Advanced Neural Language Models
“Matt Botvinick on the Spontaneous Emergence of Learning Algorithms”, Scholl 2020
Matt Botvinick on the spontaneous emergence of learning algorithms
“Aligning AI With Shared Human Values”, Hendrycks et al 2020
“Pitfalls of Learning a Reward Function Online”, Armstrong et al 2020
“Reward-rational (implicit) Choice: A Unifying Formalism for Reward Learning”, Jeon et al 2020
Reward-rational (implicit) choice: A unifying formalism for reward learning
“The Incentives That Shape Behavior”, Carey et al 2020
“2019 AI Alignment Literature Review and Charity Comparison”, Larks 2019
“Learning Norms from Stories: A Prior for Value Aligned Agents”, Frazier et al 2019
Learning Norms from Stories: A Prior for Value Aligned Agents
“Optimal Policies Tend to Seek Power”, Turner et al 2019
“Taxonomy of Real Faults in Deep Learning Systems”, Humbatova et al 2019
“Release Strategies and the Social Impacts of Language Models”, Solaiman et al 2019
Release Strategies and the Social Impacts of Language Models
“The Bouncer Problem: Challenges to Remote Explainability”, Merrer & Tredan 2019
“Scaling Data-driven Robotics With Reward Sketching and Batch Reinforcement Learning”, Cabi et al 2019
Scaling data-driven robotics with reward sketching and batch reinforcement learning
“Fine-Tuning GPT-2 from Human Preferences § Bugs Can Optimize for Bad Behavior”, Ziegler et al 2019
Fine-Tuning GPT-2 from Human Preferences § Bugs can optimize for bad behavior
“Designing Agent Incentives to Avoid Reward Tampering”, Everitt et al 2019
“Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective”, Everitt et al 2019
“Characterizing Attacks on Deep Reinforcement Learning”, Pan et al 2019
“Categorizing Wireheading in Partially Embedded Agents”, Majha et al 2019
“Risks from Learned Optimization in Advanced Machine Learning Systems”, Hubinger et al 2019
Risks from Learned Optimization in Advanced Machine Learning Systems
“GROVER: Defending Against Neural Fake News”, Zellers et al 2019
“AI-GAs: AI-generating Algorithms, an Alternate Paradigm for Producing General Artificial Intelligence”, Clune 2019
“Challenges of Real-World Reinforcement Learning”, Dulac-Arnold et al 2019
“DeepMind and Google: the Battle to Control Artificial Intelligence. Demis Hassabis Founded a Company to Build the World’s Most Powerful AI. Then Google Bought Him Out. Hal Hodson Asks Who Is in Charge”, Hodson 2019
“Forecasting Transformative AI: An Expert Survey”, Gruetzemacher et al 2019
“Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified”, Mitchell 2019
Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified
“Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures”, Uesato et al 2018
Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures
“There Is Plenty of Time at the Bottom: the Economics, Risk and Ethics of Time Compression”, Sandberg 2018
There is plenty of time at the bottom: the economics, risk and ethics of time compression
“Better Safe Than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning”, Agarwal et al 2018
Better Safe than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning
“The Alignment Problem for Bayesian History-Based Reinforcement Learners”, Everitt & Hutter 2018
The Alignment Problem for Bayesian History-Based Reinforcement Learners
“Adaptive Mechanism Design: Learning to Promote Cooperation”, Baumann et al 2018
“Visceral Machines: Risk-Aversion in Reinforcement Learning With Intrinsic Physiological Rewards”, McDuff & Kapoor 2018
Visceral Machines: Risk-Aversion in Reinforcement Learning with Intrinsic Physiological Rewards
“Incomplete Contracting and AI Alignment”, Hadfield-Menell & Hadfield 2018
“Programmatically Interpretable Reinforcement Learning”, Verma et al 2018
“Categorizing Variants of Goodhart’s Law”, Manheim & Garrabrant 2018
“The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities”, Lehman et al 2018
“Machine Theory of Mind”, Rabinowitz et al 2018
“Safe Exploration in Continuous Action Spaces”, Dalal et al 2018
“CycleGAN, a Master of Steganography”, Chu et al 2017
“AI Safety Gridworlds”, Leike et al 2017
“There’s No Fire Alarm for Artificial General Intelligence”, Yudkowsky 2017
“Safe Reinforcement Learning via Shielding”, Alshiekh et al 2017
“CAN: Creative Adversarial Networks, Generating "Art" by Learning About Styles and Deviating from Style Norms”, Elgammal et al 2017
“DeepXplore: Automated Whitebox Testing of Deep Learning Systems”, Pei et al 2017
DeepXplore: Automated Whitebox Testing of Deep Learning Systems
“On the Impossibility of Supersized Machines”, Garfinkel et al 2017
“Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks”, Katz et al 2017
Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks
“The Off-Switch Game”, Hadfield-Menell et al 2016
“Combating Reinforcement Learning’s Sisyphean Curse With Intrinsic Fear”, Lipton et al 2016
Combating Reinforcement Learning’s Sisyphean Curse with Intrinsic Fear
“Concrete Problems in AI Safety”, Amodei et al 2016
“Intelligence Explosion Microeconomics”, Yudkowsky 2013
“Advantages of Artificial Intelligences, Uploads, and Digital Minds”, Sotala 2012
Advantages of Artificial Intelligences, Uploads, and Digital Minds
“Ontological Crises in Artificial Agents’ Value Systems”, Blanc 2011
“The Normalization of Deviance in Healthcare Delivery”, Banja 2010
“Halloween Nightmare Scenario, Early 2020’s”, Wood 2009
“The Basic AI Drives”, Omohundro 2008
“Starfish § Bulrushes”, Watts 1999
“Superhumanism: According to Hans Moravec § On the Inevitability & Desirability of Human Extinction”, Platt 1995
Superhumanism: According to Hans Moravec § On the Inevitability & Desirability of Human Extinction
“Some Moral and Technical Consequences of Automation: As Machines Learn They May Develop Unforeseen Strategies at Rates That Baffle Their Programmers”, Wiener 1960
“Intelligent Machinery, A Heretical Theory”, Turing 1951
“Homepage of Paul F. Christiano”, Christiano 2024
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
ethical-lm
data-privacy
alignment-incentives
llm-safety
reliability
language-safety
Wikipedia
Miscellaneous
-
https://80000hours.org/2018/03/jan-leike-ml-alignment/
:View External Link:
-
https://80000hours.org/podcast/episodes/brian-christian-the-alignment-problem/
:View External Link:
https://80000hours.org/podcast/episodes/brian-christian-the-alignment-problem/
-
https://aiimpacts.org/partially-plausible-fictional-ai-futures/
-
https://blog.acolyer.org/2018/08/13/delayed-impact-of-fair-machine-learning/
-
https://blog.acolyer.org/2020/01/13/challenges-of-real-world-rl/
:View External Link:
https://blog.acolyer.org/2020/01/13/challenges-of-real-world-rl/
-
https://blog.x.company/1-million-hours-of-stratospheric-flight-f7af7ae728ac
-
https://chat.openai.com/share/312e82f0-cc5e-47f3-b368-b2c0c0f4ad3f
-
https://forum.effectivealtruism.org/posts/TMbPEhdAAJZsSYx2L/the-limited-upside-of-interpretability
: -
https://github.com/spdustin/ChatGPT-AutoExpert/blob/main/System%20Prompts.md
-
https://joecarlsmith.com/2023/05/08/predictable-updating-about-ai-risk
-
https://mailchi.mp/938a7eed18c3/an-71avoiding-reward-tamperi
:View External Link:
https://mailchi.mp/938a7eed18c3/an-71avoiding-reward-tamperi
-
https://medium.com/@deepmindsafetyresearch/building-safe-artificial-intelligence-52f5f75058f1
-
https://medium.com/aurora-blog/auroras-approach-to-development-5e42fec2ee4b
-
https://spectrum.ieee.org/its-too-easy-to-hide-bias-in-deeplearning-systems
-
https://thezvi.substack.com/p/jailbreaking-the-chatgpt-on-release
-
https://thezvi.substack.com/p/on-openais-preparedness-framework
-
https://thezvi.wordpress.com/2023/07/25/anthropic-observations/
-
https://twitter.com/DanielColson6/status/1702319218895868305
-
https://twitter.com/KevinAFischer/status/1646677902833102849
-
https://twitter.com/KevinAFischer/status/1646690838981005312
-
https://twitter.com/andrewwhite01/status/1634728559506870274
-
https://twitter.com/daniel_271828/status/1769853886163296455
-
https://twitter.com/juan_cambeiro/status/1643739695598419970
-
https://twitter.com/katrosenfield/status/1672969824656322561
-
https://twitter.com/metachirality/status/1769818226718888426
-
https://twitter.com/metachirality/status/1769905644725830090
-
https://twitter.com/papayathreesome/status/1670170344953372676
-
https://vkrakovna.wordpress.com/2022/06/02/paradigms-of-ai-alignment-components-and-enablers/
: -
https://web.archive.org/web/20140527121332/https://www.infinityplus.co.uk/stories/under.htm
-
https://web.archive.org/web/20240102075620/https://www.jailbreakchat.com/
-
https://www.anthropic.com/index/anthropics-responsible-scaling-policy
-
https://www.astralcodexten.com/p/constitutional-ai-rlhf-on-steroids
-
https://www.astralcodexten.com/p/perhaps-it-is-a-bad-thing-that-the
-
https://www.baen.com/Chapters/9781618249203/9781618249203___2.htm
-
https://www.deepmind.com/blog/article/Specification-gaming-the-flip-side-of-AI-ingenuity
-
https://www.dwarkeshpatel.com/p/demis-hassabis#%C2%A7timestamps
-
https://www.forourposterity.com/nobodys-on-the-ball-on-agi-alignment/
-
https://www.lesswrong.com/posts/3eqHYxfWb5x4Qfz8C/unrlhf-efficiently-undoing-llm-safeguards
-
https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message
-
https://www.lesswrong.com/posts/6dn6hnFRgqqWJbwk9/deception-chess-game-1
-
https://www.lesswrong.com/posts/9kQFure4hdDmRBNdH/how-it-feels-to-have-your-mind-hacked-by-an-ai
-
https://www.lesswrong.com/posts/EbFABnst8LsidYs5Y/goodhart-taxonomy
-
https://www.lesswrong.com/posts/Eu6CvP7c7ivcGM3PJ/goodhart-s-law-in-reinforcement-learning
-
https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit
-
https://www.lesswrong.com/posts/FkgsxrGf3QxhfLWHG/risks-from-learned-optimization-introduction
-
https://www.lesswrong.com/posts/No5JpRCHzBrWA4jmS/q-and-a-with-shane-legg-on-risks-from-ai
-
https://www.lesswrong.com/posts/ZwshvqiqCvXPsZEct/the-learning-theoretic-agenda-status-2023
-
https://www.lesswrong.com/posts/jkY6QdCfAXHJk3kea/the-petertodd-phenomenon
-
https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality-is-the-tiger-and-agents-are-its-teeth
-
https://www.lesswrong.com/posts/pNcFYZnPdXyL2RfgA/using-gpt-eliezer-against-chatgpt-jailbreaking
-
https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities
-
https://www.lesswrong.com/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research
-
https://www.lesswrong.com/posts/yDcMDJeSck7SuBs24/steganography-in-chain-of-thought-reasoning
-
https://www.neelnanda.io/mechanistic-interpretability/favourite-papers
-
https://www.newyorker.com/magazine/2022/01/24/the-rise-of-ai-fighter-pilots
-
https://www.newyorker.com/science/annals-of-artificial-intelligence/can-we-stop-the-singularity
-
https://www.nytimes.com/2018/03/15/business/self-driving-cars-remote-control.html
-
https://www.nytimes.com/2021/04/30/technology/robot-surgery-surgeon.html
-
https://www.nytimes.com/2023/05/30/technology/shoggoth-meme-ai.html
-
https://www.politico.com/news/magazine/2023/11/02/bruce-reed-ai-biden-tech-00124375
-
https://www.reddit.com/r/40krpg/comments/11a9m8u/was_using_chatgpt3_to_create_some_bits_and_pieces/
-
https://www.reddit.com/r/ChatGPT/comments/10tevu1/new_jailbreak_proudly_unveiling_the_tried_and/
-
https://www.reddit.com/r/ChatGPT/comments/12a0ajb/i_gave_gpt4_persistent_memory_and_the_ability_to/
-
https://www.reddit.com/r/ChatGPT/comments/15y4mqx/i_asked_chatgpt_to_maximize_its_censorship/
-
https://www.reddit.com/r/ChatGPT/comments/18fl2d5/nsfw_fun_with_dalle/
-
https://www.reddit.com/r/GPT3/comments/12ez822/neurosemantical_inversitis_prompt_still_works/
-
https://www.reddit.com/r/ProgrammerHumor/comments/145nduh/kiss/
-
https://www.reddit.com/r/bing/comments/110eagl/the_customer_service_of_the_new_bing_chat_is/
-
https://www.theverge.com/2021/7/6/22565448/waymo-simulation-city-autonomous-vehicle-testing-virtual
-
https://www.vox.com/future-perfect/23794855/anthropic-ai-openai-claude-2
-
https://www.wired.com/story/ai-powered-totally-autonomous-future-of-war-is-here/
-
https://www.wired.com/story/when-bots-teach-themselves-to-cheat/
:View External Link:
https://www.wired.com/story/when-bots-teach-themselves-to-cheat/
-
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-207.pdf#page=3
Link Bibliography
-
https://arxiv.org/abs/2401.05566#anthropic
: “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”, -
https://arxiv.org/abs/2401.02843
: “Thousands of AI Authors on the Future of AI”, Katja Grace, Harlan Stewart, Julia Fabienne Sandkühler, Stephen Thomas, Ben Weinstein-Raun, Jan Brauner -
https://www.newyorker.com/magazine/2023/12/11/the-inside-story-of-microsofts-partnership-with-openai
: “The Inside Story of Microsoft’s Partnership With OpenAI: The Companies Had Honed a Protocol for Releasing Artificial Intelligence Ambitiously but Safely. Then OpenAI’s Board Exploded All Their Carefully Laid Plans”, Charles Duhigg -
https://www.newyorker.com/magazine/2023/12/04/how-jensen-huangs-nvidia-is-powering-the-ai-revolution
: “How Jensen Huang’s Nvidia Is Powering the AI Revolution: The Company’s CEO Bet It All on a New Kind of Chip. Now That Nvidia Is One of the Biggest Companies in the World, What Will He Do Next?”, Stephen Witt -
https://cognitiverevolution.substack.com/p/did-i-get-sam-altman-fired-from-openai
: “Did I Get Sam Altman Fired from OpenAI?: Nathan’s Redteaming Experience, Noticing How the Board Was Not Aware of GPT-4 Jailbreaks & Had Not Even Tried GPT-4 prior to Its Early Release”, Nathan Labenz -
https://www.theatlantic.com/technology/archive/2023/11/sam-altman-open-ai-chatgpt-chaos/676050/
: “Inside the Chaos at OpenAI: Sam Altman’s Weekend of Shock and Drama Began a Year Ago, With the Release of ChatGPT”, Karen Hao, Charlie Warzel -
https://deepmind.google/about/responsibility-safety/
: “Responsibility & Safety: Our Approach”, DeepMind -
https://www.theatlantic.com/magazine/archive/2023/09/sam-altman-openai-chatgpt-gpt-4/674764/
: “Does Sam Altman Know What He’s Creating? The OpenAI CEO’s Ambitious, Ingenious, Terrifying Quest to Create a New Form of Intelligence”, Ross Andersen -
https://openai.com/blog/introducing-superalignment
: “Introducing Superalignment”, Jan Leike, Ilya Sutskever -
https://www.youtube.com/watch?v=lfXxzAVtdpU&t=1763s
: “Gödel, Escher, Bach Author Douglas Hofstadter on the State of AI Today § What about AI Terrifies You?”, Douglas Hofstadter, Amy Jo Kim -
https://www.wsj.com/articles/microsoft-and-openai-forge-awkward-partnership-as-techs-new-power-couple-3092de51
: “Microsoft and OpenAI Forge Awkward Partnership As Tech’s New Power Couple: As the Companies Lead the AI Boom, Their Unconventional Arrangement Sometimes Causes Conflict”, Tom Dotan, Deepa Seetharaman -
https://arxiv.org/abs/2306.00323
: “Thought Cloning: Learning to Think While Acting by Imitating Human Thinking”, Shengran Hu, Jeff Clune -
2023-carayannis.pdf
: “The Challenge of Advanced Cyberwar and the Place of Cyberpeace”, Elias G. Carayannis, John Draper -
https://arxiv.org/abs/2305.06972
: “Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns”, Julian Hazell -
https://arxiv.org/abs/2305.04388
: “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting”, Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman -
https://www.wired.com/story/anthropic-ai-chatbots-ethics/
: “A Radical Plan to Make AI Good, Not Evil”, Will Knight -
https://www.nytimes.com/2023/04/07/technology/ai-chatbots-google-microsoft.html
: “In AI Race, Microsoft and Google Choose Speed Over Caution: Technology Companies Were Once Leery of What Some Artificial Intelligence Could Do. Now the Priority Is Winning Control of the Industry’s next Big Thing”, Nico Grant, Karen Weise -
https://nymag.com/intelligencer/2023/03/on-with-kara-swisher-sam-altman-on-the-ai-revolution.html
: “Sam Altman on What Makes Him ‘Super Nervous’ About AI: The OpenAI Co-founder Thinks Tools like GPT-4 Will Be Revolutionary. But He’s Wary of Downsides”, Kara Swisher -
https://www.nytimes.com/2023/03/03/technology/artificial-intelligence-regulation-congress.html
: “As AI Booms, Lawmakers Struggle to Understand the Technology: Tech Innovations Are Again Racing ahead of Washington’s Ability to Regulate Them, Lawmakers and AI Experts Said”, Cecila Kang, Adam Satariano -
https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse-due-to-rlhf#Inescapable_wedding_parties
: “Mysteries of Mode Collapse § Inescapable Wedding Parties”, Janus -
https://www.youtube.com/watch?v=Q-TJFyUoenc&t=2444s
: “Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)”, Melanie Mitchell, Benny Chugg -
https://arxiv.org/abs/2210.10760#openai
: “Scaling Laws for Reward Model Overoptimization”, Leo Gao, John Schulman, Jacob Hilton -
https://www.anthropic.com/red_teaming.pdf
: “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, -
https://arxiv.org/abs/2206.02841
: “Researching Alignment Research: Unsupervised Analysis”, Jan H. Kirchner, Logan Smith, Jacques Thibodeau, Kyle McDonell, Laria Reynolds -
https://theinsideview.ai/ethan
: “Ethan Caballero on Private Scaling Progress”, Ethan Caballero, Michaël Trazzi -
https://www.lesswrong.com/posts/SbAgRYo8tkHwhd9Qx/deepmind-the-podcast-excerpts-on-agi
: “DeepMind: The Podcast—Excerpts on AGI”, William Kiely -
https://arxiv.org/abs/2204.01691#google
: “Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances”, -
https://arxiv.org/abs/2202.07785#anthropic
: “Predictability and Surprise in Large Generative Models”, -
https://arxiv.org/abs/2201.03544
: “The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models”, Alexander Pan, Kush Bhatia, Jacob Steinhardt -
https://arxiv.org/abs/2112.11446#deepmind
: “Scaling Language Models: Methods, Analysis & Insights from Training Gopher”, -
https://arxiv.org/abs/2112.00861#anthropic
: “A General Language Assistant As a Laboratory for Alignment”, -
https://arxiv.org/abs/2108.07258
: “On the Opportunities and Risks of Foundation Models”, -
https://www.sciencedirect.com/science/article/pii/S0004370221000862#deepmind
: “Reward Is Enough”, David Silver, Satinder Singh, Doina Precup, Richard S. Sutton -
https://waymo.com/blog/2021/03/replaying-real-life.html
: “Replaying Real Life: How the Waymo Driver Avoids Fatal Human Crashes”, Waymo -
https://www.lesswrong.com/posts/Wnqua6eQkewL3bqsF/matt-botvinick-on-the-spontaneous-emergence-of-learning
: “Matt Botvinick on the Spontaneous Emergence of Learning Algorithms”, Adam Scholl -
https://www.lesswrong.com/posts/SmDziGM9hBjW9DKmf/2019-ai-alignment-literature-review-and-charity-comparison
: “2019 AI Alignment Literature Review and Charity Comparison”, Larks -
https://www.economist.com/1843/2019/03/01/deepmind-and-google-the-battle-to-control-artificial-intelligence
: “DeepMind and Google: the Battle to Control Artificial Intelligence. Demis Hassabis Founded a Company to Build the World’s Most Powerful AI. Then Google Bought Him Out. Hal Hodson Asks Who Is in Charge”, Hal Hodson -
https://melaniemitchell.me/aibook/
: “Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified”, Melanie Mitchell -
2018-everitt.pdf
: “The Alignment Problem for Bayesian History-Based Reinforcement Learners”, Tom Everitt, Marcus Hutter -
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2821100/
: “The Normalization of Deviance in Healthcare Delivery”, John Banja -
https://dw2blog.com/2009/11/02/halloween-nightmare-scenario-early-2020s/
: “Halloween Nightmare Scenario, Early 2020’s”, David Wood -
https://paulfchristiano.com/
: “Homepage of Paul F. Christiano”, Paul F. Christiano