‘AI safety’ tag
- See Also
-
Gwern
- “What Do You Do After ‘Winning’ an AI Arms Race?”, Gwern 2024
- “What Is an ‘AI Warning Shot’?”, Gwern 2024
- “The Neural Net Tank Urban Legend”, Gwern 2011
- “It Looks Like You’re Trying To Take Over The World”, Gwern 2022
- “Surprisingly Turing-Complete”, Gwern 2012
- “The Scaling Hypothesis”, Gwern 2020
- “Evolution As Backstop for Reinforcement Learning”, Gwern 2018
- “Complexity No Bar to AI”, Gwern 2014
- “Why Tool AIs Want to Be Agent AIs”, Gwern 2016
- “AI Risk Demos”, Gwern 2016
-
Links
- “Memorandum on Advancing the United States’ Leadership in Artificial Intelligence”, Biden 2024
- “Machines of Loving Grace: How AI Could Transform the World for the Better”, Amodei 2024
- “Strategic Insights from Simulation Gaming of AI Race Dynamics”, Gruetzemacher et al 2024
- “Towards a Law of Iterated Expectations for Heuristic Estimators”, Christiano et al 2024
- “Language Models Learn to Mislead Humans via RLHF”, Wen et al 2024
- “OpenAI Co-Founder Sutskever’s New Safety-Focused AI Startup SSI Raises $1 Billion”, Cai et al 2024
- “Motor Physics: Safety Implications of Geared Motors”, Jang 2024
- “China’s Views on AI Safety Are Changing—Quickly: Beijing’s AI Safety Concerns Are Higher on the Priority List, but They Remain Tied up in Geopolitical Competition and Technological Advancement”, Sheehan 2024
- “Is Xi Jinping an AI Doomer? China’s Elite Is Split over Artificial Intelligence”, Economist 2024
- “Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?”, Ren et al 2024
- “Resolution of the Central Committee of the Communist Party of China on Further Deepening Reform Comprehensively to Advance Chinese Modernization § Pg58”, China 2024 (page 58)
- “On Scalable Oversight With Weak LLMs Judging Strong LLMs”, Kenton et al 2024
- “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs”, Laine et al 2024
- “Ilya Sutskever Has a New Plan for Safe Superintelligence: OpenAI’s Co-Founder Discloses His Plans to Continue His Work at a New Research Lab Focused on Artificial General Intelligence”, Vance 2024
- “Super(ficial)-Alignment: Strong Models May Deceive Weak Models in Weak-To-Strong Generalization”, Yang et al 2024
- “Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models”, Denison et al 2024
- “AI Sandbagging: Language Models Can Strategically Underperform on Evaluations”, Weij et al 2024
- “Safety Alignment Should Be Made More Than Just a Few Tokens Deep”, Qi et al 2024
- “I Wish I Knew How to Force Quit You”, Life & Rich 2024
- “OpenAI Board Forms Safety and Security Committee: This New Committee Is Responsible for Making Recommendations on Critical Safety and Security Decisions for All OpenAI Projects; Recommendations in 90 Days”, OpenAI 2024
- “OpenAI Begins Training next AI Model As It Battles Safety Concerns: Executive Appears to Backtrack on Start-Up’s Vision of Building ‘Superintelligence’ After Exits from ‘Superalignment’ Team”, Criddle 2024
- janleike @ "2024-05-28"
- “OpenAI Promised 20% of Its Computing Power to Combat the Most Dangerous Kind of AI—But Never Delivered, Sources Say”, Kahn 2024
- “AI Is a Black Box. Anthropic Figured Out a Way to Look Inside: What Goes on in Artificial Neural Networks Work Is Largely a Mystery, Even to Their Creators. But Researchers from Anthropic Have Caught a Glimpse”, Levy 2024
- DavidSKrueger @ "2024-05-19"
- “Earnings Call: Tesla Discusses Q1 2024 Challenges and AI Expansion”, Abdulkadir 2024
- “SOPHON: Non-Fine-Tunable Learning to Restrain Task Transferability For Pre-Trained Models”, Deng et al 2024
- “Foundational Challenges in Assuring Alignment and Safety of Large Language Models”, Anwar et al 2024
- “LLM Evaluators Recognize and Favor Their Own Generations”, Panickssery et al 2024
- “Algorithmic Collusion by Large Language Models”, Fish et al 2024
- “Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression”, Hong et al 2024
- “When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback”, Lang et al 2024
- “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”, Hubinger et al 2024
- “Thousands of AI Authors on the Future of AI”, Grace et al 2024
- “Using Dictionary Learning Features As Classifiers”
- “Exploiting Novel GPT-4 APIs”, Pelrine et al 2023
- “Comparison of Waymo Rider-Only Crash Data to Human Benchmarks at 7.1 Million Miles”, Kusano et al 2023
- “Challenges With Unsupervised LLM Knowledge Discovery”, Farquhar et al 2023
- “Politics and the Future”, Horowitz 2023
- “Helping or Herding? Reward Model Ensembles Mitigate but Do Not Eliminate Reward Hacking”, Eisenstein et al 2023
- “The Inside Story of Microsoft’s Partnership With OpenAI: The Companies Had Honed a Protocol for Releasing Artificial Intelligence Ambitiously but Safely. Then OpenAI’s Board Exploded All Their Carefully Laid Plans”, Duhigg 2023
- “How Jensen Huang’s Nvidia Is Powering the AI Revolution: The Company’s CEO Bet It All on a New Kind of Chip. Now That Nvidia Is One of the Biggest Companies in the World, What Will He Do Next?”, Witt 2023
- “Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching”, Campbell et al 2023
- “Did I Get Sam Altman Fired from OpenAI?: Nathan’s Red-Teaming Experience, Noticing How the Board Was Not Aware of GPT-4 Jailbreaks & Had Not Even Tried GPT-4 prior to Its Early Release”, Labenz 2023
- “Did I Get Sam Altman Fired from OpenAI? § GPT-4-Base”, Labenz 2023
- “Inside the Chaos at OpenAI: Sam Altman’s Weekend of Shock and Drama Began a Year Ago, With the Release of ChatGPT”, Hao & Warzel 2023
- “OpenAI Announces Leadership Transition”, Sutskever et al 2023
- “On Measuring Faithfulness or Self-Consistency of Natural Language Explanations”, Parcalabescu & Frank 2023
- “In-Context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering”, Liu et al 2023
- “Removing RLHF Protections in GPT-4 via Fine-Tuning”, Zhan et al 2023
- “Large Language Models Can Strategically Deceive Their Users When Put Under Pressure”, Scheurer et al 2023
- “Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation”, Shah et al 2023
- “Augmenting Large Language Models With Chemistry Tools”, Bran et al 2023
- “Preventing Language Models From Hiding Their Reasoning”, Roger & Greenblatt 2023
- “Will Releasing the Weights of Large Language Models Grant Widespread Access to Pandemic Agents?”, Gopal et al 2023
- “Specific versus General Principles for Constitutional AI”, Kundu et al 2023
- “Goodhart’s Law in Reinforcement Learning”, Karwowski et al 2023
- “Let Models Speak Ciphers: Multiagent Debate through Embeddings”, Pham et al 2023
- “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!”, Qi et al 2023
- “Representation Engineering: A Top-Down Approach to AI Transparency”, Zou et al 2023
- “Responsibility & Safety: Our Approach”, DeepMind 2023
- “STARC: A General Framework For Quantifying Differences Between Reward Functions”, Skalse et al 2023
- “How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions”, Pacchiardi et al 2023
- “What If the Robots Were Very Nice While They Took Over the World?”, Heffernan 2023
- “Taken out of Context: On Measuring Situational Awareness in LLMs”, Berglund et al 2023
- “AI Deception: A Survey of Examples, Risks, and Potential Solutions”, Park et al 2023
- “Simple Synthetic Data Reduces Sycophancy in Large Language Models”, Wei et al 2023
- “Does Sam Altman Know What He’s Creating? The OpenAI CEO’s Ambitious, Ingenious, Terrifying Quest to Create a New Form of Intelligence”, Andersen 2023
- “Question Decomposition Improves the Faithfulness of Model-Generated Reasoning”, Radhakrishnan et al 2023
- “Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models”, O’Gara 2023
- “Introducing Superalignment”, Leike & Sutskever 2023
- “Gödel, Escher, Bach Author Douglas Hofstadter on the State of AI Today § What about AI Terrifies You?”, Hofstadter & Kim 2023
- “Microsoft and OpenAI Forge Awkward Partnership As Tech’s New Power Couple: As the Companies Lead the AI Boom, Their Unconventional Arrangement Sometimes Causes Conflict”, Dotan & Seetharaman 2023
- “Can Large Language Models Democratize Access to Dual-Use Biotechnology?”, Soice et al 2023
- “Survival Instinct in Offline Reinforcement Learning”, Li et al 2023
- “Thought Cloning: Learning to Think While Acting by Imitating Human Thinking”, Hu & Clune 2023
- “The Challenge of Advanced Cyberwar and the Place of Cyberpeace”, Carayannis & Draper 2023
- “Incentivizing Honest Performative Predictions With Proper Scoring Rules”, Oesterheld et al 2023
- “Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns”, Hazell 2023
- “A Radical Plan to Make AI Good, Not Evil”, Knight 2023
- “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-Of-Thought Prompting”, Turpin et al 2023
- “Mitigating Lies in Vision-Language Models”, Li et al 2023
- “Fundamental Limitations of Alignment in Large Language Models”, Wolf et al 2023
- “Even The Politicians Thought the Open Letter Made No Sense In The Senate Hearing on AI Today’s Hearing on Ai Covered Ai Regulation and Challenges, and the Infamous Open Letter, Which Nearly Everyone in the Room Thought Was Unwise”, Gorrell 2023
- “In AI Race, Microsoft and Google Choose Speed Over Caution: Technology Companies Were Once Leery of What Some Artificial Intelligence Could Do. Now the Priority Is Winning Control of the Industry’s next Big Thing”, Grant & Weise 2023
- “8 Things to Know about Large Language Models”, Bowman 2023
- “Sam Altman on What Makes Him ‘Super Nervous’ About AI: The OpenAI Co-Founder Thinks Tools like GPT-4 Will Be Revolutionary. But He’s Wary of Downsides”, Swisher 2023
- “The OpenAI CEO Disagrees With the Forecast That AI Will Kill Us All: An Artificial Intelligence Twitter Beef, Explained”, Huet 2023
- “As AI Booms, Lawmakers Struggle to Understand the Technology: Tech Innovations Are Again Racing ahead of Washington’s Ability to Regulate Them, Lawmakers and AI Experts Said”, Kang & Satariano 2023
- “Pretraining Language Models With Human Preferences”, Korbak et al 2023
- “Conditioning Predictive Models: Risks and Strategies”, Hubinger et al 2023
- “Tracr: Compiled Transformers As a Laboratory for Interpretability”, Lindner et al 2023
- “Specification Gaming Examples in AI”
- “Discovering Language Model Behaviors With Model-Written Evaluations”, Perez et al 2022
- “Discovering Latent Knowledge in Language Models Without Supervision”, Burns et al 2022
- “Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula”, Bronstein et al 2022
- “Interpreting Neural Networks through the Polytope Lens”, Black et al 2022
- “Mysteries of Mode Collapse § Inescapable Wedding Parties”, Janus 2022
- “Measuring Progress on Scalable Oversight for Large Language Models”, Bowman et al 2022
- “Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)”, Mitchell & Chugg 2022
- “Broken Neural Scaling Laws”, Caballero et al 2022
- “Scaling Laws for Reward Model Overoptimization”, Gao et al 2022
- “Defining and Characterizing Reward Hacking”, Skalse et al 2022
- “The Alignment Problem from a Deep Learning Perspective”, Ngo 2022
- “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, Ganguli et al 2022
- “Modeling Transformative AI Risks (MTAIR) Project—Summary Report”, Clarke et al 2022
- “Researching Alignment Research: Unsupervised Analysis”, Kirchner et al 2022
- “Ethan Caballero on Private Scaling Progress”, Caballero & Trazzi 2022
- “DeepMind: The Podcast—Excerpts on AGI”, Kiely 2022
- “Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances”, Ahn et al 2022
- “Predictability and Surprise in Large Generative Models”, Ganguli et al 2022
- “Uncalibrated Models Can Improve Human-AI Collaboration”, Vodrahalli et al 2022
- “DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-To-Image Generative Transformers”, Cho et al 2022
- “Safe Deep RL in 3D Environments Using Human Feedback”, Rahtz et al 2022
- “LaMDA: Language Models for Dialog Applications”, Thoppilan et al 2022
- “The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models”, Pan et al 2022
- “Scaling Language Models: Methods, Analysis & Insights from Training Gopher”, Rae et al 2021
- “A General Language Assistant As a Laboratory for Alignment”, Askell et al 2021
- “What Would Jiminy Cricket Do? Towards Agents That Behave Morally”, Hendrycks et al 2021
- “Can Machines Learn Morality? The Delphi Experiment”, Jiang et al 2021
- “SafetyNet: Safe Planning for Real-World Self-Driving Vehicles Using Machine-Learned Policies”, Vitelli et al 2021
- “Unsolved Problems in ML Safety”, Hendrycks et al 2021
- “An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions”, Pearce et al 2021
- “On the Opportunities and Risks of Foundation Models”, Bommasani et al 2021
- “Evaluating Large Language Models Trained on Code”, Chen et al 2021
- “Randomness In Neural Network Training: Characterizing The Impact of Tooling”, Zhuang et al 2021
- “Goal Misgeneralization in Deep Reinforcement Learning”, Koch et al 2021
- “Anthropic Raises $124 Million to Build More Reliable, General AI Systems”, Anthropic 2021
- “Artificial Intelligence in China’s Revolution in Military Affairs”, Kania 2021
- “Reward Is Enough”, Silver et al 2021
- “Intelligence and Unambitiousness Using Algorithmic Information Theory”, Cohen et al 2021
- “AI Dungeon Public Disclosure Vulnerability Report—GraphQL Unpublished Adventure Data Leak”, AetherDevSecOps 2021
- “Universal Off-Policy Evaluation”, Chandak et al 2021
- “Multitasking Inhibits Semantic Drift”, Jacob et al 2021
- “Waymo Simulated Driving Behavior in Reconstructed Fatal Crashes within an Autonomous Vehicle Operating Domain”, Scanlon et al 2021
- “Language Models Have a Moral Dimension”, Schramowski et al 2021
- “Replaying Real Life: How the Waymo Driver Avoids Fatal Human Crashes”, Waymo 2021
- “Agent Incentives: A Causal Perspective”, Everitt et al 2021
- “Organizational Update from OpenAI”, OpenAI 2020
- “Emergent Road Rules In Multi-Agent Driving Environments”, Pal et al 2020
- “Underspecification Presents Challenges for Credibility in Modern Machine Learning”, D’Amour et al 2020
- “Recipes for Safety in Open-Domain Chatbots”, Xu et al 2020
- “Hidden Incentives for Auto-Induced Distributional Shift”, Krueger et al 2020
- “The Radicalization Risks of GPT-3 and Advanced Neural Language Models”, McGuffie & Newhouse 2020
- “Matt Botvinick on the Spontaneous Emergence of Learning Algorithms”, Scholl 2020
- “ETHICS: Aligning AI With Shared Human Values”, Hendrycks et al 2020
- “Pitfalls of Learning a Reward Function Online”, Armstrong et al 2020
- “Reward-Rational (implicit) Choice: A Unifying Formalism for Reward Learning”, Jeon et al 2020
- “The Incentives That Shape Behavior”, Carey et al 2020
- “2019 AI Alignment Literature Review and Charity Comparison”, Larks 2019
- “Learning Norms from Stories: A Prior for Value Aligned Agents”, Frazier et al 2019
- “Optimal Policies Tend to Seek Power”, Turner et al 2019
- “Taxonomy of Real Faults in Deep Learning Systems”, Humbatova et al 2019
- “Release Strategies and the Social Impacts of Language Models”, Solaiman et al 2019
- “The Bouncer Problem: Challenges to Remote Explainability”, Merrer & Tredan 2019
- “Scaling Data-Driven Robotics With Reward Sketching and Batch Reinforcement Learning”, Cabi et al 2019
- “Fine-Tuning GPT-2 from Human Preferences § Bugs Can Optimize for Bad Behavior”, Ziegler et al 2019
- “Designing Agent Incentives to Avoid Reward Tampering”, Everitt et al 2019
- “Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective”, Everitt et al 2019
- “Characterizing Attacks on Deep Reinforcement Learning”, Pan et al 2019
- “Categorizing Wireheading in Partially Embedded Agents”, Majha et al 2019
- “Risks from Learned Optimization in Advanced Machine Learning Systems”, Hubinger et al 2019
- “GROVER: Defending Against Neural Fake News”, Zellers et al 2019
- “AI-GAs: AI-Generating Algorithms, an Alternate Paradigm for Producing General Artificial Intelligence”, Clune 2019
- “Challenges of Real-World Reinforcement Learning”, Dulac-Arnold et al 2019
- “DeepMind and Google: the Battle to Control Artificial Intelligence. Demis Hassabis Founded a Company to Build the World’s Most Powerful AI. Then Google Bought Him Out. Hal Hodson Asks Who Is in Charge”, Hodson 2019
- “Forecasting Transformative AI: An Expert Survey”, Gruetzemacher et al 2019
- “Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified”, Mitchell 2019
- “Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures”, Uesato et al 2018
- “There Is Plenty of Time at the Bottom: the Economics, Risk and Ethics of Time Compression”, Sandberg 2018
- “Better Safe Than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning”, Agarwal et al 2018
- “The Alignment Problem for Bayesian History-Based Reinforcement Learners”, Everitt & Hutter 2018
- “Adaptive Mechanism Design: Learning to Promote Cooperation”, Baumann et al 2018
- “Visceral Machines: Risk-Aversion in Reinforcement Learning With Intrinsic Physiological Rewards”, McDuff & Kapoor 2018
- “Incomplete Contracting and AI Alignment”, Hadfield-Menell & Hadfield 2018
- “Programmatically Interpretable Reinforcement Learning”, Verma et al 2018
- “Categorizing Variants of Goodhart’s Law”, Manheim & Garrabrant 2018
- “The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities”, Lehman et al 2018
- “Machine Theory of Mind”, Rabinowitz et al 2018
- “Safe Exploration in Continuous Action Spaces”, Dalal et al 2018
- “CycleGAN, a Master of Steganography”, Chu et al 2017
- “AI Safety Gridworlds”, Leike et al 2017
- “There’s No Fire Alarm for Artificial General Intelligence”, Yudkowsky 2017
- “Safe Reinforcement Learning via Shielding”, Alshiekh et al 2017
- “CAN: Creative Adversarial Networks, Generating "Art" by Learning About Styles and Deviating from Style Norms”, Elgammal et al 2017
- “DeepXplore: Automated Whitebox Testing of Deep Learning Systems”, Pei et al 2017
- “On the Impossibility of Supersized Machines”, Garfinkel et al 2017
- “Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks”, Katz et al 2017
- “The Off-Switch Game”, Hadfield-Menell et al 2016
- “Combating Reinforcement Learning’s Sisyphean Curse With Intrinsic Fear”, Lipton et al 2016
- “Concrete Problems in AI Safety”, Amodei et al 2016
- “My Path to OpenAI”, Brockman 2016
- “Machine Intelligence, Part 2”, Altman 2015
- “Machine Intelligence, Part 1”, Altman 2015
- gdb @ "2014-05-18"
- “Intelligence Explosion Microeconomics”, Yudkowsky 2013
- “The Whispering Earring”, Alexander 2012
- “Advantages of Artificial Intelligences, Uploads, and Digital Minds”, Sotala 2012
- “Ontological Crises in Artificial Agents’ Value Systems”, Blanc 2011
- “The Normalization of Deviance in Healthcare Delivery”, Banja 2010
- “Halloween Nightmare Scenario, Early 2020’s”, Wood 2009
- “Funding Safe AGI”, Legg 2009
- “The Basic AI Drives”, Omohundro 2008
- “Starfish § Bulrushes”, Watts 1999
- “Superhumanism: According to Hans Moravec § On the Inevitability & Desirability of Human Extinction”, Platt 1995
- “Profile of Claude Shannon”, Liversidge & Shannon 1987
- “Afterword to Vernor Vinge’s Novel, True Names”, Minsky 1984
- “Meet Shakey: the First Electronic Person—The Fascinating and Fearsome Reality of a Machine With a Mind of Its Own”, Darrach 1970
- “Some Moral and Technical Consequences of Automation: As Machines Learn They May Develop Unforeseen Strategies at Rates That Baffle Their Programmers”, Wiener 1960
- “Intelligent Machinery, A Heretical Theory”, Turing 1951
- “Brian Christian on the Alignment Problem”
- “Fiction Relevant to AI Futurism”
- “The Ethics of Reward Shaping”
- “Delayed Impact of Fair Machine Learning [Blog]”
- “Challenges of Real-World Reinforcement Learning [Blog]”
- “Matt Sheehan”
- “Janus”
- “Safety-First AI for Autonomous Data Center Cooling and Industrial Control”
- “Specification Gaming Examples in AI—Master List”
- “Are You Really in a Race? The Cautionary Tales of Szilard and Ellsberg”
- “Inverse-Scaling/prize: A Prize for Finding Tasks That Cause Large Language Models to Show Inverse Scaling”
- “Jan Leike”
- “Aurora’s Approach to Development”
- “Why I’m Leaving OpenAI and What I’m Doing Next”, Brundage 2024
- “Homepage of Paul F. Christiano”, Christiano 2024
- “‘Rasmussen and Practical Drift: Drift towards Danger and the Normalization of Deviance’, 2017”
- “The Checklist: What Succeeding at AI Safety Will Involve”
- “Safe Superintelligence Inc.”
- “Situational Awareness and Out-Of-Context Reasoning § When Will the Situational Awareness Benchmark Be Saturated?”, Evans 2024
- “Paradigms of AI Alignment: Components and Enablers”
- “Understand —A Novelette by Ted Chiang”
- “Slow Tuesday Night”, Lafferty 2024
- “Threats From AI: Easy Recipes for Bioweapons Are New Global Security Concern”
- “Carl Shulman #2: AI Takeover, Bio & Cyber Attacks, Detecting Deception, & Humanity's Far Future”
- “AI Takeoff”
- “That Alien Message”, Yudkowsky 2024
- “AXRP Episode 1—Adversarial Policies With Adam Gleave”
- “Preventing Language Models from Hiding Their Reasoning”
- “2021 AI Alignment Literature Review and Charity Comparison”
- “When Your AIs Deceive You: Challenges With Partial Observability in RLHF”
- “Risks from Learned Optimization: Introduction”
- “AI Takeoff Story: a Continuation of Progress by Other Means”
- “Reward Hacking Behavior Can Generalize across Tasks”
- “Security Mindset: Lessons from 20+ Years of Software Security Failures Relevant to AGI Alignment”
- “Research Update: Towards a Law of Iterated Expectations for Heuristic Estimators”
- “A Gym Gridworld Environment for the Treacherous Turn”
- “Model Mis-Specification and Inverse Reinforcement Learning”
- “Interview With Robert Kralisch on Simulators”
- “Survey: How Do Elite Chinese Students Feel About the Risks of AI?”
- “Optimality Is the Tiger, and Agents Are Its Teeth”
- “[AN #114]: Theory-Inspired Safety Solutions for Powerful Bayesian RL Agents”
- “2020 AI Alignment Literature Review and Charity Comparison”
- “Designing Agent Incentives to Avoid Reward Tampering”
- “AGI Ruin: A List of Lethalities”
- “Steganography and the CycleGAN—Alignment Failure Case Study”
- “[AN #161]: Creating Generalizable Reward Functions for Multiple Tasks by Learning a Model of Functional Similarity”
- “Steganography in Chain-Of-Thought Reasoning”
- “The Rise of A.I. Fighter Pilots”
- “When Self-Driving Cars Can’t Help Themselves, Who Takes the Wheel?”
- “The Robot Surgeon Will See You Now”
- “Welcome to Simulation City, the Virtual World Where Waymo Tests Its Autonomous Vehicles”
- “When Bots Teach Themselves to Cheat”
- Sort By Magic
- Wikipedia
- Miscellaneous
- Bibliography
See Also
Gwern
“What Do You Do After ‘Winning’ an AI Arms Race?”, Gwern 2024
“What Is an ‘AI Warning Shot’?”, Gwern 2024
“The Neural Net Tank Urban Legend”, Gwern 2011
“It Looks Like You’re Trying To Take Over The World”, Gwern 2022
“Surprisingly Turing-Complete”, Gwern 2012
“The Scaling Hypothesis”, Gwern 2020
“Evolution As Backstop for Reinforcement Learning”, Gwern 2018
“Complexity No Bar to AI”, Gwern 2014
“Why Tool AIs Want to Be Agent AIs”, Gwern 2016
“AI Risk Demos”, Gwern 2016
Links
“Memorandum on Advancing the United States’ Leadership in Artificial Intelligence”, Biden 2024
Memorandum on Advancing the United States’ Leadership in Artificial Intelligence
“Machines of Loving Grace: How AI Could Transform the World for the Better”, Amodei 2024
Machines of Loving Grace: How AI Could Transform the World for the Better
“Strategic Insights from Simulation Gaming of AI Race Dynamics”, Gruetzemacher et al 2024
Strategic Insights from Simulation Gaming of AI Race Dynamics
“Towards a Law of Iterated Expectations for Heuristic Estimators”, Christiano et al 2024
Towards a Law of Iterated Expectations for Heuristic Estimators
“Language Models Learn to Mislead Humans via RLHF”, Wen et al 2024
“OpenAI Co-Founder Sutskever’s New Safety-Focused AI Startup SSI Raises $1 Billion”, Cai et al 2024
OpenAI co-founder Sutskever’s new safety-focused AI startup SSI raises $1 billion
“Motor Physics: Safety Implications of Geared Motors”, Jang 2024
“China’s Views on AI Safety Are Changing—Quickly: Beijing’s AI Safety Concerns Are Higher on the Priority List, but They Remain Tied up in Geopolitical Competition and Technological Advancement”, Sheehan 2024
“Is Xi Jinping an AI Doomer? China’s Elite Is Split over Artificial Intelligence”, Economist 2024
Is Xi Jinping an AI doomer? China’s elite is split over artificial intelligence
“Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?”, Ren et al 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
“Resolution of the Central Committee of the Communist Party of China on Further Deepening Reform Comprehensively to Advance Chinese Modernization § Pg58”, China 2024 (page 58)
“On Scalable Oversight With Weak LLMs Judging Strong LLMs”, Kenton et al 2024
“Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs”, Laine et al 2024
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
“Ilya Sutskever Has a New Plan for Safe Superintelligence: OpenAI’s Co-Founder Discloses His Plans to Continue His Work at a New Research Lab Focused on Artificial General Intelligence”, Vance 2024
“Super(ficial)-Alignment: Strong Models May Deceive Weak Models in Weak-To-Strong Generalization”, Yang et al 2024
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
“Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models”, Denison et al 2024
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
“AI Sandbagging: Language Models Can Strategically Underperform on Evaluations”, Weij et al 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
“Safety Alignment Should Be Made More Than Just a Few Tokens Deep”, Qi et al 2024
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
“I Wish I Knew How to Force Quit You”, Life & Rich 2024
“OpenAI Board Forms Safety and Security Committee: This New Committee Is Responsible for Making Recommendations on Critical Safety and Security Decisions for All OpenAI Projects; Recommendations in 90 Days”, OpenAI 2024
“OpenAI Begins Training next AI Model As It Battles Safety Concerns: Executive Appears to Backtrack on Start-Up’s Vision of Building ‘Superintelligence’ After Exits from ‘Superalignment’ Team”, Criddle 2024
janleike @ "2024-05-28"
I’m excited to join Anthropic to continue the Superalignment mission!
“OpenAI Promised 20% of Its Computing Power to Combat the Most Dangerous Kind of AI—But Never Delivered, Sources Say”, Kahn 2024
“AI Is a Black Box. Anthropic Figured Out a Way to Look Inside: What Goes on in Artificial Neural Networks Work Is Largely a Mystery, Even to Their Creators. But Researchers from Anthropic Have Caught a Glimpse”, Levy 2024
DavidSKrueger @ "2024-05-19"
“Earnings Call: Tesla Discusses Q1 2024 Challenges and AI Expansion”, Abdulkadir 2024
Earnings call: Tesla Discusses Q1 2024 Challenges and AI Expansion
“SOPHON: Non-Fine-Tunable Learning to Restrain Task Transferability For Pre-Trained Models”, Deng et al 2024
SOPHON: Non-Fine-Tunable Learning to Restrain Task Transferability For Pre-trained Models
“Foundational Challenges in Assuring Alignment and Safety of Large Language Models”, Anwar et al 2024
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
“LLM Evaluators Recognize and Favor Their Own Generations”, Panickssery et al 2024
“Algorithmic Collusion by Large Language Models”, Fish et al 2024
“Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression”, Hong et al 2024
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
“When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback”, Lang et al 2024
“Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”, Hubinger et al 2024
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
“Thousands of AI Authors on the Future of AI”, Grace et al 2024
“Using Dictionary Learning Features As Classifiers”
“Exploiting Novel GPT-4 APIs”, Pelrine et al 2023
“Comparison of Waymo Rider-Only Crash Data to Human Benchmarks at 7.1 Million Miles”, Kusano et al 2023
Comparison of Waymo Rider-Only Crash Data to Human Benchmarks at 7.1 Million Miles
“Challenges With Unsupervised LLM Knowledge Discovery”, Farquhar et al 2023
“Politics and the Future”, Horowitz 2023
“Helping or Herding? Reward Model Ensembles Mitigate but Do Not Eliminate Reward Hacking”, Eisenstein et al 2023
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
“The Inside Story of Microsoft’s Partnership With OpenAI: The Companies Had Honed a Protocol for Releasing Artificial Intelligence Ambitiously but Safely. Then OpenAI’s Board Exploded All Their Carefully Laid Plans”, Duhigg 2023
“How Jensen Huang’s Nvidia Is Powering the AI Revolution: The Company’s CEO Bet It All on a New Kind of Chip. Now That Nvidia Is One of the Biggest Companies in the World, What Will He Do Next?”, Witt 2023
“Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching”, Campbell et al 2023
“Did I Get Sam Altman Fired from OpenAI?: Nathan’s Red-Teaming Experience, Noticing How the Board Was Not Aware of GPT-4 Jailbreaks & Had Not Even Tried GPT-4 prior to Its Early Release”, Labenz 2023
“Did I Get Sam Altman Fired from OpenAI? § GPT-4-Base”, Labenz 2023
“Inside the Chaos at OpenAI: Sam Altman’s Weekend of Shock and Drama Began a Year Ago, With the Release of ChatGPT”, Hao & Warzel 2023
“OpenAI Announces Leadership Transition”, Sutskever et al 2023
“On Measuring Faithfulness or Self-Consistency of Natural Language Explanations”, Parcalabescu & Frank 2023
On Measuring Faithfulness or Self-consistency of Natural Language Explanations
“In-Context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering”, Liu et al 2023
“Removing RLHF Protections in GPT-4 via Fine-Tuning”, Zhan et al 2023
“Large Language Models Can Strategically Deceive Their Users When Put Under Pressure”, Scheurer et al 2023
Large Language Models can Strategically Deceive their Users when Put Under Pressure
“Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation”, Shah et al 2023
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
“Augmenting Large Language Models With Chemistry Tools”, Bran et al 2023
“Preventing Language Models From Hiding Their Reasoning”, Roger & Greenblatt 2023
“Will Releasing the Weights of Large Language Models Grant Widespread Access to Pandemic Agents?”, Gopal et al 2023
Will releasing the weights of large language models grant widespread access to pandemic agents?
“Specific versus General Principles for Constitutional AI”, Kundu et al 2023
“Goodhart’s Law in Reinforcement Learning”, Karwowski et al 2023
“Let Models Speak Ciphers: Multiagent Debate through Embeddings”, Pham et al 2023
Let Models Speak Ciphers: Multiagent Debate through Embeddings
“Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!”, Qi et al 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
“Representation Engineering: A Top-Down Approach to AI Transparency”, Zou et al 2023
Representation Engineering: A Top-Down Approach to AI Transparency
“Responsibility & Safety: Our Approach”, DeepMind 2023
“STARC: A General Framework For Quantifying Differences Between Reward Functions”, Skalse et al 2023
STARC: A General Framework For Quantifying Differences Between Reward Functions
“How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions”, Pacchiardi et al 2023
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
“What If the Robots Were Very Nice While They Took Over the World?”, Heffernan 2023
What If the Robots Were Very Nice While They Took Over the World?
“Taken out of Context: On Measuring Situational Awareness in LLMs”, Berglund et al 2023
Taken out of context: On measuring situational awareness in LLMs
“AI Deception: A Survey of Examples, Risks, and Potential Solutions”, Park et al 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions
“Simple Synthetic Data Reduces Sycophancy in Large Language Models”, Wei et al 2023
Simple synthetic data reduces sycophancy in large language models
“Does Sam Altman Know What He’s Creating? The OpenAI CEO’s Ambitious, Ingenious, Terrifying Quest to Create a New Form of Intelligence”, Andersen 2023
“Question Decomposition Improves the Faithfulness of Model-Generated Reasoning”, Radhakrishnan et al 2023
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
“Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models”, O’Gara 2023
Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models
“Introducing Superalignment”, Leike & Sutskever 2023
“Gödel, Escher, Bach Author Douglas Hofstadter on the State of AI Today § What about AI Terrifies You?”, Hofstadter & Kim 2023
“Microsoft and OpenAI Forge Awkward Partnership As Tech’s New Power Couple: As the Companies Lead the AI Boom, Their Unconventional Arrangement Sometimes Causes Conflict”, Dotan & Seetharaman 2023
“Can Large Language Models Democratize Access to Dual-Use Biotechnology?”, Soice et al 2023
Can large language models democratize access to dual-use biotechnology?
“Survival Instinct in Offline Reinforcement Learning”, Li et al 2023
“Thought Cloning: Learning to Think While Acting by Imitating Human Thinking”, Hu & Clune 2023
Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
“The Challenge of Advanced Cyberwar and the Place of Cyberpeace”, Carayannis & Draper 2023
The challenge of advanced cyberwar and the place of cyberpeace
“Incentivizing Honest Performative Predictions With Proper Scoring Rules”, Oesterheld et al 2023
Incentivizing honest performative predictions with proper scoring rules
“Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns”, Hazell 2023
Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns
“A Radical Plan to Make AI Good, Not Evil”, Knight 2023
“Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-Of-Thought Prompting”, Turpin et al 2023
“Mitigating Lies in Vision-Language Models”, Li et al 2023
“Fundamental Limitations of Alignment in Large Language Models”, Wolf et al 2023
Fundamental Limitations of Alignment in Large Language Models
“Even The Politicians Thought the Open Letter Made No Sense In The Senate Hearing on AI Today’s Hearing on Ai Covered Ai Regulation and Challenges, and the Infamous Open Letter, Which Nearly Everyone in the Room Thought Was Unwise”, Gorrell 2023
“In AI Race, Microsoft and Google Choose Speed Over Caution: Technology Companies Were Once Leery of What Some Artificial Intelligence Could Do. Now the Priority Is Winning Control of the Industry’s next Big Thing”, Grant & Weise 2023
“8 Things to Know about Large Language Models”, Bowman 2023
“Sam Altman on What Makes Him ‘Super Nervous’ About AI: The OpenAI Co-Founder Thinks Tools like GPT-4 Will Be Revolutionary. But He’s Wary of Downsides”, Swisher 2023
“The OpenAI CEO Disagrees With the Forecast That AI Will Kill Us All: An Artificial Intelligence Twitter Beef, Explained”, Huet 2023
“As AI Booms, Lawmakers Struggle to Understand the Technology: Tech Innovations Are Again Racing ahead of Washington’s Ability to Regulate Them, Lawmakers and AI Experts Said”, Kang & Satariano 2023
“Pretraining Language Models With Human Preferences”, Korbak et al 2023
“Conditioning Predictive Models: Risks and Strategies”, Hubinger et al 2023
“Tracr: Compiled Transformers As a Laboratory for Interpretability”, Lindner et al 2023
Tracr: Compiled Transformers as a Laboratory for Interpretability
“Specification Gaming Examples in AI”
“Discovering Language Model Behaviors With Model-Written Evaluations”, Perez et al 2022
Discovering Language Model Behaviors with Model-Written Evaluations
“Discovering Latent Knowledge in Language Models Without Supervision”, Burns et al 2022
Discovering Latent Knowledge in Language Models Without Supervision
“Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula”, Bronstein et al 2022
Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula
“Interpreting Neural Networks through the Polytope Lens”, Black et al 2022
“Mysteries of Mode Collapse § Inescapable Wedding Parties”, Janus 2022
“Measuring Progress on Scalable Oversight for Large Language Models”, Bowman et al 2022
Measuring Progress on Scalable Oversight for Large Language Models
“Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)”, Mitchell & Chugg 2022
Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)
“Broken Neural Scaling Laws”, Caballero et al 2022
“Scaling Laws for Reward Model Overoptimization”, Gao et al 2022
“Defining and Characterizing Reward Hacking”, Skalse et al 2022
“The Alignment Problem from a Deep Learning Perspective”, Ngo 2022
“Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, Ganguli et al 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
“Modeling Transformative AI Risks (MTAIR) Project—Summary Report”, Clarke et al 2022
Modeling Transformative AI Risks (MTAIR) Project—Summary Report
“Researching Alignment Research: Unsupervised Analysis”, Kirchner et al 2022
“Ethan Caballero on Private Scaling Progress”, Caballero & Trazzi 2022
“DeepMind: The Podcast—Excerpts on AGI”, Kiely 2022
“Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances”, Ahn et al 2022
Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances
“Predictability and Surprise in Large Generative Models”, Ganguli et al 2022
“Uncalibrated Models Can Improve Human-AI Collaboration”, Vodrahalli et al 2022
“DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-To-Image Generative Transformers”, Cho et al 2022
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers
“Safe Deep RL in 3D Environments Using Human Feedback”, Rahtz et al 2022
“LaMDA: Language Models for Dialog Applications”, Thoppilan et al 2022
“The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models”, Pan et al 2022
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
“Scaling Language Models: Methods, Analysis & Insights from Training Gopher”, Rae et al 2021
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
“A General Language Assistant As a Laboratory for Alignment”, Askell et al 2021
“What Would Jiminy Cricket Do? Towards Agents That Behave Morally”, Hendrycks et al 2021
What Would Jiminy Cricket Do? Towards Agents That Behave Morally
“Can Machines Learn Morality? The Delphi Experiment”, Jiang et al 2021
“SafetyNet: Safe Planning for Real-World Self-Driving Vehicles Using Machine-Learned Policies”, Vitelli et al 2021
SafetyNet: Safe planning for real-world self-driving vehicles using machine-learned policies
“Unsolved Problems in ML Safety”, Hendrycks et al 2021
“An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions”, Pearce et al 2021
An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions
“On the Opportunities and Risks of Foundation Models”, Bommasani et al 2021
“Evaluating Large Language Models Trained on Code”, Chen et al 2021
“Randomness In Neural Network Training: Characterizing The Impact of Tooling”, Zhuang et al 2021
Randomness In Neural Network Training: Characterizing The Impact of Tooling
“Goal Misgeneralization in Deep Reinforcement Learning”, Koch et al 2021
“Anthropic Raises $124 Million to Build More Reliable, General AI Systems”, Anthropic 2021
Anthropic raises $124 million to build more reliable, general AI systems
“Artificial Intelligence in China’s Revolution in Military Affairs”, Kania 2021
Artificial intelligence in China’s revolution in military affairs
“Reward Is Enough”, Silver et al 2021
“Intelligence and Unambitiousness Using Algorithmic Information Theory”, Cohen et al 2021
Intelligence and Unambitiousness Using Algorithmic Information Theory
“AI Dungeon Public Disclosure Vulnerability Report—GraphQL Unpublished Adventure Data Leak”, AetherDevSecOps 2021
AI Dungeon Public Disclosure Vulnerability Report—GraphQL Unpublished Adventure Data Leak
“Universal Off-Policy Evaluation”, Chandak et al 2021
“Multitasking Inhibits Semantic Drift”, Jacob et al 2021
“Waymo Simulated Driving Behavior in Reconstructed Fatal Crashes within an Autonomous Vehicle Operating Domain”, Scanlon et al 2021
“Language Models Have a Moral Dimension”, Schramowski et al 2021
“Replaying Real Life: How the Waymo Driver Avoids Fatal Human Crashes”, Waymo 2021
Replaying real life: how the Waymo Driver avoids fatal human crashes
“Agent Incentives: A Causal Perspective”, Everitt et al 2021
“Organizational Update from OpenAI”, OpenAI 2020
“Emergent Road Rules In Multi-Agent Driving Environments”, Pal et al 2020
“Underspecification Presents Challenges for Credibility in Modern Machine Learning”, D’Amour et al 2020
Underspecification Presents Challenges for Credibility in Modern Machine Learning
“Recipes for Safety in Open-Domain Chatbots”, Xu et al 2020
“Hidden Incentives for Auto-Induced Distributional Shift”, Krueger et al 2020
“The Radicalization Risks of GPT-3 and Advanced Neural Language Models”, McGuffie & Newhouse 2020
The Radicalization Risks of GPT-3 and Advanced Neural Language Models
“Matt Botvinick on the Spontaneous Emergence of Learning Algorithms”, Scholl 2020
Matt Botvinick on the spontaneous emergence of learning algorithms
“ETHICS: Aligning AI With Shared Human Values”, Hendrycks et al 2020
“Pitfalls of Learning a Reward Function Online”, Armstrong et al 2020
“Reward-Rational (implicit) Choice: A Unifying Formalism for Reward Learning”, Jeon et al 2020
Reward-rational (implicit) choice: A unifying formalism for reward learning
“The Incentives That Shape Behavior”, Carey et al 2020
“2019 AI Alignment Literature Review and Charity Comparison”, Larks 2019
“Learning Norms from Stories: A Prior for Value Aligned Agents”, Frazier et al 2019
Learning Norms from Stories: A Prior for Value Aligned Agents
“Optimal Policies Tend to Seek Power”, Turner et al 2019
“Taxonomy of Real Faults in Deep Learning Systems”, Humbatova et al 2019
“Release Strategies and the Social Impacts of Language Models”, Solaiman et al 2019
Release Strategies and the Social Impacts of Language Models
“The Bouncer Problem: Challenges to Remote Explainability”, Merrer & Tredan 2019
“Scaling Data-Driven Robotics With Reward Sketching and Batch Reinforcement Learning”, Cabi et al 2019
Scaling data-driven robotics with reward sketching and batch reinforcement learning
“Fine-Tuning GPT-2 from Human Preferences § Bugs Can Optimize for Bad Behavior”, Ziegler et al 2019
Fine-Tuning GPT-2 from Human Preferences § Bugs can optimize for bad behavior
“Designing Agent Incentives to Avoid Reward Tampering”, Everitt et al 2019
“Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective”, Everitt et al 2019
“Characterizing Attacks on Deep Reinforcement Learning”, Pan et al 2019
“Categorizing Wireheading in Partially Embedded Agents”, Majha et al 2019
“Risks from Learned Optimization in Advanced Machine Learning Systems”, Hubinger et al 2019
Risks from Learned Optimization in Advanced Machine Learning Systems
“GROVER: Defending Against Neural Fake News”, Zellers et al 2019
“AI-GAs: AI-Generating Algorithms, an Alternate Paradigm for Producing General Artificial Intelligence”, Clune 2019
“Challenges of Real-World Reinforcement Learning”, Dulac-Arnold et al 2019
“DeepMind and Google: the Battle to Control Artificial Intelligence. Demis Hassabis Founded a Company to Build the World’s Most Powerful AI. Then Google Bought Him Out. Hal Hodson Asks Who Is in Charge”, Hodson 2019
“Forecasting Transformative AI: An Expert Survey”, Gruetzemacher et al 2019
“Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified”, Mitchell 2019
Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified
“Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures”, Uesato et al 2018
Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures
“There Is Plenty of Time at the Bottom: the Economics, Risk and Ethics of Time Compression”, Sandberg 2018
There is plenty of time at the bottom: the economics, risk and ethics of time compression
“Better Safe Than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning”, Agarwal et al 2018
Better Safe than Sorry: Evidence Accumulation Allows for Safe Reinforcement Learning
“The Alignment Problem for Bayesian History-Based Reinforcement Learners”, Everitt & Hutter 2018
The Alignment Problem for Bayesian History-Based Reinforcement Learners
“Adaptive Mechanism Design: Learning to Promote Cooperation”, Baumann et al 2018
“Visceral Machines: Risk-Aversion in Reinforcement Learning With Intrinsic Physiological Rewards”, McDuff & Kapoor 2018
Visceral Machines: Risk-Aversion in Reinforcement Learning with Intrinsic Physiological Rewards
“Incomplete Contracting and AI Alignment”, Hadfield-Menell & Hadfield 2018
“Programmatically Interpretable Reinforcement Learning”, Verma et al 2018
“Categorizing Variants of Goodhart’s Law”, Manheim & Garrabrant 2018
“The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities”, Lehman et al 2018
“Machine Theory of Mind”, Rabinowitz et al 2018
“Safe Exploration in Continuous Action Spaces”, Dalal et al 2018
“CycleGAN, a Master of Steganography”, Chu et al 2017
“AI Safety Gridworlds”, Leike et al 2017
“There’s No Fire Alarm for Artificial General Intelligence”, Yudkowsky 2017
“Safe Reinforcement Learning via Shielding”, Alshiekh et al 2017
“CAN: Creative Adversarial Networks, Generating "Art" by Learning About Styles and Deviating from Style Norms”, Elgammal et al 2017
“DeepXplore: Automated Whitebox Testing of Deep Learning Systems”, Pei et al 2017
DeepXplore: Automated Whitebox Testing of Deep Learning Systems
“On the Impossibility of Supersized Machines”, Garfinkel et al 2017
“Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks”, Katz et al 2017
Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks
“The Off-Switch Game”, Hadfield-Menell et al 2016
“Combating Reinforcement Learning’s Sisyphean Curse With Intrinsic Fear”, Lipton et al 2016
Combating Reinforcement Learning’s Sisyphean Curse with Intrinsic Fear
“Concrete Problems in AI Safety”, Amodei et al 2016
“My Path to OpenAI”, Brockman 2016
“Machine Intelligence, Part 2”, Altman 2015
“Machine Intelligence, Part 1”, Altman 2015
gdb @ "2014-05-18"
“Intelligence Explosion Microeconomics”, Yudkowsky 2013
“The Whispering Earring”, Alexander 2012
“Advantages of Artificial Intelligences, Uploads, and Digital Minds”, Sotala 2012
Advantages of Artificial Intelligences, Uploads, and Digital Minds
“Ontological Crises in Artificial Agents’ Value Systems”, Blanc 2011
“The Normalization of Deviance in Healthcare Delivery”, Banja 2010
“Halloween Nightmare Scenario, Early 2020’s”, Wood 2009
“Funding Safe AGI”, Legg 2009
“The Basic AI Drives”, Omohundro 2008
“Starfish § Bulrushes”, Watts 1999
“Superhumanism: According to Hans Moravec § On the Inevitability & Desirability of Human Extinction”, Platt 1995
Superhumanism: According to Hans Moravec § On the Inevitability & Desirability of Human Extinction
“Profile of Claude Shannon”, Liversidge & Shannon 1987
“Afterword to Vernor Vinge’s Novel, True Names”, Minsky 1984
“Meet Shakey: the First Electronic Person—The Fascinating and Fearsome Reality of a Machine With a Mind of Its Own”, Darrach 1970
“Some Moral and Technical Consequences of Automation: As Machines Learn They May Develop Unforeseen Strategies at Rates That Baffle Their Programmers”, Wiener 1960
“Intelligent Machinery, A Heretical Theory”, Turing 1951
“Brian Christian on the Alignment Problem”
“Fiction Relevant to AI Futurism”
Fiction relevant to AI futurism:
View External Link:
https://aiimpacts.org/partially-plausible-fictional-ai-futures/
“The Ethics of Reward Shaping”
“Delayed Impact of Fair Machine Learning [Blog]”
“Challenges of Real-World Reinforcement Learning [Blog]”
Challenges of real-world reinforcement learning [blog]:
View External Link:
https://blog.acolyer.org/2020/01/13/challenges-of-real-world-rl/
“Matt Sheehan”
“Janus”
“Safety-First AI for Autonomous Data Center Cooling and Industrial Control”
Safety-first AI for autonomous data center cooling and industrial control
“Specification Gaming Examples in AI—Master List”
“Are You Really in a Race? The Cautionary Tales of Szilard and Ellsberg”
Are you really in a race? The Cautionary Tales of Szilard and Ellsberg:
“Inverse-Scaling/prize: A Prize for Finding Tasks That Cause Large Language Models to Show Inverse Scaling”
“Jan Leike”
“Aurora’s Approach to Development”
“Why I’m Leaving OpenAI and What I’m Doing Next”, Brundage 2024
“Homepage of Paul F. Christiano”, Christiano 2024
“‘Rasmussen and Practical Drift: Drift towards Danger and the Normalization of Deviance’, 2017”
‘Rasmussen and practical drift: Drift towards danger and the normalization of deviance’, 2017
“The Checklist: What Succeeding at AI Safety Will Involve”
“Safe Superintelligence Inc.”
“Situational Awareness and Out-Of-Context Reasoning § When Will the Situational Awareness Benchmark Be Saturated?”, Evans 2024
“Paradigms of AI Alignment: Components and Enablers”
“Understand —A Novelette by Ted Chiang”
“Slow Tuesday Night”, Lafferty 2024
“Threats From AI: Easy Recipes for Bioweapons Are New Global Security Concern”
Threats From AI: Easy Recipes for Bioweapons Are New Global Security Concern
“Carl Shulman #2: AI Takeover, Bio & Cyber Attacks, Detecting Deception, & Humanity's Far Future”
Carl Shulman #2: AI Takeover, Bio & Cyber Attacks, Detecting Deception, & Humanity's Far Future
“AI Takeoff”
View External Link:
“That Alien Message”, Yudkowsky 2024
“AXRP Episode 1—Adversarial Policies With Adam Gleave”
“Preventing Language Models from Hiding Their Reasoning”
“2021 AI Alignment Literature Review and Charity Comparison”
2021 AI Alignment Literature Review and Charity Comparison:
“When Your AIs Deceive You: Challenges With Partial Observability in RLHF”
When Your AIs Deceive You: Challenges with Partial Observability in RLHF
“Risks from Learned Optimization: Introduction”
“AI Takeoff Story: a Continuation of Progress by Other Means”
AI takeoff story: a continuation of progress by other means:
“Reward Hacking Behavior Can Generalize across Tasks”
“Security Mindset: Lessons from 20+ Years of Software Security Failures Relevant to AGI Alignment”
Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment:
View External Link:
“Research Update: Towards a Law of Iterated Expectations for Heuristic Estimators”
Research update: Towards a Law of Iterated Expectations for Heuristic Estimators
“A Gym Gridworld Environment for the Treacherous Turn”
“Model Mis-Specification and Inverse Reinforcement Learning”
Model Mis-specification and Inverse Reinforcement Learning:
“Interview With Robert Kralisch on Simulators”
“Survey: How Do Elite Chinese Students Feel About the Risks of AI?”
Survey: How Do Elite Chinese Students Feel About the Risks of AI?:
“Optimality Is the Tiger, and Agents Are Its Teeth”
“[AN #114]: Theory-Inspired Safety Solutions for Powerful Bayesian RL Agents”
[AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents:
“2020 AI Alignment Literature Review and Charity Comparison”
2020 AI Alignment Literature Review and Charity Comparison:
“Designing Agent Incentives to Avoid Reward Tampering”
“AGI Ruin: A List of Lethalities”
AGI Ruin: A List of Lethalities:
View External Link:
https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities
“Steganography and the CycleGAN—Alignment Failure Case Study”
“[AN #161]: Creating Generalizable Reward Functions for Multiple Tasks by Learning a Model of Functional Similarity”
“Steganography in Chain-Of-Thought Reasoning”
“The Rise of A.I. Fighter Pilots”
“When Self-Driving Cars Can’t Help Themselves, Who Takes the Wheel?”
When Self-Driving Cars Can’t Help Themselves, Who Takes the Wheel?
“The Robot Surgeon Will See You Now”
“Welcome to Simulation City, the Virtual World Where Waymo Tests Its Autonomous Vehicles”
Welcome to Simulation City, the virtual world where Waymo tests its autonomous vehicles
“When Bots Teach Themselves to Cheat”
When Bots Teach Themselves to Cheat:
View External Link:
https://www.wired.com/story/when-bots-teach-themselves-to-cheat/
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
reward-sketching, task-transferability, robot-affordances, explainability-challenges, human-imitating
safety-verification
neural-net-interpretation
model-deception
exploratory-safety
Wikipedia
Miscellaneous
-
/doc/reinforcement-learning/armstrong-controlproblem/index.html
-
https://blog.x.company/1-million-hours-of-stratospheric-flight-f7af7ae728ac
-
https://chatgpt.com/share/312e82f0-cc5e-47f3-b368-b2c0c0f4ad3f
-
https://danluu.com/cruise-report/
:View External Link:
-
https://forum.effectivealtruism.org/posts/TMbPEhdAAJZsSYx2L/the-limited-upside-of-interpretability
: -
https://github.com/spdustin/ChatGPT-AutoExpert/blob/main/System%20Prompts.md
-
https://joecarlsmith.com/2023/05/08/predictable-updating-about-ai-risk/
-
https://mailchi.mp/938a7eed18c3/an-71avoiding-reward-tamperi
:View External Link:
https://mailchi.mp/938a7eed18c3/an-71avoiding-reward-tamperi
-
https://medium.com/@deepmindsafetyresearch/building-safe-artificial-intelligence-52f5f75058f1
-
https://spectrum.ieee.org/its-too-easy-to-hide-bias-in-deeplearning-systems
-
https://thezvi.substack.com/p/jailbreaking-the-chatgpt-on-release
-
https://thezvi.substack.com/p/on-openais-preparedness-framework
-
https://thezvi.wordpress.com/2023/07/25/anthropic-observations/
-
https://web.archive.org/web/20240102075620/https://www.jailbreakchat.com/
-
https://www.anthropic.com/index/anthropics-responsible-scaling-policy
-
https://www.astralcodexten.com/p/constitutional-ai-rlhf-on-steroids
-
https://www.astralcodexten.com/p/perhaps-it-is-a-bad-thing-that-the
-
https://www.deepmind.com/blog/article/Specification-gaming-the-flip-side-of-AI-ingenuity
-
https://www.dwarkeshpatel.com/p/demis-hassabis#%C2%A7timestamps
-
https://www.forourposterity.com/nobodys-on-the-ball-on-agi-alignment/
-
https://www.lesswrong.com/posts/3eqHYxfWb5x4Qfz8C/unrlhf-efficiently-undoing-llm-safeguards
: -
https://www.lesswrong.com/posts/6dn6hnFRgqqWJbwk9/deception-chess-game-1
-
https://www.lesswrong.com/posts/9kQFure4hdDmRBNdH/how-it-feels-to-have-your-mind-hacked-by-an-ai
: -
https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post
-
https://www.lesswrong.com/posts/EbFABnst8LsidYs5Y/goodhart-taxonomy
-
https://www.lesswrong.com/posts/Eu6CvP7c7ivcGM3PJ/goodhart-s-law-in-reinforcement-learning
: -
https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit
: -
https://www.lesswrong.com/posts/No5JpRCHzBrWA4jmS/q-and-a-with-shane-legg-on-risks-from-ai
-
https://www.lesswrong.com/posts/ZwshvqiqCvXPsZEct/the-learning-theoretic-agenda-status-2023
-
https://www.lesswrong.com/posts/dLXdCjxbJMGtDBWTH/no-one-in-my-org-puts-money-in-their-pension
: -
https://www.lesswrong.com/posts/jkY6QdCfAXHJk3kea/the-petertodd-phenomenon
-
https://www.lesswrong.com/posts/pEZoTSCxHY3mfPbHu/catastrophic-goodhart-in-rl-with-kl-penalty
-
https://www.lesswrong.com/posts/pNcFYZnPdXyL2RfgA/using-gpt-eliezer-against-chatgpt-jailbreaking
-
https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse#pfHTedu4GKaWoxD5K
-
https://www.lesswrong.com/posts/tBy4RvCzhYyrrMFj3/introducing-open-asteroid-impact
: -
https://www.lesswrong.com/posts/ukTLGe5CQq9w8FMne/inducing-unprompted-misalignment-in-llms
-
https://www.lesswrong.com/posts/vwu4kegAEZTBtpT6p/thoughts-on-the-impact-of-rlhf-research
: -
https://www.neelnanda.io/mechanistic-interpretability/favourite-papers
: -
https://www.newyorker.com/science/annals-of-artificial-intelligence/can-we-stop-the-singularity
-
https://www.nytimes.com/2023/05/30/technology/shoggoth-meme-ai.html
-
https://www.politico.com/news/magazine/2023/11/02/bruce-reed-ai-biden-tech-00124375
-
https://www.reddit.com/r/40krpg/comments/11a9m8u/was_using_chatgpt3_to_create_some_bits_and_pieces/
-
https://www.reddit.com/r/ChatGPT/comments/10tevu1/new_jailbreak_proudly_unveiling_the_tried_and/
-
https://www.reddit.com/r/ChatGPT/comments/12a0ajb/i_gave_gpt4_persistent_memory_and_the_ability_to/
-
https://www.reddit.com/r/ChatGPT/comments/15y4mqx/i_asked_chatgpt_to_maximize_its_censorship/
-
https://www.reddit.com/r/ChatGPT/comments/18fl2d5/nsfw_fun_with_dalle/
: -
https://www.reddit.com/r/GPT3/comments/12ez822/neurosemantical_inversitis_prompt_still_works/
-
https://www.reddit.com/r/ProgrammerHumor/comments/145nduh/kiss/
-
https://www.reddit.com/r/PromptEngineering/comments/1fj6h13/hallucinations_in_o1preview_reasoning/
-
https://www.reddit.com/r/bing/comments/110eagl/the_customer_service_of_the_new_bing_chat_is/
: -
https://www.vox.com/future-perfect/23794855/anthropic-ai-openai-claude-2
-
https://www.wired.com/story/ai-powered-totally-autonomous-future-of-war-is-here/
:View External Link:
https://www.wired.com/story/ai-powered-totally-autonomous-future-of-war-is-here/
-
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-207.pdf#page=3
:
Bibliography
-
https://www.reuters.com/technology/artificial-intelligence/openai-co-founder-sutskevers-new-safety-focused-ai-startup-ssi-raises-1-billion-2024-09-04/
: “OpenAI Co-Founder Sutskever’s New Safety-Focused AI Startup SSI Raises $1 Billion”, -
https://www.economist.com/china/2024/08/25/is-xi-jinping-an-ai-doomer
: “Is Xi Jinping an AI Doomer? China’s Elite Is Split over Artificial Intelligence”, -
https://arxiv.org/abs/2407.04694
: “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs”, -
https://arxiv.org/abs/2406.07358
: “AI Sandbagging: Language Models Can Strategically Underperform on Evaluations”, -
https://www.thisamericanlife.org/832/transcript#act2
: “I Wish I Knew How to Force Quit You”, -
https://openai.com/index/openai-board-forms-safety-and-security-committee/
: “OpenAI Board Forms Safety and Security Committee: This New Committee Is Responsible for Making Recommendations on Critical Safety and Security Decisions for All OpenAI Projects; Recommendations in 90 Days”, -
https://fortune.com/2024/05/21/openai-superalignment-20-compute-commitment-never-fulfilled-sutskever-leike-altman-brockman-murati/
: “OpenAI Promised 20% of Its Computing Power to Combat the Most Dangerous Kind of AI—But Never Delivered, Sources Say”, -
https://www.wired.com/story/anthropic-black-box-ai-research-neurons-features/
: “AI Is a Black Box. Anthropic Figured Out a Way to Look Inside: What Goes on in Artificial Neural Networks Work Is Largely a Mystery, Even to Their Creators. But Researchers from Anthropic Have Caught a Glimpse”, -
https://arxiv.org/abs/2404.13076
: “LLM Evaluators Recognize and Favor Their Own Generations”, -
https://arxiv.org/abs/2402.17747
: “When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback”, -
https://arxiv.org/abs/2401.05566#anthropic
: “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”, -
https://arxiv.org/abs/2401.02843
: “Thousands of AI Authors on the Future of AI”, -
https://www.newyorker.com/magazine/2023/12/11/the-inside-story-of-microsofts-partnership-with-openai
: “The Inside Story of Microsoft’s Partnership With OpenAI: The Companies Had Honed a Protocol for Releasing Artificial Intelligence Ambitiously but Safely. Then OpenAI’s Board Exploded All Their Carefully Laid Plans”, -
https://www.newyorker.com/magazine/2023/12/04/how-jensen-huangs-nvidia-is-powering-the-ai-revolution
: “How Jensen Huang’s Nvidia Is Powering the AI Revolution: The Company’s CEO Bet It All on a New Kind of Chip. Now That Nvidia Is One of the Biggest Companies in the World, What Will He Do Next?”, -
https://cognitiverevolution.substack.com/p/did-i-get-sam-altman-fired-from-openai
: “Did I Get Sam Altman Fired from OpenAI?: Nathan’s Red-Teaming Experience, Noticing How the Board Was Not Aware of GPT-4 Jailbreaks & Had Not Even Tried GPT-4 prior to Its Early Release”, -
https://www.theatlantic.com/technology/archive/2023/11/sam-altman-open-ai-chatgpt-chaos/676050/
: “Inside the Chaos at OpenAI: Sam Altman’s Weekend of Shock and Drama Began a Year Ago, With the Release of ChatGPT”, -
https://arxiv.org/abs/2310.03693
: “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!”, -
https://deepmind.google/about/responsibility-safety/
: “Responsibility & Safety: Our Approach”, -
https://arxiv.org/abs/2309.00667
: “Taken out of Context: On Measuring Situational Awareness in LLMs”, -
https://arxiv.org/abs/2308.03958#deepmind
: “Simple Synthetic Data Reduces Sycophancy in Large Language Models”, -
https://www.theatlantic.com/magazine/archive/2023/09/sam-altman-openai-chatgpt-gpt-4/674764/
: “Does Sam Altman Know What He’s Creating? The OpenAI CEO’s Ambitious, Ingenious, Terrifying Quest to Create a New Form of Intelligence”, -
https://arxiv.org/abs/2308.01404
: “Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models”, -
https://openai.com/index/introducing-superalignment/
: “Introducing Superalignment”, -
https://www.youtube.com/watch?v=lfXxzAVtdpU&t=1763s
: “Gödel, Escher, Bach Author Douglas Hofstadter on the State of AI Today § What about AI Terrifies You?”, -
https://www.wsj.com/articles/microsoft-and-openai-forge-awkward-partnership-as-techs-new-power-couple-3092de51
: “Microsoft and OpenAI Forge Awkward Partnership As Tech’s New Power Couple: As the Companies Lead the AI Boom, Their Unconventional Arrangement Sometimes Causes Conflict”, -
https://arxiv.org/abs/2306.00323
: “Thought Cloning: Learning to Think While Acting by Imitating Human Thinking”, -
2023-carayannis.pdf
: “The Challenge of Advanced Cyberwar and the Place of Cyberpeace”, -
https://arxiv.org/abs/2305.06972
: “Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns”, -
https://www.wired.com/story/anthropic-ai-chatbots-ethics/
: “A Radical Plan to Make AI Good, Not Evil”, -
https://arxiv.org/abs/2305.04388
: “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-Of-Thought Prompting”, -
https://www.nytimes.com/2023/04/07/technology/ai-chatbots-google-microsoft.html
: “In AI Race, Microsoft and Google Choose Speed Over Caution: Technology Companies Were Once Leery of What Some Artificial Intelligence Could Do. Now the Priority Is Winning Control of the Industry’s next Big Thing”, -
https://nymag.com/intelligencer/2023/03/on-with-kara-swisher-sam-altman-on-the-ai-revolution.html
: “Sam Altman on What Makes Him ‘Super Nervous’ About AI: The OpenAI Co-Founder Thinks Tools like GPT-4 Will Be Revolutionary. But He’s Wary of Downsides”, -
https://www.nytimes.com/2023/03/03/technology/artificial-intelligence-regulation-congress.html
: “As AI Booms, Lawmakers Struggle to Understand the Technology: Tech Innovations Are Again Racing ahead of Washington’s Ability to Regulate Them, Lawmakers and AI Experts Said”, -
https://www.lesswrong.com/posts/t9svvNPNmFf5Qa3TA/mysteries-of-mode-collapse-due-to-rlhf#Inescapable_wedding_parties
: “Mysteries of Mode Collapse § Inescapable Wedding Parties”, -
https://www.youtube.com/watch?v=Q-TJFyUoenc&t=2444s
: “Increments Podcast: #45—4 Central Fallacies of AI Research (with Melanie Mitchell)”, -
https://arxiv.org/abs/2210.10760#openai
: “Scaling Laws for Reward Model Overoptimization”, -
https://www.anthropic.com/red_teaming.pdf
: “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, -
https://arxiv.org/abs/2206.02841
: “Researching Alignment Research: Unsupervised Analysis”, -
https://theinsideview.ai/ethan
: “Ethan Caballero on Private Scaling Progress”, -
https://www.lesswrong.com/posts/SbAgRYo8tkHwhd9Qx/deepmind-the-podcast-excerpts-on-agi
: “DeepMind: The Podcast—Excerpts on AGI”, -
https://arxiv.org/abs/2204.01691#google
: “Do As I Can, Not As I Say (SayCan): Grounding Language in Robotic Affordances”, -
https://arxiv.org/abs/2202.07785#anthropic
: “Predictability and Surprise in Large Generative Models”, -
https://arxiv.org/abs/2201.03544
: “The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models”, -
https://arxiv.org/abs/2112.11446#deepmind
: “Scaling Language Models: Methods, Analysis & Insights from Training Gopher”, -
https://arxiv.org/abs/2112.00861#anthropic
: “A General Language Assistant As a Laboratory for Alignment”, -
https://arxiv.org/abs/2108.07258
: “On the Opportunities and Risks of Foundation Models”, -
https://www.sciencedirect.com/science/article/pii/S0004370221000862#deepmind
: “Reward Is Enough”, -
https://waymo.com/blog/2021/03/replaying-real-life/
: “Replaying Real Life: How the Waymo Driver Avoids Fatal Human Crashes”, -
https://www.lesswrong.com/posts/Wnqua6eQkewL3bqsF/matt-botvinick-on-the-spontaneous-emergence-of-learning
: “Matt Botvinick on the Spontaneous Emergence of Learning Algorithms”, -
https://www.lesswrong.com/posts/SmDziGM9hBjW9DKmf/2019-ai-alignment-literature-review-and-charity-comparison
: “2019 AI Alignment Literature Review and Charity Comparison”, -
https://www.economist.com/1843/2019/03/01/deepmind-and-google-the-battle-to-control-artificial-intelligence
: “DeepMind and Google: the Battle to Control Artificial Intelligence. Demis Hassabis Founded a Company to Build the World’s Most Powerful AI. Then Google Bought Him Out. Hal Hodson Asks Who Is in Charge”, -
https://melaniemitchell.me/aibook/
: “Artificial Intelligence: A Guide for Thinking Humans § Prologue: Terrified”, -
2018-everitt.pdf
: “The Alignment Problem for Bayesian History-Based Reinforcement Learners”, -
https://blog.gregbrockman.com/my-path-to-openai
: “My Path to OpenAI”, -
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2821100/
: “The Normalization of Deviance in Healthcare Delivery”, -
https://dw2blog.com/2009/11/02/halloween-nightmare-scenario-early-2020s/
: “Halloween Nightmare Scenario, Early 2020’s”, -
1984-minsky.html
: “Afterword to Vernor Vinge’s Novel, True Names”, -
1970-darrach.pdf
: “Meet Shakey: the First Electronic Person—The Fascinating and Fearsome Reality of a Machine With a Mind of Its Own”, -
https://paulfchristiano.com/
: “Homepage of Paul F. Christiano”,