Claude AI tag

Gwern Branwen

See Also
Links
Miscellaneous
Link Bibliography

[Warning: JavaScript Disabled!]

[For support of key website features (link annotation popups/popovers & transclusions, collapsible sections, backlinks, tablesorting, image zooming, sidenotes etc), you must enable JavaScript.]

Links

“A Careful Examination of Large Language Model Performance on Grade School Arithmetic”, Zhang et al 2024

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

“From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples”, Vacareanu et al 2024

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

“VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?”, Liu et al 2024

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

“FABLES: Evaluating Faithfulness and Content Selection in Book-Length Summarization”, Kim et al 2024

FABLES: Evaluating faithfulness and content selection in book-length summarization

“`ArtPrompt`: ASCII Art-Based Jailbreak Attacks against Aligned LLMs”, Jiang et al 2024

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

“Using Hallucinations to Bypass GPT-4’s Filter”, Lemkin 2024

Using Hallucinations to Bypass GPT-4’s Filter

“Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”, Hubinger et al 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

“Summon a Demon and Bind It: A Grounded Theory of LLM Red Teaming in the Wild”, Inie et al 2023

Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild

“Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation”, Shah et al 2023

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

“Specific versus General Principles for Constitutional AI”, Kundu et al 2023

Specific versus General Principles for Constitutional AI

“PAIR: Jailbreaking Black Box Large Language Models in 20 Queries”, Chao et al 2023

PAIR: Jailbreaking Black Box Large Language Models in 20 Queries

“Beyond Memorization: Violating Privacy Via Inference With Large Language Models”, Staab et al 2023

Beyond Memorization: Violating Privacy Via Inference with Large Language Models

“SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?”, Jimenez et al 2023

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

“Lost in the Middle: How Language Models Use Long Contexts”, Liu et al 2023

Lost in the Middle: How Language Models Use Long Contexts

“Opportunities and Risks of LLMs for Scalable Deliberation With Polis”, Small et al 2023

Opportunities and Risks of LLMs for Scalable Deliberation with Polis

“Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-Of-Thought Prompting”, Turpin et al 2023

Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

“A Radical Plan to Make AI Good, Not Evil”, Knight 2023

A Radical Plan to Make AI Good, Not Evil

“Constitutional AI: Harmlessness from AI Feedback”, Bai et al 2022

Constitutional AI: Harmlessness from AI Feedback

“Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, Ganguli et al 2022

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

“A General Language Assistant As a Laboratory for Alignment”, Askell et al 2021

A General Language Assistant as a Laboratory for Alignment

“The Perception of Rhythm in Language”, Cutler 1994

The perception of rhythm in language

Miscellaneous

Link Bibliography

https://arxiv.org/abs/2405.00332#scale: “A Careful Examination of Large Language Model Performance on Grade School Arithmetic”, Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele Lunati, Summer Yue

link-bibliography
https://arxiv.org/abs/2402.11753: “ArtPrompt: ASCII Art-Based Jailbreak Attacks against Aligned LLMs”, Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran

link-bibliography
https://arxiv.org/abs/2401.05566#anthropic: “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”, Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

link-bibliography
https://arxiv.org/abs/2310.08419: “PAIR: Jailbreaking Black Box Large Language Models in 20 Queries”, Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong

link-bibliography
https://arxiv.org/abs/2305.04388: “Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-Of-Thought Prompting”, Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman

link-bibliography
https://www.wired.com/story/anthropic-ai-chatbots-ethics/: “A Radical Plan to Make AI Good, Not Evil”, Will Knight

link-bibliography
https://www.anthropic.com/red_teaming.pdf: “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned”, Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, Jack Clark

link-bibliography
https://arxiv.org/abs/2112.00861#anthropic: “A General Language Assistant As a Laboratory for Alignment”, Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Jared Kaplan

link-bibliography

[Quote Of The Day]

[Site Of The Day]

[Annotation Of The Day]

[adblock public service announcement]