‘GPT-4 nonfiction’ directory
- See Also
- Gwern
- “ChatGPT-O3: Website Design Feedback Ideas [Total Site Review Confabulation] ”, GPT-o3 & Gwern 2025
- “LLM Challenge: Write Non-Biblical Sentences ”, Gwern 2024
- “Abs-E, Or, Speak Only Positively ”, Gwern 2024
- “
text2epositive.py
”, Gwern 2024 - “
date-Guesser.py
”, Gwern 2024 - “
paragraphizer.py
”, Gwern 2022 - “CQK Is The First Unused TLA ”, Gwern 2023
- Links
- “Strategic Intelligence in Large Language Models: Evidence from Evolutionary Game Theory ”, Payne & Alloui-Cros 2025
- “Early Signs of Steganographic Capabilities in Frontier LLMs ”, Zolkowski et al 2025
- “Details about METR’s Preliminary [Coding] Evaluation of DeepSeek and Qwen Models ”, METR 2025
- “ChatGPT O3-Pro: Version of O3 With More Compute for Better Responses ”, OpenAI 2025
- “From Tool to Teammate: A Randomized Controlled Trial of Clinician-AI Collaborative Workflows for Diagnosis ”, Everett et al 2025
- “Beyond Benchmark Scores: Analyzing O3-Mini’s Mathematical Reasoning ”, Ho et al 2025
- “How Does O3 Guess Latitude From Photos? ”
- “VideoGameBench: Can Vision-Language Models Complete Popular Video Games? ”, Zhang et al 2025
- “How I Used GPT-4-O3 to Find CVE-2025-37899, a Remote Zero-Day Vulnerability in the Linux Kernel’s SMB Implementation ”, Heelan 2025
- “What ChatGPT Knows about My Account [JSON Export] ”, Gwern 2025
- “I Really Don’t like ChatGPT’s New Memory Dossier ”, Willison 2025
- “RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics ”, Zhang et al 2025
- “RealMath [Code] ”, Zhang et al 2025
- “Introducing Codex: A Cloud-Based Software Engineering Agent That Can Work on Many Tasks in Parallel, Powered by
codex-1
”, OpenAI 2025 - “Revealing Economic Facts: LLMs Know More Than They Say ”, Buckmann et al 2025
- “Measuring General Intelligence With Generated Games ”, Verma et al 2025
- “Is ChatGPT Actually Fixed Now? I Tested ChatGPT’s Sycophancy, and the Results Were ... Extremely Weird. We’re a Long Way from Making AI Behave. ”, Adler 2025
- “Highlights From The Comments On AI Geoguessr ”, Alexander 2025
- “[The Letter ‘G’ in ‘Strawberry’] ”, Breadd007 2025
- “Rampant AI Cheating Is Ruining Education Alarmingly Fast: ChatGPT Has Unraveled the Entire Academic Project ”, Walsh 2025
- “How ChatGPT Remembers You: A Deep Dive into Its Memory and Chat History Features ”, wunderwuzzi 2025
- “The Other Sharks Out There [LLM-Powered Copyright Link-Spam Fraud] ”, Slifkin 2025
- “Expanding on What We Missed With Sycophancy: A Deeper Dive on Our Findings, What Went Wrong, and Future Changes We’re Making ”, OpenAI 2025
- “Testing AI’s GeoGuessr Genius: Seeing a World in a Grain of Sand ”, Alexander 2025
- “Is AI Enhancing Education or Replacing It? Technology Should Facilitate Learning, Not Substitute for It ”, Shirky 2025
- “ChatGPT Induced Psychosis: Serious Replies Only ”, Zestyclementinejuice 2025
- “GPT-O3 Beats a Master-Level Geoguessr Player—Even With Fake EXIF Data ”, Patterson 2025
- “Shifting Work Patterns With Generative AI ”, Dillon et al 2025
- “Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark ”, Götting et al 2025
- “Investigating Truthfulness in a Pre-Release GPT-O3 Model ”, Chowdhury et al 2025
- “[Pseudo-Jailbreaks] ”
- elidourado @ "2025-04-08"
- “The Curve Is Bending ”, Slatton 2025
- “How Deep Do Large Language Models Internalize Scientific Literature and Citation Practices? ”, Algaba et al 2025
- “Mike Lindell’s Lawyers Used AI to Write Brief ”
- “Large Language Models Pass the Turing Test ”, Jones & Bergen 2025
- “Why Does Claude Speak Byzantine Music Notation? ”, Finke 2025
- “Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad ”, Petrov et al 2025
- “Deep Research: Supermajority Laws around the States ”
- “Obscure Scientific Facts Benchmark ”, Azulay 2025
- “Spontaneous Giving and Calculated Greed in Language Models ”, Li & Shirado 2025
- “None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks ”, Salido et al 2025
- “Idiosyncrasies in Large Language Models ”, Sun et al 2025
- “VLMs As GeoGuessr Masters: Exceptional Performance, Hidden Biases, and Privacy Risks ”, Huang et al 2025
- “SycEval: Evaluating LLM Sycophancy ”, Fanous et al 2025
- “DS R1 Is Not on Par With OA O1, and the Difference Is Qualitative, Not Quantitative: Long-Tail Benchmarks Reveal Gaps ”, Polshkov 2025
- “Deep Research Dispatch: OpenAI’s Answers to Your Questions [Crowdsourcing DR Samples] ”, Griffing 2025
- “Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs ”, Saxena et al 2025
- “Do Large Language Model Benchmarks Test Reliability? ”, Vendrow et al 2025
- “What Will AI Do to pre-Research/research? AI Makes Doing and Communicating Research Much Easier. Will There Be Any Point to It? ”, Gans 2025
- “The Efficient Market Hypothesis When Time Travel Is Possible ”, Gans & o1-pro 2025
- “WILLIAM A., a Student, by and through His Parents, E.A. and C.A. v. CLARKSVILLE-MONTGOMERY COUNTY SCHOOL SYSTEM ”, Sutton et al 2025
- “Competitive Programming With Large Reasoning Models ”, El-Kishky et al 2025
- “AI Language Model Rivals Expert Ethicist in Perceived Moral Expertise ”, Dillion et al 2025
- “Introducing Deep Research: An Agent That Uses Reasoning to Synthesize Large Amounts of Online Information and Complete Multi-Step Research Tasks for You. Available to Pro Users Today, Plus and Team Next ”, OpenAI 2025
- “Large Language Models Think Too Fast To Explore Effectively ”, Pan et al 2025
- “The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers ”, Lee et al 2025
- “How Different LLMs Answered the PhilPapers 2020 Survey ”, Satron 2025
- “People Who Frequently Use ChatGPT for Writing Tasks Are Accurate and Robust Detectors of AI-Generated Text ”, Russell et al 2025
- “Diving into the Underlying Rules or Abstractions in GPT-4 O3’s 34 ARC-AGI Failures ”, mace 2025
- “A Novel Emergence of Meta-Awareness in LLM Fine-Tuning ”, rife 2025
- “How We Used GPT-4o for Image Detection With 350 Very Similar, Single Image Classes ”, Topalian 2025
- “How Outdated Information Hides in LLM Token Generation Probabilities and Creates Logical Inconsistencies ”, Simmons 2025
- “Human Study on AI Spear Phishing Campaigns ”, Lermen & Heiding 2025
- “An Evaluation Framework for Clinical Use of Large Language Models in Patient Interaction Tasks ”, Johri et al 2025
- “Favorite Colors of Some LLMs ”, an 2024
- “Performance of LLMs on Advent of Code 2024 ”, Pinto 2024
- “The Emergence of Strategic Reasoning of Large Language Models ”, Lee & Kader 2024
- “Why You Should Be Talking With GPT-4 O1-Pro about Philosophy: Some Thoughts on How It’s Become Better, and How You Can Too ”, Lowe 2024
- “Cultural Evolution of Cooperation among LLM Agents ”, Vallinder & Hughes 2024
- “O1 Turns Pro ”
- “Frontier Models Are Capable of In-Context Scheming ”, Meinke et al 2024
- “Frontier Models Are Capable of In-Context Scheming ”, Hobbhahn et al 2024
- “Age against the Machine—Susceptibility of Large Language Models to Cognitive Impairment: Cross Sectional Analysis ”
- “Evaluating Large Language Models’ Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects ”, Heiding et al 2024
- “The Problem With [O1] Reasoners: Praying for Transfer Learning ”, McLaughlin 2024
- “BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games ”, Paglieri et al 2024
- “Business Spending on AI Surged 500% This Year to $13.8 Billion ”
- “Generative Agent Simulations of 1,000 People ”, Park et al 2024
- “Are LLMs Prescient? A Continuous Evaluation Using Daily News As the Oracle ”, Dai et al 2024
- “Hidden Persuaders: LLMs’ Political Leaning and Their Influence on Voters ”, Potter et al 2024
- “A Tutorial on Teaching Data Analytics With Generative AI ”, Bray 2024
- “Can LLMs Be Scammed? A Baseline Measurement Study ”, Sehwag et al 2024
- “AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents ”, Andriushchenko et al 2024
- “SimpleStrat: Diversifying Language Model Generation With Stratification ”, Wong et al 2024
- “SWE-Bench+: Enhanced Coding Benchmark for LLMs ”, Aleithan et al 2024
- “MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering ”, Chan et al 2024
- “Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making ”, Li et al 2024
- “Can OpenAI’s
o1-Preview
Ace the 2023 Putnam Exam? ”, Kabasares 2024 - “When a Language Model Is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI O1 ”, McCoy et al 2024
- “Invisible Unicode Text That AI Chatbots Understand and Humans Can’t? Yep, It’s a Thing ”
- “Generating Distinct AI Voice Performances By Prompt Engineering GPT-4o ”
- “I Quit Teaching Because of ChatGPT ”, Livingstone 2024
- “Evaluation of OpenAI O1: Opportunities and Challenges of AGI ”, Zhong et al 2024
- “That Message From Your Doctor? It May Have Been Drafted by ChatGPT-4 ”
- “LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s O1 on PlanBench ”, Valmeekam et al 2024
- “The 27-Year-Old Billionaire Whose Army Does AI’s Dirty Work: Alexandr Wang’s Scale AI Deploys Gig Workers around the Globe to Shape How the Big AI Models Behave § Labeler Fraud ”, Jin 2024
- “I Have Played a Little Bit With OpenAI’s New Iteration, GPT-4 O1 ”, Tao 2024
- “Thoughts While Watching Myself Be Automated ”, Dynomight 2024
- “Generative AI Can Harm Learning ”, Bastani et al 2024
- “Does Refusal Training in LLMs Generalize to the Past Tense? ”, Andriushchenko & Flammarion 2024
- “GPT-4 Is Judged More Human Than Humans in Displaced and Inverted Turing Tests ”, Rathi et al 2024
- “On Scalable Oversight With Weak LLMs Judging Strong LLMs ”, Kenton et al 2024
- “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs ”, Laine et al 2024
- “Are Large Language Models Consistent over Value-Laden Questions? ”, Moore et al 2024
- “Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation ”, Halawi et al 2024
- “APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets ”, Liu et al 2024
- “A Real-World Test of Artificial Intelligence Infiltration of a University Examinations System: A ‘Turing Test’ Case Study ”, Scarfe et al 2024
- “Connecting the Dots: LLMs Can Infer and Verbalize Latent Structure from Disparate Training Data ”, Treutlein et al 2024
- “OlympicArena: Benchmarking Multi-Discipline Cognitive Reasoning for Superintelligent AI ”, Huang et al 2024
- “What Are the Odds? Language Models Are Capable of Probabilistic Reasoning ”, Paruchuri et al 2024
- “Probing the Decision Boundaries of In-Context Learning in Large Language Models ”, Zhao et al 2024
- “Development Cost of ARC GPT-4o Prototype ”, Greenblatt 2024
- “GUI-WORLD: A Dataset for GUI-Oriented Multimodal LLM-Based Agents ”, Chen et al 2024
- “Are We Done With MMLU? ”, Gema et al 2024
- “ShareGPT4Video: Improving Video Understanding and Generation With Better Captions ”, Chen et al 2024
- “Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-Modal LLMs in Video Analysis ”, Fu et al 2024
- “LLMs Achieve Adult Human Performance on Higher-Order Theory of Mind Tasks ”, Street et al 2024
- “Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models ”, Lu et al 2024
- “DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches With TikZ ”, Belouadi et al 2024
- “DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data ”, Xin et al 2024
- “Grokked Transformers Are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization ”, Wang et al 2024
- “Observational Scaling Laws and the Predictability of Language Model Performance ”, Ruan et al 2024
- “Can Language Models Explain Their Own Classification Behavior? ”, Sherburn et al 2024
- “ChatGPT Will Be Able to Talk to You like Scarlett Johansson in Her / Upgrades to ChatGPT’s Voice Mode Bring It Closer to the Vision of a Responsive AI Assistant—And Sam Altman Seems to Know It ”, Robison 2024
- “SWE-Agent: Agent-Computer Interfaces Enable Automated Software Engineering ”, Yang et al 2024
- “GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic ”, Zhang et al 2024
- “Stochastic Lies: How LLM-Powered Chatbots Deal With Russian Disinformation about the War in Ukraine ”
- “Aligning LLM Agents by Learning Latent Preference from User Edits ”, Gao et al 2024
- “Automated Social Science: Language Models As Scientist and Subjects ”, Manning et al 2024
- “Enhancing Confidence Expression in Large Language Models Through Learning from Past Experience ”, Han et al 2024
- “Private Attribute Inference from Images With Vision-Language Models ”, Tömekçe et al 2024
- “LLM Evaluators Recognize and Favor Their Own Generations ”, Panickssery et al 2024
- “Do LLMs Play Dice? Exploring Probability Distribution Sampling in Large Language Models for Behavioral Simulation ”, Gu et al 2024
- “Is ChatGPT Transforming Academics’ Writing Style? ”, Geng & Trotta 2024
- “From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples ”, Vacareanu et al 2024
- “Election Workers Are Drowning in Records Requests. AI Chatbots Could Make It Worse: Experts Worry That Election Deniers Could Weaponize Chatbots to Overwhelm and Slow down Local Officials ”, Elliott 2024
- “Visualization-Of-Thought Elicits Spatial Reasoning in Large Language Models ”, Wu et al 2024
- “FABLES: Evaluating Faithfulness and Content Selection in Book-Length Summarization ”, Kim et al 2024
- “Re-Evaluating GPT-4’s Bar Exam Performance ”, Martínez 2024
- “A Peter Thiel-Backed AI Startup, Cognition Labs, Seeks $2 Billion Valuation: Funding round Could Increase Startup’s Valuation Nearly Sixfold in a Matter of Weeks, Reflecting AI Frenzy ”, Jin 2024
- “Vulnerability Detection With Code Language Models: How Far Are We? ”, Ding et al 2024
- “Long-Form Factuality in Large Language Models ”, Wei et al 2024
- “Gold-Medalist Coders Build an AI That Can Do Their Job for Them: A New Startup Called Cognition AI Can Turn a User’s Prompt into a Website or Video Game ”, Vance 2024
- “Playing NetHack With LLMs: Potential & Limitations As Zero-Shot Agents (NetPlay) ”, Jeurissen et al 2024
- “Teaching Large Language Models an Unseen Language on the Fly ”, Zhang et al 2024
- “Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap ”, Srivastava et al 2024
- “These Pros Were Stunned by OpenAI Deep Research: ‘I Would Use This Model Professionally’, an Antitrust Lawyer Told Me ”, Lee 2024
- “Tokenization Counts: the Impact of Tokenization on Arithmetic in Frontier LLMs ”, Singh & Strouse 2024
- “
ArtPrompt
: ASCII Art-Based Jailbreak Attacks against Aligned LLMs ”, Jiang et al 2024 - “Tasks That Language Models Don’t Learn ”, Lee & Lim 2024
- “Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models ”, Lewis & Mitchell 2024
- “The Non-Effect of Sampling Temperature on Problem Solving in GPT-3.5/GPT-4 ”, Renze & Guven 2024
- “I Think, Therefore I Am: Benchmarking Awareness of Large Language Models Using AwareBench ”, Li et al 2024
- “Better Call GPT, Comparing Large Language Models Against Lawyers ”, Martin et al 2024
- “I Am a Strange Dataset: Metalinguistic Tests for Language Models ”, Thrush et al 2024
- “GPT-4-V(Ision) Is a Human-Aligned Evaluator for Text-To-3D Generation ”, Wu et al 2024
- “Escalation Risks from Language Models in Military and Diplomatic Decision-Making ”, Rivera et al 2024
- “A Vision Check-Up for Language Models ”, Sharma et al 2024
- “Leveraging Large Language Models to Boost Dafny’s Developers Productivity ”, Silva et al 2024
- “Originality Dies When Being Average Is Easier ”
- “Testing Theory of Mind in Large Language Models and Humans ”
- “GPT-4 Passes the Bar Exam ”, Katz et al 2024
- “Large Language Models Are Able to Downplay Their Cognitive Abilities to Fit the Persona They Simulate ”, Milička et al 2024
- “WaveCoder: Widespread And Versatile Enhanced Instruction Tuning With Refined Data Generation ”, Yu et al 2023
- “PRER: Modeling Complex Mathematical Reasoning via Large Language Model Based MathAgent ”, Liao et al 2023
- “Can Linguists Distinguish between ChatGPT and Human Writing?: A Study of Research Ethics and Academic Publishing ”, Casal & Kessler 2023
- “Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine ”, Nori et al 2023
- “GPQA: A Graduate-Level Google-Proof Q&A Benchmark ”, Rein et al 2023
- 42irrationalist @ "2023-11-19"
- “Llamas Know What GPTs Don’t Show: Surrogate Models for Confidence Estimation ”, Shrivastava et al 2023
- “Comparing Humans, GPT-4, and GPT-4-V On Abstraction and Reasoning Tasks ”, Mitchell et al 2023
- “In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search ”, Li et al 2023
- “The Impact of Large Language Models on Scientific Discovery: a Preliminary Study Using GPT-4 ”, AI4Science & Quantum 2023
- “Accuracy of a Vision-Language Model on Challenging Medical Cases ”, Buckley et al 2023
- “Large Language Models Can Strategically Deceive Their Users When Put Under Pressure ”, Scheurer et al 2023
- “Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves ”, Deng et al 2023
- “Augmenting Large Language Models With Chemistry Tools ”, Bran et al 2023
- “FANToM: A Benchmark for Stress-Testing Machine Theory of Mind in Interactions ”, Kim et al 2023
- “Branch-Solve-Merge Improves Large Language Model Evaluation and Generation ”, Saha et al 2023
- “Eureka: Human-Level Reward Design via Coding Large Language Models ”, Ma et al 2023
- “Set-Of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4-V ”, Yang et al 2023
- “Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament ”, Schoenegger & Park 2023
- “Data Contamination Through the Lens of Time ”, Roberts et al 2023
- “Can GPT Models Be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on Mock CFA Exams ”, Callanan et al 2023
- “Large Language Models Can Replicate Cross-Cultural Differences in Personality ”, Niszczota et al 2023
- “Beyond Memorization: Violating Privacy Via Inference With Large Language Models ”, Staab et al 2023
- “SWE-Bench: Can Language Models Resolve Real-World GitHub Issues? ”, Jimenez et al 2023
- “Can a Computer Outfake a Human [Personality]? ”, Phillips & Robie 2023
- “Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models ”, Zhou et al 2023
- “FreshLLMs: Refreshing Large Language Models With Search Engine Augmentation ”, Vu et al 2023
- “Police Officers Are Starting to Use AI to Write Crime Reports ”
- “Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis ”, Liang et al 2023
- “Low-Resource Languages Jailbreak GPT-4 ”, Yong et al 2023
- “An Evolutionary Model of Personality Traits Related to Cooperative Behavior Using a Large Language Model ”, Suzuki & Arita 2023
- “UltraFeedback: Boosting Language Models With High-Quality Feedback ”, Cui et al 2023
- “MTOB: A Benchmark for Learning to Translate a New Language from One Grammar Book ”, Tanzer et al 2023
- “Embers of Autoregression: Understanding Large Language Models Through the Problem They Are Trained to Solve ”, McCoy et al 2023
- “The Cambridge Law Corpus: A Corpus for Legal AI Research ”, Östling et al 2023
- “The Reversal Curse: LLMs Trained on A-Is-B Fail to Learn B-Is-A ”, Berglund et al 2023
- “From Sparse to Dense: GPT-4 Summarization With Chain of Density (CoD) Prompting ”, Adams et al 2023
- “Devising and Detecting Phishing: Large Language Models versus Smaller Human Models ”, Heiding et al 2023
- “ExpeL: LLM Agents Are Experiential Learners ”, Zhao et al 2023
- “LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models ”, Guha et al 2023
- “Solving Challenging Math Word Problems Using GPT-4 Code Interpreter With Code-Based Self-Verification ”, Zhou et al 2023
- “OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax? ”, Blair-Stanek et al 2023
- “Testing GPT-4 With Wolfram Alpha and Code Interpreter Plug-Ins on Math and Science Problems ”, Davis & Aaronson 2023
- “The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain ”, Moskvichev et al 2023
- “I’m a Screenwriter. These AI Jokes Give Me Nightmares ”, Rich 2023
- “A LLM Assisted Exploitation of AI-Guardian ”, Carlini 2023
- “OpenAI Worries About What Its Chatbot Will Say About People’s Faces: An Advanced Version of ChatGPT Can Analyze Images and Is Already Helping the Blind. But Its Ability to Put a Name to a Face Is One Reason the Public Doesn’t Have Access to It ”, Hill 2023
- “GPT-4, an Artificial Intelligence Large Language Model, Exhibits High Levels of Accuracy on Dermatology Specialty Certificate Exam Questions ”, Shetty et al 2023
- “Machine-Assisted Social Psychology Hypothesis Generation ”, Banker et al 2023
- “Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events ”, Gu et al 2023
- “Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration ”, Wang et al 2023
- “Explaining Competitive-Level Programming Solutions Using LLMs ”, Li et al 2023
- “Large Language Models for Supply Chain Optimization ”, Li et al 2023
- “Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models ”, O’Gara 2023
- “LeanDojo: Theorem Proving With Retrieval-Augmented Language Models ”, Yang et al 2023
- “ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews ”, D’Arcy et al 2023
- “Understanding Social Reasoning in Language Models With Language Models ”, Gandhi et al 2023
- “Evaluating Superhuman Models With Consistency Checks ”, Fluri et al 2023
- “Evaluating the Robustness of Text-To-Image Diffusion Models against Real-World Attacks ”, Gao et al 2023
- “ChessGPT: Bridging Policy Learning and Language Modeling ”, Feng et al 2023
- “Large Language Models As Tax Attorneys: A Case Study in Legal Capabilities Emergence ”, Nay et al 2023
- “Can Large Language Models Democratize Access to Dual-Use Biotechnology? ”, Soice et al 2023
- “Let’s Verify Step by Step ”, Lightman et al 2023
- “GPT4GEO: How a Language Model Sees the World’s Geography ”, Roberts et al 2023
- “LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-Based Representations ”, Xu et al 2023
- “Learning to Generate Novel Scientific Directions With Contextualized Literature-Based Discovery ”, Wang et al 2023
- “WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia ”, Semnani et al 2023
- “How Language Model Hallucinations Can Snowball ”, Zhang et al 2023
- “C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models ”, Huang et al 2023
- “Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns ”, Hazell 2023
- “PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits ”, Jiang et al 2023
- “Boosting Theory-Of-Mind Performance in Large Language Models via Prompting ”, Moghaddam & Honey 2023
- “Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games ”
- “Today Was the First Day That I Could Definitively Say That GPT-4 Has Saved Me a Substantial Amount of Tedious Work ”, Tao 2023
- “Humans in Humans Out: On GPT Converging Toward Common Sense in Both Success and Failure ”, Koralus & Wang-Maścianica 2023
- “Advances in Apparent Conceptual Physics Reasoning in GPT-4 ”, West 2023
- “Performance of ChatGPT on Free-Response, Clinical Reasoning Exams ”, Strong et al 2023
- “Reflexion: Language Agents With Verbal Reinforcement Learning ”, Shinn et al 2023
- “How Well Do Large Language Models Perform in Arithmetic Tasks? ”, Yuan et al 2023
- “GPT-4 Technical Report § Limitations: Calibration ”, OpenAI 2023 (page 12 org openai)
- “Salesforce Announces Einstein GPT, the World’s First Generative AI for CRM ”, Salesforce 2023
- “Large Language Models Are State-Of-The-Art Evaluators of Translation Quality ”, Kocmi & Federmann 2023
- “Not What You’ve Signed up For: Compromising Real-World LLM-Integrated Applications With Indirect Prompt Injection ”, Greshake et al 2023
- “Share of Teens Using ChatGPT for Schoolwork Doubled 2023 → 2024 ”
- “Harvey, Which Uses AI to Answer Legal Questions, Lands Cash from OpenAI ”, Wiggers 2022
- “How AI Models Stack Up Against My 11-Year-Old? ”
- “Your AI Can’t See Gorillas ”, Gohel 2025
- “Janus ”
- “Something Weird Is Happening With LLMs and Chess ”, Dynomight 2025
- “Trading Off Compute in Training and Inference ”
- “A Basic Test of OpenAI’s Structured Output Feature against Financial Disclosure Reports and a Newspaper’s Police Blotter ”
- “PhysicsForums and the Dead Internet Theory ”
- “Prompt Engineering Techniques With Azure OpenAI ”
- “LLM Powered Autonomous Agents ”
- “Deep Research for Short Economics Papers ”, Cowen 2025
- “There’s a Running Theme in Here of Programming Problems LLMs Solve Where It’s… ”
- “There Might Be Some Papers or Other Guides out There, but Their Advice Will Be… ”
- “Prompting Diverse Ideas: Increasing AI Idea Variance ”
- “OpenAI API § Prompt Caching ”
- “
o3-Mini
”, OpenAI 2025 - “SWE-Agent ”
- “Situational Awareness and Out-Of-Context Reasoning § GPT-4-Base Has Non-Zero Longform Performance ”, Evans 2025
- “GPT-4 O1 Isn’t a Chat Model (And That’s the Point) ”
- “I Finally Got ChatGPT to Sound like Me ”, lsusr 2025
- “Connecting the Dots: LLMs Can Infer & Verbalize Latent Structure from Training Data ”
- “[Critical Thinking in Factchecking a Wikipedia Entry] ”, Marcello 2025
- “How Good Are LLMs at Doing ML on an Unknown Dataset? ”
- “Language Models Model Us ”
- “The Case for More Ambitious Language Model Evals ”
- “One Shockingly Impressive Capability of GPT-4.5 [Photo Geolocation] ”
- “What Kind of Writer Is ChatGPT? ”
- “[Reward-Hacking: Vibe Coder Whose Supposed App Just Simulated Real Data] ”
- “AI Will Increase the Quantity—And Quality—Of Phishing Scams ”
- “Is Finetuning GPT-4o worth It? ”
- michael_nielsen
- Sort By Magic
- Miscellaneous
- Bibliography
See Also
Gwern
“ChatGPT-O3: Website Design Feedback Ideas [Total Site Review Confabulation] ”, GPT-o3 & Gwern 2025
ChatGPT-o3: Website Design Feedback Ideas [total site review confabulation]
“LLM Challenge: Write Non-Biblical Sentences ”, Gwern 2024
“Abs-E, Or, Speak Only Positively ”, Gwern 2024
“text2epositive.py
”, Gwern 2024
“date-Guesser.py
”, Gwern 2024
“paragraphizer.py
”, Gwern 2022
“CQK Is The First Unused TLA ”, Gwern 2023
Links
“Strategic Intelligence in Large Language Models: Evidence from Evolutionary Game Theory ”, Payne & Alloui-Cros 2025
Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory
“Early Signs of Steganographic Capabilities in Frontier LLMs ”, Zolkowski et al 2025
“Details about METR’s Preliminary [Coding] Evaluation of DeepSeek and Qwen Models ”, METR 2025
Details about METR’s preliminary [coding] evaluation of DeepSeek and Qwen models
“ChatGPT O3-Pro: Version of O3 With More Compute for Better Responses ”, OpenAI 2025
ChatGPT o3-pro: Version of o3 with more compute for better responses
“From Tool to Teammate: A Randomized Controlled Trial of Clinician-AI Collaborative Workflows for Diagnosis ”, Everett et al 2025
“Beyond Benchmark Scores: Analyzing O3-Mini’s Mathematical Reasoning ”, Ho et al 2025
Beyond benchmark scores: Analyzing o3-mini’s mathematical reasoning
“How Does O3 Guess Latitude From Photos? ”
“VideoGameBench: Can Vision-Language Models Complete Popular Video Games? ”, Zhang et al 2025
VideoGameBench: Can Vision-Language Models complete popular video games?
“How I Used GPT-4-O3 to Find CVE-2025-37899, a Remote Zero-Day Vulnerability in the Linux Kernel’s SMB Implementation ”, Heelan 2025
“What ChatGPT Knows about My Account [JSON Export] ”, Gwern 2025
“I Really Don’t like ChatGPT’s New Memory Dossier ”, Willison 2025
“RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics ”, Zhang et al 2025
RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics
“RealMath [Code] ”, Zhang et al 2025
“Introducing Codex: A Cloud-Based Software Engineering Agent That Can Work on Many Tasks in Parallel, Powered by codex-1
”, OpenAI 2025
“Revealing Economic Facts: LLMs Know More Than They Say ”, Buckmann et al 2025
“Measuring General Intelligence With Generated Games ”, Verma et al 2025
“Is ChatGPT Actually Fixed Now? I Tested ChatGPT’s Sycophancy, and the Results Were ... Extremely Weird. We’re a Long Way from Making AI Behave. ”, Adler 2025
“Highlights From The Comments On AI Geoguessr ”, Alexander 2025
“[The Letter ‘G’ in ‘Strawberry’] ”, Breadd007 2025
“Rampant AI Cheating Is Ruining Education Alarmingly Fast: ChatGPT Has Unraveled the Entire Academic Project ”, Walsh 2025
“How ChatGPT Remembers You: A Deep Dive into Its Memory and Chat History Features ”, wunderwuzzi 2025
How ChatGPT Remembers You: A Deep Dive into Its Memory and Chat History Features
“The Other Sharks Out There [LLM-Powered Copyright Link-Spam Fraud] ”, Slifkin 2025
The Other Sharks Out There [LLM-powered copyright link-spam fraud]
“Expanding on What We Missed With Sycophancy: A Deeper Dive on Our Findings, What Went Wrong, and Future Changes We’re Making ”, OpenAI 2025
“Testing AI’s GeoGuessr Genius: Seeing a World in a Grain of Sand ”, Alexander 2025
Testing AI’s GeoGuessr Genius: Seeing a world in a grain of sand
“Is AI Enhancing Education or Replacing It? Technology Should Facilitate Learning, Not Substitute for It ”, Shirky 2025
“ChatGPT Induced Psychosis: Serious Replies Only ”, Zestyclementinejuice 2025
“GPT-O3 Beats a Master-Level Geoguessr Player—Even With Fake EXIF Data ”, Patterson 2025
GPT-o3 Beats a Master-Level Geoguessr Player—Even with Fake EXIF Data
“Shifting Work Patterns With Generative AI ”, Dillon et al 2025
“Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark ”, Götting et al 2025
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark
“Investigating Truthfulness in a Pre-Release GPT-O3 Model ”, Chowdhury et al 2025
“[Pseudo-Jailbreaks] ”
elidourado @ "2025-04-08"
“The Curve Is Bending ”, Slatton 2025
“How Deep Do Large Language Models Internalize Scientific Literature and Citation Practices? ”, Algaba et al 2025
How Deep Do Large Language Models Internalize Scientific Literature and Citation Practices?
“Mike Lindell’s Lawyers Used AI to Write Brief ”
“Large Language Models Pass the Turing Test ”, Jones & Bergen 2025
“Why Does Claude Speak Byzantine Music Notation? ”, Finke 2025
“Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad ”, Petrov et al 2025
“Deep Research: Supermajority Laws around the States ”
“Obscure Scientific Facts Benchmark ”, Azulay 2025
“Spontaneous Giving and Calculated Greed in Language Models ”, Li & Shirado 2025
“None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks ”, Salido et al 2025
“Idiosyncrasies in Large Language Models ”, Sun et al 2025
“VLMs As GeoGuessr Masters: Exceptional Performance, Hidden Biases, and Privacy Risks ”, Huang et al 2025
VLMs as GeoGuessr Masters: Exceptional Performance, Hidden Biases, and Privacy Risks
“SycEval: Evaluating LLM Sycophancy ”, Fanous et al 2025
“DS R1 Is Not on Par With OA O1, and the Difference Is Qualitative, Not Quantitative: Long-Tail Benchmarks Reveal Gaps ”, Polshkov 2025
“Deep Research Dispatch: OpenAI’s Answers to Your Questions [Crowdsourcing DR Samples] ”, Griffing 2025
Deep Research Dispatch: OpenAI’s Answers to Your Questions [crowdsourcing DR samples] :
“Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs ”, Saxena et al 2025
Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs
“Do Large Language Model Benchmarks Test Reliability? ”, Vendrow et al 2025
“What Will AI Do to pre-Research/research? AI Makes Doing and Communicating Research Much Easier. Will There Be Any Point to It? ”, Gans 2025
“The Efficient Market Hypothesis When Time Travel Is Possible ”, Gans & o1-pro 2025
The efficient market hypothesis when time travel is possible
“WILLIAM A., a Student, by and through His Parents, E.A. and C.A. v. CLARKSVILLE-MONTGOMERY COUNTY SCHOOL SYSTEM ”, Sutton et al 2025
“Competitive Programming With Large Reasoning Models ”, El-Kishky et al 2025
“AI Language Model Rivals Expert Ethicist in Perceived Moral Expertise ”, Dillion et al 2025
AI language model rivals expert ethicist in perceived moral expertise
“Introducing Deep Research: An Agent That Uses Reasoning to Synthesize Large Amounts of Online Information and Complete Multi-Step Research Tasks for You. Available to Pro Users Today, Plus and Team Next ”, OpenAI 2025
“Large Language Models Think Too Fast To Explore Effectively ”, Pan et al 2025
“The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers ”, Lee et al 2025
“How Different LLMs Answered the PhilPapers 2020 Survey ”, Satron 2025
“People Who Frequently Use ChatGPT for Writing Tasks Are Accurate and Robust Detectors of AI-Generated Text ”, Russell et al 2025
“Diving into the Underlying Rules or Abstractions in GPT-4 O3’s 34 ARC-AGI Failures ”, mace 2025
Diving into the Underlying Rules or Abstractions in GPT-4 o3’s 34 ARC-AGI Failures
“A Novel Emergence of Meta-Awareness in LLM Fine-Tuning ”, rife 2025
“How We Used GPT-4o for Image Detection With 350 Very Similar, Single Image Classes ”, Topalian 2025
How we used GPT-4o for image detection with 350 very similar, single image classes :
“How Outdated Information Hides in LLM Token Generation Probabilities and Creates Logical Inconsistencies ”, Simmons 2025
“Human Study on AI Spear Phishing Campaigns ”, Lermen & Heiding 2025
“An Evaluation Framework for Clinical Use of Large Language Models in Patient Interaction Tasks ”, Johri et al 2025
An evaluation framework for clinical use of large language models in patient interaction tasks
“Favorite Colors of Some LLMs ”, an 2024
“Performance of LLMs on Advent of Code 2024 ”, Pinto 2024
“The Emergence of Strategic Reasoning of Large Language Models ”, Lee & Kader 2024
The Emergence of Strategic Reasoning of Large Language Models
“Why You Should Be Talking With GPT-4 O1-Pro about Philosophy: Some Thoughts on How It’s Become Better, and How You Can Too ”, Lowe 2024
“Cultural Evolution of Cooperation among LLM Agents ”, Vallinder & Hughes 2024
“O1 Turns Pro ”
“Frontier Models Are Capable of In-Context Scheming ”, Meinke et al 2024
“Frontier Models Are Capable of In-Context Scheming ”, Hobbhahn et al 2024
“Age against the Machine—Susceptibility of Large Language Models to Cognitive Impairment: Cross Sectional Analysis ”
“Evaluating Large Language Models’ Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects ”, Heiding et al 2024
“The Problem With [O1] Reasoners: Praying for Transfer Learning ”, McLaughlin 2024
The Problem with [o1] Reasoners: Praying for Transfer Learning
“BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games ”, Paglieri et al 2024
“Business Spending on AI Surged 500% This Year to $13.8 Billion ”
Business spending on AI surged 500% this year to $13.8 billion
“Generative Agent Simulations of 1,000 People ”, Park et al 2024
“Are LLMs Prescient? A Continuous Evaluation Using Daily News As the Oracle ”, Dai et al 2024
Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle
“Hidden Persuaders: LLMs’ Political Leaning and Their Influence on Voters ”, Potter et al 2024
Hidden Persuaders: LLMs’ Political Leaning and Their Influence on Voters
“A Tutorial on Teaching Data Analytics With Generative AI ”, Bray 2024
“Can LLMs Be Scammed? A Baseline Measurement Study ”, Sehwag et al 2024
“AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents ”, Andriushchenko et al 2024
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
“SimpleStrat: Diversifying Language Model Generation With Stratification ”, Wong et al 2024
SimpleStrat: Diversifying Language Model Generation with Stratification
“SWE-Bench+: Enhanced Coding Benchmark for LLMs ”, Aleithan et al 2024
“MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering ”, Chan et al 2024
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
“Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making ”, Li et al 2024
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
“Can OpenAI’s o1-Preview
Ace the 2023 Putnam Exam? ”, Kabasares 2024
“When a Language Model Is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI O1 ”, McCoy et al 2024
“Invisible Unicode Text That AI Chatbots Understand and Humans Can’t? Yep, It’s a Thing ”
Invisible Unicode text that AI chatbots understand and humans can’t? Yep, it’s a thing
“Generating Distinct AI Voice Performances By Prompt Engineering GPT-4o ”
Generating Distinct AI Voice Performances By Prompt Engineering GPT-4o :
“I Quit Teaching Because of ChatGPT ”, Livingstone 2024
“Evaluation of OpenAI O1: Opportunities and Challenges of AGI ”, Zhong et al 2024
Evaluation of OpenAI o1: Opportunities and Challenges of AGI
“That Message From Your Doctor? It May Have Been Drafted by ChatGPT-4 ”
That Message From Your Doctor? It May Have Been Drafted by ChatGPT-4
“LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s O1 on PlanBench ”, Valmeekam et al 2024
LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench
“The 27-Year-Old Billionaire Whose Army Does AI’s Dirty Work: Alexandr Wang’s Scale AI Deploys Gig Workers around the Globe to Shape How the Big AI Models Behave § Labeler Fraud ”, Jin 2024
“I Have Played a Little Bit With OpenAI’s New Iteration, GPT-4 O1 ”, Tao 2024
I have played a little bit with OpenAI’s new iteration, GPT-4 o1 :
“Thoughts While Watching Myself Be Automated ”, Dynomight 2024
“Generative AI Can Harm Learning ”, Bastani et al 2024
“Does Refusal Training in LLMs Generalize to the Past Tense? ”, Andriushchenko & Flammarion 2024
“GPT-4 Is Judged More Human Than Humans in Displaced and Inverted Turing Tests ”, Rathi et al 2024
GPT-4 is judged more human than humans in displaced and inverted Turing tests
“On Scalable Oversight With Weak LLMs Judging Strong LLMs ”, Kenton et al 2024
“Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs ”, Laine et al 2024
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
“Are Large Language Models Consistent over Value-Laden Questions? ”, Moore et al 2024
Are Large Language Models Consistent over Value-laden Questions?
“Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation ”, Halawi et al 2024
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
“APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets ”, Liu et al 2024
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
“A Real-World Test of Artificial Intelligence Infiltration of a University Examinations System: A ‘Turing Test’ Case Study ”, Scarfe et al 2024
“Connecting the Dots: LLMs Can Infer and Verbalize Latent Structure from Disparate Training Data ”, Treutlein et al 2024
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
“OlympicArena: Benchmarking Multi-Discipline Cognitive Reasoning for Superintelligent AI ”, Huang et al 2024
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
“What Are the Odds? Language Models Are Capable of Probabilistic Reasoning ”, Paruchuri et al 2024
What Are the Odds? Language Models Are Capable of Probabilistic Reasoning
“Probing the Decision Boundaries of In-Context Learning in Large Language Models ”, Zhao et al 2024
Probing the Decision Boundaries of In-context Learning in Large Language Models
“Development Cost of ARC GPT-4o Prototype ”, Greenblatt 2024
“GUI-WORLD: A Dataset for GUI-Oriented Multimodal LLM-Based Agents ”, Chen et al 2024
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
“Are We Done With MMLU? ”, Gema et al 2024
“Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-Modal LLMs in Video Analysis ”, Fu et al 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
“LLMs Achieve Adult Human Performance on Higher-Order Theory of Mind Tasks ”, Street et al 2024
LLMs achieve adult human performance on higher-order theory of mind tasks
“Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models ”, Lu et al 2024
Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models
“DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches With TikZ ”, Belouadi et al 2024
DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ
“DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data ”, Xin et al 2024
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
“Grokked Transformers Are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization ”, Wang et al 2024
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
“Observational Scaling Laws and the Predictability of Language Model Performance ”, Ruan et al 2024
Observational Scaling Laws and the Predictability of Language Model Performance
“Can Language Models Explain Their Own Classification Behavior? ”, Sherburn et al 2024
Can Language Models Explain Their Own Classification Behavior?
“ChatGPT Will Be Able to Talk to You like Scarlett Johansson in Her / Upgrades to ChatGPT’s Voice Mode Bring It Closer to the Vision of a Responsive AI Assistant—And Sam Altman Seems to Know It ”, Robison 2024
“SWE-Agent: Agent-Computer Interfaces Enable Automated Software Engineering ”, Yang et al 2024
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
“GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic ”, Zhang et al 2024
GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic
“Stochastic Lies: How LLM-Powered Chatbots Deal With Russian Disinformation about the War in Ukraine ”
Stochastic lies: How LLM-powered chatbots deal with Russian disinformation about the war in Ukraine
“Aligning LLM Agents by Learning Latent Preference from User Edits ”, Gao et al 2024
Aligning LLM Agents by Learning Latent Preference from User Edits
“Automated Social Science: Language Models As Scientist and Subjects ”, Manning et al 2024
Automated Social Science: Language Models as Scientist and Subjects
“Enhancing Confidence Expression in Large Language Models Through Learning from Past Experience ”, Han et al 2024
Enhancing Confidence Expression in Large Language Models Through Learning from Past Experience
“Private Attribute Inference from Images With Vision-Language Models ”, Tömekçe et al 2024
Private Attribute Inference from Images with Vision-Language Models
“LLM Evaluators Recognize and Favor Their Own Generations ”, Panickssery et al 2024
“Do LLMs Play Dice? Exploring Probability Distribution Sampling in Large Language Models for Behavioral Simulation ”, Gu et al 2024
“Is ChatGPT Transforming Academics’ Writing Style? ”, Geng & Trotta 2024
“From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples ”, Vacareanu et al 2024
“Election Workers Are Drowning in Records Requests. AI Chatbots Could Make It Worse: Experts Worry That Election Deniers Could Weaponize Chatbots to Overwhelm and Slow down Local Officials ”, Elliott 2024
“Visualization-Of-Thought Elicits Spatial Reasoning in Large Language Models ”, Wu et al 2024
Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models
“FABLES: Evaluating Faithfulness and Content Selection in Book-Length Summarization ”, Kim et al 2024
FABLES: Evaluating faithfulness and content selection in book-length summarization
“Re-Evaluating GPT-4’s Bar Exam Performance ”, Martínez 2024
“A Peter Thiel-Backed AI Startup, Cognition Labs, Seeks $2 Billion Valuation: Funding round Could Increase Startup’s Valuation Nearly Sixfold in a Matter of Weeks, Reflecting AI Frenzy ”, Jin 2024
“Vulnerability Detection With Code Language Models: How Far Are We? ”, Ding et al 2024
Vulnerability Detection with Code Language Models: How Far Are We?
“Long-Form Factuality in Large Language Models ”, Wei et al 2024
“Gold-Medalist Coders Build an AI That Can Do Their Job for Them: A New Startup Called Cognition AI Can Turn a User’s Prompt into a Website or Video Game ”, Vance 2024
“Playing NetHack With LLMs: Potential & Limitations As Zero-Shot Agents (NetPlay) ”, Jeurissen et al 2024
Playing NetHack with LLMs: Potential & Limitations as Zero-Shot Agents (NetPlay)
“Teaching Large Language Models an Unseen Language on the Fly ”, Zhang et al 2024
Teaching Large Language Models an Unseen Language on the Fly
“Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap ”, Srivastava et al 2024
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
“These Pros Were Stunned by OpenAI Deep Research: ‘I Would Use This Model Professionally’, an Antitrust Lawyer Told Me ”, Lee 2024
“Tokenization Counts: the Impact of Tokenization on Arithmetic in Frontier LLMs ”, Singh & Strouse 2024
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
“ArtPrompt
: ASCII Art-Based Jailbreak Attacks against Aligned LLMs ”, Jiang et al 2024
ArtPrompt
: ASCII Art-based Jailbreak Attacks against Aligned LLMs
“Tasks That Language Models Don’t Learn ”, Lee & Lim 2024
“Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models ”, Lewis & Mitchell 2024
“The Non-Effect of Sampling Temperature on Problem Solving in GPT-3.5/GPT-4 ”, Renze & Guven 2024
The Non-Effect of Sampling Temperature on Problem Solving in GPT-3.5/GPT-4
“I Think, Therefore I Am: Benchmarking Awareness of Large Language Models Using AwareBench ”, Li et al 2024
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench
“Better Call GPT, Comparing Large Language Models Against Lawyers ”, Martin et al 2024
Better Call GPT, Comparing Large Language Models Against Lawyers
“I Am a Strange Dataset: Metalinguistic Tests for Language Models ”, Thrush et al 2024
I am a Strange Dataset: Metalinguistic Tests for Language Models
“GPT-4-V(Ision) Is a Human-Aligned Evaluator for Text-To-3D Generation ”, Wu et al 2024
GPT-4-V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
“Escalation Risks from Language Models in Military and Diplomatic Decision-Making ”, Rivera et al 2024
Escalation Risks from Language Models in Military and Diplomatic Decision-Making
“Leveraging Large Language Models to Boost Dafny’s Developers Productivity ”, Silva et al 2024
Leveraging Large Language Models to Boost Dafny’s Developers Productivity
“Originality Dies When Being Average Is Easier ”
“Testing Theory of Mind in Large Language Models and Humans ”
“GPT-4 Passes the Bar Exam ”, Katz et al 2024
“Large Language Models Are Able to Downplay Their Cognitive Abilities to Fit the Persona They Simulate ”, Milička et al 2024
“WaveCoder: Widespread And Versatile Enhanced Instruction Tuning With Refined Data Generation ”, Yu et al 2023
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation
“PRER: Modeling Complex Mathematical Reasoning via Large Language Model Based MathAgent ”, Liao et al 2023
PRER: Modeling Complex Mathematical Reasoning via Large Language Model based MathAgent
“Can Linguists Distinguish between ChatGPT and Human Writing?: A Study of Research Ethics and Academic Publishing ”, Casal & Kessler 2023
“Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine ”, Nori et al 2023
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
“GPQA: A Graduate-Level Google-Proof Q&A Benchmark ”, Rein et al 2023
42irrationalist @ "2023-11-19"
“Llamas Know What GPTs Don’t Show: Surrogate Models for Confidence Estimation ”, Shrivastava et al 2023
Llamas Know What GPTs Don’t Show: Surrogate Models for Confidence Estimation
“Comparing Humans, GPT-4, and GPT-4-V On Abstraction and Reasoning Tasks ”, Mitchell et al 2023
Comparing Humans, GPT-4, and GPT-4-V On Abstraction and Reasoning Tasks
“In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search ”, Li et al 2023
“The Impact of Large Language Models on Scientific Discovery: a Preliminary Study Using GPT-4 ”, AI4Science & Quantum 2023
The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4
“Accuracy of a Vision-Language Model on Challenging Medical Cases ”, Buckley et al 2023
Accuracy of a Vision-Language Model on Challenging Medical Cases
“Large Language Models Can Strategically Deceive Their Users When Put Under Pressure ”, Scheurer et al 2023
Large Language Models can Strategically Deceive their Users when Put Under Pressure
“Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves ”, Deng et al 2023
Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
“Augmenting Large Language Models With Chemistry Tools ”, Bran et al 2023
“FANToM: A Benchmark for Stress-Testing Machine Theory of Mind in Interactions ”, Kim et al 2023
FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions
“Branch-Solve-Merge Improves Large Language Model Evaluation and Generation ”, Saha et al 2023
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
“Eureka: Human-Level Reward Design via Coding Large Language Models ”, Ma et al 2023
Eureka: Human-Level Reward Design via Coding Large Language Models
“Set-Of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4-V ”, Yang et al 2023
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4-V
“Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament ”, Schoenegger & Park 2023
Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament
“Data Contamination Through the Lens of Time ”, Roberts et al 2023
“Can GPT Models Be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on Mock CFA Exams ”, Callanan et al 2023
Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams
“Large Language Models Can Replicate Cross-Cultural Differences in Personality ”, Niszczota et al 2023
Large language models can replicate cross-cultural differences in personality
“Beyond Memorization: Violating Privacy Via Inference With Large Language Models ”, Staab et al 2023
Beyond Memorization: Violating Privacy Via Inference with Large Language Models
“SWE-Bench: Can Language Models Resolve Real-World GitHub Issues? ”, Jimenez et al 2023
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
“Can a Computer Outfake a Human [Personality]? ”, Phillips & Robie 2023
“Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models ”, Zhou et al 2023
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
“FreshLLMs: Refreshing Large Language Models With Search Engine Augmentation ”, Vu et al 2023
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
“Police Officers Are Starting to Use AI to Write Crime Reports ”
Police officers are starting to use AI to write crime reports
“Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis ”, Liang et al 2023
“Low-Resource Languages Jailbreak GPT-4 ”, Yong et al 2023
“An Evolutionary Model of Personality Traits Related to Cooperative Behavior Using a Large Language Model ”, Suzuki & Arita 2023
“UltraFeedback: Boosting Language Models With High-Quality Feedback ”, Cui et al 2023
UltraFeedback: Boosting Language Models with High-quality Feedback
“MTOB: A Benchmark for Learning to Translate a New Language from One Grammar Book ”, Tanzer et al 2023
MTOB: A Benchmark for Learning to Translate a New Language from One Grammar Book
“Embers of Autoregression: Understanding Large Language Models Through the Problem They Are Trained to Solve ”, McCoy et al 2023
“The Cambridge Law Corpus: A Corpus for Legal AI Research ”, Östling et al 2023
“The Reversal Curse: LLMs Trained on A-Is-B Fail to Learn B-Is-A ”, Berglund et al 2023
The Reversal Curse: LLMs trained on A-is-B fail to learn B-is-A
“From Sparse to Dense: GPT-4 Summarization With Chain of Density (CoD) Prompting ”, Adams et al 2023
From Sparse to Dense: GPT-4 Summarization with Chain of Density (CoD) Prompting
“Devising and Detecting Phishing: Large Language Models versus Smaller Human Models ”, Heiding et al 2023
Devising and Detecting Phishing: Large Language Models versus Smaller Human Models
“ExpeL: LLM Agents Are Experiential Learners ”, Zhao et al 2023
“LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models ”, Guha et al 2023
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
“Solving Challenging Math Word Problems Using GPT-4 Code Interpreter With Code-Based Self-Verification ”, Zhou et al 2023
“OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax? ”, Blair-Stanek et al 2023
OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax?
“Testing GPT-4 With Wolfram Alpha and Code Interpreter Plug-Ins on Math and Science Problems ”, Davis & Aaronson 2023
Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problems
“The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain ”, Moskvichev et al 2023
The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain
“I’m a Screenwriter. These AI Jokes Give Me Nightmares ”, Rich 2023
“A LLM Assisted Exploitation of AI-Guardian ”, Carlini 2023
“OpenAI Worries About What Its Chatbot Will Say About People’s Faces: An Advanced Version of ChatGPT Can Analyze Images and Is Already Helping the Blind. But Its Ability to Put a Name to a Face Is One Reason the Public Doesn’t Have Access to It ”, Hill 2023
“GPT-4, an Artificial Intelligence Large Language Model, Exhibits High Levels of Accuracy on Dermatology Specialty Certificate Exam Questions ”, Shetty et al 2023
“Machine-Assisted Social Psychology Hypothesis Generation ”, Banker et al 2023
“Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events ”, Gu et al 2023
“Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration ”, Wang et al 2023
“Explaining Competitive-Level Programming Solutions Using LLMs ”, Li et al 2023
Explaining Competitive-Level Programming Solutions using LLMs
“Large Language Models for Supply Chain Optimization ”, Li et al 2023
“Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models ”, O’Gara 2023
Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models
“LeanDojo: Theorem Proving With Retrieval-Augmented Language Models ”, Yang et al 2023
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models
“ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews ”, D’Arcy et al 2023
ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews
“Understanding Social Reasoning in Language Models With Language Models ”, Gandhi et al 2023
Understanding Social Reasoning in Language Models with Language Models
“Evaluating Superhuman Models With Consistency Checks ”, Fluri et al 2023
“Evaluating the Robustness of Text-To-Image Diffusion Models against Real-World Attacks ”, Gao et al 2023
Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks
“ChessGPT: Bridging Policy Learning and Language Modeling ”, Feng et al 2023
“Large Language Models As Tax Attorneys: A Case Study in Legal Capabilities Emergence ”, Nay et al 2023
Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence
“Can Large Language Models Democratize Access to Dual-Use Biotechnology? ”, Soice et al 2023
Can large language models democratize access to dual-use biotechnology?
“Let’s Verify Step by Step ”, Lightman et al 2023
“GPT4GEO: How a Language Model Sees the World’s Geography ”, Roberts et al 2023
“LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-Based Representations ”, Xu et al 2023
“Learning to Generate Novel Scientific Directions With Contextualized Literature-Based Discovery ”, Wang et al 2023
Learning to Generate Novel Scientific Directions with Contextualized Literature-based Discovery
“WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia ”, Semnani et al 2023
“How Language Model Hallucinations Can Snowball ”, Zhang et al 2023
“C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models ”, Huang et al 2023
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
“Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns ”, Hazell 2023
Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns
“PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits ”, Jiang et al 2023
PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits
“Boosting Theory-Of-Mind Performance in Large Language Models via Prompting ”, Moghaddam & Honey 2023
Boosting Theory-of-Mind Performance in Large Language Models via Prompting
“Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games ”
Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games
“Today Was the First Day That I Could Definitively Say That GPT-4 Has Saved Me a Substantial Amount of Tedious Work ”, Tao 2023
“Humans in Humans Out: On GPT Converging Toward Common Sense in Both Success and Failure ”, Koralus & Wang-Maścianica 2023
Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure
“Advances in Apparent Conceptual Physics Reasoning in GPT-4 ”, West 2023
“Performance of ChatGPT on Free-Response, Clinical Reasoning Exams ”, Strong et al 2023
Performance of ChatGPT on free-response, clinical reasoning exams
“Reflexion: Language Agents With Verbal Reinforcement Learning ”, Shinn et al 2023
Reflexion: Language Agents with Verbal Reinforcement Learning
“How Well Do Large Language Models Perform in Arithmetic Tasks? ”, Yuan et al 2023
How well do Large Language Models perform in Arithmetic tasks?
“GPT-4 Technical Report § Limitations: Calibration ”, OpenAI 2023 (page 12 org openai)
“Salesforce Announces Einstein GPT, the World’s First Generative AI for CRM ”, Salesforce 2023
Salesforce Announces Einstein GPT, the World’s First Generative AI for CRM
“Large Language Models Are State-Of-The-Art Evaluators of Translation Quality ”, Kocmi & Federmann 2023
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
“Not What You’ve Signed up For: Compromising Real-World LLM-Integrated Applications With Indirect Prompt Injection ”, Greshake et al 2023
“Share of Teens Using ChatGPT for Schoolwork Doubled 2023 → 2024 ”
Share of teens using ChatGPT for schoolwork doubled 2023 → 2024 :
“Harvey, Which Uses AI to Answer Legal Questions, Lands Cash from OpenAI ”, Wiggers 2022
Harvey, which uses AI to answer legal questions, lands cash from OpenAI
“How AI Models Stack Up Against My 11-Year-Old? ”
“Your AI Can’t See Gorillas ”, Gohel 2025
“Janus ”
“Something Weird Is Happening With LLMs and Chess ”, Dynomight 2025
“Trading Off Compute in Training and Inference ”
“A Basic Test of OpenAI’s Structured Output Feature against Financial Disclosure Reports and a Newspaper’s Police Blotter ”
“PhysicsForums and the Dead Internet Theory ”
“Prompt Engineering Techniques With Azure OpenAI ”
“LLM Powered Autonomous Agents ”
“Deep Research for Short Economics Papers ”, Cowen 2025
“There’s a Running Theme in Here of Programming Problems LLMs Solve Where It’s… ”
There’s a running theme in here of programming problems LLMs solve where it’s… :
“There Might Be Some Papers or Other Guides out There, but Their Advice Will Be… ”
There might be some papers or other guides out there, but their advice will be…
“Prompting Diverse Ideas: Increasing AI Idea Variance ”
“OpenAI API § Prompt Caching ”
“o3-Mini
”, OpenAI 2025
“SWE-Agent ”
“Situational Awareness and Out-Of-Context Reasoning § GPT-4-Base Has Non-Zero Longform Performance ”, Evans 2025
Situational Awareness and Out-Of-Context Reasoning § GPT-4-base has Non-Zero Longform Performance
“GPT-4 O1 Isn’t a Chat Model (And That’s the Point) ”
“I Finally Got ChatGPT to Sound like Me ”, lsusr 2025
“Connecting the Dots: LLMs Can Infer & Verbalize Latent Structure from Training Data ”
Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data
“[Critical Thinking in Factchecking a Wikipedia Entry] ”, Marcello 2025
“How Good Are LLMs at Doing ML on an Unknown Dataset? ”
“Language Models Model Us ”
“The Case for More Ambitious Language Model Evals ”
“One Shockingly Impressive Capability of GPT-4.5 [Photo Geolocation] ”
One shockingly impressive capability of GPT-4.5 [photo geolocation]
“What Kind of Writer Is ChatGPT? ”
“[Reward-Hacking: Vibe Coder Whose Supposed App Just Simulated Real Data] ”
[Reward-hacking: vibe coder whose supposed app just simulated real data]
“AI Will Increase the Quantity—And Quality—Of Phishing Scams ”
“Is Finetuning GPT-4o worth It? ”
michael_nielsen
[‘Fourier components’-style literary criticism by GPT-4 o1] :
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
llm-evaluation
evaluation
reasoning-performance
theory-of-mind
Miscellaneous
/doc/ai/nn/transformer/gpt/codex/2024-03-07-inflection-inflection25benchmarks.svg
https://blog.matteskridge.com/business/gpt4-and-silicon-valley-bank/2023/03/19/
:https://blog.mentat.ai/benchmarking-gpt-4-turbo-a-cautionary-tale
https://blog.nawaz.org/posts/2024/Jan/llm-assisted-moderation/
:https://chat.openai.com/share/04add58f-2052-4b60-ae2a-ab708c29088f
:https://chatgpt.com/share/312e82f0-cc5e-47f3-b368-b2c0c0f4ad3f
https://clarifycapital.com/the-future-of-investment-pitching
:https://cookbook.openai.com/examples/tag_caption_images_with_gpt4v
https://finedataproducts.com/posts/2024-03-10-tax-scenarios-with-ai/
https://generallyintelligent.substack.com/p/fine-tuning-mistral-7b-on-magic-the
https://gist.github.com/Jessime/63f93215faed6f7109c6d62b7fef7fbc
:https://gist.github.com/harryaskham/68a611bef777525991790bca2f2d324d
https://github.com/E-xyza/Exonerate/blob/master/bench/reports/gpt-bench.md
https://github.com/chenandrewy/Prompts-to-Paper/blob/master/README
https://github.com/jujumilk3/leaked-system-prompts/blob/main/microsoft-bing-chat_20230209.md
https://github.com/jujumilk3/leaked-system-prompts/blob/main/openai-assistants-api_20231106.md
https://github.com/jujumilk3/leaked-system-prompts/blob/main/openai-chatgpt-ios_20230614.md
https://github.com/jujumilk3/leaked-system-prompts/blob/main/openai-chatgpt4-android_20240207.md
https://github.com/jujumilk3/leaked-system-prompts/blob/main/openai-chatgpt_20221201.md
https://github.com/kagisearch/llm-chess-puzzles?tab=readme-ov-file#results
:https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2812620
https://kenkantzer.com/lessons-after-a-half-billion-gpt-tokens/
:https://koenvangilst.nl/blog/keeping-code-complexity-in-check
https://lemire.me/blog/2023/03/22/can-gpt-pass-my-programming-courses/
:https://matthewbarnett.substack.com/p/gpt-4-takes-bryan-caplans-midterm
https://mazzzystar.github.io/2023/05/10/LLM-for-individual/
:https://micahflee.com/2023/04/capturing-the-flag-with-gpt-4/
https://openai.com/blog/function-calling-and-other-api-updates#function-calling
https://openai.com/index/introducing-structured-outputs-in-the-api/#_5PYjnV1iAHOPKPupDztdZk
https://paperswithcode.com/sota/math-word-problem-solving-on-math
https://platform.openai.com/docs/guides/reasoning/how-reasoning-works
https://pslusarz.github.io/articles/2023/12/22/compare-ocr-tesseract-gpt4-nara-rolls.html
:https://statmodeling.stat.columbia.edu/2023/04/18/chatgpt4-writes-stan-code-so-i-dont-have-to/
https://statmodeling.stat.columbia.edu/2023/08/20/bob-carpenter-thinks-gpt-4-is-awesome/
https://terrytao.wordpress.com/about/ai-generated-versions-of-the-ai-anthology-article/
:https://villekuosmanen.medium.com/i-played-chess-against-chatgpt-4-and-lost-c5798a9049ca
:https://www.betonit.ai/p/gpt-4-takes-a-new-midterm-and-gets
:https://www.construction-physics.com/p/could-chatgpt-become-an-architect
:https://www.economist.com/business/2024/02/29/how-businesses-are-actually-using-generative-ai
:https://www.euractiv.com/section/politics/news/albania-to-speed-up-eu-accession-using-chatgpt/
https://www.geoffreylitt.com/2023/03/25/llm-end-user-programming
https://www.lesswrong.com/posts/CkhJAxHeyFCg2EcET/are-language-models-good-at-making-predictions
:https://www.lesswrong.com/posts/KSroBnxCHodGmPPJ8/jailbreaking-gpt-4-s-code-interpreter
https://www.oneusefulthing.org/p/it-is-starting-to-get-strange
https://www.oneusefulthing.org/p/setting-time-on-fire-and-the-temptation
:https://www.reddit.com/r/ChatGPT/comments/12a0ajb/i_gave_gpt4_persistent_memory_and_the_ability_to/
https://www.reddit.com/r/GPT3/comments/12ez822/neurosemantical_inversitis_prompt_still_works/
https://www.reddit.com/r/OpenAI/comments/1fxa6d6/two_purported_instances_of_o1preview_and_o1mini/
https://www.reddit.com/r/OpenAI/comments/1gjj430/o1_preview_got_weird_today/
https://www.reddit.com/r/OpenAI/comments/1k3szsr/o3_and_o4minihigh_tested_on_usamo_2025/
https://www.reddit.com/r/PromptEngineering/comments/1fj6h13/hallucinations_in_o1preview_reasoning/
https://www.reddit.com/r/bing/comments/110eagl/the_customer_service_of_the_new_bing_chat_is/
:https://www.reddit.com/r/duolingo/comments/18sx06i/big_layoff_at_duolingo/
:https://www.reddit.com/r/freelanceWriters/comments/12ff5mw/it_happened_to_me_today/
:https://www.reddit.com/r/mlscaling/comments/1gyb54z/the_fate_of_gpt4o/
https://www.reddit.com/r/singularity/comments/1atjz9v/ive_put_a_complex_codebase_into_a_single/
https://www.science.org/content/blog-post/evaluation-deep-research-performance
https://www.supersimple.io/blog/gpt-4-fine-tuning-early-access
https://www.thebigquestions.com/2023/04/05/gpt-4-fails-economics/
:
Bibliography
https://arxiv.org/abs/2505.07215
: “Measuring General Intelligence With Generated Games ”,https://arxiv.org/abs/2503.21934
: “Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad ”,https://arxiv.org/abs/2502.12896
: “None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks ”,https://toloka.ai/blog/r1-is-not-on-par-with-o1-and-the-difference-is-qualitative-not-quantitative/
: “DS R1 Is Not on Par With OA O1, and the Difference Is Qualitative, Not Quantitative: Long-Tail Benchmarks Reveal Gaps ”,https://joshuagans.substack.com/p/what-will-ai-do-to-presearch
: “What Will AI Do to pre-Research/research? AI Makes Doing and Communicating Research Much Easier. Will There Be Any Point to It? ”,https://www.sciencedirect.com/science/article/pii/S0165176525000461
: “The Efficient Market Hypothesis When Time Travel Is Possible ”,https://arxiv.org/abs/2502.06807#openai
: “Competitive Programming With Large Reasoning Models ”,https://www.nature.com/articles/s41598-025-86510-0
: “AI Language Model Rivals Expert Ethicist in Perceived Moral Expertise ”,https://arxiv.org/abs/2501.15654
: “People Who Frequently Use ChatGPT for Writing Tasks Are Accurate and Robust Detectors of AI-Generated Text ”,2025-johri.pdf
: “An Evaluation Framework for Clinical Use of Large Language Models in Patient Interaction Tasks ”,https://arxiv.org/abs/2411.13543
: “BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games ”,https://arxiv.org/abs/2410.13893
: “Can LLMs Be Scammed? A Baseline Measurement Study ”,https://arxiv.org/abs/2410.06992
: “SWE-Bench+: Enhanced Coding Benchmark for LLMs ”,https://arxiv.org/abs/2410.07095#openai
: “MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering ”,https://time.com/7026050/chatgpt-quit-teaching-ai-essay/
: “I Quit Teaching Because of ChatGPT ”,https://www.wsj.com/tech/ai/alexandr-wang-scale-ai-d7c6efd7
: “The 27-Year-Old Billionaire Whose Army Does AI’s Dirty Work: Alexandr Wang’s Scale AI Deploys Gig Workers around the Globe to Shape How the Big AI Models Behave § Labeler Fraud ”,https://dynomight.net/automated/
: “Thoughts While Watching Myself Be Automated ”,https://arxiv.org/abs/2407.11969
: “Does Refusal Training in LLMs Generalize to the Past Tense? ”,https://arxiv.org/abs/2407.04694
: “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs ”,https://arxiv.org/abs/2406.18518#salesforce
: “APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets ”,https://arxiv.org/abs/2406.11233
: “Probing the Decision Boundaries of In-Context Learning in Large Language Models ”,https://arxiv.org/abs/2405.18870#google
: “LLMs Achieve Adult Human Performance on Higher-Order Theory of Mind Tasks ”,https://arxiv.org/abs/2405.15143
: “Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models ”,https://arxiv.org/abs/2405.15306
: “DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches With TikZ ”,https://arxiv.org/abs/2405.15071
: “Grokked Transformers Are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization ”,https://arxiv.org/abs/2405.10938
: “Observational Scaling Laws and the Predictability of Language Model Performance ”,https://www.theverge.com/2024/5/13/24155652/chatgpt-voice-mode-gpt4o-upgrades
: “ChatGPT Will Be Able to Talk to You like Scarlett Johansson in Her / Upgrades to ChatGPT’s Voice Mode Bring It Closer to the Vision of a Responsive AI Assistant—And Sam Altman Seems to Know It ”,https://arxiv.org/abs/2405.15793
: “SWE-Agent: Agent-Computer Interfaces Enable Automated Software Engineering ”,https://arxiv.org/abs/2405.00332#scale
: “GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic ”,https://arxiv.org/abs/2404.10618
: “Private Attribute Inference from Images With Vision-Language Models ”,https://arxiv.org/abs/2404.13076
: “LLM Evaluators Recognize and Favor Their Own Generations ”,https://arxiv.org/abs/2404.07544
: “From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples ”,https://www.wired.com/story/ai-chatbots-foia-requests-election-workers/
: “Election Workers Are Drowning in Records Requests. AI Chatbots Could Make It Worse: Experts Worry That Election Deniers Could Weaponize Chatbots to Overwhelm and Slow down Local Officials ”,https://link.springer.com/article/10.1007/s10506-024-09396-9
: “Re-Evaluating GPT-4’s Bar Exam Performance ”,https://www.wsj.com/tech/ai/a-peter-thiel-backed-ai-startup-cognition-labs-seeks-2-billion-valuation-998fa39d
: “A Peter Thiel-Backed AI Startup, Cognition Labs, Seeks $2 Billion Valuation: Funding round Could Increase Startup’s Valuation Nearly Sixfold in a Matter of Weeks, Reflecting AI Frenzy ”,https://arxiv.org/abs/2403.18624
: “Vulnerability Detection With Code Language Models: How Far Are We? ”,https://arxiv.org/abs/2403.18802#deepmind
: “Long-Form Factuality in Large Language Models ”,https://www.bloomberg.com/news/articles/2024-03-12/cognition-ai-is-a-peter-thiel-backed-coding-assistant
: “Gold-Medalist Coders Build an AI That Can Do Their Job for Them: A New Startup Called Cognition AI Can Turn a User’s Prompt into a Website or Video Game ”,https://arxiv.org/abs/2402.19450
: “Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap ”,https://arxiv.org/abs/2402.14903
: “Tokenization Counts: the Impact of Tokenization on Arithmetic in Frontier LLMs ”,https://arxiv.org/abs/2402.11753
: “ArtPrompt
: ASCII Art-Based Jailbreak Attacks against Aligned LLMs ”,https://arxiv.org/abs/2402.11349
: “Tasks That Language Models Don’t Learn ”,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10894685/
: “GPT-4 Passes the Bar Exam ”,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10936766/
: “Large Language Models Are Able to Downplay Their Cognitive Abilities to Fit the Persona They Simulate ”,https://arxiv.org/abs/2312.08926
: “PRER: Modeling Complex Mathematical Reasoning via Large Language Model Based MathAgent ”,2023-casal.pdf
: “Can Linguists Distinguish between ChatGPT and Human Writing?: A Study of Research Ethics and Academic Publishing ”,https://arxiv.org/abs/2311.16452#microsoft
: “Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine ”,https://arxiv.org/abs/2311.09247
: “Comparing Humans, GPT-4, and GPT-4-V On Abstraction and Reasoning Tasks ”,https://arxiv.org/abs/2310.13014
: “Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament ”,https://arxiv.org/abs/2310.08678
: “Can GPT Models Be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on Mock CFA Exams ”,2023-phillips.pdf
: “Can a Computer Outfake a Human [Personality]? ”,https://arxiv.org/abs/2310.04406
: “Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models ”,https://arxiv.org/abs/2310.03214#google
: “FreshLLMs: Refreshing Large Language Models With Search Engine Augmentation ”,https://arxiv.org/abs/2310.01377
: “UltraFeedback: Boosting Language Models With High-Quality Feedback ”,https://arxiv.org/abs/2309.12269
: “The Cambridge Law Corpus: A Corpus for Legal AI Research ”,https://arxiv.org/abs/2309.12288
: “The Reversal Curse: LLMs Trained on A-Is-B Fail to Learn B-Is-A ”,https://arxiv.org/abs/2309.04269
: “From Sparse to Dense: GPT-4 Summarization With Chain of Density (CoD) Prompting ”,https://arxiv.org/abs/2308.12287
: “Devising and Detecting Phishing: Large Language Models versus Smaller Human Models ”,https://arxiv.org/abs/2308.07921
: “Solving Challenging Math Word Problems Using GPT-4 Code Interpreter With Code-Based Self-Verification ”,https://time.com/6301288/the-ai-jokes-that-give-me-nightmares/
: “I’m a Screenwriter. These AI Jokes Give Me Nightmares ”,https://www.nytimes.com/2023/07/18/technology/openai-chatgpt-facial-recognition.html
: “OpenAI Worries About What Its Chatbot Will Say About People’s Faces: An Advanced Version of ChatGPT Can Analyze Images and Is Already Helping the Blind. But Its Ability to Put a Name to a Face Is One Reason the Public Doesn’t Have Access to It ”,2024-banker.pdf
: “Machine-Assisted Social Psychology Hypothesis Generation ”,https://arxiv.org/abs/2307.06439#microsoft
: “Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events ”,https://arxiv.org/abs/2307.05300#microsoft
: “Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration ”,https://arxiv.org/abs/2308.01404
: “Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models ”,https://arxiv.org/abs/2306.15626
: “LeanDojo: Theorem Proving With Retrieval-Augmented Language Models ”,https://arxiv.org/abs/2306.12587
: “ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews ”,https://arxiv.org/abs/2306.15448
: “Understanding Social Reasoning in Language Models With Language Models ”,https://arxiv.org/abs/2305.20050#openai
: “Let’s Verify Step by Step ”,https://arxiv.org/abs/2305.18354
: “LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-Based Representations ”,https://arxiv.org/abs/2305.13534
: “How Language Model Hallucinations Can Snowball ”,https://arxiv.org/abs/2305.06972
: “Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns ”,https://arxiv.org/abs/2304.11490
: “Boosting Theory-Of-Mind Performance in Large Language Models via Prompting ”,https://www.medrxiv.org/content/10.1101/2023.03.24.23287731.full
: “Performance of ChatGPT on Free-Response, Clinical Reasoning Exams ”,https://arxiv.org/abs/2304.02015#alibaba
: “How Well Do Large Language Models Perform in Arithmetic Tasks? ”,https://arxiv.org/pdf/2303.08774#page=12&org=openai
: “GPT-4 Technical Report § Limitations: Calibration ”,https://arxiv.org/abs/2302.14520
: “Large Language Models Are State-Of-The-Art Evaluators of Translation Quality ”,https://arxiv.org/abs/2302.12173
: “Not What You’ve Signed up For: Compromising Real-World LLM-Integrated Applications With Indirect Prompt Injection ”,https://techcrunch.com/2022/11/23/harvey-which-uses-ai-to-answer-legal-questions-lands-cash-from-openai/
: “Harvey, Which Uses AI to Answer Legal Questions, Lands Cash from OpenAI ”,