‘PaLM 2’ directory
- See Also
- Gwern
- Links
- “VideoGameBench: Can Vision-Language Models Complete Popular Video Games? ”, Zhang et al 2025
- “Google Announces AI Ultra Subscription Plan: $250⧸month [Deep Think in Gemini-2.5-Pro Etc] ”, Ben-Yair 2025
- “Gemini Diffusion LLM ”, Google 2025
- “Google I/O 2025 Keynote ”, Pichai 2025
- “RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics ”, Zhang et al 2025
- “RealMath [Code] ”, Zhang et al 2025
- “Advancing the Frontier of Video Understanding With Gemini 2.5 ”, Baddepudi et al 2025
- “Gemini 2.5 Pro Preview: Even Better Coding Performance ”, Kilpatrick 2025
- “Evaluating Frontier Models for Stealth and Situational Awareness ”, Phuong et al 2025
- “Gemini-2.5-Pro System Prompt ”, Liberator 2025
- “Is Google Gemini-2.5-Pro Now Better Than Claude at Pokémon? [Probably] ”, Bradshaw 2025
- “AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-Time Computation ”, Chakrabarty et al 2025
- “Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad ”, Petrov et al 2025
- “Putting Gemini 2.5 Pro through Its Paces ”, Willison 2025
- “Gemini 2.5: Our Newest Gemini Model With Thinking ”, Google 2025
- “Fiction.live: LiveBench Results, 25 February 2025: Real-World Long Context Benchmark for Writers ”
- “Spontaneous Giving and Calculated Greed in Language Models ”, Li & Shirado 2025
- “Idiosyncrasies in Large Language Models ”, Sun et al 2025
- “VLMs As GeoGuessr Masters: Exceptional Performance, Hidden Biases, and Privacy Risks ”, Huang et al 2025
- “SycEval: Evaluating LLM Sycophancy ”, Fanous et al 2025
- “Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs ”, Saxena et al 2025
- “How Different LLMs Answered the PhilPapers 2020 Survey ”, Satron 2025
- “Ingesting Millions of PDFs and Why Gemini 2.0 Changes Everything ”, Filimonov 2025
- “Proactive Agents for Multi-Turn Text-To-Image Generation Under Uncertainty ”, Hahn et al 2024
- “Frontier Models Are Capable of In-Context Scheming ”, Meinke et al 2024
- “Frontier Models Are Capable of In-Context Scheming ”, Hobbhahn et al 2024
- “Alphabet Q3 Earnings Call: CEO Sundar Pichai’s Remarks ”
- “AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents ”, Andriushchenko et al 2024
- “Scalable Watermarking for Identifying Large Language Model Outputs ”
- “Inference Scaling for Long-Context Retrieval Augmented Generation ”, Yue et al 2024
- “Project Zero: From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code ”
- “Training Language Models to Self-Correct via Reinforcement Learning ”, Kumar et al 2024
- “On Scalable Oversight With Weak LLMs Judging Strong LLMs ”, Kenton et al 2024
- “Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? ”, Lee et al 2024
- “What Are the Odds? Language Models Are Capable of Probabilistic Reasoning ”, Paruchuri et al 2024
- “Can Language Models Use Forecasting Strategies? ”, Pratt et al 2024
- “Grokked Transformers Are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization ”, Wang et al 2024
- “Analyzing Poems With LLMs ”, Toper 2024
- “Stochastic Lies: How LLM-Powered Chatbots Deal With Russian Disinformation about the War in Ukraine ”
- “Many-Shot In-Context Learning ”, Agarwal et al 2024
- “PhyloLM: Inferring the Phylogeny of Large Language Models and Predicting Their Performances in Benchmarks ”, Yax et al 2024
- “Few-Shot Recalibration of Language Models ”, Li et al 2024
- “Long-Form Factuality in Large Language Models ”, Wei et al 2024
- “Don’t Trust: Verify—Grounding LLM Quantitative Reasoning With Autoformalization ”, Zhou et al 2024
- “When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method ”, Zhang et al 2024
- “Gemini: A Family of Highly Capable Multimodal Models ”, Team et al 2023
- “ReST Meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent ”, Aksitov et al 2023
- “Rich Human Feedback for Text-To-Image Generation ”, Liang et al 2023
- “Beyond Human Data: Scaling Self-Training for Problem-Solving With Language Models (ReSTEM) ”, Singh et al 2023
- “Universal Self-Consistency for Large Language Model Generation ”, Chen et al 2023
- “Instruction-Following Evaluation for Large Language Models ”, Zhou et al 2023
- “A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models ”, Eisape et al 2023
- “PAIR: Jailbreaking Black Box Large Language Models in 20 Queries ”, Chao et al 2023
- “RLAIF: Scaling Reinforcement Learning from Human Feedback With AI Feedback ”, Lee et al 2023
- “Android in the Wild: A Large-Scale Dataset for Android Device Control ”, Rawles et al 2023
- “PaLM 2 Technical Report ”, Anil et al 2023
- “Google’s Newest AI Model Uses Nearly 5× More Text Data for Training Than Its Predecessor ”, Elias 2023
- “Pretraining Language Models With Human Preferences ”, Korbak et al 2023
- “Working With AI (Part 2): Code Conversion ”
- “Adversarial Misuse of Generative AI ”
- “George Tucker Homepage ”
- “How Good Are LLMs at Doing ML on an Unknown Dataset? ”
- “What Happened to BERT & T5? On Transformer Encoders, PrefixLM and Denoising Objectives ”, Tay 2025
- Sort By Magic
- Wikipedia (1)
- Miscellaneous
- Bibliography
See Also
Gwern
“Bell, Crow, Moon: 11 Variations ”, Gwern et al 2025
Links
“VideoGameBench: Can Vision-Language Models Complete Popular Video Games? ”, Zhang et al 2025
VideoGameBench: Can Vision-Language Models complete popular video games?
“Google Announces AI Ultra Subscription Plan: $250⧸month [Deep Think in Gemini-2.5-Pro Etc] ”, Ben-Yair 2025
Google announces AI Ultra subscription plan: $250⧸month [Deep Think in Gemini-2.5-pro etc] :
View External Link:
“Gemini Diffusion LLM ”, Google 2025
“Google I/O 2025 Keynote ”, Pichai 2025
“RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics ”, Zhang et al 2025
RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics
“RealMath [Code] ”, Zhang et al 2025
“Advancing the Frontier of Video Understanding With Gemini 2.5 ”, Baddepudi et al 2025
Advancing the frontier of video understanding with Gemini 2.5
“Gemini 2.5 Pro Preview: Even Better Coding Performance ”, Kilpatrick 2025
“Evaluating Frontier Models for Stealth and Situational Awareness ”, Phuong et al 2025
Evaluating Frontier Models for Stealth and Situational Awareness
“Gemini-2.5-Pro System Prompt ”, Liberator 2025
“Is Google Gemini-2.5-Pro Now Better Than Claude at Pokémon? [Probably] ”, Bradshaw 2025
Is Google Gemini-2.5-pro now better than Claude at Pokémon? [probably]
“AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-Time Computation ”, Chakrabarty et al 2025
“Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad ”, Petrov et al 2025
“Putting Gemini 2.5 Pro through Its Paces ”, Willison 2025
“Gemini 2.5: Our Newest Gemini Model With Thinking ”, Google 2025
“Fiction.live: LiveBench Results, 25 February 2025: Real-World Long Context Benchmark for Writers ”
Fiction.live: liveBench results, 25 February 2025: Real-World Long Context Benchmark for Writers
“Spontaneous Giving and Calculated Greed in Language Models ”, Li & Shirado 2025
“Idiosyncrasies in Large Language Models ”, Sun et al 2025
“VLMs As GeoGuessr Masters: Exceptional Performance, Hidden Biases, and Privacy Risks ”, Huang et al 2025
VLMs as GeoGuessr Masters: Exceptional Performance, Hidden Biases, and Privacy Risks
“SycEval: Evaluating LLM Sycophancy ”, Fanous et al 2025
“Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs ”, Saxena et al 2025
Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs
“How Different LLMs Answered the PhilPapers 2020 Survey ”, Satron 2025
“Ingesting Millions of PDFs and Why Gemini 2.0 Changes Everything ”, Filimonov 2025
Ingesting Millions of PDFs and why Gemini 2.0 Changes Everything :
“Proactive Agents for Multi-Turn Text-To-Image Generation Under Uncertainty ”, Hahn et al 2024
Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty
“Frontier Models Are Capable of In-Context Scheming ”, Meinke et al 2024
“Frontier Models Are Capable of In-Context Scheming ”, Hobbhahn et al 2024
“Alphabet Q3 Earnings Call: CEO Sundar Pichai’s Remarks ”
“AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents ”, Andriushchenko et al 2024
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
“Scalable Watermarking for Identifying Large Language Model Outputs ”
Scalable watermarking for identifying large language model outputs
“Inference Scaling for Long-Context Retrieval Augmented Generation ”, Yue et al 2024
Inference Scaling for Long-Context Retrieval Augmented Generation
“Project Zero: From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code ”
“Training Language Models to Self-Correct via Reinforcement Learning ”, Kumar et al 2024
Training Language Models to Self-Correct via Reinforcement Learning
“On Scalable Oversight With Weak LLMs Judging Strong LLMs ”, Kenton et al 2024
“Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? ”, Lee et al 2024
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
“What Are the Odds? Language Models Are Capable of Probabilistic Reasoning ”, Paruchuri et al 2024
What Are the Odds? Language Models Are Capable of Probabilistic Reasoning
“Can Language Models Use Forecasting Strategies? ”, Pratt et al 2024
“Grokked Transformers Are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization ”, Wang et al 2024
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
“Analyzing Poems With LLMs ”, Toper 2024
“Stochastic Lies: How LLM-Powered Chatbots Deal With Russian Disinformation about the War in Ukraine ”
Stochastic lies: How LLM-powered chatbots deal with Russian disinformation about the war in Ukraine
“Many-Shot In-Context Learning ”, Agarwal et al 2024
“PhyloLM: Inferring the Phylogeny of Large Language Models and Predicting Their Performances in Benchmarks ”, Yax et al 2024
“Few-Shot Recalibration of Language Models ”, Li et al 2024
“Long-Form Factuality in Large Language Models ”, Wei et al 2024
“Don’t Trust: Verify—Grounding LLM Quantitative Reasoning With Autoformalization ”, Zhou et al 2024
Don’t Trust: Verify—Grounding LLM Quantitative Reasoning with Autoformalization
“When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method ”, Zhang et al 2024
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method
“Gemini: A Family of Highly Capable Multimodal Models ”, Team et al 2023
“ReST Meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent ”, Aksitov et al 2023
ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent
“Rich Human Feedback for Text-To-Image Generation ”, Liang et al 2023
“Beyond Human Data: Scaling Self-Training for Problem-Solving With Language Models (ReSTEM) ”, Singh et al 2023
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (ReSTEM)
“Universal Self-Consistency for Large Language Model Generation ”, Chen et al 2023
Universal Self-Consistency for Large Language Model Generation
“Instruction-Following Evaluation for Large Language Models ”, Zhou et al 2023
“A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models ”, Eisape et al 2023
A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models
“PAIR: Jailbreaking Black Box Large Language Models in 20 Queries ”, Chao et al 2023
PAIR: Jailbreaking Black Box Large Language Models in 20 Queries
“RLAIF: Scaling Reinforcement Learning from Human Feedback With AI Feedback ”, Lee et al 2023
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
“Android in the Wild: A Large-Scale Dataset for Android Device Control ”, Rawles et al 2023
Android in the Wild: A Large-Scale Dataset for Android Device Control
“PaLM 2 Technical Report ”, Anil et al 2023
“Google’s Newest AI Model Uses Nearly 5× More Text Data for Training Than Its Predecessor ”, Elias 2023
Google’s newest AI model uses nearly 5× more text data for training than its predecessor
“Pretraining Language Models With Human Preferences ”, Korbak et al 2023
“Working With AI (Part 2): Code Conversion ”
“Adversarial Misuse of Generative AI ”
“George Tucker Homepage ”
“How Good Are LLMs at Doing ML on an Unknown Dataset? ”
“What Happened to BERT & T5? On Transformer Encoders, PrefixLM and Denoising Objectives ”, Tay 2025
What happened to BERT & T5? On Transformer Encoders, PrefixLM and Denoising Objectives :
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
reasoning-benchmark
harmful-benchmark
feedback-scaling
Wikipedia (1)
Miscellaneous
Bibliography
https://arxiv.org/abs/2503.21934
: “Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad ”,https://arxiv.org/abs/2412.06771#deepmind
: “Proactive Agents for Multi-Turn Text-To-Image Generation Under Uncertainty ”,https://arxiv.org/abs/2406.13121#google
: “Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? ”,https://arxiv.org/abs/2405.15071
: “Grokked Transformers Are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization ”,https://arxiv.org/abs/2403.18802#deepmind
: “Long-Form Factuality in Large Language Models ”,https://arxiv.org/abs/2403.18120#google
: “Don’t Trust: Verify—Grounding LLM Quantitative Reasoning With Autoformalization ”,https://arxiv.org/abs/2312.06585#deepmind
: “Beyond Human Data: Scaling Self-Training for Problem-Solving With Language Models (ReSTEM) ”,https://arxiv.org/abs/2310.08419
: “PAIR: Jailbreaking Black Box Large Language Models in 20 Queries ”,https://arxiv.org/abs/2305.10403#google
: “PaLM 2 Technical Report ”,https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html
: “Google’s Newest AI Model Uses Nearly 5× More Text Data for Training Than Its Predecessor ”,