Abs-E (or, speak only in the positive) § text2epositive.py experiment
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Business Spending on AI Surged 500% This Year to $13.8 Billion
Hidden Persuaders: LLMs’ Political Leaning and Their Influence on Voters
SimpleStrat: Diversifying Language Model Generation with Stratification
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1
Invisible Unicode Text That AI Chatbots Understand and Humans Can’t? Yep, It’s a Thing
Evaluation of OpenAI o1: Opportunities and Challenges of AGI
That Message From Your Doctor? It May Have Been Drafted by ChatGPT-4
LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench
I Have Played a Little Bit With OpenAI’s New Iteration, GPT-4 O1
Does Refusal Training in LLMs Generalize to the Past Tense?
GPT-4 is judged more human than humans in displaced and inverted Turing tests
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Are Large Language Models Consistent over Value-laden Questions?
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
A real-world test of artificial intelligence infiltration of a university examinations system: A ‘Turing Test’ case study
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
What Are the Odds? Language Models Are Capable of Probabilistic Reasoning
Probing the Decision Boundaries of In-context Learning in Large Language Models
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
LLMs achieve adult human performance on higher-order theory of mind tasks
Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models
DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
Can Language Models Explain Their Own Classification Behavior?
ChatGPT will be able to talk to you like Scarlett Johansson in Her / Upgrades to ChatGPT’s voice mode bring it closer to the vision of a responsive AI assistant—and Sam Altman seems to know it
GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Aligning LLM Agents by Learning Latent Preference from User Edits
Automated Social Science: Language Models as Scientist and Subjects
Enhancing Confidence Expression in Large Language Models Through Learning from Past Experience
Do LLMs Play Dice? Exploring Probability Distribution Sampling in Large Language Models for Behavioral Simulation
From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples
Election Workers Are Drowning in Records Requests. AI Chatbots Could Make It Worse: Experts worry that election deniers could weaponize chatbots to overwhelm and slow down local officials
Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models
FABLES: Evaluating faithfulness and content selection in book-length summarization
A Peter Thiel-Backed AI Startup, Cognition Labs, Seeks $2 Billion Valuation: Funding round could increase startup’s valuation nearly sixfold in a matter of weeks, reflecting AI frenzy
Vulnerability Detection with Code Language Models: How Far Are We?
Gold-Medalist Coders Build an AI That Can Do Their Job for Them: A new startup called Cognition AI can turn a user’s prompt into a website or video game
Playing NetHack with LLMs: Potential & Limitations as Zero-Shot Agents (NetPlay)
Teaching Large Language Models an Unseen Language on the Fly
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models
The Non-Effect of Sampling Temperature on Problem Solving in GPT-3.5/GPT-4
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench
Better Call GPT, Comparing Large Language Models Against Lawyers
I am a Strange Dataset: Metalinguistic Tests for Language Models
GPT-4-V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
Leveraging Large Language Models to Boost Dafny’s Developers Productivity
Testing Theory of Mind in Large Language Models and Humans
Large language models are able to downplay their cognitive abilities to fit the persona they simulate
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation
PRER: Modeling Complex Mathematical Reasoning via Large Language Model based MathAgent
Can linguists distinguish between ChatGPT and human writing?: A study of research ethics and academic publishing
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
Llamas Know What GPTs Don’t Show: Surrogate Models for Confidence Estimation
Comparing Humans, GPT-4, and GPT-4-V On Abstraction and Reasoning Tasks
In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search
The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4
Accuracy of a Vision-Language Model on Challenging Medical Cases
Large Language Models can Strategically Deceive their Users when Put Under Pressure
Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
Eureka: Human-Level Reward Design via Coding Large Language Models
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4-V
Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament
Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams
Large language models can replicate cross-cultural differences in personality
Beyond Memorization: Violating Privacy Via Inference with Large Language Models
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Police Officers Are Starting to Use AI to Write Crime Reports
Can large language models provide useful feedback on research papers? A large-scale empirical analysis
An evolutionary model of personality traits related to cooperative behavior using a large language model
UltraFeedback: Boosting Language Models with High-quality Feedback
MTOB: A Benchmark for Learning to Translate a New Language from One Grammar Book
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
From Sparse to Dense: GPT-4 Summarization with Chain of Density (CoD) Prompting
Devising and Detecting Phishing: Large Language Models vs. Smaller Human Models
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax?
Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problems
The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain
OpenAI Worries About What Its Chatbot Will Say About People’s Faces: An advanced version of ChatGPT can analyze images and is already helping the blind. But its ability to put a name to a face is one reason the public doesn’t have access to it
GPT-4, an artificial intelligence large language model, exhibits high levels of accuracy on dermatology specialty certificate exam questions
Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events
Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration
Explaining Competitive-Level Programming Solutions using LLMs
Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models
ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews
Understanding Social Reasoning in Language Models with Language Models
Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks
Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence
Can large language models democratize access to dual-use biotechnology?
LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations
Learning to Generate Novel Scientific Directions with Contextualized Literature-based Discovery
WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns
Boosting Theory-of-Mind Performance in Large Language Models via Prompting
Today was the first day that I could definitively say that GPT-4 has saved me a substantial amount of tedious work
Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure
Advances in apparent conceptual physics reasoning in GPT-4
Performance of ChatGPT on free-response, clinical reasoning exams
Reflexion: Language Agents with Verbal Reinforcement Learning
How well do Large Language Models perform in Arithmetic tasks?
Salesforce Announces Einstein GPT, the World’s First Generative AI for CRM
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
Harvey, which uses AI to answer legal questions, lands cash from OpenAI
A Basic Test of OpenAI’s Structured Output Feature against Financial Disclosure Reports and a Newspaper’s Police Blotter
There’s a Running Theme in Here of Programming Problems LLMs Solve Where It’s...
Situational Awareness and Out-Of-Context Reasoning § GPT-4-Base Has Non-Zero Longform Performance
Connecting the Dots: LLMs Can Infer & Verbalize Latent Structure from Training Data
AI Will Increase the Quantity—And Quality—Of Phishing Scams
[‘Fourier Components’-Style Literary Criticism by GPT-4 O1]
https://answers.microsoft.com/en-us/bing/forum/all/this-ai-chatbot-sidney-is-misbehaving/e3d6a29f-06c9-441c-bc7d-51a68e856761?page=1
https://betterprogramming.pub/the-dark-side-of-llms-we-need-to-rethink-large-language-models-now-6212aca0581a
https://blog.matteskridge.com/business/gpt4-and-silicon-valley-bank/2023/03/19/
https://blog.mentat.ai/benchmarking-gpt-4-turbo-a-cautionary-tale
https://blog.nawaz.org/posts/2024/Jan/llm-assisted-moderation/
https://chat.openai.com/share/04add58f-2052-4b60-ae2a-ab708c29088f
https://chatgpt.com/share/312e82f0-cc5e-47f3-b368-b2c0c0f4ad3f
https://clarifycapital.com/the-future-of-investment-pitching
https://cookbook.openai.com/examples/tag_caption_images_with_gpt4v
https://finedataproducts.com/posts/2024-03-10-tax-scenarios-with-ai/
https://generallyintelligent.substack.com/p/fine-tuning-mistral-7b-on-magic-the
https://gist.github.com/Jessime/63f93215faed6f7109c6d62b7fef7fbc
https://gist.github.com/harryaskham/68a611bef777525991790bca2f2d324d
https://github.blog/2023-11-08-universe-2023-copilot-transforms-github-into-the-ai-powered-developer-platform/
https://github.com/E-xyza/Exonerate/blob/master/bench/reports/gpt-bench.md
https://github.com/jujumilk3/leaked-system-prompts/blob/main/microsoft-bing-chat_20230209.md
https://github.com/jujumilk3/leaked-system-prompts/blob/main/openai-assistants-api_20231106.md
https://github.com/jujumilk3/leaked-system-prompts/blob/main/openai-chatgpt-ios_20230614.md
https://github.com/jujumilk3/leaked-system-prompts/blob/main/openai-chatgpt4-android_20240207.md
https://github.com/jujumilk3/leaked-system-prompts/blob/main/openai-chatgpt_20221201.md
https://github.com/kagisearch/llm-chess-puzzles?tab=readme-ov-file#results
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2812620
https://kenkantzer.com/lessons-after-a-half-billion-gpt-tokens/
https://koenvangilst.nl/blog/keeping-code-complexity-in-check
https://lemire.me/blog/2023/03/22/can-gpt-pass-my-programming-courses/
https://marginalrevolution.com/marginalrevolution/2023/10/goat-who-is-the-greatest-economist-of-all-time-and-why-does-it-matter.html
https://matthewbarnett.substack.com/p/gpt-4-takes-bryan-caplans-midterm
https://mazzzystar.github.io/2023/05/10/LLM-for-individual/
https://micahflee.com/2023/04/capturing-the-flag-with-gpt-4/
https://openai.com/blog/function-calling-and-other-api-updates#function-calling
https://openai.com/index/introducing-structured-outputs-in-the-api/#_5PYjnV1iAHOPKPupDztdZk
https://paperswithcode.com/sota/math-word-problem-solving-on-math
https://platform.openai.com/docs/guides/reasoning/how-reasoning-works
https://pslusarz.github.io/articles/2023/12/22/compare-ocr-tesseract-gpt4-nara-rolls.html
https://statmodeling.stat.columbia.edu/2023/04/18/chatgpt4-writes-stan-code-so-i-dont-have-to/
https://statmodeling.stat.columbia.edu/2023/08/20/bob-carpenter-thinks-gpt-4-is-awesome/
https://terrytao.wordpress.com/about/ai-generated-versions-of-the-ai-anthology-article/
https://villekuosmanen.medium.com/i-played-chess-against-chatgpt-4-and-lost-c5798a9049ca
https://web.archive.org/web/20230529224700/https://chat.openai.com/share/eef34fe5-0c8e-4595-9c28-2e9f05f05393
https://www.betonit.ai/p/gpt-4-takes-a-new-midterm-and-gets
https://www.construction-physics.com/p/could-chatgpt-become-an-architect
https://www.economist.com/business/2024/02/29/how-businesses-are-actually-using-generative-ai
https://www.euractiv.com/section/politics/news/albania-to-speed-up-eu-accession-using-chatgpt/
https://www.geoffreylitt.com/2023/03/25/llm-end-user-programming
https://www.lesswrong.com/posts/75o8oja43LXGAqbAR/palm-2-and-gpt-4-in-extrapolating-gpt-n-performance
https://www.lesswrong.com/posts/ChtGdxk9mwZ2Rxogt/smartyheadercode-anomalous-tokens-for-gpt3-5-and-gpt-4-1
https://www.lesswrong.com/posts/CkhJAxHeyFCg2EcET/are-language-models-good-at-making-predictions
https://www.lesswrong.com/posts/F6vH6fr8ngo7csDdf/chess-as-a-case-study-in-hidden-capabilities-in-chatgpt
https://www.lesswrong.com/posts/KSroBnxCHodGmPPJ8/jailbreaking-gpt-4-s-code-interpreter
https://www.lesswrong.com/posts/Z4tBreNCxnppoPLtd/gpts-ability-to-keep-a-secret-is-weirdly-prompt-dependent
https://www.lesswrong.com/posts/bNCDexejSZpkuu3yz/you-can-use-gpt-4-to-create-prompt-injections-against-gpt-4
https://www.lesswrong.com/posts/zyPaqXgFzqHkQfccq/contra-lecun-on-autoregressive-llms-are-doomed?commentId=fXGn2E8RMdwhKqwrE
https://www.malwarebytes.com/blog/threat-intelligence/2023/09/malicious-ad-served-inside-bing-ai-chatbot
https://www.oneusefulthing.org/p/it-is-starting-to-get-strange
https://www.oneusefulthing.org/p/setting-time-on-fire-and-the-temptation
https://www.reddit.com/r/ApplyingToCollege/comments/1h0vhlq/in_the_past_three_days_ive_reviewed_over_100/
https://www.reddit.com/r/ChatGPT/comments/12a0ajb/i_gave_gpt4_persistent_memory_and_the_ability_to/
https://www.reddit.com/r/ExperiencedDevs/comments/11y8hys/chatgpt_resumes_accounted_for_30_of_the_ones_we/
https://www.reddit.com/r/GPT3/comments/12ez822/neurosemantical_inversitis_prompt_still_works/
https://www.reddit.com/r/MachineLearning/comments/18u31w8/r_large_language_models_world_chess_championship/
https://www.reddit.com/r/OpenAI/comments/1fxa6d6/two_purported_instances_of_o1preview_and_o1mini/
https://www.reddit.com/r/OpenAI/comments/1gjj430/o1_preview_got_weird_today/
https://www.reddit.com/r/PromptEngineering/comments/1fj6h13/hallucinations_in_o1preview_reasoning/
https://www.reddit.com/r/bing/comments/110eagl/the_customer_service_of_the_new_bing_chat_is/
https://www.reddit.com/r/duolingo/comments/18sx06i/big_layoff_at_duolingo/
https://www.reddit.com/r/freelanceWriters/comments/12ff5mw/it_happened_to_me_today/
https://www.reddit.com/r/mlscaling/comments/1gyb54z/the_fate_of_gpt4o/
https://www.reddit.com/r/singularity/comments/1atjz9v/ive_put_a_complex_codebase_into_a_single/
https://www.reddit.com/r/slatestarcodex/comments/1201v68/10word_quote_a_short_and_simple_failure_mode_of/jdigzkh/?context=3
https://www.supersimple.io/blog/gpt-4-fine-tuning-early-access
https://www.thebigquestions.com/2023/04/05/gpt-4-fails-economics/
https://www.thendobetter.com/investing/2023/6/9/tyler-cowen-hayek-lecture-on-economics-ai-and-large-langauge-models
https://www.theverge.com/2023/2/15/23599072/microsoft-ai-bing-personality-conversations-spy-employees-webcams
https://www.vice.com/en/article/v7begx/overemployed-hustlers-exploit-chatgpt-to-take-on-even-more-full-time-jobs
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
https%253A%252F%252Farxiv.org%252Fabs%252F2410.07095%2523openai.html
https%253A%252F%252Ftime.com%252F7026050%252Fchatgpt-quit-teaching-ai-essay%252F.html
Does Refusal Training in LLMs Generalize to the Past Tense?
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
https%253A%252F%252Farxiv.org%252Fabs%252F2406.18518%2523salesforce.html
Probing the Decision Boundaries of In-context Learning in Large Language Models
LLMs achieve adult human performance on higher-order theory of mind tasks
https%253A%252F%252Farxiv.org%252Fabs%252F2405.18870%2523google.html
Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models
Jeff Clune—Professor—Computer Science—University of British Columbia
DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
ChatGPT will be able to talk to you like Scarlett Johansson in Her / Upgrades to ChatGPT’s voice mode bring it closer to the vision of a responsive AI assistant—and Sam Altman seems to know it
https%253A%252F%252Fwww.theverge.com%252F2024%252F5%252F13%252F24155652%252Fchatgpt-voice-mode-gpt4o-upgrades.html
GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic
https%253A%252F%252Farxiv.org%252Fabs%252F2405.00332%2523scale.html
From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples
Election Workers Are Drowning in Records Requests. AI Chatbots Could Make It Worse: Experts worry that election deniers could weaponize chatbots to overwhelm and slow down local officials
https%253A%252F%252Fwww.wired.com%252Fstory%252Fai-chatbots-foia-requests-election-workers%252F.html
https%253A%252F%252Flink.springer.com%252Farticle%252F10.1007%252Fs10506-024-09396-9.html
A Peter Thiel-Backed AI Startup, Cognition Labs, Seeks $2 Billion Valuation: Funding round could increase startup’s valuation nearly sixfold in a matter of weeks, reflecting AI frenzy
https%253A%252F%252Fwww.wsj.com%252Ftech%252Fai%252Fa-peter-thiel-backed-ai-startup-cognition-labs-seeks-2-billion-valuation-998fa39d.html
Vulnerability Detection with Code Language Models: How Far Are We?
https%253A%252F%252Farxiv.org%252Fabs%252F2403.18802%2523deepmind.html
Gold-Medalist Coders Build an AI That Can Do Their Job for Them: A new startup called Cognition AI can turn a user’s prompt into a website or video game
https%253A%252F%252Fwww.bloomberg.com%252Fnews%252Farticles%252F2024-03-12%252Fcognition-ai-is-a-peter-thiel-backed-coding-assistant.html
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
https%253A%252F%252Fwww.ncbi.nlm.nih.gov%252Fpmc%252Farticles%252FPMC10894685%252F.html
Large language models are able to downplay their cognitive abilities to fit the persona they simulate
https%253A%252F%252Fwww.ncbi.nlm.nih.gov%252Fpmc%252Farticles%252FPMC10936766%252F.html
PRER: Modeling Complex Mathematical Reasoning via Large Language Model based MathAgent
Can linguists distinguish between ChatGPT and human writing?: A study of research ethics and academic publishing
%252Fdoc%252Fai%252Fnn%252Ftransformer%252Fgpt%252F4%252Fnonfiction%252F2023-casal.pdf.html
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
https%253A%252F%252Farxiv.org%252Fabs%252F2311.16452%2523microsoft.html
Comparing Humans, GPT-4, and GPT-4-V On Abstraction and Reasoning Tasks
Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament
Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams
%252Fdoc%252Fpsychology%252Fpersonality%252F2023-phillips.pdf.html
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
https%253A%252F%252Farxiv.org%252Fabs%252F2310.03214%2523google.html
UltraFeedback: Boosting Language Models with High-quality Feedback
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
From Sparse to Dense: GPT-4 Summarization with Chain of Density (CoD) Prompting
Devising and Detecting Phishing: Large Language Models vs. Smaller Human Models
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
https%253A%252F%252Ftime.com%252F6301288%252Fthe-ai-jokes-that-give-me-nightmares%252F.html
OpenAI Worries About What Its Chatbot Will Say About People’s Faces: An advanced version of ChatGPT can analyze images and is already helping the blind. But its ability to put a name to a face is one reason the public doesn’t have access to it
https%253A%252F%252Fwww.nytimes.com%252F2023%252F07%252F18%252Ftechnology%252Fopenai-chatgpt-facial-recognition.html.html
%252Fdoc%252Fai%252Fnn%252Ftransformer%252Fgpt%252F3%252Fnonfiction%252F2024-banker.pdf.html
Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events
https%253A%252F%252Farxiv.org%252Fabs%252F2307.06439%2523microsoft.html
Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration
https%253A%252F%252Farxiv.org%252Fabs%252F2307.05300%2523microsoft.html
Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models
ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews
Understanding Social Reasoning in Language Models with Language Models
https%253A%252F%252Farxiv.org%252Fabs%252F2305.20050%2523openai.html
LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations
Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns
Boosting Theory-of-Mind Performance in Large Language Models via Prompting
Performance of ChatGPT on free-response, clinical reasoning exams
https%253A%252F%252Fwww.medrxiv.org%252Fcontent%252F10.1101%252F2023.03.24.23287731.full.html
How well do Large Language Models perform in Arithmetic tasks?
https%253A%252F%252Farxiv.org%252Fabs%252F2304.02015%2523alibaba.html
https%253A%252F%252Farxiv.org%252Fpdf%252F2303.08774%2523page%253D12%2526org%253Dopenai.html
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
Harvey, which uses AI to answer legal questions, lands cash from OpenAI
https%253A%252F%252Ftechcrunch.com%252F2022%252F11%252F23%252Fharvey-which-uses-ai-to-answer-legal-questions-lands-cash-from-openai%252F.html
Wikipedia Bibliography: