Bibliography:

  1. ‘GPT-4’ tag

  2. Abs-E (or, speak only in the positive) § text2epositive.py experiment

  3. text2epositive.py

  4. date-Guesser.py

  5. paragraphizer.py

  6. CQK Is The First Unused TLA

  7. O1 Turns Pro

  8. BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

  9. Business Spending on AI Surged 500% This Year to $13.8 Billion

  10. Generative Agent Simulations of 1,000 People

  11. Hidden Persuaders: LLMs’ Political Leaning and Their Influence on Voters

  12. Can LLMs be Scammed? A Baseline Measurement Study

  13. SimpleStrat: Diversifying Language Model Generation with Stratification

  14. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

  15. Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

  16. Can OpenAI’s o1-Preview Ace the 2023 Putnam Exam?

  17. When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

  18. Invisible Unicode Text That AI Chatbots Understand and Humans Can’t? Yep, It’s a Thing

  19. I Quit Teaching Because of ChatGPT

  20. Evaluation of OpenAI o1: Opportunities and Challenges of AGI

  21. That Message From Your Doctor? It May Have Been Drafted by ChatGPT-4

  22. LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench

  23. I Have Played a Little Bit With OpenAI’s New Iteration, GPT-4 O1

  24. c662a08720743b3e7eef8a746ca31e4ca6eafc85.html

  25. Thoughts while watching myself be automated

  26. Generative AI Can Harm Learning

  27. Does Refusal Training in LLMs Generalize to the Past Tense?

  28. GPT-4 is judged more human than humans in displaced and inverted Turing tests

  29. On scalable oversight with weak LLMs judging strong LLMs

  30. Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

  31. Are Large Language Models Consistent over Value-laden Questions?

  32. Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

  33. APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

  34. A real-world test of artificial intelligence infiltration of a university examinations system: A ‘Turing Test’ case study

  35. Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

  36. OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

  37. What Are the Odds? Language Models Are Capable of Probabilistic Reasoning

  38. Probing the Decision Boundaries of In-context Learning in Large Language Models

  39. Development cost of ARC GPT-4o prototype

  40. GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

  41. Are We Done with MMLU?

  42. Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

  43. LLMs achieve adult human performance on higher-order theory of mind tasks

  44. Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models

  45. DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ

  46. DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

  47. Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

  48. Can Language Models Explain Their Own Classification Behavior?

  49. ChatGPT will be able to talk to you like Scarlett Johansson in Her / Upgrades to ChatGPT’s voice mode bring it closer to the vision of a responsive AI assistant—and Sam Altman seems to know it

  50. GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic

  51. Aligning LLM Agents by Learning Latent Preference from User Edits

  52. Automated Social Science: Language Models as Scientist and Subjects

  53. Enhancing Confidence Expression in Large Language Models Through Learning from Past Experience

  54. LLM Evaluators Recognize and Favor Their Own Generations

  55. Do LLMs Play Dice? Exploring Probability Distribution Sampling in Large Language Models for Behavioral Simulation

  56. Is ChatGPT Transforming Academics’ Writing Style?

  57. From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

  58. Election Workers Are Drowning in Records Requests. AI Chatbots Could Make It Worse: Experts worry that election deniers could weaponize chatbots to overwhelm and slow down local officials

  59. Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

  60. FABLES: Evaluating faithfulness and content selection in book-length summarization

  61. Re-evaluating GPT-4’s bar exam performance

  62. A Peter Thiel-Backed AI Startup, Cognition Labs, Seeks $2 Billion Valuation: Funding round could increase startup’s valuation nearly sixfold in a matter of weeks, reflecting AI frenzy

  63. Vulnerability Detection with Code Language Models: How Far Are We?

  64. Long-form factuality in large language models

  65. Gold-Medalist Coders Build an AI That Can Do Their Job for Them: A new startup called Cognition AI can turn a user’s prompt into a website or video game

  66. Playing NetHack with LLMs: Potential & Limitations as Zero-Shot Agents (NetPlay)

  67. Teaching Large Language Models an Unseen Language on the Fly

  68. Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap

  69. Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

  70. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

  71. Tasks That Language Models Don’t Learn

  72. Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

  73. The Non-Effect of Sampling Temperature on Problem Solving in GPT-3.5/GPT-4

  74. I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench

  75. Better Call GPT, Comparing Large Language Models Against Lawyers

  76. I am a Strange Dataset: Metalinguistic Tests for Language Models

  77. GPT-4-V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

  78. A Vision Check-up for Language Models

  79. Leveraging Large Language Models to Boost Dafny’s Developers Productivity

  80. Originality Dies When Being Average Is Easier

  81. Testing Theory of Mind in Large Language Models and Humans

  82. GPT-4 passes the bar exam

  83. Large language models are able to downplay their cognitive abilities to fit the persona they simulate

  84. WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation

  85. PRER: Modeling Complex Mathematical Reasoning via Large Language Model based MathAgent

  86. Can linguists distinguish between ChatGPT and human writing?: A study of research ethics and academic publishing

  87. Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

  88. GPQA: A Graduate-Level Google-Proof Q&A Benchmark

  89. GPT-4-V Optical Illusion

  90. Llamas Know What GPTs Don’t Show: Surrogate Models for Confidence Estimation

  91. Comparing Humans, GPT-4, and GPT-4-V On Abstraction and Reasoning Tasks

  92. In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search

  93. The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4

  94. Accuracy of a Vision-Language Model on Challenging Medical Cases

  95. Large Language Models can Strategically Deceive their Users when Put Under Pressure

  96. Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves

  97. Augmenting large language models with chemistry tools

  98. FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions

  99. Branch-Solve-Merge Improves Large Language Model Evaluation and Generation

  100. Eureka: Human-Level Reward Design via Coding Large Language Models

  101. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4-V

  102. Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament

  103. Data Contamination Through the Lens of Time

  104. Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams

  105. Large language models can replicate cross-cultural differences in personality

  106. Beyond Memorization: Violating Privacy Via Inference with Large Language Models

  107. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

  108. Can a computer outfake a human [personality]?

  109. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

  110. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

  111. Police Officers Are Starting to Use AI to Write Crime Reports

  112. Can large language models provide useful feedback on research papers? A large-scale empirical analysis

  113. Low-Resource Languages Jailbreak GPT-4

  114. An evolutionary model of personality traits related to cooperative behavior using a large language model

  115. UltraFeedback: Boosting Language Models with High-quality Feedback

  116. MTOB: A Benchmark for Learning to Translate a New Language from One Grammar Book

  117. Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

  118. The Cambridge Law Corpus: A Corpus for Legal AI Research

  119. The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

  120. From Sparse to Dense: GPT-4 Summarization with Chain of Density (CoD) Prompting

  121. Devising and Detecting Phishing: Large Language Models vs. Smaller Human Models

  122. ExpeL: LLM Agents Are Experiential Learners

  123. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

  124. Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

  125. OpenAI Cribbed Our Tax Example, But Can GPT-4 Really Do Tax?

  126. Testing GPT-4 with Wolfram Alpha and Code Interpreter plug-ins on math and science problems

  127. The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain

  128. I’m a Screenwriter. These AI Jokes Give Me Nightmares

  129. A LLM Assisted Exploitation of AI-Guardian

  130. OpenAI Worries About What Its Chatbot Will Say About People’s Faces: An advanced version of ChatGPT can analyze images and is already helping the blind. But its ability to put a name to a face is one reason the public doesn’t have access to it

  131. GPT-4, an artificial intelligence large language model, exhibits high levels of accuracy on dermatology specialty certificate exam questions

  132. Machine-Assisted Social Psychology Hypothesis Generation

  133. Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events

  134. Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration

  135. Explaining Competitive-Level Programming Solutions using LLMs

  136. Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models

  137. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

  138. ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews

  139. Understanding Social Reasoning in Language Models with Language Models

  140. Evaluating Superhuman Models with Consistency Checks

  141. Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks

  142. ChessGPT: Bridging Policy Learning and Language Modeling

  143. Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence

  144. Can large language models democratize access to dual-use biotechnology?

  145. Let’s Verify Step by Step

  146. GPT4GEO: How a Language Model Sees the World’s Geography

  147. LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations

  148. Learning to Generate Novel Scientific Directions with Contextualized Literature-based Discovery

  149. WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia

  150. How Language Model Hallucinations Can Snowball

  151. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

  152. Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns

  153. Boosting Theory-of-Mind Performance in Large Language Models via Prompting

  154. Today was the first day that I could definitively say that GPT-4 has saved me a substantial amount of tedious work

  155. Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure

  156. Advances in apparent conceptual physics reasoning in GPT-4

  157. Performance of ChatGPT on free-response, clinical reasoning exams

  158. Reflexion: Language Agents with Verbal Reinforcement Learning

  159. How well do Large Language Models perform in Arithmetic tasks?

  160. GPT-4 Technical Report § Limitations: Calibration

  161. Salesforce Announces Einstein GPT, the World’s First Generative AI for CRM

  162. Large Language Models Are State-of-the-Art Evaluators of Translation Quality

  163. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

  164. Harvey, which uses AI to answer legal questions, lands cash from OpenAI

  165. Janus

  166. Something Weird Is Happening With LLMs and Chess

  167. Trading Off Compute in Training and Inference

  168. A Basic Test of OpenAI’s Structured Output Feature against Financial Disclosure Reports and a Newspaper’s Police Blotter

  169. Prompt Engineering Techniques With Azure OpenAI

  170. 9f51fc0ccaefe29b85deb1574deab082e63799df.html

  171. LLM Powered Autonomous Agents

  172. There’s a Running Theme in Here of Programming Problems LLMs Solve Where It’s...

  173. 85525c9bb48f9c95680601ccae4284f2c576e93b.html

  174. Prompting Diverse Ideas: Increasing AI Idea Variance

  175. OpenAI API § Prompt Caching

  176. Situational Awareness and Out-Of-Context Reasoning § GPT-4-Base Has Non-Zero Longform Performance

  177. I Finally Got ChatGPT to Sound like Me

  178. Connecting the Dots: LLMs Can Infer & Verbalize Latent Structure from Training Data

  179. How Good Are LLMs at Doing ML on an Unknown Dataset?

  180. Language Models Model Us

  181. The Case for More Ambitious Language Model Evals

  182. a1db1647e9173aaacd1968b6f0fdd0b4eecc578a.html

  183. What Kind of Writer Is ChatGPT?

  184. AI Will Increase the Quantity—And Quality—Of Phishing Scams

  185. Is Finetuning GPT-4o worth It?

  186. [‘Fourier Components’-Style Literary Criticism by GPT-4 O1]

  187. design#future-tag-features

    [Transclude the forward-link's context]

  188. 2024-03-07-inflection-inflection25benchmarks.svg

  189. http://antirez.com/news/141

  190. 597350770268e111f146aa9bb1c8e794a363869d.html

  191. https://ai.nejm.org/doi/pdf/10.1056/AIp2300031

  192. https://amistrongeryet.substack.com/p/can-ai-do-my-job

  193. https://answers.microsoft.com/en-us/bing/forum/all/this-ai-chatbot-sidney-is-misbehaving/e3d6a29f-06c9-441c-bc7d-51a68e856761?page=1

  194. https://applied-llms.org/

  195. https://betterprogramming.pub/the-dark-side-of-llms-we-need-to-rethink-large-language-models-now-6212aca0581a

  196. 484ebd86ccfcead62264cfdcfada2f355ad90804.html

  197. https://blog.langchain.dev/agents-round/

  198. cb23f6ad2472b55b4f7ca524d06c54d27c15b941.html

  199. https://blog.matteskridge.com/business/gpt4-and-silicon-valley-bank/2023/03/19/

  200. 0eab025028f4b52a9ada1fe25bf3010e8fc5669d.html

  201. https://blog.mentat.ai/benchmarking-gpt-4-turbo-a-cautionary-tale

  202. https://blog.nawaz.org/posts/2024/Jan/llm-assisted-moderation/

  203. 53de69c2b9bafee588794d108114e33c86956410.html

  204. https://blog.roboflow.com/gpt-4-vision/

  205. https://bloop.ai/blog/evaluating-llms-on-cobol

  206. 021c6dbebb60998705fba7c08886e69644a565b0.html

  207. https://chat.openai.com/share/04add58f-2052-4b60-ae2a-ab708c29088f

  208. 2687312bcc15ed6e94d5743992fa3defcfecf634.html

  209. https://chatgpt.com/share/312e82f0-cc5e-47f3-b368-b2c0c0f4ad3f

  210. https://clarifycapital.com/the-future-of-investment-pitching

  211. d40bac1e6b491c93a57fb86cefd3f8465ecec93f.html

  212. https://cookbook.openai.com/examples/tag_caption_images_with_gpt4v

  213. https://demian.ferrei.ro/blog/chatgpt-sucks-at-pangrams

  214. 1ccfb8e3ba4928af8143d7ecc5dbe7641d16676e.html

  215. https://dkb.blog/p/chatgpts-chess-elo-is-1400

  216. 7d0b67636708552f924e83bd0617f67f43d75fef.html

  217. https://dmicz.github.io/machine-learning/openai-changes/

  218. 7d11e59b234f27e96a4808d0b04365daf45263ad.html

  219. https://finedataproducts.com/posts/2024-03-10-tax-scenarios-with-ai/

  220. https://generallyintelligent.substack.com/p/fine-tuning-mistral-7b-on-magic-the

  221. https://gist.github.com/Jessime/63f93215faed6f7109c6d62b7fef7fbc

  222. 7a9d045c671b2039955b71b4eb95d362102d482c.html

  223. https://gist.github.com/harryaskham/68a611bef777525991790bca2f2d324d

  224. https://github.blog/2023-11-08-universe-2023-copilot-transforms-github-into-the-ai-powered-developer-platform/

  225. https://github.com/E-xyza/Exonerate/blob/master/bench/reports/gpt-bench.md

  226. https://github.com/Significant-Gravitas/AutoGPT

  227. https://github.com/TaxyAI/browser-extension

  228. https://github.com/jujumilk3/leaked-system-prompts/blob/main/microsoft-bing-chat_20230209.md

  229. https://github.com/jujumilk3/leaked-system-prompts/blob/main/openai-assistants-api_20231106.md

  230. https://github.com/jujumilk3/leaked-system-prompts/blob/main/openai-chatgpt-ios_20230614.md

  231. https://github.com/jujumilk3/leaked-system-prompts/blob/main/openai-chatgpt4-android_20240207.md

  232. https://github.com/jujumilk3/leaked-system-prompts/blob/main/openai-chatgpt_20221201.md

  233. https://github.com/kagisearch/llm-chess-puzzles?tab=readme-ov-file#results

  234. e11ab8f0af8b80e4be545b6c0767252c3f464a8d.html#results

  235. https://github.com/nomic-ai/gpt4all

  236. https://github.com/tldraw/make-real

  237. https://github.com/xenodium/chatgpt-shell/

  238. d5a2ac20a8da4c74dc7b9be0d3503dda9eb760f2.html

  239. https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2812620

  240. https://kagi.com/summarizer/api.html

  241. 69763f036df4b073e71007d3c4f3ccf35a7ba272.html

  242. https://kenkantzer.com/lessons-after-a-half-billion-gpt-tokens/

  243. 6b83bca645ec463296a7f20d72f6bacf6136361e.html

  244. https://koenvangilst.nl/blog/keeping-code-complexity-in-check

  245. https://lemire.me/blog/2023/03/22/can-gpt-pass-my-programming-courses/

  246. 72489a62790e55c446f0dfddd69a822773490f91.html

  247. https://marginalrevolution.com/marginalrevolution/2023/10/goat-who-is-the-greatest-economist-of-all-time-and-why-does-it-matter.html

  248. https://matthewbarnett.substack.com/p/gpt-4-takes-bryan-caplans-midterm

  249. https://mazzzystar.github.io/2023/05/10/LLM-for-individual/

  250. 1d6aaab3a845ac7ad77dc6669cfa47c4c44f7892.html

  251. https://micahflee.com/2023/04/capturing-the-flag-with-gpt-4/

  252. https://news.ycombinator.com/item?id=35236275

  253. 18c4321d1e7e2b7014fa88b8078725b217a347a9.html

  254. https://news.ycombinator.com/item?id=35604715

  255. https://news.ycombinator.com/item?id=36606573

  256. 9b7b97825c13deaa0bb86864f537ca916200de50.html

  257. https://news.ycombinator.com/item?id=38275945

  258. https://news.ycombinator.com/item?id=38850202#38852945

  259. https://news.ycombinator.com/item?id=39557213

  260. https://nian.llmonpy.ai/

  261. cce55ea58c3f273e7b5f31d05b8e818871bc8aba.html

  262. https://niplav.site/decompose.html#Small_Experiment

  263. https://openai.com/blog/function-calling-and-other-api-updates#function-calling

  264. https://openai.com/index/introducing-openai-o1-preview/

  265. https://openai.com/index/introducing-structured-outputs-in-the-api/#_5PYjnV1iAHOPKPupDztdZk

  266. https://openai.com/index/mle-bench/

  267. https://osf.io/preprints/psyarxiv/dc6tz/

  268. https://paperswithcode.com/sota/math-word-problem-solving-on-math

  269. https://platform.openai.com/docs/guides/reasoning/how-reasoning-works

  270. https://pslusarz.github.io/articles/2023/12/22/compare-ocr-tesseract-gpt4-nara-rolls.html

  271. 4ac66aeda51a0035dcf3fd55c66692c033de8d5a.html

  272. https://scale.com/leaderboard/coding

  273. https://scottaaronson.blog/?p=7209

  274. 7af64d8ca1802ed7e79ed7eeb6a64c64e73a55c7.html

  275. https://simulationlabs.ai/

  276. https://statmodeling.stat.columbia.edu/2023/04/18/chatgpt4-writes-stan-code-so-i-dont-have-to/

  277. https://statmodeling.stat.columbia.edu/2023/08/20/bob-carpenter-thinks-gpt-4-is-awesome/

  278. https://terrytao.wordpress.com/about/ai-generated-versions-of-the-ai-anthology-article/

  279. dc05a787c2ebefedafcd967c461d5a9b98669db4.html

  280. https://timconnors.co/posts/ai-scraper

  281. https://unlocked.microsoft.com/ai-anthology/terence-tao/

  282. 52e807e8085a98d9a9b84dd14bd57d1a4024cb7d.html

  283. https://villekuosmanen.medium.com/i-played-chess-against-chatgpt-4-and-lost-c5798a9049ca

  284. 86d9685bc99ba86aed723486e8086afc721a7d5f.html

  285. https://web.archive.org/web/20230529224700/https://chat.openai.com/share/eef34fe5-0c8e-4595-9c28-2e9f05f05393

  286. 6b21c7a1605dc773646708f965a7643d50c647eb.html

  287. https://www.betonit.ai/p/gpt-4-takes-a-new-midterm-and-gets

  288. 4175fe75cb70603ea474683b41859bf7baa56956.html

  289. https://www.caltech.edu/about/news/LLMs-in-the-classroom

  290. 23b6af8286e9e4a8d19c900de6d96097c81b50a5.html

  291. https://www.construction-physics.com/p/could-chatgpt-become-an-architect

  292. 40da8f543f70924b75741b4eb051c0dba0570c16.html

  293. https://www.economist.com/business/2024/02/29/how-businesses-are-actually-using-generative-ai

  294. https://www.euractiv.com/section/politics/news/albania-to-speed-up-eu-accession-using-chatgpt/

  295. https://www.geoffreylitt.com/2023/03/25/llm-end-user-programming

  296. https://www.lasso.security/blog/ai-package-hallucinations

  297. https://www.lesswrong.com/posts/75o8oja43LXGAqbAR/palm-2-and-gpt-4-in-extrapolating-gpt-n-performance

  298. https://www.lesswrong.com/posts/ChtGdxk9mwZ2Rxogt/smartyheadercode-anomalous-tokens-for-gpt3-5-and-gpt-4-1

  299. https://www.lesswrong.com/posts/CkhJAxHeyFCg2EcET/are-language-models-good-at-making-predictions

  300. https://www.lesswrong.com/posts/F6vH6fr8ngo7csDdf/chess-as-a-case-study-in-hidden-capabilities-in-chatgpt

  301. https://www.lesswrong.com/posts/KSroBnxCHodGmPPJ8/jailbreaking-gpt-4-s-code-interpreter

  302. https://www.lesswrong.com/posts/Z4tBreNCxnppoPLtd/gpts-ability-to-keep-a-secret-is-weirdly-prompt-dependent

  303. https://www.lesswrong.com/posts/bNCDexejSZpkuu3yz/you-can-use-gpt-4-to-create-prompt-injections-against-gpt-4

  304. https://www.lesswrong.com/posts/zyPaqXgFzqHkQfccq/contra-lecun-on-autoregressive-llms-are-doomed?commentId=fXGn2E8RMdwhKqwrE

  305. https://www.malwarebytes.com/blog/threat-intelligence/2023/09/malicious-ad-served-inside-bing-ai-chatbot

  306. https://www.nature.com/articles/s41586-023-06792-0

  307. https://www.oneusefulthing.org/p/it-is-starting-to-get-strange

  308. https://www.oneusefulthing.org/p/one-sentence

  309. f07ba7e4aebd02d54bd534f26b0fa8148485513e.html

  310. https://www.oneusefulthing.org/p/setting-time-on-fire-and-the-temptation

  311. 5f943a1d7a08a8bea668ce6db743355833ada43a.html

  312. https://www.pnas.org/doi/abs/10.1073/pnas.2405460121

  313. https://www.pnas.org/doi/full/10.1073/pnas.2317967121

  314. https://www.reddit.com/r/ApplyingToCollege/comments/1h0vhlq/in_the_past_three_days_ive_reviewed_over_100/

  315. https://www.reddit.com/r/ChatGPT/comments/12a0ajb/i_gave_gpt4_persistent_memory_and_the_ability_to/

  316. https://www.reddit.com/r/ExperiencedDevs/comments/11y8hys/chatgpt_resumes_accounted_for_30_of_the_ones_we/

  317. 2956cd4f13bb571f10018452fba7545082687500.html

  318. https://www.reddit.com/r/GPT3/comments/12ez822/neurosemantical_inversitis_prompt_still_works/

  319. https://www.reddit.com/r/MachineLearning/comments/18u31w8/r_large_language_models_world_chess_championship/

  320. https://www.reddit.com/r/OpenAI/comments/1fxa6d6/two_purported_instances_of_o1preview_and_o1mini/

  321. https://www.reddit.com/r/OpenAI/comments/1gjj430/o1_preview_got_weird_today/

  322. https://www.reddit.com/r/PromptEngineering/comments/1fj6h13/hallucinations_in_o1preview_reasoning/

  323. https://www.reddit.com/r/bing/comments/110eagl/the_customer_service_of_the_new_bing_chat_is/

  324. 5c06dbe779ad8b6cb3aa3517ae9b00ebdd87a930.html

  325. https://www.reddit.com/r/duolingo/comments/18sx06i/big_layoff_at_duolingo/

  326. d54e76c26d5f82708508c43bef4f600587c822dc.html

  327. https://www.reddit.com/r/freelanceWriters/comments/12ff5mw/it_happened_to_me_today/

  328. 9611efae8631240b60016adc8f55dd1b65b0d136.html

  329. https://www.reddit.com/r/mlscaling/comments/1gyb54z/the_fate_of_gpt4o/

  330. https://www.reddit.com/r/singularity/comments/1atjz9v/ive_put_a_complex_codebase_into_a_single/

  331. https://www.reddit.com/r/slatestarcodex/comments/1201v68/10word_quote_a_short_and_simple_failure_mode_of/jdigzkh/?context=3

  332. https://www.sabrina.dev/p/chatgpt4o-vs-math

  333. a9be480b63e725642e0e39f3574de3873742e7db.html

  334. https://www.slowboring.com/p/chatgpt-goes-to-harvard

  335. https://www.supersimple.io/blog/gpt-4-fine-tuning-early-access

  336. https://www.thebigquestions.com/2023/04/05/gpt-4-fails-economics/

  337. 756d60210372dbd37d90cfb01becaa0c63cbfe25.html

  338. https://www.thendobetter.com/investing/2023/6/9/tyler-cowen-hayek-lecture-on-economics-ai-and-large-langauge-models

  339. 2b90dd6a8b792a87cdee6041d5210e570ccf1301.html

  340. https://www.theverge.com/2023/2/15/23599072/microsoft-ai-bing-personality-conversations-spy-employees-webcams

  341. https://www.vice.com/en/article/v7begx/overemployed-hustlers-exploit-chatgpt-to-take-on-even-more-full-time-jobs

  342. https://www.youtube.com/watch?v=PgT8tPChbqc

  343. https://www.youtube.com/watch?v=g7YJIpkk7KM?t=38

  344. https://x.com/AISafetyMemes/status/1762320288862314659

  345. https://x.com/AISafetyMemes/status/1841891795782775221

  346. https://x.com/Academisfit/status/1868529612554420489

  347. https://x.com/AndreTI/status/1635801920223989760

  348. https://x.com/CFGeek/status/1768024040487453169

  349. https://x.com/ChatGPTapp/status/1732979491071549792

  350. https://x.com/DahnJahn/status/1669000659192930304

  351. https://x.com/DimitrisPapail/status/1804233021429813661

  352. https://x.com/GrantSlatton/status/1740039795659956359

  353. https://x.com/GregKamradt/status/1722386725635580292

  354. https://x.com/KevinAFischer/status/1646690838981005312

  355. https://x.com/LericDax/status/1635804659448152067

  356. https://x.com/LericDax/status/1635871504138133504

  357. https://x.com/MParakhin/status/1648199942421508096

  358. https://x.com/MarkoTervio/status/1835287416900321447

  359. https://x.com/MasterTimBlais/status/1635701745727700999

  360. https://x.com/MichaelTrazzi/status/1635743595989970945

  361. https://x.com/Naman_Bhalla/status/1637578019811340292

  362. https://x.com/ShayneRedford/status/1640702622557523969

  363. https://x.com/StudentInfosec/status/1640360234882310145

  364. https://x.com/TheStalwart/status/1720475482171253104

  365. https://x.com/VictorTaelin/status/1645553975419355136

  366. https://x.com/VivaLaPanda_/status/1677828821964439553

  367. https://x.com/YaBoyFathoM/status/1647608734175186944

  368. https://x.com/ZachWeiner/status/1694685022236610900

  369. https://x.com/_Borriss_/status/1645488757649416196

  370. https://x.com/_via_getty_/status/1635728855934836736

  371. https://x.com/_vztu/status/1712682819800224011

  372. https://x.com/abacaj/status/1635738595767058433

  373. https://x.com/alexalbert__/status/1636488551817965568

  374. https://x.com/amasad/status/1704323196944527624

  375. https://x.com/amuseddaman/status/1647367383022182400

  376. https://x.com/anthrupad/status/1639421396840316932

  377. https://x.com/apples_jimmy/status/1790158228359368894

  378. https://x.com/axpuig/status/1635771128986710016

  379. https://x.com/bindureddy/status/1724152343732859392

  380. https://x.com/biz84/status/1637793452879405064

  381. https://x.com/bryanhpchiang/status/1639830383616487426

  382. https://x.com/colin_fraser/status/1762351995296350592

  383. https://x.com/conitzer/status/1656478578857369600

  384. https://x.com/corbtt/status/1814056457626862035

  385. https://x.com/danshipper/status/1635712019549786113

  386. https://x.com/davidad/status/1636150606384582656

  387. https://x.com/davidad/status/1639215289677017099

  388. https://x.com/emollick/status/1639421740358193153

  389. https://x.com/emollick/status/1681650599933222912

  390. https://x.com/emollick/status/1736196921541140861

  391. https://x.com/emollick/status/1748492920607379682

  392. https://x.com/emollick/status/1864744770695815234

  393. https://x.com/erikphoel/status/1638936714533130245

  394. https://x.com/fabianstelzer/status/1717131243861520569

  395. https://x.com/felps_bra/status/1762494815256936932

  396. https://x.com/gdb/status/1707082027584106669

  397. https://x.com/geoffreylitt/status/1635757456377917440

  398. https://x.com/gillespi/status/1645594062773452801

  399. https://x.com/goodside/status/1635711013566795776

  400. https://x.com/goodside/status/1657396491676164096

  401. https://x.com/goodside/status/1790294534670176336

  402. https://x.com/harryaskham/status/1636376676329455617

  403. https://x.com/jbrowder1/status/1635720431091974157

  404. https://x.com/kenshinsamurai9/status/1662510532585291779

  405. https://x.com/krishnanrohit/status/1738617384276263356

  406. https://x.com/lacker/status/1655685341649719296

  407. https://x.com/mattshumer_/status/1636512490195501056

  408. https://x.com/mattshumer_/status/1651614739569541120

  409. https://x.com/mattshumer_/status/1653060363972124673

  410. https://x.com/mckaywrigley/status/1642948620604538880

  411. https://x.com/mckaywrigley/status/1708153813583204394

  412. https://x.com/mezaoptimizer/status/1725512396901433575

  413. https://x.com/michael_nielsen/status/1769404321739972859

  414. https://x.com/mplappert/status/1663892732652273664

  415. https://x.com/nickchk/status/1635731621801496577

  416. https://x.com/papayathreesome/status/1670170344953372676

  417. https://x.com/patio11/status/1677890745683025920

  418. https://x.com/patio11/status/1721722777705603432

  419. https://x.com/paulg/status/1777030573220933716

  420. https://x.com/peakcooper/status/1639716822680236032

  421. https://x.com/perrymetzger/status/1635811092654858240

  422. https://x.com/perrymetzger/status/1639968357607698433

  423. https://x.com/petergyang/status/1707169696049668472

  424. https://x.com/repligate/status/1640509177159114752

  425. https://x.com/repligate/status/1762499051571102188

  426. https://x.com/repligate/status/1783455386684555340

  427. https://x.com/repligate/status/1827900674325045375

  428. https://x.com/sangyh2/status/1636785191447564288

  429. https://x.com/shinboson/status/1769231110691500140

  430. https://x.com/shinboson/status/1794570054165729303

  431. https://x.com/shinboson/status/1805459742518595585

  432. https://x.com/skirano/status/1635736107949195278

  433. https://x.com/tamaybes/status/1639400013062348800

  434. https://x.com/tegmark/status/1635714985543204889

  435. https://x.com/vagabondjack/status/1637468848122396672

  436. BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

  437. https%253A%252F%252Farxiv.org%252Fabs%252F2411.13543.html

  438. Can LLMs be Scammed? A Baseline Measurement Study

  439. https%253A%252F%252Farxiv.org%252Fabs%252F2410.13893.html

  440. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

  441. Lil'Log

  442. Homepage: Aleksander Mądry

  443. https%253A%252F%252Farxiv.org%252Fabs%252F2410.07095%2523openai.html

  444. I Quit Teaching Because of ChatGPT

  445. https%253A%252F%252Ftime.com%252F7026050%252Fchatgpt-quit-teaching-ai-essay%252F.html

  446. Thoughts while watching myself be automated

  447. https%253A%252F%252Fdynomight.net%252Fautomated%252F.html

  448. Does Refusal Training in LLMs Generalize to the Past Tense?

  449. https%253A%252F%252Farxiv.org%252Fabs%252F2407.11969.html

  450. Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

  451. Owain Evans, AI Alignment Researcher

  452. https%253A%252F%252Farxiv.org%252Fabs%252F2407.04694.html

  453. APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

  454. Caiming Xiong—Home Page

  455. https%253A%252F%252Farxiv.org%252Fabs%252F2406.18518%2523salesforce.html

  456. Probing the Decision Boundaries of In-context Learning in Large Language Models

  457. Aditya Grover

  458. https%253A%252F%252Farxiv.org%252Fabs%252F2406.11233.html

  459. LLMs achieve adult human performance on higher-order theory of mind tasks

  460. https%253A%252F%252Farxiv.org%252Fabs%252F2405.18870%2523google.html

  461. Intelligent Go-Explore (IGE): Standing on the Shoulders of Giant Foundation Models

  462. Jeff Clune—Professor—Computer Science—University of British Columbia

  463. https%253A%252F%252Farxiv.org%252Fabs%252F2405.15143.html

  464. DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ

  465. https%253A%252F%252Farxiv.org%252Fabs%252F2405.15306.html

  466. Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

  467. https%253A%252F%252Farxiv.org%252Fabs%252F2405.15071.html

  468. ChatGPT will be able to talk to you like Scarlett Johansson in Her / Upgrades to ChatGPT’s voice mode bring it closer to the vision of a responsive AI assistant—and Sam Altman seems to know it

  469. https%253A%252F%252Fwww.theverge.com%252F2024%252F5%252F13%252F24155652%252Fchatgpt-voice-mode-gpt4o-upgrades.html

  470. GSM1k: A Careful Examination of Large Language Model Performance on Grade School Arithmetic

  471. https%253A%252F%252Farxiv.org%252Fabs%252F2405.00332%2523scale.html

  472. LLM Evaluators Recognize and Favor Their Own Generations

  473. Sam Bowman

  474. Shi Feng

  475. https%253A%252F%252Farxiv.org%252Fabs%252F2404.13076.html

  476. From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

  477. https%253A%252F%252Farxiv.org%252Fabs%252F2404.07544.html

  478. Election Workers Are Drowning in Records Requests. AI Chatbots Could Make It Worse: Experts worry that election deniers could weaponize chatbots to overwhelm and slow down local officials

  479. https%253A%252F%252Fwww.wired.com%252Fstory%252Fai-chatbots-foia-requests-election-workers%252F.html

  480. Re-evaluating GPT-4’s bar exam performance

  481. https%253A%252F%252Flink.springer.com%252Farticle%252F10.1007%252Fs10506-024-09396-9.html

  482. A Peter Thiel-Backed AI Startup, Cognition Labs, Seeks $2 Billion Valuation: Funding round could increase startup’s valuation nearly sixfold in a matter of weeks, reflecting AI frenzy

  483. https%253A%252F%252Fwww.wsj.com%252Ftech%252Fai%252Fa-peter-thiel-backed-ai-startup-cognition-labs-seeks-2-billion-valuation-998fa39d.html

  484. Vulnerability Detection with Code Language Models: How Far Are We?

  485. https%253A%252F%252Farxiv.org%252Fabs%252F2403.18624.html

  486. Long-form factuality in large language models

  487. https%253A%252F%252Farxiv.org%252Fabs%252F2403.18802%2523deepmind.html

  488. Gold-Medalist Coders Build an AI That Can Do Their Job for Them: A new startup called Cognition AI can turn a user’s prompt into a website or video game

  489. https%253A%252F%252Fwww.bloomberg.com%252Fnews%252Farticles%252F2024-03-12%252Fcognition-ai-is-a-peter-thiel-backed-coding-assistant.html

  490. Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap

  491. https%253A%252F%252Farxiv.org%252Fabs%252F2402.19450.html

  492. Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

  493. https%253A%252F%252Farxiv.org%252Fabs%252F2402.14903.html

  494. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

  495. https%253A%252F%252Farxiv.org%252Fabs%252F2402.11753.html

  496. Tasks That Language Models Don’t Learn

  497. https%253A%252F%252Farxiv.org%252Fabs%252F2402.11349.html

  498. GPT-4 passes the bar exam

  499. https%253A%252F%252Fwww.ncbi.nlm.nih.gov%252Fpmc%252Farticles%252FPMC10894685%252F.html

  500. Large language models are able to downplay their cognitive abilities to fit the persona they simulate

  501. https%253A%252F%252Fwww.ncbi.nlm.nih.gov%252Fpmc%252Farticles%252FPMC10936766%252F.html

  502. PRER: Modeling Complex Mathematical Reasoning via Large Language Model based MathAgent

  503. https%253A%252F%252Farxiv.org%252Fabs%252F2312.08926.html

  504. Can linguists distinguish between ChatGPT and human writing?: A study of research ethics and academic publishing

  505. %252Fdoc%252Fai%252Fnn%252Ftransformer%252Fgpt%252F4%252Fnonfiction%252F2023-casal.pdf.html

  506. Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

  507. https%253A%252F%252Farxiv.org%252Fabs%252F2311.16452%2523microsoft.html

  508. Comparing Humans, GPT-4, and GPT-4-V On Abstraction and Reasoning Tasks

  509. https%253A%252F%252Farxiv.org%252Fabs%252F2311.09247.html

  510. Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament

  511. https%253A%252F%252Farxiv.org%252Fabs%252F2310.13014.html

  512. Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams

  513. https%253A%252F%252Farxiv.org%252Fabs%252F2310.08678.html

  514. Can a computer outfake a human [personality]?

  515. %252Fdoc%252Fpsychology%252Fpersonality%252F2023-phillips.pdf.html

  516. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

  517. https%253A%252F%252Farxiv.org%252Fabs%252F2310.04406.html

  518. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

  519. Jason Wei

  520. https%253A%252F%252Farxiv.org%252Fabs%252F2310.03214%2523google.html

  521. UltraFeedback: Boosting Language Models with High-quality Feedback

  522. Ning Ding

  523. https%253A%252F%252Farxiv.org%252Fabs%252F2310.01377.html

  524. The Cambridge Law Corpus: A Corpus for Legal AI Research

  525. https%253A%252F%252Farxiv.org%252Fabs%252F2309.12269.html

  526. The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

  527. Owain Evans, AI Alignment Researcher

  528. https%253A%252F%252Farxiv.org%252Fabs%252F2309.12288.html

  529. From Sparse to Dense: GPT-4 Summarization with Chain of Density (CoD) Prompting

  530. https%253A%252F%252Farxiv.org%252Fabs%252F2309.04269.html

  531. Devising and Detecting Phishing: Large Language Models vs. Smaller Human Models

  532. https%253A%252F%252Farxiv.org%252Fabs%252F2308.12287.html

  533. Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

  534. https%253A%252F%252Farxiv.org%252Fabs%252F2308.07921.html

  535. I’m a Screenwriter. These AI Jokes Give Me Nightmares

  536. https%253A%252F%252Ftime.com%252F6301288%252Fthe-ai-jokes-that-give-me-nightmares%252F.html

  537. OpenAI Worries About What Its Chatbot Will Say About People’s Faces: An advanced version of ChatGPT can analyze images and is already helping the blind. But its ability to put a name to a face is one reason the public doesn’t have access to it

  538. https%253A%252F%252Fwww.nytimes.com%252F2023%252F07%252F18%252Ftechnology%252Fopenai-chatgpt-facial-recognition.html.html

  539. Machine-Assisted Social Psychology Hypothesis Generation

  540. %252Fdoc%252Fai%252Fnn%252Ftransformer%252Fgpt%252F3%252Fnonfiction%252F2024-banker.pdf.html

  541. Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events

  542. https%253A%252F%252Farxiv.org%252Fabs%252F2307.06439%2523microsoft.html

  543. Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration

  544. Furu Wei

  545. https%253A%252F%252Farxiv.org%252Fabs%252F2307.05300%2523microsoft.html

  546. Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models

  547. https%253A%252F%252Farxiv.org%252Fabs%252F2308.01404.html

  548. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

  549. https%253A%252F%252Farxiv.org%252Fabs%252F2306.15626.html

  550. ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews

  551. https%253A%252F%252Farxiv.org%252Fabs%252F2306.12587.html

  552. Understanding Social Reasoning in Language Models with Language Models

  553. https%253A%252F%252Farxiv.org%252Fabs%252F2306.15448.html

  554. Let’s Verify Step by Step

  555. Jan Leike

  556. John Schulman’s Homepage

  557. https%253A%252F%252Farxiv.org%252Fabs%252F2305.20050%2523openai.html

  558. LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations

  559. https%253A%252F%252Farxiv.org%252Fabs%252F2305.18354.html

  560. How Language Model Hallucinations Can Snowball

  561. https%253A%252F%252Farxiv.org%252Fabs%252F2305.13534.html

  562. Large Language Models Can Be Used To Effectively Scale Spear Phishing Campaigns

  563. https%253A%252F%252Farxiv.org%252Fabs%252F2305.06972.html

  564. Boosting Theory-of-Mind Performance in Large Language Models via Prompting

  565. https%253A%252F%252Farxiv.org%252Fabs%252F2304.11490.html

  566. Performance of ChatGPT on free-response, clinical reasoning exams

  567. https%253A%252F%252Fwww.medrxiv.org%252Fcontent%252F10.1101%252F2023.03.24.23287731.full.html

  568. How well do Large Language Models perform in Arithmetic tasks?

  569. https%253A%252F%252Farxiv.org%252Fabs%252F2304.02015%2523alibaba.html

  570. GPT-4 Technical Report § Limitations: Calibration

  571. https%253A%252F%252Farxiv.org%252Fpdf%252F2303.08774%2523page%253D12%2526org%253Dopenai.html

  572. Large Language Models Are State-of-the-Art Evaluators of Translation Quality

  573. https%253A%252F%252Farxiv.org%252Fabs%252F2302.14520.html

  574. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

  575. https%253A%252F%252Farxiv.org%252Fabs%252F2302.12173.html

  576. Harvey, which uses AI to answer legal questions, lands cash from OpenAI

  577. https%253A%252F%252Ftechcrunch.com%252F2022%252F11%252F23%252Fharvey-which-uses-ai-to-answer-legal-questions-lands-cash-from-openai%252F.html