We are launching Inflection-2.5, our upgraded in-house model that is competitive with all the world’s leading LLMs like GPT-4 and Gemini.
…We’ve already rolled out Inflection-2.5 to our users, and they are really enjoying Pi! We’ve seen a very substantial impact on user sentiment, engagement, and retention accelerating our organic user growth.
Our one million daily and 6 million monthly active users have now exchanged more than 4 billion messages with Pi.
An average conversation with Pi lasts 33 minutes and one in 10 lasts over an hour each day. About 60% of people who talk to Pi on any given week return the following week and we see higher monthly stickiness than leading competitors.
messages exchanged: 4 billion
monthly active users: 6 million
daily active users: 1 million
week over week retention: 60%
average session length: 33 minutes
percentage of sessions >1 hour: 10%
Inflection-2.5 benchmarks
…Below, we show a series of results on key industry benchmarks. For the sake of simplicity, we compare Inflection-2.5 to GPT-4. These results show how Pi now incorporates IQ capabilities comparable with acknowledged industry leaders. Due to differences in reporting format, we are careful to note the format used for evaluation. [Presumably these are from the original GPT-4 technical report, but GPT-4 has improved noticeably since then, so Inflection-2.5 would be somewhat worse than the results would imply.]
Inflection-1 used ~4% the training FLOPs of GPT-4 and, on average, performed at ~72% GPT-4 level on a diverse range of IQ-oriented tasks. Inflection-2.5, now powering Pi, achieves more than 94% the average performance of GPT-4 despite using only 40% the training FLOPs. We see a substantial improvement in performance across the board, with the largest gains coming in STEM areas.
Inflection-2.5 shows substantial gains over Inflection-1 on the MMLU benchmark, a diverse benchmark measuring performance across a wide range of tasks from high school to professional-level difficulty. We also evaluate on the GPQA Diamond benchmark, an extremely difficult expert-level benchmark.
…All evaluations above are done with the model that is now powering Pi, however we note that the user experience may be slightly different due to the impact of web retrieval (no benchmarks above use web retrieval), the structure of few-shot prompting, and other production-side differences.
…We thank our partners at Microsoft Azure and CoreWeave for their support in bringing the state-of-the-art language models behind Pi to millions of users across the globe.