The Facts
This week, there were two similarly themed announcements in the LLM space:
Imbue (formerly Generally Intelligent, unaffiliated with what you’re currently reading!) announced a $200M fundraise that they plan to use to train LLMs that are specialized in reasoning.
Adept AI broke their long silence and released an 8 Billion parameter LLM called Persimmon-8B, which appears to be a side-effect of their own training of reasoning-centric LLMs.
Why it matters
My current mental model is that we’re going to see two waves of innovation in the LLM space:
First wave (ongoing): low-hanging fruit, mostly improving existing concepts by augmenting them with LLMs. Much better chatbots, better semantic search, easier copywriting, easier coding with copilot, etc.
Second wave: LLMs as a reasoning engine enabling net new products. Potential to automate large swaths of knowledge work. Much harder to build.
The first wave is already incredibly impactful — Copilot and ChatGPT alone represent two of the most successful net-new products in the last handful of years.
The second wave is an honest-to-god platform shift (if / when it materializes). Software with (1) intuitions about how the world works and (2) an ability to reason about contextual information fundamentally differs from any tool we’ve had in the past. The largest barriers between now and that wave materializing are:
Models are okay at reasoning now but not great yet. They don’t use their context well enough (often ignoring information in the middle of it). Because of this, they often make mistakes and have a tendency to go off the rails.
Engineering around these deficiencies takes time — I think the current generation of models can produce revolutionary automation products, but engineering those systems will take months or years.
Adept and Imbue are clearly betting on breaking down the first barrier — by focusing specifically on reasoning; they are looking to build general-purpose agents that can reason as well as an average knowledge worker.
My thoughts
So far, the strongest indicator of how well a model can complete reasoning tasks is simple: model size. Larger models are almost always better at reasoning (the best benchmarks to look at for this are MMLU and HellaSwag). If one of these companies can successfully break that trend and build smaller models specialized in reasoning, they may be able to bring on the second wave of innovation much more quickly.