Claude 2, and the LLM Competitive Landscape

Where is there room to compete -- 7/14/2023

Jul 14, 2023

The Facts

Anthropic recently announced Claude 2, their latest and most capable language model.

Claude 2's performance on the GRE, USMLE, and Multistate Bar Exam. Claude 2 on GRE: Verbal: 165; Analytical Writing: 5.0; Quant Reasoning: 154. Claude 2 on USMLE: Step 1 (5-shot) 68.9; Step 2: 63.3; Step 3: 67.2. Claude 2 on Multistate Bar Exam (5-shot): 76.5

Claude 2 appears to be the second most capable model (on the standard benchmarks) released to date, behind only GPT-4. Claude 2 is ~ 4-5x cheaper per token than GPT-4 and supports a 100k token context window.

Why it matters

With Claude 2 being potentially the biggest competition to GPT-4 yet, it’s interesting to poke at the competitive landscape of language models. Let’s slice it a few ways:

The most obvious design space is simple: “smarter” models are essentially always larger, making them slower and more expensive. Which model makes sense for an application will likely be highly use-case dependent:

Fine-tuning can move points around on this curve in an interesting way:

There are also a handful of other criteria you might measure or care about:

Steerability — if you’re creating AI companions, the alignment done by hosted models may interfere with your ability to create companions. Open-source models are innately advantaged here.
Compliance / Security / Privacy — If you need to host a model in your VPC

Most of the decision criteria thus far have been on the cost/performance curve though, and OpenAI has dominated that market.

My thoughts

Claude 2 sits at an interesting place in the curve — it won’t win for the use cases that need the smartest models (like reasoning agents), but it could carve out a niche for use cases that need smart + latency performance.
There are a lot of other angles to attack to produce better models (with fine-tuning being the most exciting of them, IMO).

For now, excited to have another competitive entry!

Generally Intelligent

Discussion about this post