Actually, maybe LLaMA is still the best?

LLM Evaluation is hard, a real world example -- 6/2/2023

Jun 09, 2023

The Facts

On Monday, I wrote about Falcon-40B “becoming the new best OSS model”. I may have spoken too soon.

Yao is a primary contributor to Chain-of-Thought Hub, an open-source evaluation framework built to evaluate the reasoning capabilities of models. It turns out the score for LLaMA-65B on HuggingFace’s Open LLM Leaderboard is significantly lower than the score reported in the original paper.

Why it matters

I should have known better! Evaluating LLMs is still too hard.

An evaluation framework that uses models to evaluate models (called AlpacaEval) was released just yesterday, and it shows a (very different!) leaderboard:

HuggingFace (left) vs. AlpacaEval (right)

The takeaway: don’t take any release too seriously too quickly right now. Not to say all of these aren’t impressive and/or useful! We don’t have the right benchmarks yet to assess a model’s capabilities quickly. Hopefully soon.

My thoughts

With all of the confusion, here are a few things I’m confident about:

Based on the models developers choose, the best closed-source models are significantly better than the best open-source models.
The current batch of open-source models is sufficiently worse at reasoning tasks that the most sophisticated use cases are stuck with closed-source models.
Open-source models are sufficient for a number of use cases and can offer much better price/performance where that is relevant. Advanced teams are fine-tuning smaller models when the use case calls for it.

Generally Intelligent

Discussion about this post