Actually, maybe LLaMA is still the best?
LLM Evaluation is hard, a real world example -- 6/2/2023
The Facts
On Monday, I wrote about Falcon-40B “becoming the new best OSS model”. I may have spoken too soon.
Yao is a primary contributor to Chain-of-Thought Hub, an open-source evaluation framework built to evaluate the reasoning capabilities of models. It turns out the score for LLaMA-65B on HuggingFace’s Open LLM Leaderboard is significantly lower than the score reported in the original paper.
Why it matters
I should have known better! Evaluating LLMs is still too hard.
An evaluation framework that uses models to evaluate models (called AlpacaEval) was released just yesterday, and it shows a (very different!) leaderboard:
The takeaway: don’t take any release too seriously too quickly right now. Not to say all of these aren’t impressive and/or useful! We don’t have the right benchmarks yet to assess a model’s capabilities quickly. Hopefully soon.
My thoughts
With all of the confusion, here are a few things I’m confident about:
Based on the models developers choose, the best closed-source models are significantly better than the best open-source models.
The current batch of open-source models is sufficiently worse at reasoning tasks that the most sophisticated use cases are stuck with closed-source models.
Open-source models are sufficient for a number of use cases and can offer much better price/performance where that is relevant. Advanced teams are fine-tuning smaller models when the use case calls for it.