Testing LLM Agents in Academia

My favorite paper of the year so far -- 4/12/23

Apr 12, 2023

The Facts

On April 7th, Percy Liang’s lab at Stanford released this pretty incredible paper about using LLMs as agents in a small simulation game.

If you’re the kind of person who likes to read papers, I cannot recommend this one highly enough; if not, here’s a quick summary:

The team at Stanford has LLMs perform as the NPCs in a simple simulation game. The LLMs planned and executed all actions and had conversations with other agents.
They introduced a sort of “cognitive framework” to the agents, that includes short-term and long-term memory, planning capabilities, and the ability to interact with the game world.
They introduced a series of evaluation metrics to evaluate the “believability” of those agents (how much they seem like real characters to a human player)

Why it Matters (other than cool new video games)

Much of the LLM agent noise to date have been projects like AutoGPT and BabyAGI, which also give agents a “cognitive framework” — those two projects each perform real-world tasks (like writing reports, sending tweets, etc.). They have both taken off in the last few weeks because they can produce some impressive results!

This paper establishes a path forward to evaluate these agents. A small-scale simulated world is an exceptional lab environment to test different agent architectures' behavior, characteristics, and capabilities. Evaluating and understanding progress on building agents is hard without a controlled environment.

I’m going to write something longer when I get a chance about the shortcomings of evaluation tools in the LLM space right now. The current benchmarks clearly don’t represent the objective quality of models or agents, making it hard to measure and understand progress. I’m hopeful this work represents a step forward.

My Thoughts

In my mind, this paper is a perfect example of the role academia can play in the current era of corporate-dominated research:

Ways to augment LLMs (like novel memory mechanisms) that are not as expensive
Methods to evaluate the performance of LLMs

Glad to see these contributions, and it makes me really hopeful about the role academia can play in the coming years.

Also, pretty excited about the cool new video games 🚀

Share Generally Intelligent: A Daily AI Update

Generally Intelligent

Testing LLM Agents in Academia

My favorite paper of the year so far -- 4/12/23

The Facts

Why it Matters (other than cool new video games)

My Thoughts

Discussion about this post