Scaling an RNN to compete with transformers
Advancing work on one of the largest limitations of transformer architectures
The Facts
A few weeks ago, a tweet about a pretty incredible project blew up a bit:


A single developer released a huge pretrained Recurrent Neural Network (RNN) model architecture that rivals the size of modern transformer models like GPT-3. The RNN architecture has been all but non-existent in the last 5 years as most research groups have been working on scaling up the more popular (and, until now, more performant) transformer architecture.
Their model boasts competitive performance with similarly sized transformer architectures. Since that tweet, they have also released a version of their model (called Raven) that was fine-tuned following the Alpaca methodology.
Why it Matters
One of the primary limitations of modern Large Language Models is context size — how much text the model can process in a single forward pass. This size is typically represented as a number of “tokens” (which is roughly how a model represents text). GPT-3 has a maximum context size of ~3000 words, and GPT-4 has a maximum context size of ~24000 words. That may sound like a lot, but many things don’t fit under this limit (web pages, books, code bases) — meaning developers need to get creative to work with big inputs.
The reason for these limits is the attention module that is inherent to the transformer model. The nature of this module is that the computing cost of completing text with a transformer is quadratic with the size of the context window.
Finding a way to reduce the cost of inference to linear with the input size would dramatically impact what use cases are feasible with language models. I don’t know if RNNs will be the successful approach (there are other ideas out there, like Hungry Hungry Hippos), but improving this limitation is one of the more important barriers to overcome in advancing language models.
My Thoughts
This tweet got me unreasonably excited — I love fundamental model innovation, particularly when they chip away at the computation limitations of current models. We need to find ways beyond hardware to improve inference performance, and this seems like a great step forward.
Super interesting. Strong case against throwing away frameworks when the new shiny thing hits!