The Facts
OpenAI announced some notable updates to their completion APIs yesterday:
Added function calling capability to the Chat API
New versions of
gpt-4
andgpt-3.5-turbo
Added a 16k context-length version of
gpt-3.5-turbo
75% cost reduction for their embedding model
25% cost reduction for
gpt-3.5-turbo
announced the future deprecation of the previous versions of
gpt-4
andgpt-3.5-turbo
The most novel change was the first — they added an ability for gpt-4
and gpt-3.5-turbo
to choose “functions” to call. In practice, this allows a user to describe a function (with natural language) and have a language model output a valid function call:
curl https://api.openai.com/v1/chat/completions -u :$OPENAI_API_KEY -H 'Content-Type: application/json' -d '{
"model": "gpt-3.5-turbo-0613",
"messages": [
{"role": "user", "content": "What is the weather like in Boston?"}
],
"functions": [
{
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
]
}'
Why it matters
Let’s lump the changes and break them down:
Function Calls
Related to the plugin ecosystem, OpenAI is releasing a version of their model that is tuned to be able to generate calls to external tools. To date, developers have typically had to rely on prompt engineering to coax a model to output a known syntax to invoke an external API.
Notably, this also mixes in “tool routing,” where LLMs decide which tool to use. According to their statement, they fine-tuned the base models to improve their ability to select tools. This is reminiscent of Gorilla — a very promising approach.
16k Context GPT-3.5
About time. After Claude-100k was released, it seems like longer context windows are becoming table stakes for these APIs.
Model Version Update
Not much intel as to what was included in these model updates — simply that the models are “more steerable”.
These releases highlight a pain point of operating model APIs: each update is, in some sense, an un-documented breaking change. In theory, LLMs are just becoming “smarter” over time. In practice, it’s hard to guarantee the continued performance of existing prompts with new models. Layer in the challenges evaluating these models, and these changes can actually be quite painful. (or quite harmless; it’s hard to know!)
My thoughts
My two key takeaways:
Fine-tuning is key to performance, and most people aren’t doing enough of it. Lots for the rest of the world to learn here.
Always look closely at silent changes in APIs. I’m curious to see if any apps break from the change in a few weeks.