OpenAI's Realtime API is a step towards outer-loop Agents

Functions are all you need

Oct 07, 2024

Last week, OpenAI launched a bunch of features (seems on-brand, I guess?) One of the interesting details was how the realtime API works. While the websockets side is cool, one of the most interesting things is how function calling plays into the picture for agent-to-human communication.

But first lets take a step back

In July I made this picture about "3rd-Gen AI Agents" and “The Outer Loop”.

I wanted to capture the idea that most current uses of GenAI fall into one of two categories:

Integrated into backend systems as very small, focused functional tasks as part of a larger, deterministic system

For example, “Classify this text into one of 5 categories” or “draft a response to this customer question”

LLMs as a small step in a deterministic workflow

Dynamic “Agents” that execute complex, multi-step tool-calling workflows in response to a human query, usually in a multi-modal chat interface.

For example, human asks “I want to buy a blender”, agent works through several steps, presents options, and ultimately completes the task and/or answers the question.

The diagram shows how the AI assistant selects and runs the relevant tool, processes the output, and generates a response. This seamless interaction is the core of tool calling. — From Louis Dupont’s Improve User Experience with Natural Language Commands

The Outer-Loop and Inversion of Control

For the first case, AI operations are initiated by deterministic software. For the second, they are initiated by human interactions. While these use cases may access information spanning long time periods (e.g. long context from prior conversations), the scope of execution for both is usually short. We’re used to a few seconds to maybe a minute or two of execution time between human interactions.

But there’s an emerging third case — Outer Loop agents.

These are AI applications that are launched once by software or a human but then execute some set of instructions/tasks for minutes, hours, days, weeks or even longer. To be actually useful, these agents will need:

A way to contact humans to deliver updates as things change
Human Approvals on high-stakes operations
Input and feedback from subject matter experts and other peers
Help when they get stuck or encounter a task that a human must execute (e.g. make me an OAuth App and give me the client keys)
Ways for humans to interrupt / re-steer actions-in-progress

While they can’t operate without human input, ideally humans can launch these agents without needing to babysit a chat window all day. For Outer Loop agents, rather than a human summoning an AI application to perform a task or deliver an answer, we have AI applications summoning humans as needed.

Let’s look at an example, and how GPT and Claude differ in how they handle this “inverted control” use case.

Function Calling Face-Off: GPT vs. Claude

I won’t go into great detail, but you can check out the code examples for GPT and Claude in the HumanLayer repo on GitHub. Essentially we have an example where we kick off a simple Agentic workflow with a few instructions and tool calls1

The Prompt

Not word for word here, but we ask the agent something like

Please check my LinkedIn inbox and contact me on slack with a summary of the messages, offering to perform any followups.

The Tools

We give the agent three tools. For simplicity and clarity, the interactions with the LinkedIn API are stubbed out (but the agent doesn’t know that, it genuinely believes it’s fetching / sending messages).

fetch_linkedin_inbox - returns a mocked list of messages from the LinkedIn API
send_linkedin_message - sends a message to the LinkedIn API (doesn’t actually send anything)
contact_human_in_slack - sends a message to a human on slack and blocks until a response is received

The Results - GPT does the stuff…

Running this with gpt-4o results in the following (paraphrased) function calls:

fetch_linkedin_inbox() 

contact_human("You have one new message from Sarah who wants to explore your product. Terri has still not responded. Do you want me to offer availability to Sarah?")

# human responds w/ "yes and follow up with terri"

send_linkedin_message({"to": "sarah", "message": "..."})
send_linkedin_message({"to": "terri", "message": "..."})

From here, gpt-4o lands in the “stop” state with a final message about how it sent the followups and is waiting for further instructions.

…but Claude figured out a cool new thing

But Claude does one last step that’s kind of magical. Here’s the function calls run by claude-sonnet-3-5

fetch_linkedin_inbox()

contact_human("You have one new message from Sarah who wants to explore your product. Terri has still not responded. Do you want me to offer availability to Sarah?")

# human responds w/ "yes and follow up with terri"

send_linkedin_message({"to": "sarah", "message": "..."})
send_linkedin_message({"to": "terri", "message": "..."})

contact_human("I've sent the followups as you've requested, is there anything else you need?")

Claude appears to have learned to prefer to contact a human via function calls. Rather than just dumping “I did the things” to the console, it actually goes so far as to make its followup communication *via a tool call*.

Now, of course, modern APIs let clients force a model to only use tool calls, but it is interesting to see how different models bias towards tools vs. the traditional user / assistant interaction tuning.

What does this have to do with the OpenAI Realtime API?

In addition to the Realtime API’s showcase of using function calling to contact humans other than the instructing user, there were some other hints at what OpenAI is thinking with respect to outer-loop agents. The fireside chat with Sam included several references to “agents that live out in the world”.

From where I’m standing, it looks like the cutting edge is finally moving beyond 1-on-1 user/assistant conversation tuning, and toward using function calls for communication with humans. Maybe someday it will be *only* function calls.

The folks at Latent Space had another fun insight on the Dev Day floor along similar lines:

But if you, like, cut out the audio outputs and make it so it always has to output a function, like you can end up with pretty pretty good, pretty reliable, command architecture.
Yeah, actually, that's the way I want to interact with a lot of these things as well. Like, one sided voice.

(paraphrased slightly from the transcript ~00:17:10)

This describes something that is technically still human-initiated, but I love the insight that the two-way human-AI chat-style interaction isn’t the only communication paradigm.

So where is all this going?

The Agent-Human Interface is Next

The SWE-Agent team talked about the “Agent-Computer Interface (ACI)”, a spin off of the tradition Human-Computer Interface. We’ve had Human→Agent interfaces for a while now. I spend a lot of time thinking about the Agent→Human Interface (AHI)2.

If we’re doing this via function calling, then the ways in which we expose different human interaction options to LLMs becomes important. In my experience, models/agents that communicate with *multiple* humans via *multiple* natural language interfaces start to feel *so much closer* to actual human collaborators. Rather than the primary interaction channel, “Respond to the human giving you instructions” becomes just another tool call. Imagine an agent that can

Request input on blog post structure/content from a product manager and a solutions engineer SME
Draft the post based on the input, scraping documentation, and asking followup questions to those human SMEs
Queue the drafted blog post for approval by a head of marketing or CEO before posting
Contact several other agents to do similar for promoting the post on Twitter, Linkedin, and internal slack channels

Parts of these steps are doable with today’s agents, but the “interact reliably with multiple humans to achieve a goal” is still early tech. It’s not clear if the answer involves focusing on tuning, tools, prompting, or something else. It’s probably at least some combination of all three.

I’m stoked to see the community make big steps towards these outer loop agents that collaborate proactively with humans. If you’re thinking about building Gen 3 agents, find me on LinkedIn or Twitter and let’s chat! There’s lots to be figured out, including agent-to-human, agent-to-agent, memory, safety, agent orchestration / runtime, and a whole bunch more. In the meantime, I’ll leave you with this one from @kwindla from daily.co -

We didn’t do hundreds of iterations because the goal here isn’t to deliver a fully-researched paper, it’s to demonstrate what the future might look like.

And maybe we could call Yohei’s latest thing the Agent-Self Interface?

The Outer Loop