Real-Time Voice Agents with AWS Nova Sonic (AWS re:Invent 2025)

By Yoonsoo Park | December 14, 2025 | 2 minutes

Building a real-time voice agent is hard because of **Latency**. If the user says "Hello," and the bot takes 3 seconds to convert Speech-to-Text (STT), then LLM, then Text-to-Speech (TTS), the conversation feels dead. **Amazon Nova Sonic** acts as a unified multimodal model that handles audio-in/audio-out in a single stream, cutting latency dramatically. ## Architecture: The WebSocket Stream Unlike a REST API, Nova Sonic requires a persistent connection. ![Nova Sonic bidirectional stream API](images/nova-sonic-sequential.png) For example, we can build a simple bot using Nova Sonic. 1. **Client (React):** Captures microphone audio. 2. **Server (FastAPI):** Proxies the stream to Bedrock via HTTP/2. 3. **Model (Nova Sonic):** Consumes audio chunks and streams back audio chunks. ## The Event Sequence (Speculative Decoding) To make it feel even faster, Nova Sonic sends text *before* the audio is fully ready. 1. **User Splits:** "What is the weather?" 2. **Speculative Transcript:** Server sends `generationStage: "SPECULATIVE"` text. The UI shows this immediately. 3. **Audio Output:** The actual sound bytes arrive. 4. **Final Transcript:** The official log. ### Handling "Barge-In" (Interruptions) If the user interrupts the bot ("No, wait—"), Nova Sonic detects this and sends `{"interrupted": true}`. **Critical Implementation Detail:** Your client MUST immediately clear its audio buffer when this flag is received. ## Defining Tools for Voice Nova Sonic can call tools just like a text agent. ```json "toolConfiguration": { "tools": [ { "toolSpec": { "name": "get_weather", "description": "Get weather for a location", "inputSchema": { "json": { "type": "object", "properties": { "city": { "type": "string" } }, "required": ["city"] } } } } ] } ``` ## RAG Integration (Knowledge Base) You can wrap a Bedrock Knowledge Base query inside a Python function tool. ```python def retrieve_kb(query): # Call Bedrock Agent Runtime response = bedrock_runtime.retrieve( knowledgeBaseId=KB_ID, retrievalQuery={'text': query}, retrievalConfiguration={ 'vectorSearchConfiguration': {'numberOfResults': 1} } ) return response['retrievalResults'][0]['content']['text'] ``` ## Strands Integration: The "Brain" Pattern ![Strands Integration](images/nova_sonic_strands_integration.png) For complex reasoning (e.g., "Plan a travel itinerary and check budget"), Nova Sonic might struggle. The pattern is to use **Nova Sonic as the Router** and **Strands as the Brain**. 1. Nova Sonic hears the complex request. 2. It calls a "Meta-Tool" named `externalAgent`. 3. The Strands Agent (running Claude 3.5 Sonnet) performs the logic. 4. The text result is returned to Nova Sonic to speak. **Conclusion:** Use Nova Sonic for speed (simple Q&A). Offload to Strands for deep reasoning. references: * [Nova Sonic Documentation](https://docs.aws.amazon.com/nova/latest/userguide/speech.html#speech-architecture)

Building a real-time voice agent is hard because of Latency. If the user says “Hello,” and the bot takes 3 seconds to convert Speech-to-Text (STT), then LLM, then Text-to-Speech (TTS), the conversation feels dead.

Amazon Nova Sonic acts as a unified multimodal model that handles audio-in/audio-out in a single stream, cutting latency dramatically.

Architecture: The WebSocket Stream

Unlike a REST API, Nova Sonic requires a persistent connection.

Image not found: Nova Sonic bidirectional stream API

For example, we can build a simple bot using Nova Sonic.

Client (React): Captures microphone audio.
Server (FastAPI): Proxies the stream to Bedrock via HTTP/2.
Model (Nova Sonic): Consumes audio chunks and streams back audio chunks.

The Event Sequence (Speculative Decoding)

To make it feel even faster, Nova Sonic sends text before the audio is fully ready.

User Splits: “What is the weather?”
Speculative Transcript: Server sends generationStage: "SPECULATIVE" text. The UI shows this immediately.
Audio Output: The actual sound bytes arrive.
Final Transcript: The official log.

Handling “Barge-In” (Interruptions)

If the user interrupts the bot (“No, wait—”), Nova Sonic detects this and sends {"interrupted": true}. Critical Implementation Detail: Your client MUST immediately clear its audio buffer when this flag is received.

Defining Tools for Voice

Nova Sonic can call tools just like a text agent.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
"toolConfiguration": {
  "tools": [
    {
      "toolSpec": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "inputSchema": {
          "json": {
            "type": "object",
            "properties": {
              "city": { "type": "string" }
            },
            "required": ["city"]
          }
        }
      }
    }
  ]
}

RAG Integration (Knowledge Base)

You can wrap a Bedrock Knowledge Base query inside a Python function tool.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def retrieve_kb(query):
    # Call Bedrock Agent Runtime
    response = bedrock_runtime.retrieve(
        knowledgeBaseId=KB_ID,
        retrievalQuery={'text': query},
        retrievalConfiguration={
            'vectorSearchConfiguration': {'numberOfResults': 1}
        }
    )
    return response['retrievalResults'][0]['content']['text']

Strands Integration: The “Brain” Pattern

Image not found: Strands Integration

For complex reasoning (e.g., “Plan a travel itinerary and check budget”), Nova Sonic might struggle. The pattern is to use Nova Sonic as the Router and Strands as the Brain.

Nova Sonic hears the complex request.
It calls a “Meta-Tool” named externalAgent.
The Strands Agent (running Claude 3.5 Sonnet) performs the logic.
The text result is returned to Nova Sonic to speak.

Conclusion: Use Nova Sonic for speed (simple Q&A). Offload to Strands for deep reasoning.

references:

Nova Sonic Documentation