Voice AI Agent Workshop

20–21 June 2026 UC Berkeley Campus, Berkeley, CA Speaker & Sponsor (Berkeley AI Hackathon)

Voice AI is having a moment — and it's more accessible than you might think. In this workshop, we'll build a fully functional voice AI agent from scratch, using real-time speech-to-text, a large language model for reasoning, and text-to-speech to talk back. By the end, you'll have a working agent on your laptop that you can drop straight into a project.

Talk to your hackathon project in 40 minutes.

No prior voice AI experience needed. If you've worked with an API before, you're good to go.


How a Voice Agent Works

A voice agent is three pieces wired together in a loop. Understanding this loop is the whole conceptual foundation. Everything else is implementation detail.

Ear
Speech-to-Text
Brain
LLM
Mouth
Text-to-Speech

The ear captures your audio and transcribes it in real time using speech-to-text. Latency here is everything: if your STT is slow, the whole agent feels sluggish. Deepgram's STT runs at sub-300ms end-to-end, which is fast enough to feel like a real conversation.

The brain receives the transcript and decides what to say. This is your LLM: it reasons over the conversation history and your system prompt, generates a response, and can call any functions you've given it access to. It can look things up, run code, fetch data — anything you wire in.

The mouth takes the LLM's text response and streams it back as audio. Fast streaming matters here too: you want audio to start playing before the full response is generated, or the agent feels like it's thinking too hard.

With Deepgram, all three pieces run over a single WebSocket connection. You're not juggling three separate APIs — it's one socket, one loop, about 80 lines of code to start.

Design Decisions That Actually Matter

Most tutorials skip these. They're the difference between an agent that feels like a demo and one that feels like a tool.

Interruption Handling

In a real conversation, you don't wait for the other person to finish before you start talking. Your agent shouldn't either. When the user starts speaking mid-response, the agent needs to stop its output and listen. Getting this wrong is the fastest way to make an agent feel robotic.

Deepgram's Voice Agent API handles this for you. The WebSocket connection detects voice activity on both ends and manages the interrupt logic. You don't have to implement it yourself — just don't override it.

Conversation State

Every turn in the conversation needs to be threaded correctly. The LLM needs the full context of what's been said so far — by both sides — to give coherent responses. This means maintaining a message history and passing it in with every LLM call.

Keep your context window in mind. Very long conversations will eventually push older turns out of the context. A simple solution: keep the system prompt, the last N turns, and let the rest roll off. The agent won't remember everything, but the conversation will stay coherent.

Latency

The number that matters is end-to-end: from when the user stops speaking to when audio starts playing back. Under 500ms feels conversational. Over 1 second feels like a phone call with bad signal.

Three places latency hides: STT processing time, LLM first-token time, and TTS start time. Streaming helps with all three. Start playing audio as soon as the first TTS chunk arrives. Choose an LLM model that prioritises speed over capability for conversational use — a smaller, faster model is often better here than a smarter, slower one.


Getting Set Up

Before you start:

  • Node 20+ or Python 3.11+ installed
  • A code editor
  • A free Deepgram account — includes $200 credit on sign-up, no credit card needed

Five steps to a running agent:

  1. Sign up at console.deepgram.com. Your account comes with $200 in free credit — more than enough for this workshop and a weekend of hacking.
  2. Grab your API key from the console. It lives under API Keys in the left sidebar. Copy it somewhere safe.
  3. Clone the starter repo. We'll walk through this together in the workshop, but the pattern is:
    git clone <starter-repo-url>
    cd <repo-name>
  4. Add your API key to the environment. Create a .env file in the project root:
    DEEPGRAM_API_KEY=your_key_here
  5. Install dependencies and run it:
    # Node
    npm install && npm start
    
    # Python
    uv venv && uv pip install -r requirements.txt
    python main.py

If it's working, you'll see a connection message in the terminal and the agent will greet you when you speak. If it's not working, come find us at the booth.


Make It Yours — Three Modifications

The starter app works, but it's generic. These three modifications are where the session becomes yours. Each one takes about five to seven minutes. Do them in order, or skip to whichever interests you most.

Modification A: Change the Personality

The agent's system prompt is what defines who it is. Find the system_prompt variable in the starter code — it'll look something like this:

system_prompt = "You are a helpful assistant."

Replace it with something more specific. Give the agent a name, a role, a point of view. For example:

system_prompt = """You are Alf, a friendly AI assistant at a hackathon.
You give encouragement, suggest project ideas, and answer questions
about voice AI. You're enthusiastic but concise - people are busy building."""

Restart the agent and talk to it. The change is immediate. The system prompt is the entire personality — there's no training, no fine-tuning. You just wrote it.

Modification B: Swap the Voice

Deepgram's TTS has a full voice catalogue. Find the voice configuration in the starter code — it'll be a single string like "aura-asteria-en". Change it to any other voice from the catalogue.

A few to try:

  • aura-asteria-en — warm, conversational
  • aura-orion-en — deep, authoritative
  • aura-luna-en — clear, neutral
  • aura-zeus-en — bold, energetic

Restart and talk to your agent. It's a one-line change and the agent sounds entirely different. This is the modification that gets the strongest reaction.

Modification C: Add a Custom Function

This is the one that makes you realise you can hook it to anything. Function calling lets the agent invoke Python (or JS) functions you write, then incorporate the result into its response naturally.

Here's a simple example: a function that returns a random project idea when the agent is asked for inspiration.

import random

def get_project_idea():
    ideas = [
        "A voice-controlled to-do list that reads back your tasks",
        "An AI study buddy that quizzes you out loud",
        "A real-time translator that speaks back in the target language",
        "A voice journalling app that summarises your entries",
    ]
    return random.choice(ideas)

Register this function with the agent and tell the LLM when to use it via the system prompt: "When asked for a project idea, call get_project_idea() and share the result."

Now ask your agent for a project idea. Watch it call the function and weave the result into a natural spoken response. This is the "aha" moment: the agent can call your own code, your own APIs, your own data. The voice interface is just the front door.


What You Can Build From Here

Now that you have a working, personalised, function-capable voice agent, here's what that unlocks for your hackathon project:

Get Help

Building something? Stuck on something? Here's where to find us:

The Deepgram challenge prize this weekend goes to the team that builds the most creative voice-powered experience. Come say hi, show us what you're building, and let us know if you want feedback on your voice integration before judging.


Back to all talks