generated from nhcarrigan/template
523 lines
18 KiB
HTML
523 lines
18 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en">
|
|
<head>
|
|
<title>Berkeley AI Hackathon: Voice AI Agent Workshop</title>
|
|
<meta charset="utf-8" />
|
|
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
|
<meta
|
|
name="description"
|
|
content="Build a fully functional voice AI agent from scratch. The full workshop guide: concept, architecture, setup, and three hands-on modifications."
|
|
/>
|
|
<script
|
|
src="https://cdn.nhcarrigan.com/headers/index.js"
|
|
async
|
|
defer
|
|
></script>
|
|
<style>
|
|
.talk-meta {
|
|
font-size: 0.9rem;
|
|
display: flex;
|
|
flex-wrap: wrap;
|
|
gap: 0.5em 1.5em;
|
|
margin-bottom: 1.5em;
|
|
}
|
|
|
|
.talk-meta span {
|
|
display: inline-flex;
|
|
align-items: center;
|
|
gap: 0.35em;
|
|
}
|
|
|
|
.talk-links {
|
|
margin-bottom: 1.5em;
|
|
}
|
|
|
|
.talk-links a {
|
|
margin-right: 1em;
|
|
}
|
|
|
|
hr {
|
|
border: 1px solid var(--witch-plum);
|
|
margin: 2em 0;
|
|
}
|
|
|
|
.is-dark hr {
|
|
border-color: var(--witch-rose);
|
|
}
|
|
|
|
section {
|
|
margin-bottom: 2em;
|
|
}
|
|
|
|
blockquote {
|
|
border-left: 3px solid var(--witch-rose);
|
|
margin: 1.5em 0;
|
|
padding: 0.5em 0 0.5em 1.25em;
|
|
font-style: italic;
|
|
}
|
|
|
|
.is-dark blockquote {
|
|
border-left-color: var(--witch-mauve);
|
|
}
|
|
|
|
.callout {
|
|
background: rgba(212, 165, 199, 0.08);
|
|
border: 1px solid var(--witch-plum);
|
|
border-radius: 10px;
|
|
padding: 1.25em 1.5em;
|
|
margin: 1.25em 0;
|
|
}
|
|
|
|
.is-dark .callout {
|
|
border-color: var(--witch-rose);
|
|
}
|
|
|
|
.callout p:last-child {
|
|
margin-bottom: 0;
|
|
}
|
|
|
|
.loop-diagram {
|
|
display: flex;
|
|
align-items: center;
|
|
justify-content: center;
|
|
flex-wrap: wrap;
|
|
gap: 0.5em;
|
|
margin: 1.5em 0;
|
|
font-size: 1rem;
|
|
}
|
|
|
|
.loop-step {
|
|
background: rgba(212, 165, 199, 0.12);
|
|
border: 1px solid var(--witch-plum);
|
|
border-radius: 8px;
|
|
padding: 0.6em 1em;
|
|
text-align: center;
|
|
}
|
|
|
|
.is-dark .loop-step {
|
|
border-color: var(--witch-rose);
|
|
}
|
|
|
|
.loop-arrow {
|
|
font-size: 1.2rem;
|
|
opacity: 0.6;
|
|
}
|
|
|
|
.step-list {
|
|
counter-reset: steps;
|
|
list-style: none;
|
|
padding: 0;
|
|
}
|
|
|
|
.step-list li {
|
|
counter-increment: steps;
|
|
display: flex;
|
|
gap: 1em;
|
|
align-items: flex-start;
|
|
margin-bottom: 1.25em;
|
|
}
|
|
|
|
.step-list li::before {
|
|
content: counter(steps);
|
|
background: var(--witch-plum);
|
|
color: var(--witch-moon);
|
|
border-radius: 50%;
|
|
min-width: 1.8em;
|
|
height: 1.8em;
|
|
display: flex;
|
|
align-items: center;
|
|
justify-content: center;
|
|
font-size: 0.9rem;
|
|
flex-shrink: 0;
|
|
}
|
|
|
|
.is-dark .step-list li::before {
|
|
background: var(--witch-rose);
|
|
color: var(--witch-black);
|
|
}
|
|
|
|
.mod-block {
|
|
background: rgba(212, 165, 199, 0.08);
|
|
border: 1px solid var(--witch-plum);
|
|
border-radius: 10px;
|
|
padding: 1.25em 1.5em;
|
|
margin: 1.25em 0;
|
|
}
|
|
|
|
.is-dark .mod-block {
|
|
border-color: var(--witch-rose);
|
|
}
|
|
|
|
.mod-block h3 {
|
|
margin-top: 0;
|
|
}
|
|
|
|
.mod-block p:last-child {
|
|
margin-bottom: 0;
|
|
}
|
|
|
|
code {
|
|
background: rgba(0, 0, 0, 0.08);
|
|
border-radius: 4px;
|
|
padding: 0.15em 0.4em;
|
|
font-family: monospace;
|
|
font-size: 0.9em;
|
|
}
|
|
|
|
.is-dark code {
|
|
background: rgba(255, 255, 255, 0.08);
|
|
}
|
|
|
|
pre {
|
|
background: rgba(0, 0, 0, 0.08);
|
|
border-radius: 8px;
|
|
padding: 1em 1.25em;
|
|
overflow-x: auto;
|
|
font-family: monospace;
|
|
font-size: 0.9em;
|
|
line-height: 1.6;
|
|
}
|
|
|
|
.is-dark pre {
|
|
background: rgba(255, 255, 255, 0.05);
|
|
}
|
|
|
|
.back-link {
|
|
font-size: 0.9rem;
|
|
display: block;
|
|
margin-top: 2em;
|
|
}
|
|
</style>
|
|
</head>
|
|
<body>
|
|
<main>
|
|
<h1>Voice AI Agent Workshop</h1>
|
|
<div class="talk-meta">
|
|
<span>
|
|
<i class="fas fa-calendar-alt" aria-hidden="true"></i>
|
|
20–21 June 2026
|
|
</span>
|
|
<span>
|
|
<i class="fas fa-map-marker-alt" aria-hidden="true"></i>
|
|
UC Berkeley Campus, Berkeley, CA
|
|
</span>
|
|
<span>
|
|
<i class="fas fa-user-tag" aria-hidden="true"></i>
|
|
Speaker & Sponsor (Berkeley AI Hackathon)
|
|
</span>
|
|
</div>
|
|
<div class="talk-links">
|
|
<a href="https://ai.hackberkeley.org" target="_blank" rel="noopener noreferrer">
|
|
<i class="fas fa-external-link-alt" aria-hidden="true"></i>
|
|
Berkeley AI Hackathon
|
|
</a>
|
|
<a href="https://console.deepgram.com" target="_blank" rel="noopener noreferrer">
|
|
<i class="fas fa-external-link-alt" aria-hidden="true"></i>
|
|
Deepgram Console ($200 free credit)
|
|
</a>
|
|
</div>
|
|
|
|
<p>
|
|
Voice AI is having a moment — and it's more accessible than you might think. In this
|
|
workshop, we'll build a fully functional voice AI agent from scratch, using real-time
|
|
speech-to-text, a large language model for reasoning, and text-to-speech to talk back. By
|
|
the end, you'll have a working agent on your laptop that you can drop straight into a project.
|
|
</p>
|
|
|
|
<blockquote>
|
|
Talk to your hackathon project in 40 minutes.
|
|
</blockquote>
|
|
|
|
<p>No prior voice AI experience needed. If you've worked with an API before, you're good to go.</p>
|
|
|
|
<hr />
|
|
|
|
<section>
|
|
<h2>How a Voice Agent Works</h2>
|
|
<p>
|
|
A voice agent is three pieces wired together in a loop. Understanding this loop is the whole
|
|
conceptual foundation. Everything else is implementation detail.
|
|
</p>
|
|
<div class="loop-diagram">
|
|
<div class="loop-step">
|
|
<strong>Ear</strong><br />
|
|
Speech-to-Text
|
|
</div>
|
|
<span class="loop-arrow">→</span>
|
|
<div class="loop-step">
|
|
<strong>Brain</strong><br />
|
|
LLM
|
|
</div>
|
|
<span class="loop-arrow">→</span>
|
|
<div class="loop-step">
|
|
<strong>Mouth</strong><br />
|
|
Text-to-Speech
|
|
</div>
|
|
<span class="loop-arrow" style="transform: rotate(90deg); display: block;">→</span>
|
|
</div>
|
|
<p>
|
|
<strong>The ear</strong> captures your audio and transcribes it in real time using
|
|
speech-to-text. Latency here is everything: if your STT is slow, the whole agent feels
|
|
sluggish. Deepgram's STT runs at sub-300ms end-to-end, which is fast enough to feel like
|
|
a real conversation.
|
|
</p>
|
|
<p>
|
|
<strong>The brain</strong> receives the transcript and decides what to say. This is your
|
|
LLM: it reasons over the conversation history and your system prompt, generates a response,
|
|
and can call any functions you've given it access to. It can look things up, run code,
|
|
fetch data — anything you wire in.
|
|
</p>
|
|
<p>
|
|
<strong>The mouth</strong> takes the LLM's text response and streams it back as audio. Fast
|
|
streaming matters here too: you want audio to start playing before the full response is
|
|
generated, or the agent feels like it's thinking too hard.
|
|
</p>
|
|
<p>
|
|
With Deepgram, all three pieces run over a single WebSocket connection. You're not juggling
|
|
three separate APIs — it's one socket, one loop, about 80 lines of code to start.
|
|
</p>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Design Decisions That Actually Matter</h2>
|
|
<p>
|
|
Most tutorials skip these. They're the difference between an agent that feels like a demo
|
|
and one that feels like a tool.
|
|
</p>
|
|
|
|
<h3>Interruption Handling</h3>
|
|
<p>
|
|
In a real conversation, you don't wait for the other person to finish before you start
|
|
talking. Your agent shouldn't either. When the user starts speaking mid-response, the agent
|
|
needs to stop its output and listen. Getting this wrong is the fastest way to make an agent
|
|
feel robotic.
|
|
</p>
|
|
<p>
|
|
Deepgram's Voice Agent API handles this for you. The WebSocket connection detects voice
|
|
activity on both ends and manages the interrupt logic. You don't have to implement it
|
|
yourself — just don't override it.
|
|
</p>
|
|
|
|
<h3>Conversation State</h3>
|
|
<p>
|
|
Every turn in the conversation needs to be threaded correctly. The LLM needs the full
|
|
context of what's been said so far — by both sides — to give coherent responses. This
|
|
means maintaining a message history and passing it in with every LLM call.
|
|
</p>
|
|
<p>
|
|
Keep your context window in mind. Very long conversations will eventually push older turns
|
|
out of the context. A simple solution: keep the system prompt, the last N turns, and let
|
|
the rest roll off. The agent won't remember everything, but the conversation will stay
|
|
coherent.
|
|
</p>
|
|
|
|
<h3>Latency</h3>
|
|
<p>
|
|
The number that matters is end-to-end: from when the user stops speaking to when audio
|
|
starts playing back. Under 500ms feels conversational. Over 1 second feels like a phone
|
|
call with bad signal.
|
|
</p>
|
|
<p>
|
|
Three places latency hides: STT processing time, LLM first-token time, and TTS start time.
|
|
Streaming helps with all three. Start playing audio as soon as the first TTS chunk arrives.
|
|
Choose an LLM model that prioritises speed over capability for conversational use — a
|
|
smaller, faster model is often better here than a smarter, slower one.
|
|
</p>
|
|
</section>
|
|
|
|
<hr />
|
|
|
|
<section>
|
|
<h2>Getting Set Up</h2>
|
|
|
|
<div class="callout">
|
|
<p><strong>Before you start:</strong></p>
|
|
<ul>
|
|
<li>Node 20+ or Python 3.11+ installed</li>
|
|
<li>A code editor</li>
|
|
<li>A <a href="https://console.deepgram.com" target="_blank" rel="noopener noreferrer">free Deepgram account</a> — includes $200 credit on sign-up, no credit card needed</li>
|
|
</ul>
|
|
</div>
|
|
|
|
<p>Five steps to a running agent:</p>
|
|
<ol class="step-list">
|
|
<li>
|
|
<div>
|
|
Sign up at <a href="https://console.deepgram.com" target="_blank" rel="noopener noreferrer">console.deepgram.com</a>.
|
|
Your account comes with $200 in free credit — more than enough for this workshop and a weekend of hacking.
|
|
</div>
|
|
</li>
|
|
<li>
|
|
<div>
|
|
Grab your API key from the console. It lives under API Keys in the left sidebar. Copy it somewhere safe.
|
|
</div>
|
|
</li>
|
|
<li>
|
|
<div>
|
|
Clone the starter repo. We'll walk through this together in the workshop, but the pattern is:
|
|
<pre>git clone <starter-repo-url>
|
|
cd <repo-name></pre>
|
|
</div>
|
|
</li>
|
|
<li>
|
|
<div>
|
|
Add your API key to the environment. Create a <code>.env</code> file in the project root:
|
|
<pre>DEEPGRAM_API_KEY=your_key_here</pre>
|
|
</div>
|
|
</li>
|
|
<li>
|
|
<div>
|
|
Install dependencies and run it:
|
|
<pre># Node
|
|
npm install && npm start
|
|
|
|
# Python
|
|
uv venv && uv pip install -r requirements.txt
|
|
python main.py</pre>
|
|
</div>
|
|
</li>
|
|
</ol>
|
|
<p>
|
|
If it's working, you'll see a connection message in the terminal and the agent will greet
|
|
you when you speak. If it's not working, come find us at the booth.
|
|
</p>
|
|
</section>
|
|
|
|
<hr />
|
|
|
|
<section>
|
|
<h2>Make It Yours — Three Modifications</h2>
|
|
<p>
|
|
The starter app works, but it's generic. These three modifications are where the session
|
|
becomes yours. Each one takes about five to seven minutes. Do them in order, or skip to
|
|
whichever interests you most.
|
|
</p>
|
|
|
|
<div class="mod-block">
|
|
<h3>Modification A: Change the Personality</h3>
|
|
<p>
|
|
The agent's system prompt is what defines who it is. Find the <code>system_prompt</code>
|
|
variable in the starter code — it'll look something like this:
|
|
</p>
|
|
<pre>system_prompt = "You are a helpful assistant."</pre>
|
|
<p>
|
|
Replace it with something more specific. Give the agent a name, a role, a point of view.
|
|
For example:
|
|
</p>
|
|
<pre>system_prompt = """You are Alf, a friendly AI assistant at a hackathon.
|
|
You give encouragement, suggest project ideas, and answer questions
|
|
about voice AI. You're enthusiastic but concise - people are busy building."""</pre>
|
|
<p>
|
|
Restart the agent and talk to it. The change is immediate. The system prompt is the
|
|
entire personality — there's no training, no fine-tuning. You just wrote it.
|
|
</p>
|
|
</div>
|
|
|
|
<div class="mod-block">
|
|
<h3>Modification B: Swap the Voice</h3>
|
|
<p>
|
|
Deepgram's TTS has a full voice catalogue. Find the voice configuration in the starter
|
|
code — it'll be a single string like <code>"aura-asteria-en"</code>. Change it to any
|
|
other voice from the catalogue.
|
|
</p>
|
|
<p>
|
|
A few to try:
|
|
</p>
|
|
<ul>
|
|
<li><code>aura-asteria-en</code> — warm, conversational</li>
|
|
<li><code>aura-orion-en</code> — deep, authoritative</li>
|
|
<li><code>aura-luna-en</code> — clear, neutral</li>
|
|
<li><code>aura-zeus-en</code> — bold, energetic</li>
|
|
</ul>
|
|
<p>
|
|
Restart and talk to your agent. It's a one-line change and the agent sounds entirely
|
|
different. This is the modification that gets the strongest reaction.
|
|
</p>
|
|
</div>
|
|
|
|
<div class="mod-block">
|
|
<h3>Modification C: Add a Custom Function</h3>
|
|
<p>
|
|
This is the one that makes you realise you can hook it to anything. Function calling lets
|
|
the agent invoke Python (or JS) functions you write, then incorporate the result into its
|
|
response naturally.
|
|
</p>
|
|
<p>
|
|
Here's a simple example: a function that returns a random project idea when the agent
|
|
is asked for inspiration.
|
|
</p>
|
|
<pre>import random
|
|
|
|
def get_project_idea():
|
|
ideas = [
|
|
"A voice-controlled to-do list that reads back your tasks",
|
|
"An AI study buddy that quizzes you out loud",
|
|
"A real-time translator that speaks back in the target language",
|
|
"A voice journalling app that summarises your entries",
|
|
]
|
|
return random.choice(ideas)</pre>
|
|
<p>
|
|
Register this function with the agent and tell the LLM when to use it via the system
|
|
prompt: <em>"When asked for a project idea, call get_project_idea() and share the result."</em>
|
|
</p>
|
|
<p>
|
|
Now ask your agent for a project idea. Watch it call the function and weave the result
|
|
into a natural spoken response. This is the "aha" moment: the agent can call your own
|
|
code, your own APIs, your own data. The voice interface is just the front door.
|
|
</p>
|
|
</div>
|
|
</section>
|
|
|
|
<hr />
|
|
|
|
<section>
|
|
<h2>What You Can Build From Here</h2>
|
|
<p>
|
|
Now that you have a working, personalised, function-capable voice agent, here's what that
|
|
unlocks for your hackathon project:
|
|
</p>
|
|
<ul>
|
|
<li>
|
|
<strong>Accessibility layer</strong> — add voice input and output to any existing
|
|
interface. Users who can't type or read small text get a completely different experience.
|
|
</li>
|
|
<li>
|
|
<strong>In-game NPC</strong> — drop the agent into a game as a character that actually
|
|
talks back. Hook the function calling to your game state so it knows what's happening.
|
|
</li>
|
|
<li>
|
|
<strong>Voice-controlled developer tool</strong> — talk to your build process, your
|
|
deploy pipeline, your monitoring dashboard. Voice is an unusually good interface for
|
|
things you want to do hands-free.
|
|
</li>
|
|
<li>
|
|
<strong>Multilingual support</strong> — Deepgram's STT handles dozens of languages.
|
|
The LLM can respond in whatever language the user speaks. Global voice interface, almost for free.
|
|
</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Get Help</h2>
|
|
<p>Building something? Stuck on something? Here's where to find us:</p>
|
|
<ul>
|
|
<li><strong>At the event:</strong> the Deepgram booth — we're here all weekend</li>
|
|
<li><strong>Docs:</strong> <a href="https://developers.deepgram.com" target="_blank" rel="noopener noreferrer">developers.deepgram.com</a></li>
|
|
<li><strong>Community:</strong> <a href="https://deepgram.com/discord" target="_blank" rel="noopener noreferrer">Deepgram's Discord</a></li>
|
|
</ul>
|
|
<p>
|
|
The Deepgram challenge prize this weekend goes to the team that builds the most creative
|
|
voice-powered experience. Come say hi, show us what you're building, and let us know if you
|
|
want feedback on your voice integration before judging.
|
|
</p>
|
|
</section>
|
|
|
|
<hr />
|
|
<a class="back-link" href="/">
|
|
<i class="fas fa-arrow-left" aria-hidden="true"></i>
|
|
Back to all talks
|
|
</a>
|
|
</main>
|
|
</body>
|
|
</html>
|