static-pages/talks/voice-ai-agent-workshop/index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Berkeley AI Hackathon: Voice AI Agent Workshop</title>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <meta
      name="description"
      content="Build a fully functional voice AI agent from scratch. The full workshop guide: concept, architecture, setup, and three hands-on modifications."
    />
    <script
      src="https://cdn.nhcarrigan.com/headers/index.js"
      async
      defer
    ></script>
    <style>
      .talk-meta {
        font-size: 0.9rem;
        display: flex;
        flex-wrap: wrap;
        gap: 0.5em 1.5em;
        margin-bottom: 1.5em;
      }

      .talk-meta span {
        display: inline-flex;
        align-items: center;
        gap: 0.35em;
      }

      .talk-links {
        margin-bottom: 1.5em;
      }

      .talk-links a {
        margin-right: 1em;
      }

      hr {
        border: 1px solid var(--witch-plum);
        margin: 2em 0;
      }

      .is-dark hr {
        border-color: var(--witch-rose);
      }

      section {
        margin-bottom: 2em;
      }

      blockquote {
        border-left: 3px solid var(--witch-rose);
        margin: 1.5em 0;
        padding: 0.5em 0 0.5em 1.25em;
        font-style: italic;
      }

      .is-dark blockquote {
        border-left-color: var(--witch-mauve);
      }

      .callout {
        background: rgba(212, 165, 199, 0.08);
        border: 1px solid var(--witch-plum);
        border-radius: 10px;
        padding: 1.25em 1.5em;
        margin: 1.25em 0;
      }

      .is-dark .callout {
        border-color: var(--witch-rose);
      }

      .callout p:last-child {
        margin-bottom: 0;
      }

      .loop-diagram {
        display: flex;
        align-items: center;
        justify-content: center;
        flex-wrap: wrap;
        gap: 0.5em;
        margin: 1.5em 0;
        font-size: 1rem;
      }

      .loop-step {
        background: rgba(212, 165, 199, 0.12);
        border: 1px solid var(--witch-plum);
        border-radius: 8px;
        padding: 0.6em 1em;
        text-align: center;
      }

      .is-dark .loop-step {
        border-color: var(--witch-rose);
      }

      .loop-arrow {
        font-size: 1.2rem;
        opacity: 0.6;
      }

      .step-list {
        counter-reset: steps;
        list-style: none;
        padding: 0;
      }

      .step-list li {
        counter-increment: steps;
        display: flex;
        gap: 1em;
        align-items: flex-start;
        margin-bottom: 1.25em;
      }

      .step-list li::before {
        content: counter(steps);
        background: var(--witch-plum);
        color: var(--witch-moon);
        border-radius: 50%;
        min-width: 1.8em;
        height: 1.8em;
        display: flex;
        align-items: center;
        justify-content: center;
        font-size: 0.9rem;
        flex-shrink: 0;
      }

      .is-dark .step-list li::before {
        background: var(--witch-rose);
        color: var(--witch-black);
      }

      .mod-block {
        background: rgba(212, 165, 199, 0.08);
        border: 1px solid var(--witch-plum);
        border-radius: 10px;
        padding: 1.25em 1.5em;
        margin: 1.25em 0;
      }

      .is-dark .mod-block {
        border-color: var(--witch-rose);
      }

      .mod-block h3 {
        margin-top: 0;
      }

      .mod-block p:last-child {
        margin-bottom: 0;
      }

      code {
        background: rgba(0, 0, 0, 0.08);
        border-radius: 4px;
        padding: 0.15em 0.4em;
        font-family: monospace;
        font-size: 0.9em;
      }

      .is-dark code {
        background: rgba(255, 255, 255, 0.08);
      }

      pre {
        background: rgba(0, 0, 0, 0.08);
        border-radius: 8px;
        padding: 1em 1.25em;
        overflow-x: auto;
        font-family: monospace;
        font-size: 0.9em;
        line-height: 1.6;
      }

      .is-dark pre {
        background: rgba(255, 255, 255, 0.05);
      }

      .back-link {
        font-size: 0.9rem;
        display: block;
        margin-top: 2em;
      }
    </style>
  </head>
  <body>
    <main>
      <h1>Voice AI Agent Workshop</h1>
      <div class="talk-meta">
        <span>
          <i class="fas fa-calendar-alt" aria-hidden="true"></i>
          20&ndash;21 June 2026
        </span>
        <span>
          <i class="fas fa-map-marker-alt" aria-hidden="true"></i>
          UC Berkeley Campus, Berkeley, CA
        </span>
        <span>
          <i class="fas fa-user-tag" aria-hidden="true"></i>
          Speaker &amp; Sponsor (Berkeley AI Hackathon)
        </span>
      </div>
      <div class="talk-links">
        <a href="https://ai.hackberkeley.org" target="_blank" rel="noopener noreferrer">
          <i class="fas fa-external-link-alt" aria-hidden="true"></i>
          Berkeley AI Hackathon
        </a>
        <a href="https://console.deepgram.com" target="_blank" rel="noopener noreferrer">
          <i class="fas fa-external-link-alt" aria-hidden="true"></i>
          Deepgram Console ($200 free credit)
        </a>
      </div>

      <p>
        Voice AI is having a moment &mdash; and it's more accessible than you might think. In this
        workshop, we'll build a fully functional voice AI agent from scratch, using real-time
        speech-to-text, a large language model for reasoning, and text-to-speech to talk back. By
        the end, you'll have a working agent on your laptop that you can drop straight into a project.
      </p>

      <blockquote>
        Talk to your hackathon project in 40 minutes.
      </blockquote>

      <p>No prior voice AI experience needed. If you've worked with an API before, you're good to go.</p>

      <hr />

      <section>
        <h2>How a Voice Agent Works</h2>
        <p>
          A voice agent is three pieces wired together in a loop. Understanding this loop is the whole
          conceptual foundation. Everything else is implementation detail.
        </p>
        <div class="loop-diagram">
          <div class="loop-step">
            <strong>Ear</strong><br />
            Speech-to-Text
          </div>
          <span class="loop-arrow">&rarr;</span>
          <div class="loop-step">
            <strong>Brain</strong><br />
            LLM
          </div>
          <span class="loop-arrow">&rarr;</span>
          <div class="loop-step">
            <strong>Mouth</strong><br />
            Text-to-Speech
          </div>
          <span class="loop-arrow" style="transform: rotate(90deg); display: block;">&rarr;</span>
        </div>
        <p>
          <strong>The ear</strong> captures your audio and transcribes it in real time using
          speech-to-text. Latency here is everything: if your STT is slow, the whole agent feels
          sluggish. Deepgram's STT runs at sub-300ms end-to-end, which is fast enough to feel like
          a real conversation.
        </p>
        <p>
          <strong>The brain</strong> receives the transcript and decides what to say. This is your
          LLM: it reasons over the conversation history and your system prompt, generates a response,
          and can call any functions you've given it access to. It can look things up, run code,
          fetch data &mdash; anything you wire in.
        </p>
        <p>
          <strong>The mouth</strong> takes the LLM's text response and streams it back as audio. Fast
          streaming matters here too: you want audio to start playing before the full response is
          generated, or the agent feels like it's thinking too hard.
        </p>
        <p>
          With Deepgram, all three pieces run over a single WebSocket connection. You're not juggling
          three separate APIs &mdash; it's one socket, one loop, about 80 lines of code to start.
        </p>
      </section>

      <section>
        <h2>Design Decisions That Actually Matter</h2>
        <p>
          Most tutorials skip these. They're the difference between an agent that feels like a demo
          and one that feels like a tool.
        </p>

        <h3>Interruption Handling</h3>
        <p>
          In a real conversation, you don't wait for the other person to finish before you start
          talking. Your agent shouldn't either. When the user starts speaking mid-response, the agent
          needs to stop its output and listen. Getting this wrong is the fastest way to make an agent
          feel robotic.
        </p>
        <p>
          Deepgram's Voice Agent API handles this for you. The WebSocket connection detects voice
          activity on both ends and manages the interrupt logic. You don't have to implement it
          yourself &mdash; just don't override it.
        </p>

        <h3>Conversation State</h3>
        <p>
          Every turn in the conversation needs to be threaded correctly. The LLM needs the full
          context of what's been said so far &mdash; by both sides &mdash; to give coherent responses. This
          means maintaining a message history and passing it in with every LLM call.
        </p>
        <p>
          Keep your context window in mind. Very long conversations will eventually push older turns
          out of the context. A simple solution: keep the system prompt, the last N turns, and let
          the rest roll off. The agent won't remember everything, but the conversation will stay
          coherent.
        </p>

        <h3>Latency</h3>
        <p>
          The number that matters is end-to-end: from when the user stops speaking to when audio
          starts playing back. Under 500ms feels conversational. Over 1 second feels like a phone
          call with bad signal.
        </p>
        <p>
          Three places latency hides: STT processing time, LLM first-token time, and TTS start time.
          Streaming helps with all three. Start playing audio as soon as the first TTS chunk arrives.
          Choose an LLM model that prioritises speed over capability for conversational use &mdash; a
          smaller, faster model is often better here than a smarter, slower one.
        </p>
      </section>

      <hr />

      <section>
        <h2>Getting Set Up</h2>

        <div class="callout">
          <p><strong>Before you start:</strong></p>
          <ul>
            <li>Node 20+ or Python 3.11+ installed</li>
            <li>A code editor</li>
            <li>A <a href="https://console.deepgram.com" target="_blank" rel="noopener noreferrer">free Deepgram account</a> &mdash; includes $200 credit on sign-up, no credit card needed</li>
          </ul>
        </div>

        <p>Five steps to a running agent:</p>
        <ol class="step-list">
          <li>
            <div>
              Sign up at <a href="https://console.deepgram.com" target="_blank" rel="noopener noreferrer">console.deepgram.com</a>.
              Your account comes with $200 in free credit &mdash; more than enough for this workshop and a weekend of hacking.
            </div>
          </li>
          <li>
            <div>
              Grab your API key from the console. It lives under API Keys in the left sidebar. Copy it somewhere safe.
            </div>
          </li>
          <li>
            <div>
              Clone the starter repo. We'll walk through this together in the workshop, but the pattern is:
              <pre>git clone &lt;starter-repo-url&gt;
cd &lt;repo-name&gt;</pre>
            </div>
          </li>
          <li>
            <div>
              Add your API key to the environment. Create a <code>.env</code> file in the project root:
              <pre>DEEPGRAM_API_KEY=your_key_here</pre>
            </div>
          </li>
          <li>
            <div>
              Install dependencies and run it:
              <pre># Node
npm install && npm start

# Python
uv venv && uv pip install -r requirements.txt
python main.py</pre>
            </div>
          </li>
        </ol>
        <p>
          If it's working, you'll see a connection message in the terminal and the agent will greet
          you when you speak. If it's not working, come find us at the booth.
        </p>
      </section>

      <hr />

      <section>
        <h2>Make It Yours &mdash; Three Modifications</h2>
        <p>
          The starter app works, but it's generic. These three modifications are where the session
          becomes yours. Each one takes about five to seven minutes. Do them in order, or skip to
          whichever interests you most.
        </p>

        <div class="mod-block">
          <h3>Modification A: Change the Personality</h3>
          <p>
            The agent's system prompt is what defines who it is. Find the <code>system_prompt</code>
            variable in the starter code &mdash; it'll look something like this:
          </p>
          <pre>system_prompt = "You are a helpful assistant."</pre>
          <p>
            Replace it with something more specific. Give the agent a name, a role, a point of view.
            For example:
          </p>
          <pre>system_prompt = """You are Alf, a friendly AI assistant at a hackathon.
You give encouragement, suggest project ideas, and answer questions
about voice AI. You're enthusiastic but concise - people are busy building."""</pre>
          <p>
            Restart the agent and talk to it. The change is immediate. The system prompt is the
            entire personality &mdash; there's no training, no fine-tuning. You just wrote it.
          </p>
        </div>

        <div class="mod-block">
          <h3>Modification B: Swap the Voice</h3>
          <p>
            Deepgram's TTS has a full voice catalogue. Find the voice configuration in the starter
            code &mdash; it'll be a single string like <code>"aura-asteria-en"</code>. Change it to any
            other voice from the catalogue.
          </p>
          <p>
            A few to try:
          </p>
          <ul>
            <li><code>aura-asteria-en</code> &mdash; warm, conversational</li>
            <li><code>aura-orion-en</code> &mdash; deep, authoritative</li>
            <li><code>aura-luna-en</code> &mdash; clear, neutral</li>
            <li><code>aura-zeus-en</code> &mdash; bold, energetic</li>
          </ul>
          <p>
            Restart and talk to your agent. It's a one-line change and the agent sounds entirely
            different. This is the modification that gets the strongest reaction.
          </p>
        </div>

        <div class="mod-block">
          <h3>Modification C: Add a Custom Function</h3>
          <p>
            This is the one that makes you realise you can hook it to anything. Function calling lets
            the agent invoke Python (or JS) functions you write, then incorporate the result into its
            response naturally.
          </p>
          <p>
            Here's a simple example: a function that returns a random project idea when the agent
            is asked for inspiration.
          </p>
          <pre>import random

def get_project_idea():
    ideas = [
        "A voice-controlled to-do list that reads back your tasks",
        "An AI study buddy that quizzes you out loud",
        "A real-time translator that speaks back in the target language",
        "A voice journalling app that summarises your entries",
    ]
    return random.choice(ideas)</pre>
          <p>
            Register this function with the agent and tell the LLM when to use it via the system
            prompt: <em>"When asked for a project idea, call get_project_idea() and share the result."</em>
          </p>
          <p>
            Now ask your agent for a project idea. Watch it call the function and weave the result
            into a natural spoken response. This is the "aha" moment: the agent can call your own
            code, your own APIs, your own data. The voice interface is just the front door.
          </p>
        </div>
      </section>

      <hr />

      <section>
        <h2>What You Can Build From Here</h2>
        <p>
          Now that you have a working, personalised, function-capable voice agent, here's what that
          unlocks for your hackathon project:
        </p>
        <ul>
          <li>
            <strong>Accessibility layer</strong> &mdash; add voice input and output to any existing
            interface. Users who can't type or read small text get a completely different experience.
          </li>
          <li>
            <strong>In-game NPC</strong> &mdash; drop the agent into a game as a character that actually
            talks back. Hook the function calling to your game state so it knows what's happening.
          </li>
          <li>
            <strong>Voice-controlled developer tool</strong> &mdash; talk to your build process, your
            deploy pipeline, your monitoring dashboard. Voice is an unusually good interface for
            things you want to do hands-free.
          </li>
          <li>
            <strong>Multilingual support</strong> &mdash; Deepgram's STT handles dozens of languages.
            The LLM can respond in whatever language the user speaks. Global voice interface, almost for free.
          </li>
        </ul>
      </section>

      <section>
        <h2>Get Help</h2>
        <p>Building something? Stuck on something? Here's where to find us:</p>
        <ul>
          <li><strong>At the event:</strong> the Deepgram booth &mdash; we're here all weekend</li>
          <li><strong>Docs:</strong> <a href="https://developers.deepgram.com" target="_blank" rel="noopener noreferrer">developers.deepgram.com</a></li>
          <li><strong>Community:</strong> <a href="https://deepgram.com/discord" target="_blank" rel="noopener noreferrer">Deepgram's Discord</a></li>
        </ul>
        <p>
          The Deepgram challenge prize this weekend goes to the team that builds the most creative
          voice-powered experience. Come say hi, show us what you're building, and let us know if you
          want feedback on your voice integration before judging.
        </p>
      </section>

      <hr />
      <a class="back-link" href="/">
        <i class="fas fa-arrow-left" aria-hidden="true"></i>
        Back to all talks
      </a>
    </main>
  </body>
</html>