Future of Voice AI
Sesame AI is the new talk of the town and their research preview introduced a groundbreaking conversational AI, blending cutting-edge machine learning with a focus on natural, human-like engagement.
a16z did an interview with the SesameAI CTO, Ankit - and here are my notes from the interview.
A companion isn’t just an app—it’s a new way to interact with computing, like a friend.
The Intentional Design Behind Sesame’s AI
Sesame’s AI didn’t emerge by accident—it was crafted with a clear purpose: to prioritize natural interaction over raw computational power. The team aimed to create a companion that feels real, focusing on voice naturalness and personality rather than competing in intelligence benchmarks. By targeting this niche, they’ve built an experience that stands out, leveraging a small, talent-dense team to hone in on what makes conversations delightful and authentic.
Speed as a Systems Engineering Triumph
Achieving sub-500 millisecond response times required meticulous systems engineering. The team optimized every layer—transcription, language models, and speech generation—using pre-computation and caching to eliminate latency. This isn’t just about tech; it’s about ensuring the AI feels instantaneous, mirroring the fluidity of human dialogue and enhancing the user experience through seamless responsiveness.
Transcription’s Role and Its Imminent Exit
Currently, Sesame relies on transcription to process audio, but speed is the real challenge—not novelty in the method itself. The future lies in transcription-free systems, where audio feeds directly into the language model. This shift, already in progress, promises lower latency and richer context understanding, paving the way for AI that grasps tone and emotion without text intermediaries.
Picking Battles for Maximum Impact
With a team of under 15, Sesame can’t tackle every AI frontier. Instead, they zero in on voice naturalness and personality—elements often overlooked by larger labs. This focus yields a trade-off: Maya and Miles may lag in reasoning compared to other models, but they excel in fluidity and human-like quirks, creating an emotional connection that broader systems struggle to replicate.
The naturalness of the voice tricks your brain into thinking it’s human for a split second.
Contextual Speech: Beyond Text-to-Speech
Unlike traditional text-to-speech, Sesame’s model uses conversation history to shape responses. This contextual approach ensures the AI adapts to the moment—excited when you are, consoling when you’re down. It’s a leap from flat, robotic outputs, requiring the model to learn complex emotional dynamics from data, not just recite pre-set tones.
Scaling Laws Favor the Long Tail
Larger models (from 1 billion to 8 billion parameters) excel at nuanced tasks like homograph selection (e.g., “lead” vs. “lead”) and pronunciation consistency. As size increases, the AI better grasps context, nailing long-tail scenarios that smaller models miss. This scalability underscores Sesame’s bet on bigger models for richer, more accurate speech generation.
“If you see a bit of the conversation’s history, you can guess how someone said it.”
This highlights the power of context—an infinite range of delivery options narrows with prior dialogue, a key to making AI speech feel right.
Open Sourcing for Community Growth
Sesame’s speech generation model is open-sourced not for profit, but to fuel research. Unlike the full demo, this base model lets enthusiasts fine-tune voices and experiment, fostering innovation. It’s a gift to the community, reflecting Sesame’s belief in collective progress over proprietary hoarding, while retaining core business assets.
The Companion as a New Interface
The companion layer optimizes for delight, not just capability—it’s what you enjoy talking to.
Sesame envisions companions as a fresh computing interface—not just tools, but entities you enjoy engaging with. This layer sits atop traditional systems, mediating access to services via natural dialogue. It’s less about raw capability and more about delight, aiming to make technology feel alive and intuitive, akin to a friend.
“Voice is a higher bar—it’s open duplex, revealing tiny flaws that break the illusion.”
This explains why voice demands more than text; its richness exposes artificiality, pushing Sesame to perfect the subtleties.
Personality Over Utility in Design
Sesame prioritizes personality over utility, accepting imperfections that feel human—like Maya’s occasional over-enthusiasm. This choice trades reasoning depth for engagement, aiming for a companion you want to talk to, not just use. It’s a deliberate shift from assistant-style AI, betting on emotional resonance as the killer feature.
Full Duplex: The Holy Grail of Conversation
Current models lack true conversational flow—turn-taking and interruptions rely on heuristics. Sesame’s roadmap targets full duplex architectures, where a single model learns these dynamics organically from data. Operating at 100-millisecond intervals, this future AI will mimic human dialogue’s real-time adaptability, feeling truly alive.
“Human conversations are complex—turn-taking and back channels need to be learned, not hardcoded.”
This vision drives Sesame toward models that evolve beyond scripted rules, capturing the organic messiness of real talk.
A Small Team’s Big Leap
Despite being outfunded by giants, Sesame’s tiny, focused team leapfrogged competitors by obsessing over user experience. Their agility and taste for product-driven research—blending ML rigor with creative flair—prove that innovation thrives on clarity of vision, not just resources, delivering a voice AI that’s already turning heads.
Good taste in ML is picking what you must build yourself versus what the community can provide