Nikhil Goyal — AI Engineer & Software Developer

The idea was simple: what if instead of staring at a LeetCode problem alone, you had an AI interviewer that actually talked to you - asked follow-up questions, pushed back on your approach, gave you hints when you were stuck?

Simple idea. Genuinely hard to build.

The moment you add voice to an AI system, every architectural decision you made for a text-based app falls apart. Latency that was "acceptable" at 2 seconds becomes unusable. Buffering that worked fine in HTTP becomes a nightmare over audio streams. State management that was clean in request-response becomes tangled when both sides are talking at once.

This is the story of building Vetta.ai - and everything that broke along the way.

The Latency Problem Nobody Talks About

When people quote latency numbers for AI voice apps, they're usually measuring the wrong thing.

They measure: time from user stops speaking to when the AI starts speaking.

What actually matters: does it feel like a real conversation?

These are not the same. A system can hit 600ms on the first metric and still feel robotic if the audio delivery is choppy, if it cuts you off mid-sentence, if it can't handle interruptions.

Real conversational latency has four components:

. STT (Speech-to-Text): How fast can you transcribe what the user said?
. LLM inference: How fast can the model generate a response?
. TTS (Text-to-Speech): How fast can you convert that response to audio?
. Streaming delivery: How fast does the audio actually reach the user's ears?

Each of these compounds. A 200ms STT + 400ms LLM + 300ms TTS = 900ms before the user hears a single word. That's already over the threshold where conversations feel natural.

My target was under 800ms end-to-end. Here's how I hit it.

The Stack

I chose FastAPI for the backend - specifically for its first-class WebSocket support and async architecture. For STT, Deepgram Nova-2. For TTS, ElevenLabs. For the AI brain, Gemini 2.5.

The naive architecture looks like this:

User speaks, then WebSocket, then STT, then LLM, then TTS, then WebSocket, then user hears.

This is sequential. Every component waits for the previous one to finish. At any reasonable quality level, this will never hit 800ms.

The real architecture looks nothing like this.

Parallelism Is Everything

The first insight: TTS doesn't need to wait for the full LLM response.

When Gemini starts streaming tokens, you don't wait for the complete sentence. The moment you have a natural sentence boundary - a period, a comma with enough context, a question mark - you fire that chunk to ElevenLabs immediately.

So while ElevenLabs is synthesizing the first sentence, Gemini is already generating the second. By the time the first audio chunk arrives at the client, the second is already in the TTS pipeline.

LLM token stream: [chunk1....] [chunk2....] [chunk3....] TTS pipeline: [audio1.......] [audio2.......] [audio3.......] Client playback: [audio1] [audio2] [audio3]

This alone dropped perceived latency by ~35%.

The Interruption Problem

Here's something nobody tells you about voice AI: users will try to interrupt.

In a real interview, you cut in mid-sentence all the time. "Wait, can you clarify what you mean by-" The interviewer stops, listens, responds.

A naive implementation will keep playing its audio even as you're speaking over it. The result feels like a bad phone call, not a conversation.

My solution was a dual-channel WebSocket architecture. The client maintains two separate streams:

Outbound: User audio, sent continuously
Inbound: AI audio, received and played back

On the server side, I run continuous VAD (Voice Activity Detection) on the inbound user audio even while the AI is speaking. The moment user speech is detected during AI playback, I send an INTERRUPT signal to the client, which immediately stops playback and sends a CANCEL signal back to terminate the in-flight TTS stream.

The key was making this interruption detection fast enough that it doesn't feel like the AI is ignoring you for 300ms before stopping. Under 150ms interrupt response was the goal. I got it to around 120ms.

Stateful Interviews with Redis

The voice layer was the hardest part of Vetta.ai to build, but it's not the most interesting part architecturally. That's the interview state machine.

A real technical interview has structure. The interviewer remembers what you said five minutes ago. They track which parts of the problem you've solved. They escalate hints based on how long you've been stuck.

This requires persistent state across WebSocket messages - which, in a serverless-friendly architecture, means Redis.

Every interview session gets a Redis key that stores: session_id, problem, conversation_history, hints_given, current_phase, code_snapshots, start_time.

The LLM prompt is dynamically constructed from this state on every turn. Gemini 2.5 never "knows" the full conversation - it gets a sliding window of context plus the current state object. This keeps latency predictable regardless of how long the interview runs.

The hint escalation logic was the most satisfying thing to tune. Three levels:

. Nudge: "Think about what data structure would give you O(1) lookup."
. Hint: "A hash map could help here. What would you store as the key?"
. Reveal: "Consider storing each number's complement as you iterate..."

The system tracks time-on-current-phase and hint-count to decide when to escalate. It also monitors the code sandbox output - if the user is submitting and failing tests repeatedly, that's a signal to intervene even if they haven't explicitly asked for help.

The Code Sandbox

Every real interview has a coding environment. I built a sandboxed execution environment that supports Python and JavaScript, with strict resource limits:

5 second execution timeout
128MB memory limit
No network access
No filesystem writes

The sandbox runs in an isolated container. User code is submitted, executed, and the output (stdout, stderr, return value) is fed back into the LLM context so the interviewer can comment on actual test results.

"Your solution passes 18/20 test cases. The failing cases involve empty arrays - does your code handle that edge case?"

This detail - the AI actually seeing your code output - is what makes Vetta feel different from a chatbot pretending to be an interviewer.

What I'd Do Differently

Streaming TTS chunking is brittle. My sentence-boundary detection is regex-based and breaks on certain LLM outputs (code blocks, lists, mid-sentence quotes). A proper prosody-aware splitter would be better.

Redis TTL management is annoying. Session cleanup requires careful TTL configuration. I had a bug early on where abandoned sessions were accumulating and I didn't notice until the Redis memory started climbing.

The interruption UX needs more work. When the AI gets interrupted mid-thought, the resumed response sometimes sounds disconnected because the LLM doesn't have clean context on where it was. I want to explore "resumable generation" - giving the model its interrupted output as context when it resumes.

The Number That Mattered

After two weeks of optimization, Vetta.ai hit 740ms median latency on a warm server. Cold starts are around 1.1 seconds - still acceptable since those only happen on the first turn of a session.

More importantly: people who used it stopped noticing the AI. They just had the conversation.

That's the real benchmark.

Vetta.ai is live. If you want to try it, the link is here.

How I Built a Voice AI Interviewer with <800ms Latency