Skip to main content
Most modular solution for technical teams. Send audio from your voice agent, receive avatar video streams with 250ms response times from audio input to HD avatar video output.

When to Use This

Choose speech-to-video when you need:
  • Component-Level Control: Complete management of turn detection, STT, LLM, and TTS components
  • Complex Tool Calling: Flexible LLM integrations with external APIs and databases
  • Voice Infrastructure Migration: Seamlessly upgrades existing voice agent infrastructure
For zero-infrastructure deployment, use managed agents instead.

Pipeline Overview

1

Your Voice Agent Pipeline

You manage media transport, turn detection, STT, LLM, and TTS components
2

Beyond Presence Speech-to-Video API

Receives audio input from your pipeline
3

Avatar Video Output

Beyond Presence manages avatar generation and video streaming

Supported Frameworks

We support integration with popular voice agent frameworks including LiveKit and Pipecat, allowing you to add avatar video to your existing voice pipelines.

Next Steps