Voice Translation

The Krisp Voice Translation (VT) SDK enables real-time voice translation for applications that require live multilingual voice communication.

The SDK captures live audio, securely streams it to Krisp Voice Translation services (Krisp Cloud), and delivers translated speech back to the application as an audio stream in near real time. This allows applications to support natural spoken conversations between participants who do not share a common language.

Voice Translation is built on the same technology used in the Krisp application and leverages Krisp’s voice processing stack to maintain stable performance and high translation quality in real-world environments.

Session Model

Voice Translation operates within the context of a translation session.

Each session is single-directional, translating speech from one source language to one target language (for example, English → Spanish). Applications that require bi-directional translation are expected to create two independent sessions, one per direction.

The SDK supports all-to-all language translation, allowing any supported source language to be translated into any supported target language.

How It Works

Audio is captured from the application’s audio source.
Audio is encrypted and streamed to Krisp Voice Translation services.
Speech is transcribed, translated, and synthesized in the target language.
The translated audio stream is delivered back to the application.
(Optional) Original and translated transcripts can be generated for downstream use.

All voice data is processed in transit and in memory only and is not stored by default, aligning with enterprise security and privacy requirements.

Latency and Conversational Flow

Voice Translation involves a multi-stage processing pipeline, which introduces an inherent trade-off between latency and translation quality. To generate accurate and natural translations, especially for sensitive content such as numbers, dates, and identifiers, the system requires sufficient speech context.

In practice, the SDK operates with a small, conversational delay, typically several words into the speech, followed by steady-state latency typically in the 800–900 ms range. This design prioritizes translation accuracy and conversational clarity over ultra-low latency, resulting in a more natural dialogue experience.

Platform Support

The Voice Translation SDK supports the following platforms:

Native C++ SDK for desktop applications on Windows and macOS
JS SDK for Web-based applications

Platform-specific integration details are covered in the corresponding SDK guides.