API Reference
Voice Translation WebSocket API
Overview
The Voice Translation WebSocket API provides real-time audio transcription and translation via a persistent WebSocket connection.
Client ↔︎ WSS Server
Endpoint
wss://streaming.krisp.ai/vt?authorization=Api-Key SESSION_KEY
Session key generation
const axios = require('axios');
let config = {
method: 'get',
url: 'https://api.developers.krisp.ai/v2/sdk/voice-translation/session/token?expiration_ttl=100'
// expiration_ttl is in MINUTES. Min: 5, Max: 1440 (24h).
headers: {
'Authorization': 'api-key API_KEY'
}
};
axios.request(config)
.then((response) => {
// response.data.data = {
// session_key: "session_key_...",
// expires_at: "2026-05-04T12:00:00.000Z",
// key_id: 123,
// status: "active",
// type: "session"
// }
const SESSION_KEY = response.data.data.session_key;
})
.catch((error) => {
console.log(error);
});Initial Client Message
Once the WebSocket connection is established, the client must send a single JSON configuration message as the first message, before sending any audio.
{
"config": {
"audio": {
"format": "pcm_s16le",
"sample_rate": 16000,
"channels": 1
},
"source_language": "en-US",
"target_language": "fr-FR",
"voice": "male" | "female",
"vocabulary": ["special", "domain", "terms"],
"translation_dictionary": [
{ "source": "referral", "target": "référence" },
{ "source": "copay", "target": "quote-part" },
{ "source": "email", "target": "courriel" }
],
"transcript": {
"interim": true,
"final": true,
"translate": true
},
"features": {
"background_voice_cancellation": true
},
"metadata": {
"reference_id": "your-reference-id"
}
}
}Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
audio.format | string | No | pcm_s16le | Audio encoding format |
audio.sample_rate | integer | No | 16000 | Audio sample rate in Hz |
audio.channels | integer | No | 1 | Number of audio channels |
source_language | string | Yes | — | Source language locale in BCP 47 format (e.g. en-US) |
target_language | string | Yes | — | Target language locale in BCP 47 format (e.g. fr-FR) |
voice | string | ✅ | — | Voice ID for translated audio output |
vocabulary | array of strings | No | — | Domain-specific or uncommon words for improved recognition accuracy |
translation_dictionary | array of objects | No | — | Custom source→target term mappings for ambiguous or specialized vocabulary |
transcript.interim | boolean | No | true | When the entire transcript object is omitted. If a transcript is provided but the interim is missing, it defaults to false. |
transcript.final | boolean | No | true | true for the final transcript of an utterance, false for interim transcripts |
transcript.translate | boolean | No | true | Whether to emit translated transcript events |
features | json | No | - | Additional features |
metadata | json | No | — | Client-supplied metadata |
Bidirectional Binary Audio
After the initial JSON configuration message, audio is exchanged as raw binary WebSocket frames in both directions.
| Direction | Description |
|---|---|
| Client → Server | Raw PCM audio chunks in the format declared in audio.* fields |
| Server → Client | Synthesised translated speech PCM audio, same encoding as the inbound stream |
Binary frames and JSON event frames coexist on the same WebSocket connection. The receiver distinguishes them by WebSocket frame opcode: 0x2 (binary) for audio, 0x1 (text) for JSON events.
Server Events (Client ↔︎ WSS Server)
The server emits JSON text frames over the WebSocket. Each event corresponds to one of the types below.
Transcript
{
"transcript": {
"text": "Hello, how are you?",
"final": false,
"start": "2026-03-25T19:24:45.370+00:00",
"duration": 436,
"utterance_id": "daslkndlkans",
"reference_id": "askdnl"
}
}| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Transcript text for this chunk |
final | boolean | Yes | false for interim events |
start | string (ISO 8601) | Yes | Chunk start timestamp (UTC) |
duration | integer | Yes | Chunk duration in milliseconds |
utterance_id | string | Yes | Links this event to the corresponding final transcript and translation |
reference_id | string | No | Echoed client reference ID |
Translation
{
"translate": {
"text": "¿Hola, cómo estás?",
"final": true,
"utterance_id": "daslkndlkans",
"reference_id": "askdnl"
}
}| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Translated text |
final | boolean | Yes | false for interim events |
utterance_id | string | Yes | Links this event to the corresponding transcript |
reference_id | string | No | Echoed client reference ID |
Error
Emitted when the server encounters an issue processing the request.
{
"error": {
"code": 400,
"reason": "Bad request",
"description": "Actual description of the error",
"reference_id": "askdnl"
}
}| Field | Type | Required | Description |
|---|---|---|---|
code | integer | Yes | HTTP-style status code |
reason | string | Yes | Short reason string |
description | string | Yes | Detailed error description |
reference_id | string | No | Echoed client reference ID |
Error Codes
| Code | Reason | Causes |
|---|---|---|
400 | Bad Request | Malformed request · Invalid audio format · Incorrect target language · Invalid voice |
401 | Unauthorized | Missing API key · Invalid API key · Expired API key |
402 | Payment Required | Balance exhausted · Subscription expired |
429 | Too Many Requests | Rate limit exceeded · Max concurrent connections exceeded |
500 | Internal Server Error | Server failed to process the request |
List Supported Languages
Returns the languages currently available for voice translation. Use the returned language_code values as source_language / target_language when starting a session.
GET /voice-translation/languagesAuthentication: Authorization: API-Key {api-key}
Response 200
{
"success": true,
"code": 0,
"data": [
{
"name": "English (United States)",
"language_code": "en-US"
},
{
"name": "French",
"language_code": "fr-FR"
},
{
"name": "German",
"language_code": "de-DE"
}
],
"req_id": "..."
}Language codes are BCP-47 (e.g. en-US, not en). The list is dynamic — fetch at session start rather than hard-coding.
Notes & Known Limitations
- Audio formats — Only
pcm_s16leis supported. - Sample rates & channels — Only
16000 Hzmono confirmed.
Updated 14 days ago
