API Reference
Voice Translation WebSocket API
Overview
The Voice Translation WebSocket API provides real-time audio transcription and translation via a persistent WebSocket connection. It operates as a two-leg proxy architecture: the Client ↔︎ WSS Server leg handles authentication, session validation, and configuration enrichment, while the WSS Server ↔︎ Worker (Pipecat) leg carries the merged configuration and bidirectional audio/event traffic to the translation worker.
┌──────────┐ WSS (leg 1) ┌────────────┐ WSS (leg 2) ┌──────────────┐
│ Client │ ◄─────────────► │ WSS Server │ ◄─────────────► │ Worker │
│ │ │ (proxy) │ │ (Pipecat) │
└──────────┘ └────────────┘ └──────────────┘
Part 1 — Client ↔︎ WSS Server Public API
Endpoint
wss://streaming.krisp.ai/vt?authorization=api-key SESSION_KEY
Session key generation
const axios = require('axios');
let config = {
method: 'get',
url: 'https://sdkapi.krisp.ai/v2/sdk/voice-translation/session/token?expiration_ttl=100',
headers: {
'Authorization': 'api-key API_KEY'
}
};
axios.request(config)
.then((response) => {
const SESSION_KEY = response.data.data.session_key
})
.catch((error) => {
console.log(error);
});Initial Client Message
Once the WebSocket connection is established, the client must send a single JSON configuration message as the first message, before sending any audio.
{
"config": {
"audio": {
"format": "pcm_s16le",
"sample_rate": 16000,
"channels": 1
},
"source_language": "en-US",
"target_language": "fr-FR",
"voice": "male" | "female",
"vocabulary": ["special", "domain", "terms"],
"translation_dictionary": [
{ "source": "referral", "target": "référence" },
{ "source": "copay", "target": "quote-part" },
{ "source": "email", "target": "courriel" }
],
"transcript": {
"interim": true,
"final": true,
"translate": true
},
"features": {
"background_voice_cancellation": true
},
"metadata": {
"reference_id": "your-reference-id"
}
}
}Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
audio.format | string | No | pcm_s16le | Audio encoding format |
audio.sample_rate | integer | No | 16000 | Audio sample rate in Hz |
audio.channels | integer | No | 1 | Number of audio channels |
source_language | string | Yes | — | Source language locale in BCP 47 format (e.g. en-US) |
target_language | string | Yes | — | Target language locale in BCP 47 format (e.g. fr-FR) |
voice | string | ✅ | — | Voice ID for translated audio output |
vocabulary | array of strings | No | — | Domain-specific or uncommon words for improved recognition accuracy |
translation_dictionary | array of objects | No | — | Custom source→target term mappings for ambiguous or specialized vocabulary |
transcript.interim | boolean | No | false | Whether to emit interim (in-progress) transcript events |
transcript.final | boolean | No | true | Whether to emit final transcript events |
transcript.translate | boolean | No | true | Whether to emit translated transcript events |
features | json | No | - | Additional features |
metadata | json | No | — | Client-supplied metadata |
Bidirectional Binary Audio
After the initial JSON configuration message, audio is exchanged as raw binary WebSocket frames in both directions.
| Direction | Description |
|---|---|
| Client → Server | Raw PCM audio chunks in the format declared in audio.* fields |
| Server → Client | Synthesised translated speech PCM audio, same encoding as the inbound stream |
Binary frames and JSON event frames coexist on the same WebSocket connection. The receiver distinguishes them by WebSocket frame opcode: 0x2 (binary) for audio, 0x1 (text) for JSON events.
Server Events (Client ↔︎ WSS Server)
The server emits JSON text frames over the WebSocket. Each event corresponds to one of the types below.
Transcript
{
"transcript": {
"text": "Hello, how are you?",
"final": false,
"start": "2026-03-25T19:24:45.370+00:00",
"duration": 436,
"utterance_id": "daslkndlkans",
"reference_id": "askdnl"
}
}| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Transcript text for this chunk |
final | boolean | Yes | false for interim events |
start | string (ISO 8601) | Yes | Chunk start timestamp (UTC) |
duration | integer | Yes | Chunk duration in milliseconds |
utterance_id | string | Yes | Links this event to the corresponding final transcript and translation |
reference_id | string | No | Echoed client reference ID |
Translation
{
"translate": {
"text": "¿Hola, cómo estás?",
"final": true,
"utterance_id": "daslkndlkans",
"reference_id": "askdnl"
}
}| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Translated text |
final | boolean | Yes | false for interim events |
utterance_id | string | Yes | Links this event to the corresponding transcript |
reference_id | string | No | Echoed client reference ID |
Error
Emitted when the server encounters an issue processing the request.
{
"error": {
"code": 400,
"reason": "Bad request",
"description": "Actual description of the error",
"reference_id": "askdnl"
}
}| Field | Type | Required | Description |
|---|---|---|---|
code | integer | Yes | HTTP-style status code |
reason | string | Yes | Short reason string |
description | string | Yes | Detailed error description |
reference_id | string | No | Echoed client reference ID |
Error Codes
| Code | Reason | Causes |
|---|---|---|
400 | Bad Request | Malformed request · Invalid audio format · Incorrect target language · Invalid voice |
401 | Unauthorized | Missing API key · Invalid API key · Expired API key |
402 | Payment Required | Balance exhausted · Subscription expired |
429 | Too Many Requests | Rate limit exceeded · Max concurrent connections exceeded |
500 | Internal Server Error | Server failed to process the request |
Part 2 — WSS Server Internal Processing
When a client connects, the WSS Server performs the following steps before establishing the second WebSocket leg to the Worker.
Step 1 — Authentication
The api_key field from the initial JSON message is validated. If the key is missing, invalid, or expired, the server closes the WebSocket with an error event (401) and terminates the connection.
Step 2 — Session & Profile Check
Using the resolved API key, the server loads the associated account profile and verifies:
- Whether new sessions are permitted (e.g. concurrent connection limit)
- Whether the account has sufficient credit or an active subscription
- Whether the requested language pair and voice are available under the account’s tier
If any check fails, an appropriate error event is returned (402, 429, etc.) and the connection is closed.
Step 3 — Configuration Enrichment
The server fetches internal backend configuration for the API key and merges it with the client-supplied request body. This enriched configuration is not exposed to the client; it is forwarded only to the Worker.
Step 4 — Worker Connection (Leg 2)
The server opens a second WebSocket connection to the Pipecat Worker and sends the merged configuration (client body + enriched fields) as the initial message. Once the Worker acknowledges the connection, the WSS Server enters transparent proxy mode and routes all subsequent frames between the client and the Worker without modification.
Part 3 — WSS Server ↔︎ Worker (Pipecat)
The Worker is a Pipecat-based bot that accepts a WebSocket connection and performs the actual STT → translation → TTS pipeline.
Endpoint
Internal — not publicly accessible. Assigned by the WSS Server’s service discovery or configuration.
Initial Worker Message
The WSS Server sends the fully merged configuration as the first JSON frame:
{
"request_id": "api-1775549100475-s09tq718yn8",
"source_lang": "en-US",
"target_lang": "ru-RU",
"audio": {
"format": "pcm_s16le",
"sample_rate": 16000,
"channels": 1
},
"tier": 2,
"translator": {
"provider": "soniox",
"settings": null,
"mode": "synchronous",
"vocabulary": [
"qqwqqwwq"
],
"dictionary": {
"referral": "référence",
"copay": "quote-part",
"email": "courriel"
}
},
"stt": {
"provider": "soniox",
"settings": null,
"interim": true,
"final": true,
"translate": true,
"vocabulary": [
"qqwqqwwq"
]
},
"tts": {
"provider": "google",
"settings": null,
"voice": {
"id": "ru-RU-Chirp3-HD-Aoede",
"speed": 1,
"settings": null
}
},
"features": {
"background_voice_cancellation": true
},
"metadata": {
"user_id": 123456,
"team_id": 654321,
"product_line": "call_center",
"reference_id": "client-reference-id"
}
}Bidirectional Binary Audio
Same framing as the Client ↔︎ WSS Server leg. The WSS Server forwards binary audio frames from the client to the Worker unchanged, and forwards synthesised audio frames from the Worker to the client unchanged.
Worker Events
The Worker emits the same JSON event schema as defined in Part 1 (Interim Transcript, Final Transcript, Translation, Error). The WSS Server forwards these events to the client transparently.
Full Session Flow
Client WSS Server Worker (Pipecat)
│ │ │
│─── Connect (WSS) ────────────►│ │
│ Authorization: Bearer key │ │
│ │ │
│─── Initial config (JSON) ────►│ │
│ ├─[1] Authenticate API key │
│ ├─[2] Check session & credits │
│ ├─[3] Fetch backend config │
│ ├─[4] Merge client + backend config │
│ │ │
│ │─── Connect (WSS, leg 2) ───────────►│
│ │─── Merged config (JSON) ───────────►│
│ │ │
│─── Audio chunk (binary) ─────►│──── Audio chunk (binary) ──────────►│
│◄── Interim transcript (JSON) ─│◄─── Interim transcript (JSON) ──────│
│─── Audio chunk (binary) ─────►│──── Audio chunk (binary) ──────────►│
│◄── Interim transcript (JSON) ─│◄─── Interim transcript (JSON) ──────│
│ │ │
│ [end of utterance detected by Worker] │
│◄── Final transcript (JSON) ───│◄─── Final transcript (JSON) ────────│
│◄── Translation (JSON) ────────│◄─── Translation (JSON) ─────────────│
│◄── Audio chunk (binary) ──────│◄─── Synthesised audio (binary) ─────│
│ │ │
│ [on any error] │
│◄── Error (JSON) ──────────────│◄─── Error (JSON) ───────────────────│
│ │ │
│─── Close ────────────────────►│─── Close ──────────────────────────►│
Notes & Known Limitations
- Audio formats — Only
pcm_s16leis supported. - Sample rates & channels — Only
16000 Hzmono confirmed.
Updated about 19 hours ago
