Voice Translation WebSocket API

Overview

The Voice Translation WebSocket API provides real-time audio transcription and translation via a persistent WebSocket connection.

Client ↔︎ WSS Server

Endpoint

wss://streaming.krisp.ai/vt?authorization=Api-Key SESSION_KEY

Session key generation

const axios = require('axios');

let config = {
  method: 'get',
  url: 'https://api.developers.krisp.ai/v2/sdk/voice-translation/session/token?expiration_ttl=100'
	// expiration_ttl is in MINUTES. Min: 5, Max: 1440 (24h).
  headers: { 
    'Authorization': 'api-key API_KEY'
  }
};

axios.request(config)
  .then((response) => {
  // response.data.data = {
  //   session_key: "session_key_...",
  //   expires_at: "2026-05-04T12:00:00.000Z",
  //   key_id: 123,
  //   status: "active",
  //   type: "session"
  // }
	const SESSION_KEY = response.data.data.session_key;
})
.catch((error) => {
  console.log(error);
});

Initial Client Message

Once the WebSocket connection is established, the client must send a single JSON configuration message as the first message, before sending any audio.

{
  "config": {
    "audio": {
      "format": "pcm_s16le",
      "sample_rate": 16000,
      "channels": 1
    },
    "source_language": "en-US",
    "target_language": "fr-FR",
    "voice": "male" | "female",
    "vocabulary": ["special", "domain", "terms"],
    "translation_dictionary": [
      { "source": "referral", "target": "référence" },
      { "source": "copay", "target": "quote-part" },
      { "source": "email", "target": "courriel" }
    ],
    "transcript": {
      "interim": true,
      "final": true,
      "translate": true
    },
	  "features": {
	    "background_voice_cancellation": true
	  },
    "metadata": {
      "reference_id": "your-reference-id"
    }
  }
}

Parameters

Parameter	Type	Required	Default	Description
`audio.format`	string	No	`pcm_s16le`	Audio encoding format
`audio.sample_rate`	integer	No	`16000`	Audio sample rate in Hz
`audio.channels`	integer	No	`1`	Number of audio channels
`source_language`	string	Yes	—	Source language locale in BCP 47 format (e.g. `en-US`)
`target_language`	string	Yes	—	Target language locale in BCP 47 format (e.g. `fr-FR`)
`voice`	string	✅	—	Voice ID for translated audio output
`vocabulary`	array of strings	No	—	Domain-specific or uncommon words for improved recognition accuracy
`translation_dictionary`	array of objects	No	—	Custom source→target term mappings for ambiguous or specialized vocabulary
`transcript.interim`	boolean	No	`true`	When the entire transcript object is omitted. If a transcript is provided but the interim is missing, it defaults to false.
`transcript.final`	boolean	No	`true`	`true` for the final transcript of an utterance, `false` for interim transcripts
`transcript.translate`	boolean	No	`true`	Whether to emit translated transcript events
`features`	json	No	-	Additional features
`metadata`	json	No	—	Client-supplied metadata

Bidirectional Binary Audio

After the initial JSON configuration message, audio is exchanged as raw binary WebSocket frames in both directions.

Direction	Description
Client → Server	Raw PCM audio chunks in the format declared in `audio.*` fields
Server → Client	Synthesised translated speech PCM audio, same encoding as the inbound stream

Binary frames and JSON event frames coexist on the same WebSocket connection. The receiver distinguishes them by WebSocket frame opcode: 0x2 (binary) for audio, 0x1 (text) for JSON events.

Server Events (Client ↔︎ WSS Server)

The server emits JSON text frames over the WebSocket. Each event corresponds to one of the types below.

Transcript

{
  "transcript": {
    "text": "Hello, how are you?",
    "final": false,
    "start": "2026-03-25T19:24:45.370+00:00",
    "duration": 436,
    "utterance_id": "daslkndlkans",
    "reference_id": "askdnl"
  }
}

Field	Type	Required	Description
`text`	string	Yes	Transcript text for this chunk
`final`	boolean	Yes	`false` for interim events
`start`	string (ISO 8601)	Yes	Chunk start timestamp (UTC)
`duration`	integer	Yes	Chunk duration in milliseconds
`utterance_id`	string	Yes	Links this event to the corresponding final transcript and translation
`reference_id`	string	No	Echoed client reference ID

Translation

{
  "translate": {
    "text": "¿Hola, cómo estás?",
    "final": true,
    "utterance_id": "daslkndlkans",
    "reference_id": "askdnl"
  }
}

Field	Type	Required	Description
`text`	string	Yes	Translated text
`final`	boolean	Yes	`false` for interim events
`utterance_id`	string	Yes	Links this event to the corresponding transcript
`reference_id`	string	No	Echoed client reference ID

Error

Emitted when the server encounters an issue processing the request.

{
  "error": {
    "code": 400,
    "reason": "Bad request",
    "description": "Actual description of the error",
    "reference_id": "askdnl"
  }
}

Field	Type	Required	Description
`code`	integer	Yes	HTTP-style status code
`reason`	string	Yes	Short reason string
`description`	string	Yes	Detailed error description
`reference_id`	string	No	Echoed client reference ID

Error Codes

Code	Reason	Causes
`400`	Bad Request	Malformed request · Invalid audio format · Incorrect target language · Invalid voice
`401`	Unauthorized	Missing API key · Invalid API key · Expired API key
`402`	Payment Required	Balance exhausted · Subscription expired
`429`	Too Many Requests	Rate limit exceeded · Max concurrent connections exceeded
`500`	Internal Server Error	Server failed to process the request

List Supported Languages

Returns the languages currently available for voice translation. Use the returned language_code values as source_language / target_language when starting a session.

GET /voice-translation/languages

Authentication: Authorization: API-Key {api-key}

Response 200

{
  "success": true,
  "code": 0,
  "data": [
    {
      "name": "English (United States)",
      "language_code": "en-US"
    },
    {
      "name": "French",
      "language_code": "fr-FR"
    },
    {
      "name": "German",
      "language_code": "de-DE"
    }
  ],
  "req_id": "..."
}

Language codes are BCP-47 (e.g. en-US, not en). The list is dynamic — fetch at session start rather than hard-coding.

Notes & Known Limitations

Audio formats — Only pcm_s16le is supported.
Sample rates & channels — Only 16000 Hz mono confirmed.