API Reference

Voice Translation WebSocket API

Overview

The Voice Translation WebSocket API provides real-time audio transcription and translation via a persistent WebSocket connection.


Client ↔︎ WSS Server

Endpoint

wss://streaming.krisp.ai/vt?authorization=Api-Key SESSION_KEY

Session key generation

const axios = require('axios');

let config = {
  method: 'get',
  url: 'https://api.developers.krisp.ai/v2/sdk/voice-translation/session/token?expiration_ttl=100'
	// expiration_ttl is in MINUTES. Min: 5, Max: 1440 (24h).
  headers: { 
    'Authorization': 'api-key API_KEY'
  }
};

axios.request(config)
  .then((response) => {
  // response.data.data = {
  //   session_key: "session_key_...",
  //   expires_at: "2026-05-04T12:00:00.000Z",
  //   key_id: 123,
  //   status: "active",
  //   type: "session"
  // }
	const SESSION_KEY = response.data.data.session_key;
})
.catch((error) => {
  console.log(error);
});

Initial Client Message

Once the WebSocket connection is established, the client must send a single JSON configuration message as the first message, before sending any audio.

{
  "config": {
    "audio": {
      "format": "pcm_s16le",
      "sample_rate": 16000,
      "channels": 1
    },
    "source_language": "en-US",
    "target_language": "fr-FR",
    "voice": "male" | "female",
    "vocabulary": ["special", "domain", "terms"],
    "translation_dictionary": [
      { "source": "referral", "target": "référence" },
      { "source": "copay", "target": "quote-part" },
      { "source": "email", "target": "courriel" }
    ],
    "transcript": {
      "interim": true,
      "final": true,
      "translate": true
    },
	  "features": {
	    "background_voice_cancellation": true
	  },
    "metadata": {
      "reference_id": "your-reference-id"
    }
  }
}

Parameters

ParameterTypeRequiredDefaultDescription
audio.formatstringNopcm_s16leAudio encoding format
audio.sample_rateintegerNo16000Audio sample rate in Hz
audio.channelsintegerNo1Number of audio channels
source_languagestringYesSource language locale in BCP 47 format (e.g. en-US)
target_languagestringYesTarget language locale in BCP 47 format (e.g. fr-FR)
voicestringVoice ID for translated audio output
vocabularyarray of stringsNoDomain-specific or uncommon words for improved recognition accuracy
translation_dictionaryarray of objectsNoCustom source→target term mappings for ambiguous or specialized vocabulary
transcript.interimbooleanNotrueWhen the entire transcript object is omitted. If a transcript is provided but the interim is missing, it defaults to false.
transcript.finalbooleanNotruetrue for the final transcript of an utterance, false for interim transcripts
transcript.translatebooleanNotrueWhether to emit translated transcript events
featuresjsonNo-Additional features
metadatajsonNoClient-supplied metadata

Bidirectional Binary Audio

After the initial JSON configuration message, audio is exchanged as raw binary WebSocket frames in both directions.

DirectionDescription
Client → ServerRaw PCM audio chunks in the format declared in audio.* fields
Server → ClientSynthesised translated speech PCM audio, same encoding as the inbound stream

Binary frames and JSON event frames coexist on the same WebSocket connection. The receiver distinguishes them by WebSocket frame opcode: 0x2 (binary) for audio, 0x1 (text) for JSON events.


Server Events (Client ↔︎ WSS Server)

The server emits JSON text frames over the WebSocket. Each event corresponds to one of the types below.

Transcript

{
  "transcript": {
    "text": "Hello, how are you?",
    "final": false,
    "start": "2026-03-25T19:24:45.370+00:00",
    "duration": 436,
    "utterance_id": "daslkndlkans",
    "reference_id": "askdnl"
  }
}
FieldTypeRequiredDescription
textstringYesTranscript text for this chunk
finalbooleanYesfalse for interim events
startstring (ISO 8601)YesChunk start timestamp (UTC)
durationintegerYesChunk duration in milliseconds
utterance_idstringYesLinks this event to the corresponding final transcript and translation
reference_idstringNoEchoed client reference ID

Translation

{
  "translate": {
    "text": "¿Hola, cómo estás?",
    "final": true,
    "utterance_id": "daslkndlkans",
    "reference_id": "askdnl"
  }
}
FieldTypeRequiredDescription
textstringYesTranslated text
finalbooleanYesfalse for interim events
utterance_idstringYesLinks this event to the corresponding transcript
reference_idstringNoEchoed client reference ID

Error

Emitted when the server encounters an issue processing the request.

{
  "error": {
    "code": 400,
    "reason": "Bad request",
    "description": "Actual description of the error",
    "reference_id": "askdnl"
  }
}
FieldTypeRequiredDescription
codeintegerYesHTTP-style status code
reasonstringYesShort reason string
descriptionstringYesDetailed error description
reference_idstringNoEchoed client reference ID

Error Codes

CodeReasonCauses
400Bad RequestMalformed request · Invalid audio format · Incorrect target language · Invalid voice
401UnauthorizedMissing API key · Invalid API key · Expired API key
402Payment RequiredBalance exhausted · Subscription expired
429Too Many RequestsRate limit exceeded · Max concurrent connections exceeded
500Internal Server ErrorServer failed to process the request


List Supported Languages

Returns the languages currently available for voice translation. Use the returned language_code values as source_language / target_language when starting a session.

GET /voice-translation/languages

Authentication: Authorization: API-Key {api-key}

Response 200

{
  "success": true,
  "code": 0,
  "data": [
    {
      "name": "English (United States)",
      "language_code": "en-US"
    },
    {
      "name": "French",
      "language_code": "fr-FR"
    },
    {
      "name": "German",
      "language_code": "de-DE"
    }
  ],
  "req_id": "..."
}

Language codes are BCP-47 (e.g. en-US, not en). The list is dynamic — fetch at session start rather than hard-coding.




Notes & Known Limitations

  • Audio formats — Only pcm_s16le is supported.
  • Sample rates & channels — Only 16000 Hz mono confirmed.