API Reference


Voice Translation WebSocket API

Overview

The Voice Translation WebSocket API provides real-time audio transcription and translation via a persistent WebSocket connection. It operates as a two-leg proxy architecture: the Client ↔︎ WSS Server leg handles authentication, session validation, and configuration enrichment, while the WSS Server ↔︎ Worker (Pipecat) leg carries the merged configuration and bidirectional audio/event traffic to the translation worker.

┌──────────┐   WSS (leg 1)   ┌────────────┐   WSS (leg 2)   ┌──────────────┐
│  Client  │ ◄─────────────► │ WSS Server │ ◄─────────────► │    Worker    │
│          │                 │  (proxy)   │                 │  (Pipecat)   │
└──────────┘                 └────────────┘                 └──────────────┘

Part 1 — Client ↔︎ WSS Server Public API

Endpoint

wss://streaming.krisp.ai/vt?authorization=api-key SESSION_KEY

Session key generation

const axios = require('axios');

let config = {
  method: 'get',
  url: 'https://sdkapi.krisp.ai/v2/sdk/voice-translation/session/token?expiration_ttl=100',
  headers: { 
    'Authorization': 'api-key API_KEY'
  }
};

axios.request(config)
.then((response) => {
  const SESSION_KEY = response.data.data.session_key
})
.catch((error) => {
  console.log(error);
});

Initial Client Message

Once the WebSocket connection is established, the client must send a single JSON configuration message as the first message, before sending any audio.

{
  "config": {
    "audio": {
      "format": "pcm_s16le",
      "sample_rate": 16000,
      "channels": 1
    },
    "source_language": "en-US",
    "target_language": "fr-FR",
    "voice": "male" | "female",
    "vocabulary": ["special", "domain", "terms"],
    "translation_dictionary": [
      { "source": "referral", "target": "référence" },
      { "source": "copay", "target": "quote-part" },
      { "source": "email", "target": "courriel" }
    ],
    "transcript": {
      "interim": true,
      "final": true,
      "translate": true
    },
	  "features": {
	    "background_voice_cancellation": true
	  },
    "metadata": {
      "reference_id": "your-reference-id"
    }
  }
}

Parameters

ParameterTypeRequiredDefaultDescription
audio.formatstringNopcm_s16leAudio encoding format
audio.sample_rateintegerNo16000Audio sample rate in Hz
audio.channelsintegerNo1Number of audio channels
source_languagestringYesSource language locale in BCP 47 format (e.g. en-US)
target_languagestringYesTarget language locale in BCP 47 format (e.g. fr-FR)
voicestringVoice ID for translated audio output
vocabularyarray of stringsNoDomain-specific or uncommon words for improved recognition accuracy
translation_dictionaryarray of objectsNoCustom source→target term mappings for ambiguous or specialized vocabulary
transcript.interimbooleanNofalseWhether to emit interim (in-progress) transcript events
transcript.finalbooleanNotrueWhether to emit final transcript events
transcript.translatebooleanNotrueWhether to emit translated transcript events
featuresjsonNo-Additional features
metadatajsonNoClient-supplied metadata

Bidirectional Binary Audio

After the initial JSON configuration message, audio is exchanged as raw binary WebSocket frames in both directions.

DirectionDescription
Client → ServerRaw PCM audio chunks in the format declared in audio.* fields
Server → ClientSynthesised translated speech PCM audio, same encoding as the inbound stream

Binary frames and JSON event frames coexist on the same WebSocket connection. The receiver distinguishes them by WebSocket frame opcode: 0x2 (binary) for audio, 0x1 (text) for JSON events.


Server Events (Client ↔︎ WSS Server)

The server emits JSON text frames over the WebSocket. Each event corresponds to one of the types below.

Transcript

{
  "transcript": {
    "text": "Hello, how are you?",
    "final": false,
    "start": "2026-03-25T19:24:45.370+00:00",
    "duration": 436,
    "utterance_id": "daslkndlkans",
    "reference_id": "askdnl"
  }
}
FieldTypeRequiredDescription
textstringYesTranscript text for this chunk
finalbooleanYesfalse for interim events
startstring (ISO 8601)YesChunk start timestamp (UTC)
durationintegerYesChunk duration in milliseconds
utterance_idstringYesLinks this event to the corresponding final transcript and translation
reference_idstringNoEchoed client reference ID

Translation

{
  "translate": {
    "text": "¿Hola, cómo estás?",
    "final": true,
    "utterance_id": "daslkndlkans",
    "reference_id": "askdnl"
  }
}
FieldTypeRequiredDescription
textstringYesTranslated text
finalbooleanYesfalse for interim events
utterance_idstringYesLinks this event to the corresponding transcript
reference_idstringNoEchoed client reference ID

Error

Emitted when the server encounters an issue processing the request.

{
  "error": {
    "code": 400,
    "reason": "Bad request",
    "description": "Actual description of the error",
    "reference_id": "askdnl"
  }
}
FieldTypeRequiredDescription
codeintegerYesHTTP-style status code
reasonstringYesShort reason string
descriptionstringYesDetailed error description
reference_idstringNoEchoed client reference ID

Error Codes

CodeReasonCauses
400Bad RequestMalformed request · Invalid audio format · Incorrect target language · Invalid voice
401UnauthorizedMissing API key · Invalid API key · Expired API key
402Payment RequiredBalance exhausted · Subscription expired
429Too Many RequestsRate limit exceeded · Max concurrent connections exceeded
500Internal Server ErrorServer failed to process the request

Part 2 — WSS Server Internal Processing

When a client connects, the WSS Server performs the following steps before establishing the second WebSocket leg to the Worker.

Step 1 — Authentication

The api_key field from the initial JSON message is validated. If the key is missing, invalid, or expired, the server closes the WebSocket with an error event (401) and terminates the connection.

Step 2 — Session & Profile Check

Using the resolved API key, the server loads the associated account profile and verifies:

  • Whether new sessions are permitted (e.g. concurrent connection limit)
  • Whether the account has sufficient credit or an active subscription
  • Whether the requested language pair and voice are available under the account’s tier

If any check fails, an appropriate error event is returned (402, 429, etc.) and the connection is closed.

Step 3 — Configuration Enrichment

The server fetches internal backend configuration for the API key and merges it with the client-supplied request body. This enriched configuration is not exposed to the client; it is forwarded only to the Worker.

Step 4 — Worker Connection (Leg 2)

The server opens a second WebSocket connection to the Pipecat Worker and sends the merged configuration (client body + enriched fields) as the initial message. Once the Worker acknowledges the connection, the WSS Server enters transparent proxy mode and routes all subsequent frames between the client and the Worker without modification.


Part 3 — WSS Server ↔︎ Worker (Pipecat)

The Worker is a Pipecat-based bot that accepts a WebSocket connection and performs the actual STT → translation → TTS pipeline.

Endpoint

Internal — not publicly accessible. Assigned by the WSS Server’s service discovery or configuration.

Initial Worker Message

The WSS Server sends the fully merged configuration as the first JSON frame:

{
  "request_id": "api-1775549100475-s09tq718yn8",
  "source_lang": "en-US",
  "target_lang": "ru-RU",
  "audio": {
    "format": "pcm_s16le",
    "sample_rate": 16000,
    "channels": 1
  },
  "tier": 2,
  "translator": {
    "provider": "soniox",
    "settings": null,
    "mode": "synchronous",
    "vocabulary": [
      "qqwqqwwq"
    ],
    "dictionary": {
      "referral": "référence",
      "copay": "quote-part",
      "email": "courriel"
    }
  },
  "stt": {
    "provider": "soniox",
    "settings": null,
    "interim": true,
    "final": true,
    "translate": true,
	  "vocabulary": [
      "qqwqqwwq"
    ]
  },
  "tts": {
    "provider": "google",
    "settings": null,
    "voice": {
      "id": "ru-RU-Chirp3-HD-Aoede",
      "speed": 1,
      "settings": null
    }
  },
  "features": {
    "background_voice_cancellation": true
  },
	"metadata": {
		"user_id": 123456,
		"team_id": 654321,
		"product_line": "call_center",
		"reference_id": "client-reference-id"
  }
}

Bidirectional Binary Audio

Same framing as the Client ↔︎ WSS Server leg. The WSS Server forwards binary audio frames from the client to the Worker unchanged, and forwards synthesised audio frames from the Worker to the client unchanged.

Worker Events

The Worker emits the same JSON event schema as defined in Part 1 (Interim Transcript, Final Transcript, Translation, Error). The WSS Server forwards these events to the client transparently.


Full Session Flow

Client                        WSS Server                         Worker (Pipecat)
  │                               │                                     │
  │─── Connect (WSS) ────────────►│                                     │
  │    Authorization: Bearer key  │                                     │
  │                               │                                     │
  │─── Initial config (JSON) ────►│                                     │
  │                               ├─[1] Authenticate API key            │
  │                               ├─[2] Check session & credits         │
  │                               ├─[3] Fetch backend config            │
  │                               ├─[4] Merge client + backend config   │
  │                               │                                     │
  │                               │─── Connect (WSS, leg 2) ───────────►│
  │                               │─── Merged config (JSON) ───────────►│
  │                               │                                     │
  │─── Audio chunk (binary) ─────►│──── Audio chunk (binary) ──────────►│
  │◄── Interim transcript (JSON) ─│◄─── Interim transcript (JSON) ──────│
  │─── Audio chunk (binary) ─────►│──── Audio chunk (binary) ──────────►│
  │◄── Interim transcript (JSON) ─│◄─── Interim transcript (JSON) ──────│
  │                               │                                     │
  │              [end of utterance detected by Worker]                  │
  │◄── Final transcript (JSON) ───│◄─── Final transcript (JSON) ────────│
  │◄── Translation (JSON) ────────│◄─── Translation (JSON) ─────────────│
  │◄── Audio chunk (binary) ──────│◄─── Synthesised audio (binary) ─────│
  │                               │                                     │
  │              [on any error]                                         │
  │◄── Error (JSON) ──────────────│◄─── Error (JSON) ───────────────────│
  │                               │                                     │
  │─── Close ────────────────────►│─── Close ──────────────────────────►│

Notes & Known Limitations

  • Audio formats — Only pcm_s16le is supported.
  • Sample rates & channels — Only 16000 Hz mono confirmed.