Krisp Audio SDK Model Selection Guide

SDK v9.0.0

Introduction

Krisp Audio native SDKs consist of two primary components - the API and the AI models.

  • The API takes the AI model as an input. A single instance of the model can simultaneously process one or many audio streams.
  • Each model is a file with a “kef” extension.
  • Conceptually, the audio is processed in the AI model, and the API interprets the model for audio processing.

The following diagram demonstrates audio processing through the Krisp Audio SDK.

*The input and output 48kHz sampling rate of the stream is configurable.
*The 32kHz working sampling of the model depends on the selected model.

  1. The audio is resampled to the operational sampling rate of the model.
  2. The audio is processed using the AI model at the operational sampling rate.
  3. The audio is resampled back to the original sampling rate.

Microphone vs Speaker Stream Use Cases

Krisp provides dedicated AI models to clean background noise and voices from the microphone (outbound/uplink) and speaker (inbound/downlink) audio streams.

Krisp provides multiple models. Each model is unique, and optimized for specific use cases, however, all models are optimized for the inbound (speaker) or outbound (microphone) use case.

The outbound (microphone) models should not be used on the inbound (speaker) streams, and the inbound models should not be used on the outbound streams as Krisp’s AI model training is optimized for different audio streams..

Outbound Models:

  • Trained to handle single-speaker
  • Designed to work with raw unmodified audio directly from the mic
  • Provide the market-best voice quality for the microphone stream

Inbound Models:

  • Trained to handle the challenges of multi-speaker scenarios with various background noises and acoustic environments
  • Designed to tolerate voice quality loss caused by lossy audio codecs
  • Designed to tolerate voice degradations caused by the noise cancelation effects with the conferencing apps

Outbound models are optimized for the case of a single main speaker and raw microphone inputs, and using them for multi-voice scenarios or on streams degraded by lossy network codecs may lead to suboptimal performance, including voice suppression. Conversely, using inbound models for outbound audio may not achieve the same level of performance as outbound models specifically designed for raw microphone inputs.

Multivoice Limitations

Krisp inbound tech is not designed for multiple simultaneous voices on the same microphone. Inbound models are trained to handle multiple voices, each voice with its own mic, acoustic environment, and network-bound degradation.


Outbound (Microphone) Noise Canceling Models

Standard NC Models

Krisp outbound NC bundle includes standard 3 NC models:

NC QualitySampling RateModel NameSizeRecommendations
High32kHzc8.f.s.0263005.8MBOutstanding quality with balanced CPU usage
Medium32kHzc6.f.s.ced1255.7MBUse on mobile and power-limited systems
Low16kHzc5.w.xs.c9ac8f2.8MBUse only when low CPU usage is an absolute priority

Model Size and Architecture Matters

Bigger models usually consume more CPU. However, varying CPU consumption can also result from differences in the models' underlying architecture. For example, high and medium-quality NC models have almost the same size, but the high-quality model consumes significantly more CPU due to its different architecture.

Dedicated NC 8kHz Model

Krisp standard outbound NC models are designed to work with sampling rates above 8kHz.
For sampling rates 8kHz and below we have an on-demand available NC solution not included in the standard bundle.

NC QualitySampling RateModel NameSizeRecommendations
Medium8kHzc5.n.s.20949d2.8MBShould be used with sampling rates up to 8kHz inclusive

Large Models

Large and CPU-demanding models are available for evaluation upon request. Check with your Krisp application engineer to access larger models. Krisp’s standard models are used across more than 150M users globally. Some partners require additional voice processing performance due to challenging noise distractors and SNR conditions.

Large vs. Standard NC Models

  • Large models provide the best noise cancellation quality for extreme noise and SNR conditions but consume more CPU resources.
  • Standard models are optimized for low CPU utilization and power efficiency, providing market-best noise cancellation quality across most use cases.
  • Large models generally respond faster to noise condition changes, minimizing noise leakage.

Outbound (Microphone) BVC Models

BVC models

The standard BVC model is designed to remove background noises and voices. The BVC model has built-in NC. You should never use two Krisp NC/BVC models on the same audio stream.

Krisp BVC technology has two primary requirements:

  • The sampling rate of the voice in the audio should be above 8kHz.
  • A headset or compatible device from the list should be used

Please use regular NC models if one of these requirements is unmet.

QualitySampling RateModel NameSizeRecommendations
High32kHzhs.c6.f.s.de56df14.7MBPreferred choice for headset and earbuds.
Medium16kHzhs.c6.w.s.087e358.6MBUse only when low CPU usage is an absolute priority.
Must be used with USB headsets or headsets with jack connectors only.
Used on power-limited thin clients in the call centers using headset.

Large BVC Models

Are available on demand. Contact your Krisp application engineer to access them for evaluation.




1. Sampling rates above 32KHz will be downsampled and will not be noticeable to the human ear
2. Not recommended because of over-aggressive noise canceling
3. Audio resolution loss happens when the audio input is downsampled to match the model's sampling rate
  1. Sampling rates above 32kHz will be downsampled and will not be noticeable by the human ear.
  2. Not recommended because of over-aggressive noise canceling.
  3. Audio resolution loss happens when the audio input is downsampled to match the model's sampling rate.

Inbound (Speaker Stream) Noise Canceling Models

Standard Inbound Models For Telephony

The inbound models are well-trained to tolerate:

  • Possible voice degradations caused by the network audio codecs.
  • Voice degradations caused by the noise cancelation effects with the conferencing apps.
  • Other voice-processing effects.
  • Challenges of the multi-speaker scenarios with various background noises and acoustic effects.
NC QualitySampling RateModel NameSizeRecommendations
Medium8kHzc7.n.s.9f43894.1MBThe model is designed specifically for call operations inbound use cases.
Performs best for sampling rates up to 8kHz inclusive
High16kHzc8.w.m.d266be13MBThe standard inbound model is optimized for inbound conference calls. Works at the wide band, preserving voice clarity. Supports the narrow band