Voice Activity Detection (VAD)

Voice Activity Detection (VAD) algorithm is designed to predict whether there is speech in an audio frame or not. It is able to identify the speech presence in high noise conditions. One common application of the VAD algorithm is the conditional processing of the input audio depending on whether speech is detected or not, resulting in optimized CPU performance by bypassing Krisp NC and passing silence in the output.

Model specs

Model size391 KB
Frame size support10 ms
Sampling rate8k*
CPU consumption1-2%**

* Audio streams with higher sample rate will be downsampled to 8k for processing.

** CPU consumption may vary depending on platform.


float vad_res = krispAudioVadFrameInt16(vadSessionID, &wavShortDataIn[i*IN_BUF_SIZE], static_cast<unsigned int>(IN_BUF_SIZE));
float threshold = 0.4
if(vad_res > threshold) { // 0.4 is Krisp recommended threshold
	if (0 > krispAudioNcCleanAmbientNoiseInt16(ncSessionID, &wavShortDataIn[i*IN_BUF_SIZE], static_cast<unsigned int>(IN_BUF_SIZE),
				 &wavShortDataOut[i*OUT_BUF_SIZE], static_cast<unsigned int>(OUT_BUF_SIZE))) {
         cerr << "Error in DeNoise processing!" << endl;
	} else {
      std::memset(&(wavShortDataOut[i*OUT_BUF_SIZE]), 0, OUT_BUF_SIZE * sizeof(short));

The function krispAudioVadFrameInt16 outputs a probability value within the range of [0, 1], where a higher value indicates the presence of voice in the audio stream.

The aggressiveness of the algorithm can be adjusted by configuring the threshold to a value within the range. The recommended threshold is 0.4. The threshold should be fine-tuned to adjust the algorithm to the specific use case - the higher the threshold the lower the aggressiveness of the VAD algorithm.

:grey-exclamation:Note: for processing float data use krispAudioVadFrameFloat function instead of krispAudioVadFrameInt16