Voice Activity Detection (VAD)
Voice Activity Detection (VAD) algorithm is designed to predict whether there is speech in an audio frame or not. It is able to identify the speech presence in high noise conditions. One common application of the VAD algorithm is the conditional processing of the input audio depending on whether speech is detected or not, resulting in optimized CPU performance by bypassing Krisp NC and passing silence in the output.
Model specs
Parameter | Value |
---|---|
Model size | 391 KB |
Frame size support | 10 ms |
Sampling rate | 8k* |
CPU consumption | 1-2%** |
* Audio streams with higher sample rate will be downsampled to 8k for processing.
** CPU consumption may vary depending on platform.
Integration
float vad_res = krispAudioVadFrameInt16(vadSessionID, &wavShortDataIn[i*IN_BUF_SIZE], static_cast<unsigned int>(IN_BUF_SIZE));
float threshold = 0.5
if(vad_res > threshold) { // 0.5 is Krisp recommended threshold
if (0 > krispAudioNcCleanAmbientNoiseInt16(ncSessionID, &wavShortDataIn[i*IN_BUF_SIZE], static_cast<unsigned int>(IN_BUF_SIZE),
&wavShortDataOut[i*OUT_BUF_SIZE], static_cast<unsigned int>(OUT_BUF_SIZE))) {
cerr << "Error in DeNoise processing!" << endl;
break;
}
} else {
std::memset(&(wavShortDataOut[i*OUT_BUF_SIZE]), 0, OUT_BUF_SIZE * sizeof(short));
}
The function krispAudioVadFrameInt16
outputs a probability value within the range of [0, 1], where a higher value indicates the presence of voice in the audio stream.
The aggressiveness of the algorithm can be adjusted by configuring the threshold to a value within the range. The recommended threshold is 0.5. The threshold should be fine-tuned to adjust the algorithm to the specific use case - the higher the threshold the lower the aggressiveness of the VAD algorithm.
❕Note: for processing float data use krispAudioVadFrameFloat function instead of krispAudioVadFrameInt16
Updated 8 months ago