Turn-Taking

VIVA Turn-Taking (TT) model is integrated into the server audio pipeline to predict the end-of-turn in a human-to-bot conversation. It's designed to replace traditional VAD-based end-of-turn prediction, providing higher accuracy.

TT model is very small and uses only audio input for end-of-turn prediction. Audio-based approaches rely on analyzing acoustic and prosodic features of speech. These features include, changes in pitch, energy levels, intonation, pauses and speaking rate. By detecting silence or overlapping speech, the system predicts when the user has finished speaking and when it is safe to respond. For example, a sudden drop in energy followed by a pause can be interpreted as a turn-ending cue. Such models are effective in real-time, low-latency scenarios where immediate response timing is critical.

Check the following blog posts for some technical details:

Currently the model supports only English. Support for more languages will follow soon.

The model operates on 100ms frames and provides a prediction score with each frame.

For best performance, the TT model must be deployed on VIVA Voice Isolation's output.

Here is an example showing how to use TT in production.