Krisp Turn-Taking v2

Technology

This model uses only audio input for End-of-Turn prediction for real-time conversational systems like human-bot interaction. It represents a general improvement over the krisp-viva-tt-v1 across all test cases; it was trained on a better-structured dataset with further enriched data augmentations.

Testing Results

The main advantage of the new krisp-viva-tt-v2 Krisp Turn-Taking model over the previous version is:

  • Improved results with noisy audio
  • Better accuracy when paired with Krisp background voice and noise removal models

Testing: No Noise

The table below presents evaluation results on ~1800 audio samples extracted from real conversations. The dataset includes ~1000 hold cases and ~800 shift cases, with a mild level of background noise. Although the numerical results show only a small difference between the two versions on this dataset, the accompanying graph indicates that the mean shift prediction time has improved at the same false positive rate.

ModelBalanced accuracyAUCF1 Score
krisp-viva-tt-v10.820.890.804
krisp-viva-tt-v20.8230.9040.813


Another evaluation was conducted on the noisy mixes dataset at noise levels of 5 dB, 10 dB, and 15 dB. Two testing scenarios were considered:

Directly on the noisy mixes dataset

  • On the same dataset, after processing through the Krisp BVC Inbound model
  • In both cases, the results demonstrate the advantage of krisp-viva-tt-v2 compared to krisp-viva-tt-v1

In both cases, the results demonstrate the advantage of krisp-viva-tt-v2 compared to krisp-viva-tt-v1.

Testing: Using Noisy Dataset

ModelBalanced accuracyAUCF1 Score
krisp-viva-tt-v10.7230.7990.71
krisp-viva-tt-v20.7680.8420.757

Testing: Using Noisy Dataset After Removing Background Voices And Noises

Testing on the noisy mixes dataset, after applying krisp-viva-tel-v2 background noise and voice removal model

ModelBalanced accuracyAUCF1 Score
krisp-viva-tt-v10.7870.8540.775
krisp-viva-tt-v20.8160.8850.808