BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection
Abstract
Fast and accurate spoken content retrieval is vital for applications such as voice search. Query-by-Example Spoken Term Detection (STD) involves retrieving matching segments from an audio database given a spoken query. Token-based STD systems, which use discrete speech representations, enable efficient search but struggle with robustness to noise and reverberation, and with inefficient token utilization. We address these challenges by proposing a noise and reverberation-augmented training strategy to improve tokenizer robustness. In addition, we introduce optimal transport-based regularization to ensure balanced token usage and enhance token efficiency. To further speed up retrieval, we adopt a TF-IDF-based search mechanism. Empirical evaluations demonstrate that the proposed method outperforms STD baselines across various distortion levels while maintaining high search efficiency.
Summary
This paper introduces BEST-STD 2.0, an improved speech tokenizer for spoken term detection (STD). The primary research question addresses the limitations of existing token-based STD systems, which struggle with robustness to noise and reverberation and exhibit inefficient token utilization. BEST-STD 2.0 tackles these challenges by incorporating a noise and reverberation-augmented training strategy to enhance tokenizer robustness and introducing an optimal transport-based regularization method to promote balanced token usage. Additionally, a TF-IDF-based search mechanism is adopted for faster retrieval. The methodology involves training a bidirectional Mamba encoder within a self-supervised learning framework. The framework uses DTW alignment between clean and noisy utterances to create anchor-positive pairs and incorporates a contrastive loss, a commitment loss, and a robust consistency loss. Optimal transport is applied to regularize the codebook learning process, preventing codebook collapse and ensuring balanced token utilization. The system's performance is evaluated on LibriSpeech and TIMIT datasets under various noise and reverberation conditions, using Mean Term Weighted Value (MTWV) and Jaccard similarity as metrics. The key findings demonstrate that BEST-STD 2.0 outperforms existing STD baselines across different acoustic conditions while maintaining high search efficiency. This matters to the field because it provides a more robust and efficient solution for spoken content retrieval, enabling text-like search capabilities over raw speech in noisy real-world environments.
Key Insights
- •Novel noise and reverberation-augmented training strategy significantly improves the robustness of speech tokens, as evidenced by higher Jaccard similarity scores compared to baselines under noisy and reverberant conditions (e.g., Jaccard similarity of 0.78 vs 0.64 for a Transformer-based model in noise + reverb).
- •Introduction of optimal transport-based regularization effectively prevents codebook collapse, achieving a normalized entropy close to 1 for codebooks of sizes ranging from 1024 to 4096, indicating near-perfect balance in token usage.
- •The TF-IDF-based retrieval strategy provides a roughly 3x speedup in retrieval latency compared to BEST-STD, reducing the average retrieval time for top-10 matches to ~1.2 seconds.
- •The bidirectional Mamba encoder exhibits better performance compared to a Transformer-based encoder in noisy and reverberant settings, attributed to its more effective temporal modeling.
- •BEST-STD 2.0 demonstrates strong performance on out-of-vocabulary (OOV) terms, indicating the compositional nature of the generated tokens and its ability to generalize beyond seen vocabulary.
- •The system surpasses WavLM-based approaches in noisy environments, even though WavLM is explicitly trained for noise robustness, indicating a more effective tokenization strategy.
Practical Implications
- •The research has direct applications in voice search, audio archiving, and other spoken content retrieval systems, particularly in noisy environments.
- •Practitioners and engineers can use BEST-STD 2.0 to build more robust and efficient spoken term detection systems, leveraging the provided noise-augmented training framework and optimal transport regularization.
- •The findings suggest the potential for further research into alternative encoder architectures and loss functions to improve token robustness and discriminability.
- •Future work could explore the application of BEST-STD 2.0 to other speech processing tasks, such as speech recognition and speaker identification.
- •The code and models have been made available (https://github.com/anupsingh15/BEST-STD2.0), facilitating reproducibility and adoption by the research community.