MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery
Episode

MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery

Dec 22, 20258:12
Computation and Languageeess.AS
No ratings yet

Abstract

This paper introduces MauBERT, a multilingual extension of HuBERT that leverages articulatory features for robust cross-lingual phonetic representation learning. We continue HuBERT pre-training with supervision based on a phonetic-to-articulatory feature mapping in 55 languages. Our models learn from multilingual data to predict articulatory features or phones, resulting in language-independent representations that capture multilingual phonetic properties. Through comprehensive ABX discriminability testing, we show MauBERT models produce more context-invariant representations than state-of-the-art multilingual self-supervised learning models. Additionally, the models effectively adapt to unseen languages and casual speech with minimal self-supervised fine-tuning (10 hours of speech). This establishes an effective approach for instilling linguistic inductive biases in self-supervised speech models.

Summary

The paper addresses the challenge of discovering acoustic units in low-resource languages with limited data. The authors propose MauBERT, a multilingual extension of HuBERT, to leverage articulatory features for robust cross-lingual phonetic representation learning. They continue pre-training HuBERT with supervision based on a phonetic-to-articulatory feature mapping across 55 languages. The models learn to predict articulatory features or phones, resulting in language-independent representations. The models are evaluated using ABX discriminability testing, demonstrating that MauBERT produces more context-invariant representations compared to state-of-the-art multilingual self-supervised learning models. Furthermore, MauBERT effectively adapts to unseen languages and casual speech with minimal self-supervised fine-tuning (10 hours of speech), establishing an effective approach for instilling linguistic inductive biases in self-supervised speech models. The key contributions of this work are twofold. First, it demonstrates that multilingual supervised fine-tuning of HuBERT for articulatory feature or phone prediction creates robust multilingual phonetic representations with strong zero-shot transfer capabilities. Second, the resulting models enable effective adaptation to unseen languages and casual speech with minimal self-supervised fine-tuning, achieving strong speaker and contextual invariance in new languages with only 10 hours of unlabelled data. As a byproduct, the method also yields candidate phoneme and feature sets for unseen languages, with potential applications for linguistic analyses of low-resource languages. This research is significant because it offers a practical method for developing speech technologies for languages with limited resources, potentially benefiting linguists and speech technologists working on low-resource or unwritten languages.

Key Insights

  • MauBERT models trained to predict articulatory features (MAUBERT-FEAT) and phones (MAUBERT-PHONE) exhibit superior performance on training languages, particularly for phone-level metrics (Table 1).
  • When transitioning from training to development languages, phone accuracy drops by 15-21%, while feature accuracy is more resilient (3-4% degradation), suggesting articulatory features provide a more stable cross-lingual representation (Table 1).
  • Supervised fine-tuning using Masked Phone Prediction (MPR) significantly reduces ABX error rates compared to standard phone prediction (PR), achieving a 38% relative improvement over the zero-shot baseline (3.07% vs. 5.22% on development languages; Table 3).
  • Self-supervised fine-tuning with phone frequency-based clustering demonstrates gains over standard K-means clustering, particularly in phoneme-level discrimination tasks and longer temporal contexts (Table 3).
  • In zero-shot mode, MauBERT models perform slightly better than multilingual baselines on read speech but show reversed performance on casual speech; self-supervised fine-tuning recovers competitive performance on casual speech (Table 4).
  • A frequency distribution of articulatory feature vectors produced by MAUBERT-FEAT can be used to discover phonetic inventories of previously unseen languages, with an optimized frequency threshold achieving a precision of 0.778-0.872 and a recall of 0.532-0.810 (Table 7).
  • The performance on the zero resource speech challenge 2017 shows that low-resource language Wolof achieves comparable error rates to high-resource languages after fine-tuning, indicating robust few-shot adaptation capabilities (Figure 2).

Practical Implications

  • The MauBERT framework can be used to develop speech recognition systems for low-resource languages, requiring minimal labeled data for fine-tuning.
  • Linguists can use the frequency-based methodology to generate initial phonetic hypotheses for endangered languages, guiding subsequent detailed analysis.
  • Speech technologists can leverage the cross-lingual transfer capabilities of MauBERT to create multilingual speech applications.
  • Future research can focus on optimizing data selection strategies for fine-tuning, potentially focusing on phonetically diverse datasets.
  • Future research could extend the self-supervised fine-tuning beyond the encoder to encompass the entire MAUBERT architecture, enabling end-to-end adaptation and improved performance.

Links & Resources

Authors