Poster: Recognizing Hidden-in-the-Ear Private Key for Reliable Silent Speech Interface Using Multi-Task Learning
Abstract
Silent speech interface (SSI) enables hands-free input without audible vocalization, but most SSI systems do not verify speaker identity. We present HEar-ID, which uses consumer active noise-canceling earbuds to capture low-frequency "whisper" audio and high-frequency ultrasonic reflections. Features from both streams pass through a shared encoder, producing embeddings that feed a contrastive branch for user authentication and an SSI head for silent spelling recognition. This design supports decoding of 50 words while reliably rejecting impostors, all on commodity earbuds with a single model. Experiments demonstrate that HEar-ID achieves strong spelling accuracy and robust authentication.
Summary
The paper addresses the problem of speaker identity verification in silent speech interfaces (SSI). Existing SSI systems often lack robust authentication mechanisms, making them vulnerable to security breaches. The authors introduce HEar-ID, a novel multi-task learning framework that leverages commodity active noise-canceling earbuds to simultaneously perform silent spelling recognition and user authentication. HEar-ID captures both low-frequency whisper audio and high-frequency ultrasonic reflections from the ear canal. These features are processed through a shared encoder and then branched into a contrastive learning module for authentication and an SSI head for spelling recognition. The approach uses a combination of signal processing techniques for feature extraction from both whisper audio and ultrasonic reflections. A contrastive learning objective (CLWUM) is employed to create a "private-key" space where genuine user's whisper and ultrasonic embeddings are aligned, while embeddings from impostors are repelled. This is combined with an authentication head that calculates cosine similarity to decide on user verification. The system is trained end-to-end using a multi-task learning approach with weighted losses for contrastive learning, authentication, and CTC-based spelling recognition. Experiments conducted on 11 participants demonstrate that HEar-ID achieves a 90.25% Top-1 word recognition accuracy for 8 of the participants along with a robust authentication performance with a 3.2% False Positive Rate (FPR). This work contributes a practical and secure SSI solution by integrating user authentication directly into the silent speech recognition process using readily available hardware.
Key Insights
- •Multi-Task Learning for SSI and Authentication: The core insight is that silent speech recognition and speaker authentication are correlated due to the unique acoustic properties of the ear canal. Multi-task learning allows the system to leverage shared representations for both tasks, improving overall performance and robustness.
- •Whisper-Ultrasonic Fusion: Combining whisper audio and ultrasonic reflections provides complementary information, enhancing both spelling accuracy and authentication reliability. The system extracts autoregressive (AR) coefficients from ultrasonic data (17.5-23kHz) and mel-spectrograms from whisper audio (0-11kHz).
- •Contrastive Learning (CLWUM) for Identity Encoding: The contrastive learning module effectively creates a "private key" embedding space, where genuine user's whisper-ultrasonic pairs are close together, and impostor embeddings are far away. This is a novel approach to user authentication in SSI.
- •Performance Metrics: For 8 out of 11 participants, the system achieved a Top-1 word recognition accuracy of 90.25% when recognizing 50 words. The average authentication performance was TPR = 81.76 % with FPR = 3.2 %.
- •Hardware Simplicity: The system relies on commodity active noise-canceling earbuds, making it practical and accessible. The authors highlight the use of Edifier W380NB earbuds in their experiments.
- •Limitations: Performance varies across participants, with some users (S5, S9, S10) experiencing lower recognition accuracy and/or TPRs, potentially due to unclear articulation or inconsistent sensor placement. The lexicon is currently limited to 50 words.
Practical Implications
- •Secure Silent Speech Interfaces: HEar-ID provides a foundation for developing more secure SSI systems for applications where privacy and authentication are critical, such as dictating sensitive information in public or controlling secure devices.
- •Hands-Free Authentication: The technology can be used for hands-free authentication in various scenarios, such as unlocking devices, accessing secure areas, or authorizing transactions, especially in environments where audible speech is undesirable or impossible.
- •Earable Computing Applications: This research demonstrates the potential of earable devices for biometric authentication and silent communication. It opens doors for integrating these functionalities into existing and future earbud designs.
- •Future Research Directions: Future work should focus on expanding the lexicon size, improving robustness to variations in articulation and sensor placement, and exploring continuous verification methods. The authors also mention leveraging generative models to create synthetic training data.
- •Potential Beneficiaries: This research would benefit developers of SSI systems, manufacturers of earable devices, and users who require secure and private communication methods.