A Silent Speech Interface: Using Mouth-Based Sensors for Non-Vocal Communication

From Nikipedia
Jump to navigation Jump to search

Originally written 2025. Substantially rewritten June 2026 to reflect a year in which the core algorithmic obstacles to silent speech were largely overcome — and to reposition this proposal around the gaps that remain.

Abstract:

This white paper proposes a direction in assistive and ambient human-computer interaction: a silent speech interface using intraoral sensors — interpreting silently mouthed words from inside the mouth, with no audible speech. The concept originated over 15 years ago from personal experience living in Japan, where speaking aloud in public spaces is socially discouraged, and was reawakened by Augmental's MouthPad.

The 2025 version of this paper treated silent speech decoding as an unsolved research problem. As of mid-2026, that framing is out of date in the best possible way: the hardest algorithmic walls it named — dataset scarcity, speaker-dependence, and latency/error — have been cleared in the literature. What remains unsolved is narrower and more ownable: the intraoral modality is still missing from every dataset, and the social-silence use case (as opposed to the medical one) is unclaimed. This revision reframes the proposal accordingly.

1. Introduction:

Voice-based interfaces are now ubiquitous, yet they still exclude users who cannot speak — for medical, situational, or cultural reasons. In Japan, speaking aloud on public transportation or in quiet public spaces is socially frowned upon. A silent speech interface captures and interprets tongue, lip, and internal oral muscle movement to generate spoken or written output without audible speech.

The thesis of this paper has not changed. What changed is the world around it. In 2025–2026 the field crossed from "promising lab demos" into "funded companies and usable accuracy." The argument below is no longer whether this is feasible — it is where the remaining leverage sits for someone not racing the large labs.

2. What Changed Since 2025:

The 2025 paper listed four constraints. Three of them now have research answers:

  • Dataset acquisition for silent mouthed speech (the hardest one). Stanford's MONA LISA (2024→2026) solved this sideways. Rather than collect a massive silent-mouthing corpus, it trains a shared latent space so that ordinary audio datasets (e.g. LibriSpeech) improve a silent decoder via cross-modal alignment. You no longer need an ocean of silent data — you borrow from the ocean of existing speech audio. This is the single most important shift: it cleared open-vocabulary word error rate from 28.8% → 12.2%, the first time noninvasive silent speech passed the ~15% usability threshold.
  • Variability in oral anatomy / speaker-independence. Meta's ctrl-labs surface-EMG wristband demonstrated calibration-free operation across ~6,500 subjects — the first neuromotor interface that works on a brand-new person with zero individual training. The architecture that achieves cross-person generalization now exists and is transferable to face/jaw/throat signals.
  • Latency and misinterpretation. The fix is the LLM layer. MONA LISA's "LISA" stage uses a large language model to re-score decoder hypotheses — the same trick that made phone dictation usable — driving vocal-EMG error as low as 3.7% WER. Sentence-level EMG/EEG sensor fusion with a language model now exists in wearable form.

The fourth constraint — hygiene and long-term intraoral wear comfort — remains genuinely unsolved for the intraoral form factor specifically. That is no longer a footnote. It is now the main thing separating this proposal (palate/tongue sensing) from the ear-, jaw-, and neck-worn devices that captured the breakthroughs above.

What actually shipped or spun out

  • AlterEgo (MIT) spun out as a for-profit company in early 2025, led by original inventor Arnav Kapur. The new consumer unit is an ear-and-jawline wearable resembling a hearing aid — the bulky lab rig is gone. It was demonstrated working from both mouthed and merely intended-to-mouth words, and presented at the Axios AI+ Summit in September 2025. This is the most direct embodiment of this paper's vision, and it is now funded.
  • Augmental's "MouthPad Whisper" — adding a microphone and sensors to capture near-silent and eventually fully silent speech — is in development, exactly the intraoral trajectory proposed here. Notably, Augmental's shipping product and public FAQ describe only cursor control (tongue-as-mouse); the silent-speech layer is the next cycle, not the current one. The wheelchair/cursor capability is the minimum viable accessibility product, not the ceiling.
  • A 2026 hardware result combined microneedle-array electrodes with mandible-coupled strain sensing to reach 8.5% WER on a 1,396-word vocabulary (>90% on common words).
  • The field is now large enough to have its own 2026 systematic review: Silent Speech Interfaces in the Era of Large Language Models.

Takeaway: the decoder problem is being solved by organizations with more compute than any independent effort can match (Stanford, Meta, the MIT spinout). Competing on "a better silent-speech model" is a losing game. The opportunity is elsewhere.

3. Technical Approach:

The proposed intraoral system would integrate:

  • Pressure-sensitive sensors on the palate and teeth
  • Inertial sensors (IMUs) to detect subtle jaw and tongue movement
  • Real-time sensor fusion to map articulatory gestures to phonemes
  • Cross-modal training (per MONA LISA) so that existing speech-audio corpora bootstrap the decoder, rather than requiring a large bespoke silent-mouthing dataset
  • An LLM re-scoring stage to convert noisy phoneme hypotheses into fluent text
  • A calibration-free model architecture (per Meta's sEMG work) so the device works on a new mouth without per-user training

Crucially, none of these five components needs to be invented here. Four of them are published. The work is integration plus the one missing input.

4. Potential Applications:

  • Assistive Communication: for users with ALS, tracheostomies, or lung impairments (the well-funded, reimbursement-driven market that every competitor already targets)
  • Private Interaction: silent texting, dictation, or silent queries to an AI assistant in public spaces — the under-served market
  • Covert Use Cases: scenarios where soundless communication is required

5. The Three Open Doors:

Given that the algorithm race is lost to the large labs, three contributions remain genuinely available to an independent effort:

Door 1 — The intraoral data gap nobody is filling

Every breakthrough above used face/neck/jaw EMG, lip video, or wrist sEMG. Almost none used intraoral (tongue-to-palate contact) sensing. Because MONA LISA's cross-modal method means you need only a modest amount of in-domain data to bootstrap a decoder, a small public intraoral silent-speech dataset is now both achievable and missing from every taxonomy. A few hundred mouthed phrases captured on an off-the-shelf palatal-contact or tongue-position rig, released openly, would be cited — because it is the one modality nobody has published. This is the highest-leverage build contribution.

Door 2 — The "Japan problem" is the actual wedge, and it is unclaimed

This idea did not originate from disability. It originated from social-context silence — Japanese public transit. Every company in this space markets to the medical/ALS case, because that is where reimbursement lives. Nobody owns the positioning of discreet input for able-bodied users in quiet, public, or social contexts — silent texting on a train, silent AI queries in a meeting, hands-and-voice-free dictation. AlterEgo gestures at "a crowded cafe" but leads with the medical story. The cultural-ergonomics framing — why silence-as-default matters, what the social-acceptability thresholds are, which contexts demand it — is white space. It is writing and positioning work, not a compute race.

Door 3 — Integration, not invention

The pieces now exist as separate published parts: cross-modal training (Stanford), calibration-free architecture (Meta), LLM re-scoring (MONA LISA), intraoral hardware (Augmental). Nobody has assembled the full stack for the intraoral case. A credible integration spec — "here is how to combine these four results into a working intraoral silent-speech system, and here is the single missing dataset" — is a legitimate research-design contribution requiring no novel component.

6. Feasibility and Constraints (2026):

  • Solved-enough: open-vocabulary decoding, speaker-independence, latency/error — via cross-modal training, calibration-free architectures, and LLM re-scoring.
  • Still open and intraoral-specific: hygiene and long-term wear comfort; variability of oral anatomy as it affects sensor seating (distinct from the model-level speaker-independence now solved); and the absence of any public intraoral dataset.
  • Strategic: the defensible position is the missing modality and the unclaimed market — not a better decoder.

7. Social and Cultural Impact:

If successful, such a system could normalize silent, mouth-only interaction with machines — useful not only for people with disabilities but for anyone seeking discreet, non-disruptive communication. The idea originated from living in Japan, where voice input feels socially intrusive in certain settings. That origin is now the strategic point, not a biographical aside: the social-silence use case is the part of this field that remains genuinely open.

8. Next Steps (revised):

  1. Pick one door. Door 2 (the cultural-ergonomics positioning essay) costs nothing but writing and is unclaimed. Door 1 (a tiny open intraoral dataset) is the highest-impact if the goal is to build.
  2. Collect a small, openly-released intraoral silent-speech dataset using off-the-shelf palate-sensor hardware.
  3. Draft the integration spec combining cross-modal training + calibration-free architecture + LLM re-scoring for the intraoral case.
  4. Collaborate with linguists and speech therapists to map articulatory gestures to semantic output — still valuable, now in service of the dataset rather than from scratch.

Conclusion:

The 2025 thesis was right, which is why the field overtook it. Silent speech is no longer waiting on a decoder breakthrough — that arrived. It is waiting on the intraoral modality and on someone willing to articulate why quiet is a feature for everyone, not just a workaround for the voiceless. Those two gaps are the contribution now.

Acknowledgments:

We acknowledge the work of Tomás Vega and the Augmental team for pioneering intraoral interaction; Arnav Kapur and the AlterEgo team; the Stanford HAI authors of MONA LISA; and the communities of silent speech decoding, wearable HCI, and neuro-linguistics for laying — and in 2025–2026, dramatically advancing — the groundwork for this vision.

Future Research Directions:

  • A public intraoral silent-speech corpus (the missing modality)
  • Cross-modal bootstrapping of intraoral decoders from speech-audio datasets
  • Calibration-free intraoral models that work on a new mouth without per-user training
  • Sublingual EMG or photoplethysmography for enhanced articulation capture
  • Multilingual mouthing models and AI-driven contextual inference

See also

Sources (2026 update)