← Back to Research

NEO 2: AI Dubbing and Lip Sync

Cross-lingual lip-sync that matches mouth movements to any spoken language. A single forward pass handles phoneme mapping, jaw dynamics, and facial muscle coordination.

Read the research Try the demo
NEO 2 AI dubbing side-by-side comparison

Side-by-side: original English speech (left) and AI-dubbed Mandarin with matched lip-sync (right).

Abstract

Dubbing video into other languages has historically required either re-recording with a native speaker or accepting the uncanny mismatch between audio and lip movements that comes with simple voice-over. Neither approach scales. The first destroys the original performance. The second looks wrong and audiences notice immediately.

NEO 2 takes a different approach. Given a video of a person speaking and a target language audio track, the system maps phonemes from the target language onto the speaker's existing facial movements. It modifies jaw position, lip shape, and tongue visibility frame by frame while preserving the speaker's identity, skin texture, and surrounding scene. The output is a photorealistic video where the person appears to natively speak the target language.

The entire pipeline runs in a single forward pass. There is no iterative refinement, no manual correction, no post-processing. The model handles the phoneme-to-viseme mapping, facial deformation, and temporal smoothing in one step.

30+ Languages tested
< 2s Per segment inference
4.2 MOS (Mean Opinion Score)

How it works

The pipeline has four stages. Each runs sequentially but the full chain executes as a single inference call.

01

Audio analysis

The target language audio is decomposed into a phoneme sequence with timing information. We extract prosodic features (pitch, rhythm, emphasis) to preserve the emotional tone of the original delivery.

02

Facial mapping

Each phoneme maps to a set of viseme targets: specific positions for the jaw, lips, tongue, and cheeks. The mapping accounts for coarticulation, where the shape of one sound is influenced by the sounds before and after it.

03

Neural rendering

A conditional diffusion model generates new mouth-region pixels for each frame. The model is conditioned on the original face, the target viseme, and the surrounding context. It preserves skin texture, lighting, and the speaker's identity while only modifying the articulatory region.

04

Temporal smoothing

A lightweight temporal network enforces frame-to-frame consistency. It removes flickering, smooths transitions between visemes, and ensures the modified region blends seamlessly with the unmodified face across the full video sequence.

The challenge of cross-lingual lip-sync

Languages differ not just in sound but in how they use the face. Mandarin relies on tonal variation with relatively constrained lip movement. Arabic uses pharyngeal consonants that engage the throat and back of the mouth in ways English never does. Spanish vowels require more pronounced lip rounding than their English equivalents.

Previous approaches to AI dubbing fell into two categories. The first modified the audio but left the face untouched, producing a visible mismatch between what you hear and what you see. The second warped the entire face to match new audio, often destroying the speaker's identity or producing artifacts around the jawline.

NEO 2 takes a third path. It operates only on the articulatory region (lips, jaw, tongue, cheeks) while leaving everything else pixel-identical to the original. The key insight is that viseme targets can be predicted directly from phoneme sequences without needing an intermediate 3D face model.

Comparison of lip positions across languages

Viseme comparison across four language families. The same speaker's mouth region adapts to language-specific phoneme requirements while preserving identity.

Architecture

The system is built around a conditional diffusion model trained on paired multilingual speech data. Given a source frame and a target phoneme, the model generates a modified mouth region that blends seamlessly with the untouched portions of the face.

Three components work together. A phoneme encoder converts the target language audio into a dense sequence of articulatory features. A spatial attention module identifies which pixels in the source frame correspond to the articulatory region. And a diffusion decoder generates the modified pixels conditioned on both the articulatory features and the surrounding facial context.

The temporal consistency module sits on top. It takes a window of generated frames and enforces smooth transitions, removing the flickering artifacts that plague frame-by-frame generation approaches. This runs as a lightweight post-process, adding minimal latency.

Additional technical details will be added here from the research team's brief. This section will expand to cover training data, loss functions, ablation studies, and comparison with prior work.

Try it yourself

Interactive demo coming soon

Upload a video clip and dub it into any language with lip-sync. The playground will run NEO 2 inference in real time.
Visit playground.colossyan.com →

Sample results

Original English speech Original (English)
AI-dubbed Mandarin with lip-sync Dubbed (Mandarin)
English → Mandarin

Mandarin is a tonal language with relatively constrained lip movement compared to English. The challenge is adjusting lip shapes for Mandarin phonemes while preserving the tonal expressiveness in the face. Notice how the jaw remains more closed during consonant clusters, matching native Mandarin articulation.

Original English speech Original (English)
AI-dubbed Spanish with lip-sync Dubbed (Spanish)
English → Spanish

Spanish has a five-vowel system with more pronounced lip rounding than English. The model adapts lip protrusion for /o/ and /u/ sounds and widens the mouth for the open /a/. Jaw dynamics shift to accommodate the syllable-timed rhythm of Spanish, which differs from English stress patterns.

Original English speech Original (English)
AI-dubbed Arabic with lip-sync Dubbed (Arabic)
English → Arabic

Arabic pharyngeal and uvular consonants require throat and back-of-mouth articulations that have no English equivalent. The model generates visible throat tension for pharyngealized sounds and adjusts tongue root position for emphatic consonants, movements that are subtle but immediately noticeable to native speakers.

Original English speech Original (English)
AI-dubbed Hindi with lip-sync Dubbed (Hindi)
English → Hindi

Hindi distinguishes between aspirated and unaspirated stops, and between dental and retroflex consonants. The model generates the tongue-tip curling visible for retroflex sounds (/ʈ/, /ɖ/) and the brief lip parting for aspirated releases. These are among the most challenging viseme targets because the differences are small but phonemically contrastive.

Related research

Explore more