Beyond Fisher: High Fidelity, Full-Duplex Multilingual Conversational AI Datasets
Every major full-duplex voice AI model still traces its training data back to Fisher English — two-party telephone calls recorded at 8 kHz in 2004. Ocular Full-Duplex Hi-Fi is the high-fidelity, multilingual conversational corpus the field has been asking for.


Every major full-duplex voice AI model traces its training data back to the same source: the Fisher English corpus[1], a catalog of two-party telephone calls recorded at 8 kHz in 2004.
Meta's dGSLM,[2] the foundational full-duplex model, trained on 2,000 hours of Fisher. SyncLLM[3] used 1,927 hours of Fisher as its real-data anchor; the remaining 99% of its 215,000-hour training set was synthesized by TTS. SALM-Duplex[4] and PersonaPlex[5] (the two openly available end-to-end duplex speech-to-speech models) are both grounded on the same telephony-grade audio. PersonaPlex draws speaker voice samples from Fisher and adjacent audiobook corpora, then synthesizes the conversations themselves.
The recent survey From Turn-Taking to Synchronous Dialogue[6] names data scarcity as the primary bottleneck to progress, more limiting than architecture. The Full-Duplex-Bench evaluation[7] finds that no open-source model achieves both natural backchanneling and appropriate interruption behavior simultaneously, and attributes the gap to training data, not model design.
The field has asked for better data. The Ocular Full-Duplex Hi-Fi corpus is our answer.
How Labs Have Patched the Gap
Rather than collect better data, labs have patched the Fisher gap with increasingly elaborate simulations.
Synthetic TTS data does not contain real prosody, real disfluency, or real overlaps. SyncLLM is illustrative: the model learns the structural pattern of conversation from synthesized dialogue, then anchors to telephony-grade audio for acoustic grounding. The structure is fake; only the acoustics (and only at 8 kHz) are real.
SALM-Duplex and PersonaPlex go further. Each layers its TTS dialogue on top of a different real-audio backbone: SALM-Duplex mixes VoxPopuli's[12] European Parliament recordings (16 kHz) with 8 kHz Fisher, while PersonaPlex draws speaker voices from Libriheavy[13] audiobook readings and synthesizes the conversational structure on top of them.
For barge-in (the moment a speaker interrupts mid-utterance), SALM-Duplex inserts a programmatic 0.64-second silence followed by the interrupting voice. PersonaPlex uses "negative-duration silence" to stitch the interrupting voice on top of the interrupted speaker's audio.
These are engineering approximations of a real human behavior. A barge-in involves overlapping prosody, pitch alignment between both speakers, and a floor negotiation (one speaker yields) that unfolds over hundreds of milliseconds with acoustic cues at every step. The simulation captures none of that.
That this is a data problem and not an architecture problem is visible directly in the benchmark numbers. The Thinking Machines Interaction Model[11] scores 77.8 on Full-Duplex-Bench against ~50 for the best openly available models, and the researchers attribute the gap not to architectural superiority but to proprietary access to high-quality real conversational data.
What the Field Has Explicitly Asked For
Three recent papers read as direct calls for better full-duplex training data:
From Turn-Taking to Synchronous Dialogue:[6] A survey of full-duplex spoken language models that classifies architectures and unifies evaluation across temporal dynamics, behavioral arbitration, semantic coherence, and acoustic performance. Its headline finding: the open problems aren't architectural — they're "synchronous data scarcity, architectural divergence, and evaluation gaps." Solving the first one unblocks the other two.
Full-Duplex-Bench:[7] Evaluates full-duplex models across pause handling, backchanneling, turn-taking, and user interruption using 727 samples drawn from CANDOR,[9] ICC,[10] and GPT-4o / ChatTTS synthetic audio. The authors explicitly note the scarcity of high-quality annotated full-duplex data as a limitation of the benchmark itself: a corpus with labeled backchannel timing, barge-in types, and turn-mechanics would let the benchmark be run at scale with ground truth rather than proxy metrics.
FLEXI:[8] Introduces six interaction scenario types (including emergency and emotional support) and finds significant open-source/commercial gaps in every scenario. The authors identify per-scenario training data coverage as a key missing ingredient. A corpus annotated with scenario types and role dynamics directly addresses this.
The Ocular Full-Duplex Hi-Fi Corpus
Ocular Full-Duplex Hi-Fi is a studio-grade conversation corpus built deliberately against the constraints Fisher imposed. Every parameter (sample rate, channel topology, capture device, acoustic environment, speech mode) is set against what Fisher allowed.
Ocular Full-Duplex Hi-Fi at a glance
The corpus, parameter-by-parameter, against the data the field has been training on for two decades.
| Parameter | Ocular Full-Duplex Hi-Fi | Fisher English |
|---|---|---|
| Sample rate | 48 kHz | 8 kHz |
| Bit depth | 24-bit | 8-bit µ-law |
| Effective bandwidth | DC – 24 kHz | DC – 4 kHz (PSTN-capped) |
| Channels per session | One isolated channel per speaker | One per speaker (telephony mix) |
| Channel correlation | Pearson r ≈ 0.001 (no bleed) | Crosstalk inherent to PSTN |
| Capture device | Studio-grade wired mics | PSTN handset microphone |
| Capture environment | Acoustically gated rooms, pass/fail SNR pre-check | Whatever environment the caller was in |
| Speech mode | Unscripted, naturalistic | Topic-prompted, telephone |
| Per-file format | Mono FLAC (lossless) | µ-law .sph |
| Coverage | Worldwide, with diverse accents, ages, and topics | US English, telephony-grade |
Ocular Full-Duplex Hi-Fi values reflect the v1.0 corpus shipped at 48 kHz / 24-bit. Fisher English values cited from the LDC Fisher English Training Speech Part 1 catalog entry (LDC2004S13).
Per-session quality gate. Before recording begins, each session must pass a quality check: spectral energy verified above 2 kHz, noise floor below threshold, peak levels within headroom, SNR confirmed. Sessions that fail the gate don't record. Fisher had no equivalent gate; it accepted whatever the telephone network delivered.
Unscripted conversation. Speakers are not given prompts, not read-speech, not asked to demonstrate specific behaviors. Conversations are collected worldwide across diverse accents, ages, and topics. Natural barge-ins, real backchannels, and genuine hesitation occur because the conditions for them to occur are present.
Why fidelity matters above 4 kHz. Fisher captured what telephone infrastructure allowed: 8 kHz audio, capped at 4 kHz by the network, through handset microphones, in whatever acoustic environment callers happened to be in at the time. The acoustic signals that separate human speech from TTS output (the audible breath before a response, the prosodic fall of a completed clause, the 2–5 kHz presence region that carries vocal warmth, the overlap of two voices at full fidelity) all live above 4 kHz. Fisher cannot contain them. Every model trained primarily on Fisher is learning from data that is acoustically blind to the layer it most needs.
You can hear the difference in the samples below: the breaths, the 2–5 kHz presence band, and the unmixed overlap, all preserved.
Samples
Two unscripted conversations, captured per-speaker on isolated channels. Speaker A and Speaker B recorded simultaneously on separate studio-grade wired microphones, time-synchronized, 48 kHz / 24-bit lossless.
Conversation 7
Channel isolation, side by side
Same conversation, two simultaneous wired microphones in different rooms. The pair below is the raw per-speaker capture, what the model would actually train on.
Speaker A · Studio-grade mic · Room 1
Speaker B · Studio-grade mic · Room 2
Each speaker captured on an isolated channel at 48 kHz / 24-bit / mono FLAC. The two spectrograms above are the raw per-microphone recordings: no mixing, no re-encoding.
Two speakers, separate rooms, separate microphones, time-synchronized. Each channel is clean isolation: no bleed, no merge, no re-encode.
Conversation 8
Swapped rooms, same isolation
The same two speakers swap microphone and room assignments. Cross-correlation between channels comes back at Pearson r = 0.001, so what you're seeing is two independent captures, not one mic bleeding into the other.
Speaker A · Studio-grade mic · Room 2
Speaker B · Studio-grade mic · Room 1
Same pair of speakers, microphone and room assignments swapped, still captured on isolated channels at 48 kHz / 24-bit / mono FLAC.
What Annotation of This Data Enables
The samples above are raw capture: 48 kHz / 24-bit, channel-separated, unprocessed. What makes a corpus training-ready is the annotation layer on top of that capture. Because the audio exists at full fidelity with independent channels, it can support annotation depth that lower-quality or merged captures cannot: turn-construction unit boundaries and transition-relevance places (TRPs),[14] backchannel timing, and barge-in type all require prosodic detail that 8 kHz telephony erases.
The table below maps each annotation type to the specific research gap it closes and the benchmark metric it directly improves. This is the layer that every paper in this space has identified as missing.
Annotation Layers
Each annotation type, the research gap it closes, and the benchmark metric it directly improves.
| Annotation type | Research gap it closes | Benchmark metric |
|---|---|---|
| Backchannel timing ("mm-hmm", "yeah", "right" labeled with timestamp, speaker, and whether it occurred during the other speaker's turn) | No public corpus annotates backchannels at broadband quality | JSD-D (Full-Duplex-Bench[7]) |
| Barge-in type (aggressive interruption vs. collaborative overlap, with overlap duration and floor outcome) | Every open model simulates barge-in; none trained on labeled real barge-in | TOR (Full-Duplex-Bench[7]) |
| Silence classification (thinking pause vs. turn yield vs. floor hold) | Models can't distinguish "wait" from "speak now"; causes false interruptions | Response latency, TOR-D (Full-Duplex-Bench[7]) |
| Turn-construction unit boundaries (where a syntactically complete turn ends, with prosodic signals) | "When is it safe to speak?", the hardest open problem in full-duplex | TRP detection accuracy (Sacks et al.[14]) |
| Filled pauses and disfluencies ("um", "uh", false-starts, repetitions, verbatim) | Standard TTS doesn't produce filler tokens naturally; they're effectively absent from every training corpus | Naturalness MOS |
| Paralinguistic events (breath, laughter, sigh, clear-throat) | Voice cloning and expressive TTS both require labeled breath and laugh events | Naturalness MOS, voice cloning quality |
| Scenario type (QA, emotional support, procedural, casual, emergency) | FLEXI finds open/commercial gaps in every scenario; coverage requires labeled data | CSC (FLEXI[8]) |
Research gaps cited from the SyncLLM, SALM-Duplex, and PersonaPlex papers. Benchmark metric names from the Full-Duplex-Bench (FD-Bench) evaluation suite: JSD-D (Jensen-Shannon divergence of pause distributions), TOR (Take-Over Rate), and BPC (Backchannel Prediction Consistency).
Each annotation layer is a direct training signal for a metric the field already uses to evaluate models. The annotation protocol covering all seven layers (plus prosodic markers, EQ trajectory, and semantic role dynamics) is available on request.
How This Corpus Compares
The table below lines up Ocular Full-Duplex Hi-Fi against Fisher and the three open simulated training sets across the axes every recent full-duplex paper names as a training bottleneck.
Corpus Comparison
How Ocular Full-Duplex Hi-Fi stacks up against Fisher English and the simulated training sets that followed it.
| Corpus | Sample rate | Channels | Backchannels | Real barge-in | Paralinguistics | Scenario labels |
|---|---|---|---|---|---|---|
| Ocular Full-Duplex Hi-Fi | 48 kHz / 24-bit | Separated | Labeled | Real | Labeled | Labeled |
| Fisher English | 8 kHz / 8-bit µ-law | Separated | None | Real (unlabeled) | None | None |
| SyncLLM training | ~2k h Fisher + ~213k h TTS | Merged | None | Simulated | None | Text-only |
| SALM-Duplex training | TTS + 8 kHz Fisher | Merged | None | Simulated (0.64 s silence) | None | None |
| PersonaPlex training | TTS + audiobook (Fisher voices) | Merged | None | Simulated (neg. duration) | None | Role prompts |
Ocular Full-Duplex Hi-Fi values shown in bold are the per-row reference. Comparator values cited from the SyncLLM, SALM-Duplex, and PersonaPlex papers and the Fisher English LDC catalog entry.
Every empty cell in this table is a gap the evaluation literature has named as a bottleneck. The interactive comparison below picks each of those corpora apart on the audio itself.
Audio Comparison
The same corpora, on the audio itself.
Pick a corpus on the right. The Ocular Full-Duplex Hi-Fi reference stays pinned on the left so the spectral ceiling and annotation gaps are easy to read side-by-side.
Ocular Full-Duplex Hi-Fi
ReferenceSpeaker A — 24 kHz
Speaker B — 24 kHz
Fisher English
Real LDC corpus recording — 8 kHz telephone, Sample A
Real LDC corpus recording — 8 kHz telephone, Sample B
Attribute Comparison
Ocular Full-Duplex Hi-Fi against Fisher English, axis by axis.
One row per parameter every recent full-duplex paper names as a training bottleneck. The Ocular reference column stays bolded so the row-by-row deltas are easy to read at a glance.
| Ocular Full-Duplex Hi-FiReference | Fisher English | |
|---|---|---|
| Sample rate | 48 kHz / 24-bit | 8 kHz telephony |
| Channels | Per-speaker isolated | Per-speaker |
| Backchannels labeled | Labeled | None |
| Real barge-in | Real | Real (unlabeled) |
| Paralinguistics labeled | Labeled | None |
| Scenario labels | Labeled | None |
Ocular Full-Duplex Hi-Fi values shown in bold are the per-row reference. Comparator values for Fisher English cited from the corresponding corpus paper or catalog entry referenced in the article body.
The 8 kHz spectral ceiling is visible in the Fisher spectrograms — energy drops abruptly at 4 kHz where PSTN encoding clips it off. The breath, the 2–5 kHz presence region, and the unmixed overlap that Ocular Full-Duplex Hi-Fi captures all live above that line, in a band Fisher cannot reach.
Why This Matters
Five points from the argument above, in the order they bear on a fine-tuning decision.
Key takeaways
Future Work
We are running LoRA fine-tunes of SALM-Duplex and comparable open architectures on the Ocular corpus and measuring the delta on Full-Duplex-Bench. We expect the signal to be strongest on TOR (Take-Over Rate) and backchannel timing, the two axes where the gap between simulated and real training data is most direct, and where real barge-in labels are the annotation type most absent from every prior training set.
Until those numbers ship as a follow-up post, the warrant for the claim is the corpus itself: verified capture, channel-separated audio, and the annotation protocol above. The samples make the rest of the argument.
Request Access
We're sharing sample packages with research teams working on full-duplex speech systems, evaluation benchmarks, and TTS training. A sample package includes:
- Channel-separated FLAC files: one mono track per speaker, 48 kHz / 24-bit lossless.
- Spectrograms: full-band, per-channel, for every sample.
- Conformance reports: quality-gate output (SNR, peak headroom, spectral coverage, channel correlation) for every session.
- Annotation protocol specification: full schema for the seven annotation layers above plus prosodic markers, EQ trajectory, and semantic role dynamics.
The corpus ships across American English, French, Mandarin Chinese, Spanish, Vietnamese, Bahasa Indonesia, Japanese, Thai, German, Arabic, Hindi, Korean, Polish, and Russian, with regional dialect coverage and per-speaker dialect tags inside each language. Every language is captured against the same studio-grade pipeline (paired native speakers, isolated channels, 48 kHz / 24-bit, the full annotation layer above), so cross-lingual transfer experiments train on like-for-like data. Browse the full multilingual catalogue and request samples on the marketplace.
Request a sample package →, or email founders@useocular.com directly. Mention the model or benchmark you're targeting and we'll size the package to fit. To be notified when the fine-tuning results post ships, include "follow-up" in your subject line.
For more on the corpus, see the Hi-Fi dataset page.
Author

Louis Murerwa
Co-founder & CTO

