April 20, 2026Company

Ocular AI Manifesto

We're an AI data applied research company encoding human expertise into frontier models, starting with voice.

Michael Moyo

Ocular AI is an Applied Data Research Lab focused on encoding human expertise into frontier AI models and enabling them to perform economically valuable real-world tasks.

I want to lay out, in plain language, why I think this is the most important thing anyone can be working on right now, and what we've actually built.

I'll be direct about the thesis. The last decade of AI was won by compute. The next will be won by compute and the right data, scaled together. More GPUs without better data plateaus. Better data without the compute to train against it and serve it at inference plateaus too. The frontier moves when both inputs grow in lockstep, and the data input has to come from real-world expert work, not another scrape of the open web.

A Technological Renaissance, and the Fourth Industrial Revolution

We're living through what historians will probably look back on as a Technological Renaissance. Klaus Schwab named the moment a decade ago. He called it the Fourth Industrial Revolution,^[23] and the framing has only gotten sharper since.

Three revolutions came before this one. Each followed the same pattern, and I think it's worth dwelling on the pattern because it tells you what kind of company you need to build for this moment.

The first revolution mechanised muscle. Steam, then internal combustion, took the work that human bodies and animal teams used to do, and condensed it into machines that scaled past biology.

The second revolution condensed expert craft. In 1913, Henry Ford's Highland Park line cut the time to assemble a Model T from over twelve hours down to roughly ninety minutes.^[24] He did it by breaking the car into eighty-four discrete steps, each one repeatable by someone who didn't need to hold the whole vehicle in their head. The craft of automotive assembly, which had previously required a small team of skilled mechanics building a car as a single unit, got absorbed into a piece of infrastructure that could do the same job at a different scale. Brynjolfsson and McAfee called this the "Second Machine Age" pattern in their 2014 book,^[26] and Carlota Perez's longer-arc analysis of how technological revolutions actually settle says exactly the same thing.^[27]

The new infrastructure absorbs the old craft, then expands what's possible.

The third revolution digitised information work. The keystroke, the spreadsheet, the database, the search query. Knowledge work got abstracted onto chips and connected over networks, and the productivity gains compounded for forty years.

The fourth is condensing cognition itself. The radiologist who can read a chest film in five seconds and tell you something a less experienced colleague would have missed. The litigator who can build an argument that wins the case because she's spotted the precedent nobody else had. The customer-service rep who hears the frustration in someone's voice and decides to skip the script. The engineer who looks at a pull request late on a Friday and sees the bug before he reads the diff. None of that is being replaced. It's being absorbed.

That last word matters to me. I don't think AI is going to replace experts. The assembly line didn't replace craftsmen. It absorbed their craft into something that could scale past them. The chip didn't replace clerks. It gave them tools that made each clerk responsible for ten times more output. The model is doing the same thing for cognitive work.

The expert is the unit of input. The model is the new unit of scale.

The catch is what we're trying to condense this time. The Highland Park line was decomposable because automotive assembly was largely visible. A foreman could watch a worker, write down each motion, hand the card to the next person, and the process replicated. The expertise we're trying to condense now is mostly invisible. Michael Polanyi gave it a name in 1966. Tacit knowledge.^[25]

"We know more than we can tell."
— Michael Polanyi, The Tacit Dimension, 1966

The cadence with which a senior clinician asks a follow-up question. The moment a negotiator decides to pause. The millisecond a fluent speaker chooses to breathe instead of speak. None of that lives in any written-down spec. It lives in the doing.

Capturing that tacit expertise at the scale and quality frontier training demands is the work the field hasn't figured out yet. It's the work we built Ocular for.

Compute Is Still Necessary. It Is Not Enough.

I want to be careful with this part because the consensus in the field is moving fast and a lot of the simpler "compute is dead" takes get the dynamics wrong. Compute is not the bottleneck and it is also still indispensable. Both can be true at the same time.

Start with what compute has done. Compute used to train frontier models has roughly doubled every six months for the past decade.^[1] That's an order of magnitude faster than Moore's law. Architectures have moved up the stack from convolutional networks^[2] to transformers^[3] to mixture-of-experts^[4] to reasoning-time compute.^[5] Post-training has graduated from supervised fine-tuning to RLHF^[6] and constitutional methods.^[7] Every generation, the curves go up and to the right.

The results are genuinely impressive. Frontier models pass the US Medical Licensing Examination.^[8] They score in the top decile of the Uniform Bar Exam.^[9] They reach silver-medal performance on International Mathematical Olympiad problems.^[10] They resolve a meaningful fraction of real GitHub issues end-to-end.^[11]

The position I take is closest to Rich Sutton's in The Bitter Lesson:^[28] general methods that scale with computation are the ones that win. I think that thesis is still right. What changed is what computation is now being applied to. For the deep-learning era it was applied to whatever raw public text could be scraped at internet scale. That's the lever that's saturating, not Sutton's bet on compute. Public text is finite, and the high-quality slice is closer to exhausted than the field publicly admits.^[12] Villalobos et al. project that pretraining demand will outpace the supply of public human-generated text well before the end of the decade.

So the next gain is not "compute or data." It's compute paired with a different data input than the open web has on offer. Investments in GPUs, clusters, and inference infrastructure should keep growing. They should grow alongside investment in the data layer that fills the next training run with real expert work, captured deliberately, with provenance and consent built in. Either input without the other plateaus. Both together is where the curve keeps going up.

The Scaling Law Plateau

The literature has been telling us this for a while. The clean power-law gains the field came to associate with raw compute^[17] were always a story about paired growth. Parameters, data, and compute had to scale in lockstep. The Chinchilla result quietly shifted the binding constraint.^[18] Hoffmann et al. showed that almost every frontier model from that era had been trained on too little data for its parameter count. Compute-optimal training, it turned out, requires substantially more tokens per parameter than the field had been spending. The implication, restated plainly. At frontier compute budgets, data is the next limit to push. Not parameters. Not FLOPs. And critically, "more data" only helps if the data is the right kind.

The follow-on work hammered the point. Sorscher et al. demonstrated that careful data pruning can beat neural scaling laws.^[19] A smaller, higher-quality training set produces a better model than a larger, noisier one at matched compute. Quality, not quantity, is the lever the field has been undershooting on the data side of the pair.

Andrej Karpathy framed the same shift earlier from a different angle. In Software 2.0 he argued that neural networks are a new programming paradigm where the dataset is the source code.^[29] The compute is what compiles that source into a model. Better code and more compilation both still matter. What the field is rediscovering is that you cannot keep pouring compute into the same scraped, commoditised source code and expect the model to keep improving. The new gains come from writing better source.

You can see the consequences in plain sight. Reasoning-time compute models^[5] are spending gain budget at inference precisely because the marginal return on additional pretraining compute against the same commoditised text is shrinking. The frontier labs that are still pulling ahead are spending heavily on both sides of the pair. They are still buying compute aggressively. They are also sourcing data nobody else has. Licensed studio audio. Private code repos. Expert-authored reasoning traces. Instrumented agent conversations. The two investments compound. Compute without the new data stalls. New data without the compute to train against it sits in cold storage.

The chart below is the qualitative shape of what's happening. Compute paired with commodity web data and compute on its own are both real curves, and both have already bent. The remaining headroom, the layer the next generation of frontier models will compete on, comes from compute paired with data that's deliberately captured, expert-grounded, and rich in the behaviours models actually need to learn.

Where the next decade of gain comes from

Compute paired with frontier expert data is the only band with meaningful headroom left.

Conceptual schematic. Diminishing returns on compute alone and on commodity web data are documented across the scaling-law literature (Kaplan 2020, Hoffmann 2022, Sorscher 2022). Each band depicts the marginal capability gain from adding more of that input while holding the others fixed. The topmost band is the pair that hasn't bent yet — compute scaled in lockstep with deliberately captured expert data.

That's the bet behind every major frontier-model investment over the last eighteen months. The labs winning are scaling compute and scaling a deliberate data investment alongside it. Neither lever on its own gets them there. The two paired is what does.

What's Actually Missing

The benchmarks are passing. The deployments aren't.

I see this every week. Models that score 95th percentile on the bar exam still hallucinate citations in real briefs. Stanford's RegLab quantified the gap precisely. Leading LLMs hallucinate legal information between 69% and 88% of the time on jurisdiction-specific queries.^[13] Models that ace MedQA still miss diagnoses experienced clinicians catch, because clinical reasoning isn't multiple-choice. It's incomplete information, judgment under uncertainty, and longitudinal patient context.^[14] Voice models that pass intelligibility benchmarks fall apart on accents the training data underrepresents.^[15] They miss the backchannels and barge-ins that make conversation feel natural.^[16] They freeze when a real human interrupts.

The pattern is the same across every domain that matters.

Benchmarks vs. practice

Where frontier models pass the test and where real practice still requires a human.

Domain	Benchmark frontier models clear	What real practice still requires
Medicine	USMLE pass-level performance^[8]	Differential diagnosis under incomplete information, longitudinal context, patient communication^[14]
Law	UBE 90th percentile^[9]	Jurisdictional nuance and citation accuracy. Leading LLMs hallucinate legal cases 69 to 88% of the time^[13]
Software engineering	High pass@1 on isolated coding tasks	End-to-end resolution of real GitHub issues. Frontier models clear only a fraction of SWE-bench^[11]
Mathematics	Silver-medal IMO problem solving^[10]	Open-ended research-grade reasoning, theorem development, novel proof strategy
Conversational voice	Word error rate on read speech	Accent coverage,^[15] full-duplex turn-taking. Best open models score around 50 on Full-Duplex-Bench against 77.8 for frontier closed systems^[16]

Benchmark performance reflects frontier closed and open models on the cited evaluations as of early 2026. Deployment shortcomings cite peer-reviewed and pre-print evidence, not anecdote.

The gap in every row is the same gap. The benchmarks measure what's easy to standardise. The deployments require what's hard to capture, namely judgment, context, and the texture of how an expert actually does the work.

The bottleneck isn't compute. It isn't architecture. It's expertise.

Models have run out of easy internet to learn from, and what's left doesn't teach them how the real world actually works. How an experienced radiologist reads a film. How a litigator builds an argument. How a fluent speaker hesitates, breathes, and rephrases mid-sentence. That expertise lives in people. The question is whether you can capture it at the depth and scale frontier training demands.

Benchmarks That Reflect the Real Economy

The benchmarks that defined the last era of AI, including MMLU, MedQA, the bar exam, and IMO, measure what's standardisable. They reward test-taking. They do not measure the work most economic value actually sits in. Long-horizon, judgment-laden, multi-step jobs that fold in domain context, partial information, and live consequences.

The field is finally starting to correct that. OpenAI's GDPval evaluates frontier models against 1,320 real-world, economically valuable tasks drawn from 44 occupations across the nine highest-GDP US industries.^[20] Every task is authored by industry professionals, with an average of 14 years of experience, and every model output is graded by industry experts on whether it would pass as production-ready work product. METR's long-horizon study finds that the length of real coding and reasoning tasks a frontier model can complete at 50% success rate has been doubling roughly every seven months.^[21] That's a measure of autonomous work capacity, not of test scores. Sierra's τ-bench evaluates agents the way customer-service teams evaluate their own staff. End-to-end multi-turn interactions in airline and retail domains, with real tool use, a simulated user with real preferences, and a binary success criterion on whether the full job got done.^[22]

The breadth of GDPval is the part I want to flag. The nine industries below cover the work that actually moves the US economy, and the five occupations selected inside each one are the day-to-day jobs the models will need to do if frontier AI is going to graduate from "test-taking" to "actually employed." The selection isn't arbitrary either. OpenAI chose industries that contribute over 5% of US GDP using sectoral GDP data from the Federal Reserve Bank of St. Louis,^[32] then picked the five highest-compensation occupations inside each sector from the May 2024 Bureau of Labor Statistics occupational employment release,^[31] and filtered the list to predominantly knowledge-work jobs using the O*NET task database from the US Department of Labor.^[33] The full methodology and grading rubric are in the GDPval paper itself.^[30]

GDPval coverage

015 roles
Real estate and rental and leasing
- Concierges
- Property, real estate, and community association managers
- Real estate sales agents
- Real estate brokers
- Counter and rental clerks
025 roles
Government
- Recreation workers
- Compliance officers
- First-line supervisors of police and detectives
- Administrative services managers
- Child, family, and school social workers
035 roles
Manufacturing
- Mechanical engineers
- Industrial engineers
- Buyers and purchasing agents
- Shipping, receiving, and inventory clerks
- First-line supervisors of production and operating workers
045 roles
Professional, scientific, and technical services
- Software developers
- Lawyers
- Accountants and auditors
- Computer and information systems managers
- Project management specialists
055 roles
Health care and social assistance
- Registered nurses
- Nurse practitioners
- Medical and health services managers
- First-line supervisors of office and administrative support workers
- Medical secretaries and administrative assistants
065 roles
Finance and insurance
- Customer service representatives
- Financial and investment analysts
- Financial managers
- Personal financial advisors
- Securities, commodities and financial services sales agents
074 roles
Retail trade
- Pharmacists
- First-line supervisors of retail sales workers
- General and operations managers
- Private detectives and investigators
085 roles
Wholesale trade
- Sales managers
- Order clerks
- First-line supervisors of non-retail sales workers
- Sales representatives, wholesale and manufacturing, except technical and scientific products
- Sales representatives, wholesale and manufacturing, technical and scientific products
095 roles
Information
- Audio and video technicians
- Producers and directors
- News analysts, reporters, and journalists
- Film and video editors
- Editors

The nine industries and 44 occupations OpenAI's GDPval evaluates against. Sectors selected for contributing over 5% of US GDP (FRED), with the five highest-compensation knowledge-work occupations chosen inside each one (BLS Occupational Employment Statistics, filtered against O*NET task labels).

These evaluations are harder to top, harder to game, and harder to over-fit. They are also where the next decade of AI will actually be judged. A model that saturates MMLU and the USMLE can still fail to complete a single hour-long economically valuable task end-to-end. Closing that gap is not a compute problem. It is an evaluation problem first (you cannot optimise what you do not measure), and a data problem immediately after, because you cannot train against the shape of real expert work if you do not have the data that captures it.

That's the lane we're explicitly built for. Every dataset we ship is designed to support evaluation against real workflows. Full multi-turn calls. Full case files. Full code-review rounds. Full procedural sessions. Not isolated standardised-test items. Every annotation layer in the Foundry is one a domain expert would actually grade in their own practice. The goal is not to push the leaderboard on yesterday's benchmark. It is to move models into the work that matters. Deployed jobs, with real users, in real conditions, on real long-horizon tasks.

The Labor Market of the AI Economy

The natural next question, once you've scrolled through the GDPval carousel above and seen the 44 occupations the field is preparing to evaluate models against, is what work looks like for the people who currently hold those jobs. I get asked this constantly. My honest answer is that the labor market does not shrink. It reshapes. Three things change at once.

Before getting into them, it's worth grounding the discussion in the empirical work. BCG's Henderson Institute published a microeconomic model in April 2026, run over 165 million US jobs distributed across roughly 1,500 distinct roles, and the headline finding is the one to anchor on.

"Over the next two to three years, 50% to 55% of US jobs are projected to be reshaped by AI, and 10% to 15% are vulnerable to elimination over the following four to five."
— BCG Henderson Institute, AI Will Reshape More Jobs Than It Replaces, 2026

Reshaped is a much bigger number than eliminated. It means most workers keep the same or a similar role but face new expectations for how the work gets done and what they produce. The mechanism underwriting that asymmetry is old, and it has a name. Jevons Paradox, first articulated by the British economist William Stanley Jevons in 1865 to describe how steam-engine efficiency improvements led to more coal consumption rather than less,^[37] generalises cleanly to labor. When AI reduces the cost or cycle time of a unit of work, total demand for that work often expands rather than contracts. The BCG paper makes the connection explicit,^[36] and the worked example they lean on is software engineering. Headcount in the role has kept rising in the years since ChatGPT, even as per-engineer output has surged, because the demand for software was never bounded in the first place. The right mental model is not "AI replaces the engineer." It is "AI lets organisations build the software they always wanted to build but could not staff for."

Two pieces of nuance matter before going further. First, Jevons does not apply everywhere. BCG's segmentation explicitly separates roles where demand is expandable (and Jevons amplifies employment) from roles where demand is bounded (and productivity gains translate into headcount reductions). Call-centre representatives are their canonical bounded-demand example, with routine inquiry volume tied to the customer base size rather than to the cost of handling each inquiry. Even there the picture is messier than the headline. The same paper notes that surviving call-centre roles transition into higher-value relationship management, proactive risk mitigation, and escalation handling on the complex cases the AI defers up. The middle of the routine-call distribution thins. The high end thickens. Second, BCG flags a constraint that is itself a labor-market signal worth dwelling on. Scaling agentic AI in enterprises requires specialised integration talent — forward-deployed engineers, systems integrators, project managers who tailor the systems to enterprise context — and supply of that talent is running well behind demand.^[36] A new category of work is being created by the diffusion of AI itself, not just preserved through demand expansion in an existing one.

First, the most valuable form of expert work shifts from doing the task to teaching the model how to do the task, and then verifying that the model did the task correctly. A senior clinician's hour spent annotating a hundred differential-diagnosis traces and grading the model's reasoning against them is worth more, measured in deployable model capability, than that same hour spent seeing one more patient. The clinician still sees patients. But a new tier of paid, structured work opens up alongside the clinical day job, and the people who can articulate why they decide become more valuable than those who only execute. That is the Polanyi observation^[25] made operational. The tacit experts who learn to externalise their tacit knowledge are the ones who get paid twice.

Second, the labor market that emerges around the model is barbell-shaped. David Autor's work on labor-market polarisation already predicted exactly this dynamic. Technology compresses the middle and expands both tails.^[34] The BCG segmentation gives the barbell its texture. Roughly 5% of US jobs are amplified (AI augments + demand expands, headcount grows, e.g. software engineering and parts of law), 14% are rebalanced (AI augments + demand bounded, headcount steady but the role is redesigned), 12% are divergent (AI substitutes + demand expandable, uneven outcomes with junior positions exposed and senior roles persisting), 12% are substituted (AI substitutes + demand bounded, real losses on the routine end), 23% are enabled (AI embedded but the core role is unchanged), and 34% have limited exposure in the near term.^[36] Autor's high end is the amplified + rebalanced + senior-divergent set. The leveraged tail is the enabled + limited-exposure set where AI assistants multiply individual output without replacing the worker. The compressed middle is the substituted + junior-divergent share where the entry-level rung of structured work is the most exposed. Autor's more recent NBER essay argues that the right product and policy response is to push AI toward rebuilding middle-class work rather than displacing it,^[35] and the data infrastructure question is upstream of every version of that response. You cannot reskill knowledge workers into AI oversight roles if the models they are supposed to oversee were never trained on the work itself.

Third, geography stops being a moat. The current AI economy is grossly biased toward the languages, accents, dialects, and cultural contexts the open web happens to over-represent. The Koenecke et al. PNAS paper^[15] already showed what that bias looks like when models hit production. Word error rates that systematically disadvantage entire demographic groups, deployed at billion-user scale. The flip side of that is the opportunity. Once expert work itself becomes a tradeable input to model training, fluent native speakers in under-represented languages, clinicians in health systems the field has never trained on, lawyers who actually practice in jurisdictions that are not in the LexisNexis corpus, all of them can plug into the AI economy as expert contributors without having to leave their existing professions.

The labor market for AI training data is global by construction.

It does not have to recapitulate the geographic concentration of the cloud or the chip industry, and that is the part of this transition I am most personally invested in seeing get built correctly.

What we are building at Ocular is one piece of the infrastructure that makes all three of those shifts possible. The Network that connects the experts. The Foundry that turns their contributions into training data. The rates and the credentialing layer that recognise their work as the strategic input it actually is. The same supply gap BCG flags for integration talent shows up in spades for expert AI training contributors, and it is the gap we are explicitly built to close. The labor market of the AI economy needs an employer of record for expert AI training contributions. We are building toward being one.

Our Mission

Ocular AI encodes human expertise into frontier models.

We're an AI data applied research company with one mission. To bring frontier AI into the real world.

Not by scaling compute further. By going to the source of every behaviour the next generation of models needs to learn, the experts who actually do the work, and building the infrastructure to encode what they know into training data, alignment signals, and evaluations.

Beyond Labels and Bounding Boxes

The last era of AI data was image classification and basic annotation. A million annotators in a million tabs drawing boxes on a million stop signs. That worked when models were learning what things looked like. It does not work when models need to learn what experts do.

Frontier models need to learn from the full fabric of human expertise. Voice, vision, reasoning, taste, judgment, and the subtle ways experts think through hard problems. They need real conversations, not transcripts. Real diagnoses, not multiple-choice exams. Real legal reasoning, not bar-exam answers. Real code review on real codebases, not isolated snippets.

That's what we capture. We do it two ways.

A curated Expert Network. Doctors, lawyers, engineers, linguists, researchers, creatives, and native speakers, matched to the tasks where their expertise actually matters. Credentialed, vetted, paid commensurate with their skill, and tracked through every contribution.
A Data Foundry. Purpose-built data infrastructure that ingests, processes, structures, and evaluates expert contributions at scale, turning raw human effort into frontier-grade datasets with quality, provenance, consent, and evaluation built in.

Neither half works without the other. A network without infrastructure produces inconsistent one-off contributions, which is the failure mode every annotation marketplace eventually hits. Infrastructure without the right network produces clean pipelines around weak signal, which is the failure mode pure synthetic-data shops eventually hit. The two together produce datasets that move benchmark numbers, and more importantly, move deployment outcomes.

We're deliberate about how each half is built. The Expert Network is identity-verified, credential-checked, and weighted toward people whose day job is the work we're capturing. Not adjacent skills. Not entry-level approximations. The Foundry is purpose-built infrastructure rather than a workflow stitched on top of generic annotation tooling. Capture stations tuned to the modality. Structured task UIs co-designed with domain leads. Automated quality gates that run before a contribution ever reaches a reviewer. Evaluation suites that mirror the production tasks the data will train models for.

This is what "absorbing tacit expertise into infrastructure" actually looks like in practice. The Network is where the expertise lives. The Foundry is the modern equivalent of the Highland Park line. The place where the practice of a craft is decomposed, structured, captured, and made repeatable at machine scale. Except the craft this time isn't bolting a chassis. It's clinical reasoning, full-duplex conversation, code-review judgment, or multilingual cadence.

The stack composes bottom-up. Domain experts, linguists, researchers, and a global workforce feed the Network. The Foundry sits in the middle, capturing, structuring, evaluating, and packaging. It emits four classes of frontier-grade output at the top. Datasets. Alignments. Evals. Benchmarks. Tasks, tools, RL environments, and rubrics are shipped alongside so the data is directly usable in a modern training pipeline.

How the stack composes

Expert Network plus Data Foundry, one stack producing frontier-grade training data.

Expert-Level Training Data

Datasets

Alignments

Evals

Benchmarks

Data Foundry

Tasks, Tools, RL Environments, & Rubrics

Elite Expert Network

Domain Experts

Linguists

Researchers

Global Workforce

The Network at the base captures expertise. The Foundry processes, structures, and evaluates it. The top tier ships the four output classes a modern training pipeline actually consumes.

The diagram is deliberately three tiers because every reduction beyond that breaks the value chain. Skip the Network and the inputs are wrong. Skip the Foundry and the inputs never become outputs that pass a domain expert's review. Skip the structured output tier and the data sits in cold storage instead of moving model performance. The whole stack has to run for any single piece to matter.

Starting With Voice

We're starting with voice data.

AI only becomes meaningful when you can naturally interact with it. Voice is the most human interface we have, and the one where today's models still fall short across accents, languages, and the ways people actually talk. The open evaluation literature is unambiguous. No open full-duplex model currently achieves both natural backchanneling and appropriate interruption behaviour simultaneously, and researchers attribute the gap to training data, not model design.^[16]

The data shape of the problem is clear in the literature too. Every major open full-duplex model since 2022, including dGSLM, SyncLLM, SALM-Duplex, and PersonaPlex, traces its real-data anchor back to a single corpus. Fisher English. 8 kHz telephone conversations recorded in 2004. Above 4 kHz live the breath, the prosodic fall, the 2 to 5 kHz presence band that carries vocal warmth, and the unmixed overlap of two voices in real conversation. Fisher is acoustically blind to all of it. We make the full case in a companion post on the Ocular Hi-Fi corpus, and you can browse the studio-grade dataset page for the full spec.

We've spent the last year working hand-in-hand with leading and fastest-growing AI labs building voice models. Capturing voice datasets at scale, then making that audio AI-ready. Transcribing, diarising, annotating disfluencies and prosody, labelling backchannels and barge-ins, and shaping it into formats their training pipelines can actually use. They trust us to operate as an extension of their team.

This week we're releasing our first open source dataset. A Multi-Accent English ASR Dataset spanning 11 countries, produced end-to-end by our Expert Network and Data Foundry. It's live now. Accent and dialect coverage in particular is one of the most consequential gaps in the field. The foundational PNAS study by Koenecke et al.^[15] quantified word-error-rate disparities of major commercial ASR systems across speaker demographics. The supply of high-quality, accent-balanced training data has remained the most direct lever for closing them.

Voice Agents That Can Hold a Real Job

The same shift from academic benchmarks to economic outcomes that GDPval is forcing on text models is overdue in voice. The numbers the field publishes today are mostly word error rate on read speech, and MOS scores on TTS samples. Those say very little about whether a voice agent can hold a real customer-service call. The 12-minute exchanges. The interrupted transfers. The negotiated escalations. The moments where the user is upset and the agent has to read the room before responding. The companies actually running voice agents in production, like call centres, telephony platforms, and conversational commerce, care about handle time, first-contact resolution, deflection rate, and customer satisfaction. None of those show up on a WER table.

This is where the τ-bench style of evaluation starts to bite for voice.^[22] The task isn't "transcribe this clean utterance." It's "complete this end-to-end customer-service interaction, including using the right tools at the right moment, handling user objections, escalating when the policy says to, and recovering when something goes wrong." That requires training data with the same shape. Real conversations. Full-duplex audio. Labelled tool calls. Labelled escalations. Labelled emotional states. Labelled outcomes. It also requires the model to operate over long horizons. A typical service call has 50+ turns, hundreds of seconds of audio, and a session state that drifts across the call.

Sierra has been pushing this evaluation discipline further. Their newer τ-knowledge benchmark, released in February 2026, pairs the τ-bench agentic structure with a realistic 698-document fintech knowledge base spanning roughly 195K tokens and 21 product categories, and the headline result is the one that should focus every team building voice agents for production.^[38] The best frontier model evaluated, GPT-5.2 with high reasoning, completes only around 26% of tasks end-to-end at pass^1. Sierra's own framing is unsparing.

"Current frontier models still fail at retrieving, interpreting, and acting on messy real-world documentation."
— Sierra, τ-knowledge, February 2026

The four failure modes their qualitative analysis catalogues^[38] map almost one-to-one onto what breaks voice agents in production. Agents miss policy interdependencies and follow the user's stated order even when policy requires a different sequence (file the dispute first, get the credit-limit-increase auto-rejected because of the pending dispute). They take user assertions at face value without verifying against system state (the user claims "my dispute was approved" and the agent applies the credit, when the dispute is still under review). They skip tool-discovery steps and then hallucinate a success response when the silent failure comes back as an error. They make unwarranted assumptions instead of asking a clarifying question. Every one of those is a real customer-service-call failure mode, not a benchmark abstraction. A voice agent that ships any of those into a real call loses the customer.

The implication for the data layer is direct. Training data for production voice agents has to include the texture τ-knowledge captures. Multi-hop reasoning over an actual knowledge base. Conditional policy application. Tool-discovery flows where the right next call isn't given to the agent but has to be located inside a document first. Verification turns where the agent checks system state against what the user just claimed. Solution-efficiency signal, not just final-state success, because a customer-support call that takes thirty turns instead of ten is itself a failed deployment. That texture is not in the open data the field has been collecting, and it is exactly the shape our Foundry is built to capture.

What we want to see, and what we're building the data layer to make possible, is voice agents that can run a real job. A claims call, end to end. A customer onboarding, end to end. A multilingual support session that hands off to a human cleanly when it should. A two-week sales engagement carried across multiple voice and text touchpoints. These are the deployments the next generation of voice models needs to win at. The only way to train for them is on data that looks like them. Real call shape. Real durations. Real tool use. Real consequences.

Voice is the wedge. The same infrastructure (Expert Network plus Data Foundry) generalises to every modality and every domain where the bottleneck is the same. Rich, expert-grounded data, captured at the shape of the actual work models are being asked to do.

What We're Building Toward

Pillars

01
Expert-grounded, never crowdsourced. Every dataset is traceable to a credentialed expert. No second-hand annotation. No anonymous Mechanical-Turk labels for tasks that require judgment.
02
Multimodal by design. Voice today, vision and reasoning next. The same Foundry pipeline (capture, structure, evaluate) applied to every modality the next generation of models needs.
03
Native at the source. Languages, accents, and dialects covered by people who actually speak them, in the environments where the speech naturally occurs.
04
Provenance, consent, and licensing built in. Every clip and every label tied to a verified contributor, an explicit consent record, and a clear license trail. The data is safe to train on, defensible to ship behind, and auditable on request.
05
Evaluations that match practice. Benchmarks co-designed by domain experts to mirror real workflows. Differential diagnosis, full-duplex dialogue, end-to-end code review. Not standardised tests that frontier models already saturate.

Where We're Headed

The next decade of AI won't be won by whoever has the most GPUs alone. It will be won by whoever pairs serious compute with the most faithfully encoded human expertise, and then deploys that model against work that actually matters in the real economy. Compute keeps scaling. The data input scales alongside it. Whichever lab moves both levers in lockstep is the lab that opens the next gap.

The labs already building that future are the ones treating data not as an annotation budget line item but as a strategic input on equal footing with compute. Sourced from the right people. Structured by purpose-built infrastructure. Evaluated against the workflows the models will actually deploy into, not the standardised tests of the last era. That's where the frontier moves next.

Our work is to make that data layer real, at scale, across modalities and domains. Voice today, and the long-horizon agentic deployments voice unlocks. Customer service that runs end-to-end. Multilingual support that hands off cleanly. Conversational commerce that closes the loop. Vision and reasoning next, under the same playbook. Expert-grounded capture. Purpose-built infrastructure. Evaluation against the real job.

That's the work. That's Ocular AI.

If you want models that work in the real world, talk to us. If you're an expert who wants to shape the next generation of AI, join the network. And if you want to see what we've built so far, start with our open source datasets and the Hi-Fi corpus.

Author