Multimodal Systems

Half your data is trapped in images, scans, and audio.

We build systems that read documents, see images, and hear audio, then turn that mess into structured data you can use.

Live extraction

Raw input goes in. Typed fields come out. No rekeying.

The problem

The real world is not plain text.

The data that matters rarely arrives clean. It is a photographed receipt, a scanned contract, a voicemail, a screenshot, a PDF that is really just an image of a table. People can read all of it instantly. Software, for years, could not. So a person had to sit in the middle and retype it.

Modern models changed that. They see images, read documents, and understand speech, not as separate bolt-on tools but as one system. You can hand a model a photo of an invoice and get back the vendor, the amount, and the due date as clean fields. You can hand it a support call and get back the issue and the next action.

The hard part is turning that capability into something reliable. A demo that reads one clean invoice is easy. A system that reads ten thousand messy ones, knows when it is unsure, and never quietly invents a number that was not there, is engineering. That gap is where most multimodal projects stall.

Picture a finance team that wired a vision model straight to their inbox to read supplier invoices. It worked beautifully in the demo on three crisp PDFs. Then the real mail arrived: a faxed invoice rotated ninety degrees, a phone photo with a thumb over the total, a scanned page where the comma in 1,500 vanished and the model read 11500. It did not flag any of these. It returned a number every time, with the same calm confidence, and the wrong ones sailed through to the ledger because nothing told anyone to look. The model was not broken. The system around it never asked whether the answer could be trusted.

What nobody tells you

The model is not the hard part. Getting it the right pixels is.

Here is the truth most teams find out the expensive way: on real documents, what you feed the model decides accuracy more than which model you pick. "Just send it to a vision model" works on a clean screenshot and falls over on a dense scan. Two teams running the exact same model get wildly different results, and the gap is almost always preprocessing: resolution, cropping, orientation, and which frames you bother to look at.

Resolution is the one nobody budgets for. A vision model does not see your image the way you do. It chops the page into patches and turns each one into tokens, and a full size page can balloon into thousands of them, which is slower and more expensive on every single call. So the obvious move is to shrink the image first. That is also how you destroy the answer. Downsample a tax form and the model still reads the headings fine, but the eight point digits in the table, the handwritten signature, the stamped date, all blur into mush. It does not tell you it stopped being able to read them. It just starts guessing, and the guess looks exactly like a real answer.

Video makes this brutal. You cannot send every frame, so the standard trick is to sample, often as little as one or two frames a second on long footage. Anything that happens between the frames you grabbed never existed as far as the model is concerned. The label flashes for half a second, the forklift crosses the aisle, the defect appears on one part and is gone: miss the frame, miss the event. Uniform sampling feels safe and quietly throws away exactly the moments that mattered.

This is also why the old debate, OCR pipeline versus end to end vision model, has a non obvious answer. Traditional OCR runs in stages: detect the text, recognize the characters, then reconstruct the layout. Each stage throws away context the next one needed, so a borderline table or a multi column page tends to come out scrambled. A vision model reads the whole page in one pass and keeps the layout in its head, which is genuinely better, until the page is too dense and you have downsized it to save money, and now it is confidently inventing rows. Neither approach is right by default. The win is in matching the preprocessing to the document, then choosing the model, not the other way around.

How we build it

We build for the messy input, not the clean one.

We design around the documents and images you actually get, with checks for when the model is guessing. These are the patterns we work with.

Built around your real inputs

A model that handles a clean PDF falls apart on a phone photo taken at an angle in bad light. We build and test against the inputs you actually receive, not the ideal ones, so quality holds in production. That means we collect the rotated scans, the glare, the coffee stains, the second page someone forgot to include, and we make those part of the test set from day one. If your worst input is a faxed receipt photographed under a desk lamp, that is the bar we tune to, because that is the one that breaks systems built only on the demo file.

Preprocessing before model choice

Before we argue about which model to use, we fix what the model actually sees. We straighten skewed pages, crop to the region that holds the answer, and pick a resolution high enough to keep small digits and handwriting legible without paying for tokens you do not need. For video we sample frames where things change instead of blindly grabbing one a second. We tune these knobs against your documents until the model gets clean pixels to work with, because the cleanest input is what makes the rest of the pipeline trustworthy.

Structured output you can trust

Free text is hard to use downstream. We make the model return typed, validated fields: dates as dates, amounts as numbers, with a defined schema, so the output drops straight into your systems. We constrain the model to that schema instead of hoping it formats things nicely, and we validate every field after: a date that is not a real date, an amount that does not match the line items, a required field left blank. Anything that fails the check gets caught at the door, not three steps downstream in your database.

Honest about uncertainty

The dangerous failure is a confident wrong answer. We build confidence scoring and validation so the system flags what it is unsure about and routes it to a human, instead of guessing and moving on. The trick is that a model's own stated confidence is not enough, it will say it is sure about a number it read off a blurred cell. So we cross check: does the total equal the sum of the parts, does the date fall in a sane range, did two passes agree. Disagreement is the signal that the system does not know, and that is exactly when it should stop and ask.

The right model for each part

Vision, speech, and language are different jobs. We combine the right model for each step instead of forcing one model to do everything, which keeps accuracy up and cost down. A small fast model can sort and triage the easy pages, transcription handles the audio, and the heavy reasoning model is saved for the genuinely hard extractions. Routing each page to the cheapest model that can still get it right is how processing a large volume stays affordable instead of sending every page to the most expensive option by default.

A human in the loop where it counts

For the cases that matter, the system does not decide alone. We add review steps for low-confidence or high-stakes extractions, so accuracy stays high without a person checking every single one. The reviewer sees the original image and the extracted fields side by side, with the uncertain ones highlighted, so a correction takes seconds instead of a full re-read. And every correction feeds back: the cases people fix become the examples that sharpen the next version, so the share of work that needs a human shrinks over time instead of staying flat.

"A model that confidently reads a number that was never on the page is worse than no model at all. The work is in knowing when it does not know."
Inferzo · Bending binaries to behave

What you get

Messy inputs in, clean data out.

Not a demo that reads one perfect sample. A system that handles your real volume and tells you when it is unsure.

An ingestion pipeline for your input types: images, scans, PDFs, audio

Extraction into a typed, validated schema that fits your systems

Confidence scoring so low-certainty results get flagged, not buried

A human review step for the cases that need it

Evaluation on your real data, with accuracy you can actually measure

Documentation so your team can add new input types and fields

Not sure if your documents or images are clean enough for this to work? Send us a few real samples and we will tell you straight.

Invoke us

Is this the right call

When this fits.

Good fit

Your important data arrives as images, scans, PDFs, or audio
People currently retype or transcribe that data by hand
You need structured fields out of unstructured input
The volume is high enough that manual entry is a real cost or bottleneck

Wrong call

Your data is already clean, structured text. You do not need a vision or speech model to read it
Every extraction must be 100 percent correct with no human ever checking. No model clears that bar alone
You have a handful of documents a month. A person reading them is cheaper than building this

Deployment and scale

Handles one document or a million.

The system runs where your documents already live. If they are sensitive and cannot leave your environment, the whole pipeline runs inside it. If they are already in the cloud, we process them there. Your compliance rules set the boundary, not us.

Everything is containerized. The same pipeline reads one uploaded file or a nightly batch of fifty thousand, with the heavy model work scaled up only when there is work to do, so you are not paying for idle GPUs.

Costs scale with the right model in the right place. We route simple pages to cheap, fast models and reserve the expensive ones for the hard cases, so processing a large volume does not mean a large bill.

What we settle before we begin: which fields must be exact, what confidence level triggers a human, and where the documents are allowed to be processed. Everything else follows from those three.

Ready to start

Show us what you are stuck retyping.

Send us the documents, images, or recordings your team processes by hand, and what you need out of them. We will tell you whether a multimodal system can read them reliably, and what it should return.

Invoke us