← All posts
May 22, 2026 · Technical

The five PO formats that break every OCR pipeline

We ran 270 real industrial-distributor POs through three off-the-shelf OCR services. About 30% came back unusable. The failures cluster into five shapes, and each one needs a different fix.

If you've watched a generic OCR product confidently extract "$5.OO" as the unit price, this post is for you. We'll cover the five PO shapes that cause those failures, what causes the failure mechanically, and what we ended up building.

1. Handwritten POs on multi-part carbon forms

Yes, these still exist. A guy named Dale at a plumbing supply house in Ohio sends 18 of them to one of our test customers every week. He writes part numbers in pencil on a triplicate form, scans the top copy, attaches the JPG to an email.

What breaks: handwriting recognition. Tesseract reads "BR-ELB-1/2" as "8R-EL8-1/Z". A "B" with a sloppy belly becomes an "8". The "0" in "10" becomes a circled letter "O", which becomes any of "O, Q, 0, 0" depending on the OCR engine.

How we handle it: when the deterministic extractor returns zero structured lines from a PDF, we ship the page as a base64 PNG to Claude's vision API as a fallback, with a prompt that says: "you're reading an industrial parts PO, return structured line items." Vision routinely catches what Tesseract loses on handwriting, because vision actually parses the page semantically rather than character-by-character. The customer pays roughly a penny per page in API cost. Worth it on Dale's eighteen POs a week.

2. Scanned PDFs that pretend to be searchable

The buyer printed an email to PDF, scanned it back in, ran OCR-ish on the scan, attached the resulting "searchable PDF" to a new email. The text layer in that PDF is a hallucination: the OCR that produced it interpreted noise as text. pdfplumber dutifully extracts a paragraph of nonsense and never notices.

How we handle it: we don't trust the text layer if the layout has very low confidence. pdfplumber tells us when the text has very low character density or when extracted characters don't align with detected glyph boxes. If we see either signal, we ignore the text layer and OCR the raster.

3. Tables with merged cells across rows

Buyer's MRP system emits POs with a single description cell that spans three rows ("BRASS ELBOW / SCH 40 / 1/2 IN MIP"). Each of the three lines has its own quantity and price. Naive table extraction sees three lines with the same description and prices that don't seem to belong to anything.

How we handle it: we identify merged cells by checking the bounding boxes of detected text against the table's row grid. A description that spans multiple rows gets associated with the line that owns the smallest qty/price pair on each row. We also keep the raw cell positions so you can sanity-check what the parser saw.

4. Buyer part numbers that don't match anything on earth

Some distributors' customers maintain their own part-number registry that has zero overlap with the distributor's QuickBooks SKUs. An "ACME-EL34" maps to your "BR-ELB-075-NPT" and there's no algorithmic way to know that.

This isn't an OCR problem per se — OCR extracts the part number correctly. It's a matching problem that masquerades as an OCR problem because operators say "OCR is broken" when really the matcher returned UNMATCHED.

How we handle it: cross-reference tables. You can pre-upload a CSV mapping the customer's part numbers to your internal SKUs. Or — and this is the better answer — the connector learns. The first time a rep manually resolves an ACME-EL34, SideQuest writes the mapping to your cross-reference CSV automatically. The next PO from Acme with the same part auto-matches. Read more in the v0.8.0 changelog.

5. Emails where the PO is in the body, not the attachment

"Hi Marcia, please ship the following: 10x 1/2 brass elbows, 5x 3/4 ball valves, 2x roll of PTFE tape. Thanks, Joe." No attachment. No structure. No DocNumber.

How we handle it: when there's no PDF attachment, the connector parses the email body as text and runs the same line-extraction logic over it. The structural cues for "lines" change — instead of looking for a table, we look for quantity markers (10x, (10), qty 10), inch markings, the word "each", etc. The match-confidence threshold is lower for body-text POs, which means more lines get flagged for human review, which is the right behaviour.

The meta-lesson

If you're building an AP or PO automation tool that claims to handle "all PO formats," the only honest version of that claim has five fallbacks behind it. Off-the-shelf OCR services don't have those fallbacks because their target market is invoices, which are far more standardized than POs. Distributors get the long tail. The long tail eats generic OCR for breakfast.

See the full pipeline in action on a real scanned PO at our demo, or check the changelog for what we ship next.