RETURN_TO_BLOG
AI & Automation 15 min

Automated Data Extraction from Documents — OCR with AI

An employee spends 3 hours daily re-typing data from invoices, contracts and forms. At 17 EUR/hour that's 4,600 EUR per year — for one process. I show how to build an OCR+AI pipeline with 90%+ accuracy and who's accountable when the model makes mistakes.

One document. One supplier invoice.

An employee opens the PDF. Types in: invoice number, date, seller VAT number, line items (name, quantity, unit price, VAT, gross value), bank account, payment terms. Checks if the total adds up. Clicks "approve." Done.

Time: 8 minutes. Rate: 17 EUR/hour. Cost: 2.27 EUR per invoice.

The company processes 500 invoices per month.

Annual cost: 2.27 EUR x 500 x 12 = 13,600 EUR per year.

That's without counting errors (average 1-3% with manual entry), the cost of correcting them, and payment delays from the backlog.

Automating this process with OCR+AI costs 4,200-7,000 EUR one-time + 840-1,680 EUR/year operational. ROI: year 1 = 100-200%, year 2+ = 600%+.

This post is a technical guide. It shows how the pipeline works, which document types are difficult, and which solution fits which scale.

/// PIPELINE OCR+AI — 6 KROKÓW OD SKANU DO ERP

01INTAKE

Email, folder, skaner, formularz

PDF/obraz
02PRE-PROCESS

DPI check, rotate, standaryzacja

Czysty PDF
03OCR

Azure / AWS / Tesseract

Tekst + confidence
04EXTRACTION

LLM wyciąga pola z schematu

JSON z danymi
05VALIDATION

NIP, sumy, duplikaty, reguły

OK / Flaga
06ERP IMPORT

API lub plik flat do systemu

Zaksięgowane
85%
DOKUMENTÓW AUTO-IMPORT
90%+
ACCURACY POLECRYTYCZNE
6-8x
SZYBCIEJ NIŻ RĘCZNE WPROWADZANIE

Four document types — a different approach to each

There is no single "OCR pipeline." The approach depends on document structure.

Type 1: Structured documents (invoices, forms, standard prints)

Fixed structure — the "VAT number" field is always in the same place. OCR reads the text, the model extracts values from defined positions. Accuracy 90-98% for good quality scans.

Tools: Azure Document Intelligence, AWS Textract, Google Document AI. All have pre-built models for invoices — no training from scratch needed.

Challenges: invoices from small businesses often have non-standard layouts, phone scans have low resolution, handwritten invoices have illegible handwriting.

Type 2: Semi-structured documents (contracts, offers, correspondence)

Known semantic structure, but variable layout. "Payment terms" always appear in a contract, but can be in different locations, different font, different phrasing.

Here you need an LLM for interpretation, not just OCR for reading. Pipeline: OCR → text → LLM that understands document semantics → JSON with fields.

Accuracy for a well-configured model: 80-92% depending on document variability.

Type 3: Unstructured documents (reports, emails, notes)

Text with no predefined structure. The goal is extracting specific information (e.g., from a customer email: order amount, deadline, delivery address).

Requires a strong LLM with a good prompt. Accuracy: 70-85% — highly dependent on prompt quality and document clarity.

Type 4: Images and scans with handwritten content

Most commonly cited reason "OCR won't work for us." In reality: handwriting recognition models (Azure, Google) achieve 85-94% accuracy for legible handwriting, 60-75% for illegible.

Strategy: automate legible ones, flag illegible ones for human review. Don't try to automate 100% — aim for 80% auto + 20% human review.

Pipeline: scan → OCR → extraction → validation → ERP

Step-by-step of the architecture I deploy:

Step 1: Intake Document arrives via one of several channels: email attachment, SharePoint/Google Drive folder, network scanner output, web form upload. n8n/Make listens and triggers the pipeline.

Step 2: Pre-processing Check: is the file readable (DPI > 150), is it not empty, what format. Convert to standardized PDF or image. Rotate correction if scan is tilted.

Step 3: OCR Send to OCR engine (Azure Document Intelligence, AWS Textract, Tesseract for on-premise). Receive text with confidence score for each field.

Step 4: Extraction LLM (GPT-4o-mini or GPT-4o depending on complexity) receives raw OCR text + JSON schema with fields to extract. Returns JSON with values and confidence.

Step 5: Validation Business rules: is the VAT number valid (checksum), does the line item sum equal the gross total, is the date in a sensible range, does the vendor exist in the database. If rule violated: flag for human review.

Step 6: Human review queue Documents with confidence below threshold or rule violations go to the review queue. The employee sees the document and pre-filled fields — only needs to confirm or correct.

Step 7: ERP/system import After review: automatic import to ERP via API or direct database write.

document_extraction_pipeline.py
import jsonfrom openai import OpenAI

client = OpenAI()

INVOICE_SCHEMA = { "invoice_number": "string - invoice number", "invoice_date": "string - issue date YYYY-MM-DD", "due_date": "string - payment due date YYYY-MM-DD", "seller_name": "string - seller company name", "seller_vat": "string - seller VAT number", "buyer_name": "string - buyer company name", "buyer_vat": "string - buyer VAT number", "net_total": "number - net amount", "vat_total": "number - VAT amount", "gross_total": "number - gross total", "bank_account": "string - bank account number", "currency": "string - currency (USD/EUR/GBP)" }

def extract_invoice_data(ocr_text: str) -> dict: schema_str = json.dumps(INVOICE_SCHEMA, indent=2)

prompt = f"""Extract data from this invoice. OCR text (may contain errors): --- {ocr_text} --- Return JSON with these fields: {schema_str}

Rules: - If a field cannot be found, return null - Amounts: numbers with period as decimal separator - Dates: YYYY-MM-DD format - Do not guess - if uncertain, return null"""

response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"}, temperature=0 )

return json.loads(response.choices[0].message.content)

def validate_invoice(data: dict) -> list: errors = []

# Validate totals net = data.get("net_total") vat = data.get("vat_total") gross = data.get("gross_total") if all(x is not None for x in [net, vat, gross]): calculated = round(net + vat, 2) if abs(calculated - gross) > 0.05: errors.append(f"Total mismatch: {net} + {vat} = {calculated}, gross = {gross}")

# Basic date validation date = data.get("invoice_date") if date and (len(date) != 10 or date[4] != "-" or date[7] != "-"): errors.append(f"Invalid date format: {date}")

return errors

if __name__ == "__main__": ocr_text = """Invoice No: INV-2026-001 Date: 15 May 2026, Due: 29 May 2026 From: Tech Solutions Ltd, VAT: GB123456789 To: Client Corp LLC, VAT: US987654321 Service: AI System Implementation - 4500.00 USD net VAT 20%: 900.00 USD Total: 5400.00 USD Bank: GB29 NWBK 6016 1331 9268 19"""

extracted = extract_invoice_data(ocr_text) errors = validate_invoice(extracted)

print("Extracted:", json.dumps(extracted, indent=2)) if errors: print("Validation errors:", errors) else: print("Valid - ready to import to ERP")

OCR+AI solution comparison

SolutionAccuracy (typical invoices)Cost per 1000 pagesOn-premise?Pre-built modelsBest for
Azure Document Intelligence93-97%~$15yes (containers)yes (invoices, receipts)Enterprise + SMB
AWS Textract90-95%$7-15noyes (tables, forms)AWS stack
Google Document AI91-96%$10-20noyes (specialized)Google stack
Tesseract + GPT75-88%<$1 (Tesseract) + LLMyesno (custom prompt)On-premise/GDPR
Nanonets88-94%from $499/monthnoyes + autoMLSMB without devs
Custom fine-tune94-98% (narrow domain)GPU serveryesnoHigh volume

Accuracy vs cost — finding the right point

Many companies make the mistake of targeting 100% automation and 99% accuracy. That's unnecessary and expensive.

A more pragmatic approach:

90% automation + 10% human review is often better than 95% automation with higher error risk. Why? Because the cost of a human reviewing 10% of documents is low, while peace of mind (and data security) — high.

My production standards: - Accuracy >= 92% for critical fields (VAT numbers, amounts, bank accounts) - Confidence threshold: documents below 85% confidence automatically go to human review - Automation target: 80-85% of documents with zero human intervention - Remaining 15-20%: human review with pre-filled fields (not re-typing from scratch)

Three OCR deployment traps

Trap 1: "OCR won't work because our invoices are different"

I hear this on every project. The truth: Azure Document Intelligence and similar tools are trained on millions of documents and handle significant variability. Exceptions: handwritten invoices, very low quality scans, documents in exotic languages.

Before rejecting OCR — run a proof of concept on 50 real documents. Results are almost always better than expected.

Trap 2: No business validation = errors in ERP

System extracted data from OCR. Imported to ERP. Nobody checked. Invoice with invalid VAT number went to tax filing, rounded amount is off by a penny, duplicate invoice imported twice.

Business validation (VAT number checksum, total verification, duplicate check) is mandatory. Not optional.

Trap 3: Ignoring GDPR when processing invoices and contracts

Financial documents and contracts contain personal data. Sending them to external APIs (Azure, AWS, OpenAI) requires a Data Processing Agreement (DPA). Azure and AWS have standard DPAs — sign before processing begins.

Alternative: Tesseract on-premise + Ollama (local LLM) for full data control. Lower accuracy, zero GDPR risk.

Frequently asked questions

---

I build OCR+AI pipelines tailored to your document types and systems — from simple invoice automation to complex contract processing with validation and approval workflows. Get in touch — I start with a free accuracy test on 30 of your documents. Before we deploy anything, you'll see real numbers.

/// AUTHOR
Paweł Wiszniewski – AI & Web Engineer

Paweł Wiszniewski

Senior Full-Stack Engineer & AI Architect

8+ years building AI systems, automations, and scalable web applications that reduce costs and improve operational efficiency.

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...