Automated Data Extraction from Documents — OCR with AI
An employee spends 3 hours daily re-typing data from invoices, contracts and forms. At 17 EUR/hour that's 4,600 EUR per year — for one process. I show how to build an OCR+AI pipeline with 90%+ accuracy and who's accountable when the model makes mistakes.
One document. One supplier invoice.
An employee opens the PDF. Types in: invoice number, date, seller VAT number, line items (name, quantity, unit price, VAT, gross value), bank account, payment terms. Checks if the total adds up. Clicks "approve." Done.
Time: 8 minutes. Rate: 17 EUR/hour. Cost: 2.27 EUR per invoice.
The company processes 500 invoices per month.
Annual cost: 2.27 EUR x 500 x 12 = 13,600 EUR per year.
That's without counting errors (average 1-3% with manual entry), the cost of correcting them, and payment delays from the backlog.
Automating this process with OCR+AI costs 4,200-7,000 EUR one-time + 840-1,680 EUR/year operational. ROI: year 1 = 100-200%, year 2+ = 600%+.
This post is a technical guide. It shows how the pipeline works, which document types are difficult, and which solution fits which scale.
/// PIPELINE OCR+AI — 6 KROKÓW OD SKANU DO ERP
Email, folder, skaner, formularz
DPI check, rotate, standaryzacja
Azure / AWS / Tesseract
LLM wyciąga pola z schematu
NIP, sumy, duplikaty, reguły
API lub plik flat do systemu
Four document types — a different approach to each
There is no single "OCR pipeline." The approach depends on document structure.
Type 1: Structured documents (invoices, forms, standard prints)
Fixed structure — the "VAT number" field is always in the same place. OCR reads the text, the model extracts values from defined positions. Accuracy 90-98% for good quality scans.
Tools: Azure Document Intelligence, AWS Textract, Google Document AI. All have pre-built models for invoices — no training from scratch needed.
Challenges: invoices from small businesses often have non-standard layouts, phone scans have low resolution, handwritten invoices have illegible handwriting.
Type 2: Semi-structured documents (contracts, offers, correspondence)
Known semantic structure, but variable layout. "Payment terms" always appear in a contract, but can be in different locations, different font, different phrasing.
Here you need an LLM for interpretation, not just OCR for reading. Pipeline: OCR → text → LLM that understands document semantics → JSON with fields.
Accuracy for a well-configured model: 80-92% depending on document variability.
Type 3: Unstructured documents (reports, emails, notes)
Text with no predefined structure. The goal is extracting specific information (e.g., from a customer email: order amount, deadline, delivery address).
Requires a strong LLM with a good prompt. Accuracy: 70-85% — highly dependent on prompt quality and document clarity.
Type 4: Images and scans with handwritten content
Most commonly cited reason "OCR won't work for us." In reality: handwriting recognition models (Azure, Google) achieve 85-94% accuracy for legible handwriting, 60-75% for illegible.
Strategy: automate legible ones, flag illegible ones for human review. Don't try to automate 100% — aim for 80% auto + 20% human review.
Pipeline: scan → OCR → extraction → validation → ERP
Step-by-step of the architecture I deploy:
Step 1: Intake Document arrives via one of several channels: email attachment, SharePoint/Google Drive folder, network scanner output, web form upload. n8n/Make listens and triggers the pipeline.
Step 2: Pre-processing Check: is the file readable (DPI > 150), is it not empty, what format. Convert to standardized PDF or image. Rotate correction if scan is tilted.
Step 3: OCR Send to OCR engine (Azure Document Intelligence, AWS Textract, Tesseract for on-premise). Receive text with confidence score for each field.
Step 4: Extraction LLM (GPT-4o-mini or GPT-4o depending on complexity) receives raw OCR text + JSON schema with fields to extract. Returns JSON with values and confidence.
Step 5: Validation Business rules: is the VAT number valid (checksum), does the line item sum equal the gross total, is the date in a sensible range, does the vendor exist in the database. If rule violated: flag for human review.
Step 6: Human review queue Documents with confidence below threshold or rule violations go to the review queue. The employee sees the document and pre-filled fields — only needs to confirm or correct.
Step 7: ERP/system import After review: automatic import to ERP via API or direct database write.
import jsonfrom openai import OpenAI
client = OpenAI()
INVOICE_SCHEMA = { "invoice_number": "string - invoice number", "invoice_date": "string - issue date YYYY-MM-DD", "due_date": "string - payment due date YYYY-MM-DD", "seller_name": "string - seller company name", "seller_vat": "string - seller VAT number", "buyer_name": "string - buyer company name", "buyer_vat": "string - buyer VAT number", "net_total": "number - net amount", "vat_total": "number - VAT amount", "gross_total": "number - gross total", "bank_account": "string - bank account number", "currency": "string - currency (USD/EUR/GBP)" }
def extract_invoice_data(ocr_text: str) -> dict: schema_str = json.dumps(INVOICE_SCHEMA, indent=2)
prompt = f"""Extract data from this invoice. OCR text (may contain errors): --- {ocr_text} --- Return JSON with these fields: {schema_str}
Rules: - If a field cannot be found, return null - Amounts: numbers with period as decimal separator - Dates: YYYY-MM-DD format - Do not guess - if uncertain, return null"""
response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"}, temperature=0 )
return json.loads(response.choices[0].message.content)
def validate_invoice(data: dict) -> list: errors = []
# Validate totals net = data.get("net_total") vat = data.get("vat_total") gross = data.get("gross_total") if all(x is not None for x in [net, vat, gross]): calculated = round(net + vat, 2) if abs(calculated - gross) > 0.05: errors.append(f"Total mismatch: {net} + {vat} = {calculated}, gross = {gross}")
# Basic date validation date = data.get("invoice_date") if date and (len(date) != 10 or date[4] != "-" or date[7] != "-"): errors.append(f"Invalid date format: {date}")
return errors
if __name__ == "__main__": ocr_text = """Invoice No: INV-2026-001 Date: 15 May 2026, Due: 29 May 2026 From: Tech Solutions Ltd, VAT: GB123456789 To: Client Corp LLC, VAT: US987654321 Service: AI System Implementation - 4500.00 USD net VAT 20%: 900.00 USD Total: 5400.00 USD Bank: GB29 NWBK 6016 1331 9268 19"""
extracted = extract_invoice_data(ocr_text) errors = validate_invoice(extracted)
print("Extracted:", json.dumps(extracted, indent=2)) if errors: print("Validation errors:", errors) else: print("Valid - ready to import to ERP")
OCR+AI solution comparison
| Solution | Accuracy (typical invoices) | Cost per 1000 pages | On-premise? | Pre-built models | Best for |
|---|---|---|---|---|---|
| Azure Document Intelligence | 93-97% | ~$15 | yes (containers) | yes (invoices, receipts) | Enterprise + SMB |
| AWS Textract | 90-95% | $7-15 | no | yes (tables, forms) | AWS stack |
| Google Document AI | 91-96% | $10-20 | no | yes (specialized) | Google stack |
| Tesseract + GPT | 75-88% | <$1 (Tesseract) + LLM | yes | no (custom prompt) | On-premise/GDPR |
| Nanonets | 88-94% | from $499/month | no | yes + autoML | SMB without devs |
| Custom fine-tune | 94-98% (narrow domain) | GPU server | yes | no | High volume |
Accuracy vs cost — finding the right point
Many companies make the mistake of targeting 100% automation and 99% accuracy. That's unnecessary and expensive.
A more pragmatic approach:
90% automation + 10% human review is often better than 95% automation with higher error risk. Why? Because the cost of a human reviewing 10% of documents is low, while peace of mind (and data security) — high.
My production standards: - Accuracy >= 92% for critical fields (VAT numbers, amounts, bank accounts) - Confidence threshold: documents below 85% confidence automatically go to human review - Automation target: 80-85% of documents with zero human intervention - Remaining 15-20%: human review with pre-filled fields (not re-typing from scratch)
Three OCR deployment traps
Trap 1: "OCR won't work because our invoices are different"
I hear this on every project. The truth: Azure Document Intelligence and similar tools are trained on millions of documents and handle significant variability. Exceptions: handwritten invoices, very low quality scans, documents in exotic languages.
Before rejecting OCR — run a proof of concept on 50 real documents. Results are almost always better than expected.
Trap 2: No business validation = errors in ERP
System extracted data from OCR. Imported to ERP. Nobody checked. Invoice with invalid VAT number went to tax filing, rounded amount is off by a penny, duplicate invoice imported twice.
Business validation (VAT number checksum, total verification, duplicate check) is mandatory. Not optional.
Trap 3: Ignoring GDPR when processing invoices and contracts
Financial documents and contracts contain personal data. Sending them to external APIs (Azure, AWS, OpenAI) requires a Data Processing Agreement (DPA). Azure and AWS have standard DPAs — sign before processing begins.
Alternative: Tesseract on-premise + Ollama (local LLM) for full data control. Lower accuracy, zero GDPR risk.
Frequently asked questions
---
I build OCR+AI pipelines tailored to your document types and systems — from simple invoice automation to complex contract processing with validation and approval workflows. Get in touch — I start with a free accuracy test on 30 of your documents. Before we deploy anything, you'll see real numbers.
/// RELATED_RECORDS
How AI Reads Invoices from Email and Enters Them into ERP
AI can automatically read an invoice from an email attachment — PDF, scan, or phone photo — and enter the data directly into an ERP system without any manual retyping. Full automation of cost invoice processing: from the mailbox to accounting.
Where to Start with AI Implementation in Your Company
AI implementation starts not with choosing a tool, but with identifying one repetitive process that wastes the most human time. Learn step by step how to select, map, and automate that process.
How to Build a Company Internal Knowledge Base with AI (RAG in Practice)
An internal knowledge base built on RAG lets you create your own company chatbot that answers only from your company's documents — not the model's guesses. Safe, up-to-date, precise AI with full control over your data.
Signal received?
Terminate
Silence
Initiate protocol. Establish connection. Let's build something loud.
