RETURN_TO_BLOG
Updated: AI & Automation 12 min

Talk to Your Database — Text-to-SQL, the AI That Turns Questions into SQL Queries

Text-to-SQL (also called NL2SQL or conversational analytics) is a technique where you ask a question in natural language — plain English, an ordinary sentence — and an AI system translates it into a SQL query, runs it against your database, and returns the answer. Instead of waiting two days for an analyst to write a query and build a report, you ask "how many customers haven't placed an order in three months?" and get the list in seconds. This isn't ChatGPT hard-wired to a database — it's an architecture where three things are critical: context (the model has to understand your schema and business terms), safety (read-only, never write), and verification (the query is checked before it runs). A well-built system reaches over 90% accuracy. A naive one — wiring the model straight to the database — fails spectacularly.

Your data sits in a database, ERP or warehouse — but to answer a simple business question, someone has to write SQL or build a report. Text-to-SQL flips that around: you ask in plain English, the AI generates the query, runs it read-only and returns the answer with a chart. I explain how it works, why the naive approach fails, how a semantic layer pushes accuracy from 50% to over 90%, and how to build it safely.

The most common scene on a consultation: a business owner shows me a dashboard in their ERP and says "all my data is here, but when I want to answer a specific question I have to ask IT, or export to Excel and calculate by hand". The data exists. The problem is access to it. Every non-standard question requires either a pre-built report (which doesn't exist) or someone who knows SQL (who is busy). Text-to-SQL solves exactly that gap.

In this article I explain how the technology works under the hood, where it genuinely delivers, what its limits are, and how to build it so it doesn't end in a data disaster. No sugar-coating — including when it's NOT worth it.

The problem it actually solves

A company's operational data lives in databases: PostgreSQL, MySQL, MS SQL Server, a warehouse like BigQuery or Snowflake, or the database under your ERP (Optima, Subiekt, SAP). The classic path to an answer looks like this: a business user has a question → they raise it with IT or an analyst → someone writes a SQL query → builds a report → sends back the result. This loop takes hours to days, and for unusual questions it often dies at "we don't have a report for that".

The result: 80% of a company's data is practically inaccessible to the people making decisions. Not because the data doesn't exist — because the technical barrier (knowing SQL and the database structure) cuts everyone off from it except a narrow group.

Text-to-SQL removes that barrier. A question asked in natural language becomes a database query. Someone in sales, finance, or management asks directly and gets an answer — no intermediary.

How it works under the hood — the question-to-answer loop

/// TEXT-TO-SQL: THE QUESTION-TO-ANSWER LOOP

01
Question in plain English
"How much did we earn from client X in Q1?"
02
Context + schema
Schema RAG, semantic layer, few-shot examples
03
LLM generates SQL
Model writes a query in your DB dialect
04
Validation
SELECT-only, row limit, syntax check
05
Read-only execution
On a replica — zero write risk
06
Answer + chart
Result translated back to natural language

* On execution error, step 04→03 closes a self-correction loop: the model sees the error message and fixes the query.

Contrary to appearances, this is not "send the question to GPT and run whatever it returns". A production text-to-SQL system is a pipeline with several layers:

1. Understanding the question and pulling context. The system doesn't send the whole database to the model — that would be too expensive and ineffective. Instead it uses schema RAG: based on the question, it selects only the tables and columns that matter. It adds examples of similar queries from history (few-shot) and definitions of business terms from a semantic layer.

2. Generating SQL. The language model (GPT-4o, Claude, or an open-source model) writes the query in your database dialect — because SQL in PostgreSQL differs from SQL in MS SQL Server. This is the step where the context from point one decides everything.

3. Validation before execution. Before anything touches the database, the query is checked: is it definitely SELECT-only (no UPDATE/DELETE/DROP)? Does it have a row limit? Is the syntax valid? Does it reference columns the user isn't allowed to see?

4. Read-only execution. The query runs against a database replica with read-only privileges. Even if the model hallucinated a destructive command, the database physically won't let it run.

5. Translating the result. The raw result (a table of numbers) goes back to the model, which phrases the answer in natural language and — if useful — suggests a chart.

The self-correction loop. If a query returns an execution error, the system shows the model the error message and asks for a fix. That single loop can raise accuracy by double-digit percentage points — because the model learns from its own mistake in real time.

Why the naive approach fails — and what actually drives accuracy

/// WHAT DRIVES TEXT-TO-SQL ACCURACY

Naive: LLM + raw schema~50%
Model guesses column meaning and relations
+ Dynamic few-shot~70%
Examples of similar queries from history
+ Validation & self-correction~80%
Fix loop after an execution error
+ Semantic layer & glossary90%+
Business metrics & terms defined once

* Indicative figures. For reference: on the BIRD benchmark top models (GPT-4o) reach ~82%, human experts ~93%. Real company databases are often harder than the benchmark.

Here's the crux that most demos hide. Wiring the model straight to the database ("here's the schema, here's the question, write SQL") works great on a simple database with three tables and readable names. On a real company database — with a table named tdok, a column knt_gidnumer, and three different fields that could each mean "amount" — the naive model guesses. And when it guesses, it returns an answer that looks correct but is wrong. That's the most dangerous outcome: not an error, but plausible-looking falsehood.

What genuinely drives accuracy, from cheapest to most valuable:

  • Dynamic few-shot — attaching examples of similar, proven queries from history to the prompt. The model sees the pattern "this is how we calculate revenue here" and imitates it.
  • Validation and self-correction — a loop where a faulty query is fixed based on the database's error message.
  • Semantic layer — this is the game-changer. Instead of letting the model guess every time what an "active customer" or "margin" is, you define those terms ONCE, in a business glossary. The model no longer translates the raw schema — it translates the question into defined metrics. Research shows up to a 3x accuracy gain when grounding the model in a business glossary rather than schema alone.

For the record — benchmarks: on the BIRD dataset (realistic, hard databases) top models like GPT-4o reach around 82% accuracy, while a human expert reaches ~93%. That shows two things at once: the technology is already very useful, but not infallible. Which is why the architecture around the model (context, validation, semantic layer) decides whether in your company this is 55% or 92%.

Safety — the most important chapter everyone skips

Letting a language model touch a production database introduces two distinct risks that must be closed with architecture, not a promise.

Write risk. A hallucinated UPDATE or DELETE could corrupt live data. The fix isn't soft ("we'll ask the model not to do that") — it's hard: the system connects to the database through an account with read-only privileges, ideally on a separate replica. The database will physically reject any command other than SELECT. It's a layer the model can't bypass, because it doesn't depend on the model.

Data exposure risk. A query could return rows or columns a given user shouldn't see — salaries, sensitive data, other departments' data. The fix: row-level and column-level security at the database layer, plus mapping the app user's permissions onto database permissions. A salesperson asking about "revenue" sees their regions, not the whole company.

On top of that: a full audit trail. Every generated query and every result is logged. Without enforced read-only and without auditing, most NL2SQL prototypes fail a compliance review — and rightly so.

Security layerWhat it doesWhy it's essential
Read-only account + replicaPhysically blocks writes and offloads productionA hallucinated UPDATE/DELETE has no way to execute
Query validationAllows only SELECT, enforces a row limitStops destructive and overly costly queries
Row/column-level securityFilters data to the user's permissionsA salesperson won't see salaries or another department's data
Audit logLogs the question, SQL and resultCompliance, accountability, debugging wrong answers
Human-in-the-loopShows the generated SQL before runningBuilds trust and catches errors on sensitive queries

What it looks like in practice — an example

Imagine a question from someone in sales: "show the ten customers with the biggest year-over-year drop in orders". The system pulls the schema of the orders and customers tables, adds the definition of "year-over-year drop" from the semantic layer, and generates the query:

generated-query.sql
-- Question: "show the 10 customers with the biggest year-over-year drop in orders"SELECT c.name AS customer,       SUM(CASE WHEN o.year = 2025 THEN o.value ELSE 0 END) AS revenue_2025,       SUM(CASE WHEN o.year = 2026 THEN o.value ELSE 0 END) AS revenue_2026,       SUM(CASE WHEN o.year = 2026 THEN o.value ELSE 0 END)         - SUM(CASE WHEN o.year = 2025 THEN o.value ELSE 0 END) AS changeFROM orders oJOIN customers c ON c.id = o.customer_idWHERE o.year IN (2025, 2026)GROUP BY c.nameHAVING SUM(CASE WHEN o.year = 2026 THEN o.value ELSE 0 END)     < SUM(CASE WHEN o.year = 2025 THEN o.value ELSE 0 END)ORDER BY change ASCLIMIT 10;

Validation confirms: it's a SELECT, it has a LIMIT, it only references allowed tables. The query runs on the read-only replica. The result goes back to the model, and the user sees: "Here are the 10 customers with the biggest drop. The largest: Kowalski Ltd — down PLN 142,000 (−38%). Want to see the causes broken down by product category?".

The employee wrote no SQL. Didn't wait for an analyst. And — crucially — can immediately ask a follow-up.

Where text-to-SQL genuinely delivers

  • Ad-hoc business analytics. Questions that come up once and have no pre-built report: "which rep has the most dead leads?", "what's the average margin on category X this quarter?".
  • Self-service for management and non-technical teams. Leadership, sales, finance, marketing — they ask directly, no IT queue.
  • A layer over ERP/CRM. A conversational front-end on the database under Optima, Subiekt or SAP — asks for data the standard reports don't show.
  • First line of data work. Before anyone engages an analyst, they self-check 80% of questions in seconds.

What text-to-SQL still doesn't do well — honestly

  • Very complex analytical queries. Multi-level CTEs, time windows, advanced statistics — accuracy drops here and a human or a tested, pre-built report is still better.
  • A dirty, inconsistent schema. If the database has cryptic names, no foreign keys, and data in three places at once, the model will make mistakes. A semantic layer helps, but that's work to do, not magic.
  • Ambiguous questions. "Show the best customers" — best by what? Revenue, margin, frequency? A good system asks back; a weak one guesses.
  • Critical decisions without verification. A plausible-looking but wrong answer is a real risk. For high-stakes decisions, always show the generated SQL and result for approval.

These aren't reasons to give up — they're reasons to build with architecture, not a naive prompt.

What it costs

ComponentEstimateNotes
System build (one-off)PLN 8,000–30,000Depends on table count, schema quality and semantic-layer scope
Semantic layer / glossaryPart of the buildThe bulk of the work on a complex, inconsistent schema
LLM API cost (monthly)PLN 100–500Depends on question volume; open-source models cut this
Infrastructure (read-only replica)PLN 0–300/moAn existing database replica is often enough
MaintenanceLowUpdates on schema changes and new metrics

For comparison: a full-time analyst writing ad-hoc reports costs tens of thousands of zlotys a month. Text-to-SQL doesn't replace an analyst on complex projects — but it takes 80% of the repetitive, simple questions off their plate, the ones that block their time today.

My approach to deployment

When I build a text-to-SQL system for a company, the order is always the same and it starts with safety, not the model:

  1. 1.Read-only replica and permissions — first I cut off any possibility of writes. That's the foundation, not an add-on.
  2. 2.Semantic layer — I define the key business terms with the client: what an "active customer" is, how we calculate "margin", what "order fulfilled" means. This is where accuracy is born.
  3. 3.Few-shot from real queries — I build a library of proven examples from the company's history.
  4. 4.Validation and self-correction — a loop that catches errors before they reach the user.
  5. 5.Audit and human-in-the-loop — log everything, plus an option to preview the SQL on sensitive questions.

The result: management and non-technical teams ask for data directly, and IT stops being the bottleneck on every question. If you want to check whether your database is fit for text-to-SQL and what it would actually give your company — book a consultation: I review the schema, measure accuracy on your real questions, and show what can be built.

FAQ — Text-to-SQL

/// AUTHOR
Paweł Wiszniewski – AI & Web Engineer

Paweł Wiszniewski

SEO & GEO Specialist & AI Engineer

SEO/GEO specialist (10 years) and AI engineer (3 years). I build search visibility, AI systems and automations that reduce costs and improve operational efficiency.

Signal received?

Terminate
Silence

Initiate protocol. Establish connection. Let's build something loud.

> WAITING_FOR_INPUT...