What is Text-to-SQL? Definition, Limits, and Alternatives

Text-to-SQL converts natural language into SQL queries using an LLM. Learn why it fails in production and how semantic layers provide a governed alternative.

10 min read

Text-to-SQL (also called Text2SQL, NL2SQL, or natural language to SQL) converts natural language questions into SQL queries using a large language model. You ask "what was revenue last quarter?" and the model generates SELECT SUM(amount) FROM orders WHERE created_at >= '2025-10-01'. The system executes that query against your database and returns the result.

The appeal is obvious: anyone can query data without learning SQL. The problem is equally obvious: the model is guessing. It guesses which tables to join, which columns to aggregate, which filters to apply. Sometimes it's right. Often enough, it's wrong. And in production, "often enough" is a deal-breaker.

Text-to-SQL is a useful prototyping tool. It's a dangerous production system. Understanding the gap between those two is the point of this page.

How does text-to-SQL work?

The model receives your database schema as context: table names, column types, relationships, and sometimes sample data or descriptions. When a user asks a question, the model generates SQL that it believes answers the question. The system executes that SQL against the database and returns the result.

The pipeline typically looks like this:

User question → Schema context → LLM generates SQL → Execute against DB → Return results

Some implementations add refinements:

Few-shot examples. Include example question-to-SQL pairs in the prompt so the model learns your conventions. Helps with common queries but doesn't cover edge cases.
Schema filtering. Only pass relevant tables to reduce the context window and improve accuracy. Requires knowing which tables are relevant before the model starts, which is itself a hard problem.
RAG over documentation. Retrieve business logic from docs and pass it alongside the schema. Better than nothing, but the model still writes raw SQL from scratch on every query.
Query validation. Run the generated SQL through a validator before executing. Catches syntax errors but not semantic errors (the SQL runs, but the answer is wrong).
Self-correction loops. If the query fails, pass the error back to the model and ask it to fix the SQL. Helps with syntax. Doesn't help with logic.

Even with all of these, the fundamental architecture is the same: the model generates SQL from scratch on every query. There's no reuse of known-good definitions. The same question asked twice might produce different SQL. The model is interpreting your schema, not referencing a shared source of truth.

Why does text-to-SQL fail in production?

Text-to-SQL works well in demos and benchmarks. It breaks in production for specific, predictable reasons.

Hallucinated business logic

Ask "what was revenue last quarter?" The model generates:

SELECT SUM(amount) FROM orders WHERE created_at >= '2026-01-01'

Looks reasonable. But your finance team defines revenue as SUM(amount) WHERE status != 'refunded' AND type != 'trial'. The model doesn't know this. It generates plausible SQL from column names and returns a number that's close enough to seem right but different enough to cause problems.

Ask a different model the same question and you might get:

SELECT SUM(total_price) FROM orders WHERE status = 'completed' AND quarter = 'Q1'

Different table, different column, different filter. Also plausible. Also wrong.

This isn't a model quality problem. GPT-4, Claude, and Gemini all exhibit this behavior. The model is doing what it's designed to do: generate plausible SQL from schema context. But "plausible" and "correct" are different things when your CFO is looking at the number.

Inconsistent results

The same user, the same question, different phrasing:

"What was Q1 revenue?" → SUM(amount) WHERE created_at BETWEEN '2026-01-01' AND '2026-03-31'
"Show me revenue for the first quarter" → SUM(total_price) WHERE quarter = 1
"How much did we make in Jan-Mar?" → SUM(amount) WHERE MONTH(created_at) IN (1, 2, 3)

Three queries, three different SQL statements, potentially three different numbers. Text-to-SQL has no mechanism to guarantee that semantically identical questions produce identical results. Every query is a fresh interpretation.

No access control

Text-to-SQL gives the model access to raw tables. If the model can see a table, it can query any row in that table. There's no built-in:

Row-level security. Customer A's agent can generate SQL that returns Customer B's data.
Tenant isolation. In multi-tenant B2B products, there's no structural way to scope queries to a specific tenant.
RBAC. No role-based restrictions on which metrics or dimensions different users can access.

You can patch this by adding WHERE tenant_id = ? to every generated query. But the model doesn't add this reliably. You end up building a query rewriting layer that inspects and modifies every generated SQL statement. At that point, you're building a semantic layer with extra steps.

No audit trail

When the number looks wrong, you need to trace it. With text-to-SQL, the trail is: "an LLM generated some SQL based on a prompt that included the schema and some context." You can't point to a versioned metric definition that was used. You can't reproduce the exact result because the model's output is non-deterministic. You can't prove the number was correct because correctness was never defined.

Metric governance requires a fixed, versioned definition that every consumer references. Text-to-SQL doesn't have one.

Scale problems

Text-to-SQL sends a full schema (or filtered subset) to the LLM on every query. For large databases with hundreds of tables:

Context windows fill up. You can't fit the whole schema in the prompt. Schema filtering helps but adds complexity and failure modes.
Latency increases. LLM inference takes 1-5 seconds per query. For interactive dashboards or agent workflows that make multiple queries, this compounds.
Cost scales linearly. Every query is an LLM call. At agent scale (hundreds of queries per minute across tenants), the API costs add up.
No caching. Every query generates fresh SQL and hits the database. There's no pre-aggregation layer to serve cached results for common questions.

Text-to-SQL vs semantic layer queries

	Text-to-SQL	Semantic layer
How it works	LLM generates SQL from schema context	Agent selects from governed metric definitions
Metric definitions	None (model guesses)	Centralized in YAML, version-controlled
Consistency	Varies per query and phrasing	Same result every time
Multi-tenancy	Not built in	Row-level security per tenant
Access control	None	RBAC with governance groups
Audit trail	None (SQL generated on the fly)	Full query logging with metric lineage
Performance	LLM inference + warehouse query	Pre-aggregated cache (milliseconds)
Cost at scale	LLM API cost per query + warehouse cost	Warehouse cost only (cached)
AI agent support	Direct SQL generation	Governed tools via MCP
Setup effort	Low (connect LLM to schema)	Medium (define metrics in YAML)

The tradeoff is real: text-to-SQL is faster to set up. A semantic layer is more work upfront but reliable in production. The question is where you are on the spectrum from prototype to production.

When is text-to-SQL the right choice?

Text-to-SQL isn't always wrong. It's useful when:

Internal exploration. A data analyst wants to quickly explore a new dataset. The queries are throwaway. Consistency doesn't matter. No customers see the results.
Prototyping. You're validating whether an AI analytics feature is worth building. Text-to-SQL gets you a demo in hours. Just don't ship the demo.
Ad-hoc questions against well-documented schemas. If your schema is small, well-named, and well-documented, the model generates better SQL. Works best with fewer than 20 tables.
One-off analysis. A PM asks "how many users signed up last week?" The analyst uses text-to-SQL to get a quick answer. Nobody builds a dashboard for it.

Text-to-SQL is wrong when:

Customers see the results. Inconsistent numbers erode trust.
Multiple consumers query the same metrics. Dashboards, APIs, and agents all need the same revenue number.
Multi-tenancy is required. B2B products need structural tenant isolation.
Auditability matters. Finance, compliance, or customer-facing use cases where you need to trace where a number came from.
Scale. Hundreds of queries per minute across tenants. LLM inference latency and cost become blockers.

The semantic layer alternative

A semantic layer replaces the "LLM generates SQL" step with "agent queries governed definitions." Instead of interpreting column names, the agent selects from pre-defined metrics that have fixed calculations, access controls, and caching.

cubes:
  - name: orders
    sql_table: public.orders
    measures:
      - name: total_revenue
        sql: "CASE WHEN status != 'refunded' AND type != 'trial' THEN amount ELSE 0 END"
        type: sum
    dimensions:
      - name: status
        sql: status
        type: string
      - name: created_at
        sql: created_at
        type: time

The agent calls explore_schema to discover available metrics, then query to fetch data. It never writes SQL. The semantic layer generates the SQL, which is deterministic and consistent. Read the full architecture in What Is an Agentic Semantic Layer?.

For a step-by-step tutorial replacing text-to-SQL with governed MCP queries, see How to Connect an AI Agent to Your Data Warehouse.

Frequently asked questions

What is text-to-SQL in simple terms?

Text-to-SQL is a technology that lets you ask questions about data in plain English and get SQL queries back. An AI model reads your database structure, understands your question, and writes the SQL to answer it. It's like having a junior analyst who can write SQL but doesn't know your business rules.

Is text-to-SQL accurate?

On benchmarks like Spider and BIRD, top models achieve 70-85% execution accuracy. In production with real business schemas, accuracy is lower because benchmarks don't test for business logic correctness. The SQL might execute successfully but return the wrong number because the model used the wrong revenue definition. "Runs without errors" and "returns the right answer" are different bars.

What are the best text-to-SQL tools?

Popular tools include Text2SQL.ai, Vanna.ai, Defog, and the text-to-SQL capabilities built into ChatGPT, Claude, and Gemini. Enterprise platforms like ThoughtSpot and Databricks AI/BI include text-to-SQL as a feature. For governed production use, a semantic layer approach (Cube, Bonnard) provides the natural language interface with metric governance underneath.

Can text-to-SQL replace a data analyst?

No. Text-to-SQL generates queries but doesn't understand business context. It can't tell you which metric to look at, why a number changed, or what action to take. It's a query tool, not an analysis tool. Data analysts bring domain knowledge, judgment, and the ability to ask the right questions. Text-to-SQL helps them write SQL faster.

How is text-to-SQL different from a semantic layer?

Text-to-SQL generates SQL from scratch on every query. A semantic layer pre-defines metrics and lets consumers (including AI agents) query those definitions. Text-to-SQL is interpretation. A semantic layer is definition. The practical difference: text-to-SQL gives you plausible answers, a semantic layer gives you correct, consistent, auditable answers.

What is text-to-SQL AI?

Text-to-SQL AI refers to using large language models (GPT-4, Claude, Gemini, open-source models) to convert natural language questions into SQL queries. The AI model takes your database schema as context and generates SQL that it believes answers the question. The term is used interchangeably with "natural language to SQL," "NL2SQL," and "conversational SQL."

Does text-to-SQL work with any database?

Most text-to-SQL implementations work with any SQL database: PostgreSQL, MySQL, Snowflake, BigQuery, Databricks, SQL Server, and others. The model generates SQL in the appropriate dialect based on the database type. Some tools support multiple dialects from the same interface.

Related terms

Governed metrics for every surface.

Define your semantic layer in YAML. Query it from any AI agent, dashboard, or SDK. Ship analytics your customers can trust.