AI Data Analysis: Why Governed Metrics Beat Raw SQL Generation

AI data analysis tools generate plausible SQL but unreliable numbers. Here's why governed metrics through a semantic layer produce trustworthy results at scale.

11 min read

AI data analysis tools are everywhere. Upload a spreadsheet to Julius AI and get a chart. Ask ChatGPT's Data Analyst to find trends. Connect Databricks AI/BI to your warehouse and let it generate queries. The pitch: anyone can analyze data without writing SQL.

The pitch works for one-off exploration. It breaks when you need trustworthy numbers at scale: consistent results across tools, governed access per customer, and an audit trail for every answer. That's where most AI analytics tools stop and where a semantic layer starts.

What is AI data analysis?

AI data analysis uses large language models and machine learning to automate parts of the data analysis workflow: generating queries, identifying patterns, creating visualizations, and summarizing findings in natural language. Instead of writing SQL by hand, you describe what you want and the AI produces it.

The tools fall into three categories:

Spreadsheet AI. Upload a CSV and ask questions. Julius AI, Formula Bot, ChatGPT Data Analyst. Good for ad-hoc exploration of small datasets. No governance, no multi-tenancy, no audit trail. Your data leaves your infrastructure.

BI tool AI. Built into an existing BI platform. Tableau's AI analytics, Power BI Copilot, Domo AI, ThoughtSpot's natural language search. Better than spreadsheet tools because they query your warehouse. Still limited by the BI tool's ecosystem and licensing model.

Text-to-SQL agents. An LLM generates SQL against your warehouse schema. DataGPT, custom GPT agents, Databricks AI/BI. Most flexible, most dangerous. The agent writes arbitrary SQL. No guardrails on what it queries or how it calculates metrics.

All three share the same fundamental problem: the AI interprets your data on every query. There's no shared definition of what "revenue" means. Different tools, different phrasings, different numbers.

The accuracy problem nobody talks about

AI data analysis demos are impressive. The tool generates a chart from a natural language question in seconds. But run the same question through two different tools and compare the numbers. They don't match.

Why? Because each tool independently interprets your schema. "Revenue" might be SUM(amount) in one tool and SUM(total_price) WHERE status = 'completed' in another. Both are plausible. Neither knows that your finance team excludes refunds and trial conversions.

This isn't a model quality issue. GPT-4, Claude, and Gemini all exhibit this behavior. The model does what it's designed to do: generate plausible SQL from schema context. But "plausible" and "correct" are different things.

The accuracy problem compounds at scale:

One analyst, one question, one tool — accuracy is "good enough." You eyeball the result and know if it's in the right ballpark.
Multiple analysts, same question, different tools — the numbers diverge. Debates about "which dashboard is right" start.
Customers querying their own data — you can't tell customers "the number is approximately right." They expect precision. One wrong number and trust is gone.
AI agents querying autonomously — no human in the loop to sanity-check. Wrong numbers propagate through workflows without anyone noticing.

Text-to-SQL gives you speed. It doesn't give you trust. For production analytics, you need both.

What makes AI analytics production-grade?

The gap between a demo and a production system is five things. Most AI data analysis tools have one or two. None of the spreadsheet or text-to-SQL tools have all five.

1. Governed metric definitions

Every metric has one definition. "Revenue" is SUM(amount) WHERE status != 'refunded' AND type != 'trial'. This definition is fixed, versioned in Git, and referenced by every consumer. The AI doesn't interpret what revenue means. It queries the governed definition.

cubes:
  - name: orders
    sql_table: public.orders
    measures:
      - name: total_revenue
        sql: "CASE WHEN status != 'refunded' AND type != 'trial' THEN amount ELSE 0 END"
        type: sum
        description: "Total revenue excluding refunds and trials"
      - name: order_count
        type: count
      - name: average_order_value
        sql: amount
        type: avg
    dimensions:
      - name: status
        sql: status
        type: string
      - name: category
        sql: category
        type: string
      - name: created_at
        sql: created_at
        type: time

This is a semantic layer: a metadata layer between your warehouse and every consumer. The AI agent calls explore_schema to discover available metrics, then query to fetch governed data. It never generates SQL from scratch.

2. Consistency across every surface

The same metric definition serves dashboards, React components in your product, REST APIs, AI agents, and markdown reports. Ask for "revenue" from any surface and you get the same number. Not because the AI happened to generate the same SQL, but because the definition is fixed.

This matters for AI specifically because AI tools generate different SQL on every call. Even the same tool, same question, different session can produce different queries. Governed definitions eliminate this variance.

3. Multi-tenant access control

For B2B products, Customer A's AI agent should never see Customer B's data. This can't be a prompt instruction ("only query data for customer A"). It needs to be structural: enforced on every query regardless of what the AI does.

    security_context:
      - name: tenant_filter
        sql: "{SECURITY_CONTEXT.tenant_id} = customer_id"

Every query through a tenant's key automatically includes this filter. The AI agent can't bypass it. See RBAC for more on how this works.

4. Performance at AI scale

AI agents make more queries than humans. A human asks one question. An agent might make 10 queries to answer it: exploring the schema, trying different dimensions, following up on anomalies. Multiply by hundreds of customers' agents querying simultaneously.

Pre-aggregation caching pre-computes common rollups. Hot queries return in single-digit milliseconds. Without it, AI-scale query volumes overwhelm most warehouses.

5. Audit trail

When the number is wrong, you need to trace it. With a semantic layer, every result traces back to a versioned metric definition in Git. You can point to the exact commit that defined the calculation, see when it changed, and who approved the change. With raw text-to-SQL, the trail is "an LLM generated some SQL" with no reproducibility.

AI data analysis tools compared

Tool	Approach	Governed metrics	Multi-tenant	Audit trail	Best for
Julius AI	Spreadsheet upload + LLM	No	No	No	Ad-hoc exploration of CSVs
ChatGPT Data Analyst	File upload + code interpreter	No	No	No	One-off analysis, prototyping
Databricks AI/BI	Text-to-SQL + Unity Catalog	Partial (table-level)	Partial	Partial	Teams already on Databricks
Power BI Copilot	Natural language in Power BI	DAX measures (locked to PBI)	Complex setup	Within PBI	Microsoft ecosystem
Tableau AI	Natural language in Tableau	Tableau semantic model	Enterprise	Within Tableau	Salesforce ecosystem
ThoughtSpot	Natural language BI	Proprietary semantic layer	Enterprise	Yes	Enterprise AI-powered BI
DataGPT	Conversational analytics agent	No	No	No	Autonomous data analysis
Domo AI	BI platform with AI features	Within Domo	Within Domo	Within Domo	End-to-end data platform
Semantic layer (Bonnard)	Governed metric serving	Yes (YAML, versioned)	Structural	Full (Git-backed)	B2B analytics, AI agents, embedded

The tools in the top half are AI-first but governance-second. The ones in the bottom half (enterprise BI) have governance but lock it to their ecosystem. A standalone semantic layer provides governance without ecosystem lock-in.

When each approach makes sense

Use spreadsheet AI (Julius, ChatGPT) when:

You're exploring a dataset for the first time
Nobody else needs to reproduce the result
The data isn't sensitive
You want speed over accuracy

Use BI tool AI (Power BI Copilot, Tableau AI) when:

You're already in that BI ecosystem
Internal teams are the primary consumer
You don't need to serve AI agents or embed in products
Enterprise licensing fits your budget

Use a semantic layer when:

Multiple consumers need the same metrics (dashboards + API + agents)
Customers see the results (B2B embedded analytics)
AI agents query data autonomously
You need auditability (which definition produced this number?)
Metrics must be consistent across every tool and surface

How a semantic layer works with AI agents

The semantic layer doesn't replace AI. It governs what AI can query.

An AI agent connects via MCP or REST API. It calls explore_schema to discover available metrics and their descriptions. It calls query to fetch governed data with measures, dimensions, and filters. The semantic layer generates the SQL (correctly, for your specific warehouse dialect), executes it, and returns structured results.

Five tools are available to agents:

explore_schema: Discover cubes, measures, dimensions, segments
query: Fetch aggregated data using governed definitions
sql_query: Custom SQL for edge cases (still governed by access controls)
describe_field: Get metadata about a specific measure or dimension
visualize: Render charts directly in the conversation

The agent doesn't write SQL. It selects from governed definitions. The result is deterministic: same question, same metric definition, same answer. Always.

For a full tutorial on connecting AI agents to governed data: How to Connect an AI Agent to Your Data Warehouse.

The shift from AI-generated SQL to AI-queried metrics

The first generation of AI analytics gave agents raw warehouse access and hoped for the best. It worked in demos. It failed in production because the agent interprets business logic on every query.

The next generation separates the concerns. Data teams define governed metrics in a semantic layer. AI agents query those definitions instead of raw tables. The data team's expertise is encoded in the schema. The AI's role shifts from "guess what revenue means" to "pick the right governed metric and present the result."

This is a better division of labor. The data team defines what's true. The AI handles the interface. Trustworthy numbers with a natural language frontend.

Getting started

Cloud:

npm install -g @bonnard/cli
bon init
bon deploy

Self-hosted:

npm install -g @bonnard/cli
npx @bonnard/cli init --self-hosted
docker compose up -d
bon deploy

bon init generates agent configs for Claude Code, Cursor, and Codex so AI assistants understand your semantic layer from the first prompt. Define your metrics, run bon deploy, and bon mcp to configure AI agent connections.

Full tutorial: How to Connect an AI Agent to Your Data Warehouse. For background: What Is a Semantic Layer?.

Self-host free under Apache 2.0, or use Bonnard Cloud for managed infrastructure.

Frequently asked questions

What is AI data analysis?

AI data analysis uses large language models and machine learning to automate data querying, pattern detection, and insight generation. Instead of writing SQL manually, you ask questions in natural language and the AI produces queries, charts, and summaries. Tools range from spreadsheet analyzers (Julius AI) to enterprise BI platforms with AI features (Power BI Copilot, Tableau AI).

Is AI data analysis accurate?

For ad-hoc exploration, accuracy is "good enough." For production analytics where customers see the numbers, accuracy varies per query because the AI generates SQL from scratch each time. Two tools generating SQL for "revenue" may produce different numbers because they interpret the calculation differently. A semantic layer solves this by defining each metric once and letting AI query the definition instead of generating its own.

What is the best AI tool for data analysis?

It depends on the use case. For one-off CSV exploration: Julius AI or ChatGPT Data Analyst. For enterprise BI: Power BI Copilot or Tableau AI. For governed, multi-tenant analytics served to AI agents and B2B customers: a semantic layer approach. See the comparison table above.

How is AI analytics different from traditional BI?

Traditional BI requires analysts to write queries and build dashboards. AI analytics automates the query generation. The risk: AI-generated queries are inconsistent. Traditional BI with a semantic layer gives you consistent definitions. AI analytics with a semantic layer gives you both: consistent definitions with a natural language interface.

Can AI replace data analysts?

No. AI can generate queries and create visualizations faster than humans. It can't define what metrics should mean, decide which questions to ask, or judge whether an insight is actionable. Data analysts bring domain knowledge and judgment. AI handles the mechanical parts. A semantic layer encodes the analyst's expertise so AI can query it reliably.

What is the difference between AI analytics and text-to-SQL?

Text-to-SQL is a technique where an LLM generates SQL from natural language. AI analytics is broader: it includes text-to-SQL plus pattern detection, anomaly identification, forecasting, and visualization generation. Both share the same governance gap: the AI interprets your data on every query. A semantic layer addresses this for both.

What warehouses work with AI data analysis?

Most AI analytics tools work with common warehouses: Snowflake, BigQuery, Databricks, PostgreSQL (including Supabase, Neon, and RDS), Redshift, and DuckDB (including MotherDuck). The semantic layer generates the appropriate SQL dialect for each warehouse.

Ready to ship a customer-ready MCP?

Turn your semantic layer, dbt, or warehouse into a governed, per-customer MCP for your customers' agents.