Teams that live in insurance data usually have the same problem: business users can’t write
SQL, and analysts don’t have time for constant ad-hoc query requests.
Ask My Data bridges that gap. It connects to Databricks, discovers the
available tables/columns, translates your question into valid Databricks SQL, executes the
query, and returns results with a clear natural-language answer. When your question is about
files, it can also retrieve content from Databricks Volumes (PDFs/CSVs) and extract what
matters.
The problem
- SQL is the bottleneck: business users need analytics answers, but most
don’t have SQL expertise, so data teams get pulled into simple queries.
- Tribal knowledge: writing correct SQL requires knowing table names,
schemas, and how to join claims ↔ policies ↔ customers in your domain model.
- Ad-hoc work has no queue: the “quick question” becomes a time sink,
especially during busy release cycles.
- Risk of incorrect results: manual SQL attempts can produce wrong
numbers, which is worse than slow work because it can lead to wrong decisions.
The workflow: from question to answer
Ask My Data runs a tight, repeatable sequence for every question:
1
Discover schema & volumes — reads available tables/columns in your configured
catalog/schema and lists files in volumes.
Databricks API · schema introspection
2
Generate SQL or a file retrieval plan — interprets the question, maps it to the
right tables, and produces safe, read-only SQL (or a plan to fetch the right file).
natural language interpretation · SQL generation
3
Execute / retrieve — runs the query against Databricks, or downloads the file via
REST API and extracts its content.
databricks-sql-connector · query execution
4
Format results — shows the generated SQL for transparency, presents data in a clean
table view, and generates a short answer summary.
result formatting · summary generation
Example questions (copy/paste)
"How many claims are filed for each policy type?"
"Show all policies where coverage limits exceed $5 million"
"Which broker has the most claims in the last 90 days?"
"List open claims with payout greater than $50,000"
What the agent returns
- Generated SQL (or retrieval plan) so users can audit logic.
- Query Results as a readable table (or extracted file content).
- Answer Summary in plain English with key takeaways.
- Row count & notable observations to avoid “looks right” ambiguity.
Transparency by design
Ask My Data doesn’t treat SQL as a black box. It displays the SQL it generated so users
understand what was run (and analysts can quickly spot when a question mapped to the wrong
table relationship).
SELECT policy_type, COUNT(*) AS claim_count
FROM <catalog>.<schema>.claims
GROUP BY policy_type
ORDER BY claim_count DESC
LIMIT 50;
Key capabilities (and limits)
- Read-only access: only SELECT queries are generated (no writes/DDL).
- Auto-discovery: tables and columns come from the live Databricks
schema.
- Smart JOIN mapping: relationships across claims, policies, customers,
reinsurance, and submissions.
- Error handling: connection/query failures handled gracefully with
optional retry.
- Query limits: LIMIT clauses prevent runaway queries.
- Volume file support: can retrieve and extract PDFs/CSVs from Databricks
Volumes.
Out of scope (v1): multi-turn conversational context, chart/graph
generation, cross-schema joins, and write operations (INSERT/UPDATE/DELETE or DDL).
Tables vs volumes: two paths in one workflow
Before it generates an answer, the workflow discovers what’s available in your Databricks
catalog/schema:
- Path A (table data): translate your question into a read-only
Databricks SQL query using the correct fully-qualified table names and joins.
- Path B (volume files): identify the file you’re asking for, download it
via Databricks REST APIs, and extract content (PDF text/sections, CSV rows/summary, or
text content).
This matters because “Ask My Data” isn’t just “chat + SQL”. It decides whether you want rows
from claims/customers/policy tables or content from volume files like
files (PDFs/CSVs) and then uses the right execution strategy.
SQL safety and transparency
To keep results trustworthy, the workflow is designed to be auditable:
- Generated SQL is shown: you can verify the logic before trusting the
numbers.
- Read-only constraint: only SELECT queries are allowed.
- Row limiting: sensible LIMIT defaults help prevent huge scans.
- Clear escalation: if the question is ambiguous or requires data outside
the configured catalog/schema, the workflow escalates for a human clarification.
Where it fits
- Self-serve analytics: business users ask questions without needing SQL
skills.
- Faster ad-hoc answers: reduce turnaround time for “can you query this
for me?” requests.
- Transparency for analysts: the workflow shows the generated SQL so
logic can be audited.
- File Q&A: retrieve and extract relevant content from PDFs/CSVs
stored in Databricks Volumes.
Starting prompt
user_question: "How many claims are filed for each policy type?"
catalog: verticalserve
schema: insurance
For the step-by-step screenshots (question input → generated SQL → results → answer
summary),
see the full use-case walkthrough on the
Ask My Data use case page.
Try Ask My
Data
Run the workflow in
InsightStudio and ask anything about your Databricks insurance data.
Download
Free Trial