Rami Krispin posted very nice slides and notebooks last week on the topic of using an LLM to query a SQL database.
(1/2) I had the pleasure of presenting today at the ODSC East 2025 conference about querying data with LLMs.
Creating an AI agent (which is a fancy name for three Python functions) is easier than it sounds. At its core, all you need: 👇🏼
#python #sql #llm #ai — Rami Krispin (@ramikrispin.bsky.social) May 16, 2025 at 3:23 AM
Krispin showed how to query a DuckDB table in 3 steps.
- Formulate a question about the data, in plain English.
- Use an LLM to translate the question into a SQL query.
- Execute the SQL query.
I love this because Krispin A. shows us the code and the prompts and B. does not try to chase the hype and go full agentic™. As developers, we stay in the loop, debugging and building intuition about how the LLM behaves. Once we have confidence in the solution, we can build evals and automate, but that all comes later.
Krispin’s notebooks work with either OpenAI GPT-4.1 or Google’s Gemini 2.0 Flash. My initial thought was to do a quick translation of Krispin’s code to work with Anthropic’s Claude Sonnet 3.7 but along the way I realized I could extend his work in a few relatively easy ways.
- I use a non-English dataset to experiment with Claude’s ability to translate my English questions implicitly.
- I validate the LLM’s ouptut as proper SQL with sqlglot.
- I use prompt caching to reduce the cost of asking multiple questions about the same dataset.
- I use Pandas to get the table schema in JSON format instead of custom Python.
What this post does not do is show the basics of calling an LLM API or how to iterate your way to good prompt; for that, read Krispin’s work first.
Set up
The data
We’ll use a dataset of street trees in the city of Barcelona, from the Barcelona Open Data Service. Trees are a bit more uplifting than the crime dataset Krispin used and, more importantly, this dataset’s text column names are in Catalan. We’ll see that Claude Sonnet 3.7 is able to implicitly bridge between the Catalan table schema and my English questions.
Prior to running the code below, I downloaded the CSV for the 2025 Q1 dataset and loaded it into a persistent DuckDB database with a single table called barcelona_trees
. Our first step is to connect to that DB and look at a couple rows of data.
Each row representes a tree in the city of Barcelona. A tree is described by its location (address, latitude/longitude, park name, neighborhood, district), species name (in Latin, Spanish, and Catalan), date of planting, watering details, etc.
Code
0 | 1 | |
---|---|---|
codi | 0028076AR | 0031530AR |
x_etrs89 | 432003.084 | 432715.622 |
y_etrs89 | 4582669.092 | 4584894.08 |
latitud | 41.392682 | 41.412782 |
longitud | 2.186647 | 2.194922 |
tipus_element | ARBRE VIARI | ARBRE VIARI |
espai_verd | None | None |
adreca | Av Meridiana, 13 | C Concili de Trento, 76 |
cat_especie_id | 126 | 126 |
cat_nom_cientific | Platanus × acerifolia | Platanus × acerifolia |
cat_nom_castella | Plátano | Plátano |
cat_nom_catala | Plàtan | Plàtan |
categoria_arbrat | EXEMPLAR | EXEMPLAR |
data_plantacio | NaT | NaT |
tipus_aigua | None | None |
tipus_reg | SENSE INFORMAR | SENSE INFORMAR |
geom | POINT (432003.083645579 4582669.09160047) | POINT (432715.621632449 4584894.07964375) |
catalogacio | None | None |
codi_barri | 66 | 72 |
nom_barri | EL PARC I LA LLACUNA DEL POBLENOU | SANT MARTÍ DE PROVENÇALS |
codi_districte | 10 | 10 |
nom_districte | SANT MARTÍ | SANT MARTÍ |
Get the table schema
For an LLM to give us SQL code that we can run directly against our data table, the LLM needs to know the schema of the table. It’s not obvious what format is best this, but Claude told me to use stringified JSON, so let’s go with that. 🤷♂️
Right off the bat, we’ll encapsulate the code in a function so we can re-use it easily later. After we define the function, we use it and print the first five entries of the schema, for illustration.
Code
def get_table_schema(con, table: str) -> str:
"""Return the schema for a database table in stringified JSON."""
sql = f"""
SELECT
column_name,
data_type,
is_nullable,
column_default AS default_value,
numeric_scale
FROM information_schema.columns
WHERE table_name = '{table}'
"""
schema = con.execute(sql).df()
schema = schema.to_json(orient="records", indent=2)
return schema
schema = get_table_schema(con, table='barcelona_trees')
print(schema[:690])
[
{
"column_name":"codi",
"data_type":"VARCHAR",
"is_nullable":"YES",
"default_value":null,
"numeric_scale":null
},
{
"column_name":"x_etrs89",
"data_type":"DOUBLE",
"is_nullable":"YES",
"default_value":null,
"numeric_scale":0
},
{
"column_name":"y_etrs89",
"data_type":"DOUBLE",
"is_nullable":"YES",
"default_value":null,
"numeric_scale":0
},
{
"column_name":"latitud",
"data_type":"DOUBLE",
"is_nullable":"YES",
"default_value":null,
"numeric_scale":0
},
{
"column_name":"longitud",
"data_type":"DOUBLE",
"is_nullable":"YES",
"default_value":null,
"numeric_scale":0
},
Call the LLM
This is where things start to get interesting.
First, let’s talk about the model and the hyperparameters (lines 4-6 in the code block below). As I mentioned above, we’ll use Anthropic’s Claude Sonnet 3.78 model. We limit it to 1,000 tokens of output, which is way more than we need for our simple database but still good practice. Finally, we set the temperature to 0 because our task requires the single best answer from the LLM, i.e. the answer with the highest chain of next-token probabilities.
The next bit is the system prompt, where the rubber really meets the road. First, we prime the LLM with its general task, including the details that we’re using a DuckDB database and that we want only clean SQL (lines 8-17). In his demo, Krispin struggled to get Gemini to refrain from adding markdown backticks; Claude Sonnet 3.7 also includes commentary and markdown by default but simple prompting seems to be enough to get Claude to behave as we want.
In the second part of the system prompt, we pass the name of the table and the schema (lines 18-26).
At the end of the system prompt, we enable prompt caching, to save money with repeated questions about the same table. Generally, Anthropic charges $3 per million tokens of input to Claude Sonnet 3.7, but a cache read is 10X cheaper at $0.30/million tokens. Everything in the input up to the cache_control
bit (line 24) will be cached. Note that Anthropic does not cache inputs with fewer than 1,024 tokens but it does so silently (which is confusing). For simple tables, the schema may not reach this threshold, although with this example it already does.
Finally, we pass the user’s question as the user_prompt
(lines 27-37).
We return the LLM’s output, of course, but also the usage statistics for tracking and debugging.
Code
def generate_llm_sql(client, table: str, schema: str, user_prompt: str):
"""Translate a natural langauge question to a SQL query."""
response = client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=1000,
temperature=0,
system=[
{
"type": "text",
"text": (
"You are an AI assistant tasked with generating SQL queries "
"to answer questions about the data in a given DuckDB table. "
"Return only clean SQL code; do not include additional "
"commentary or even markdown formatting."
),
},
{
"type": "text",
"text": (
f"Here is the schema for the DuckDB table '{table}'."
f"\n\n{schema}"
),
"cache_control": {"type": "ephemeral"},
},
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": user_prompt,
}
],
},
],
)
return response.content[0].text, response.usage
Now let’s spin up the Anthropic SDK client and call our new function with an easy question:
How many trees are there in Barcelona?
Code
SELECT COUNT(*) FROM barcelona_trees
Usage(cache_creation_input_tokens=1233, cache_read_input_tokens=0, input_tokens=15, output_tokens=11, server_tool_use=None)
Sure, it’s an easy question but that looks like pretty good SQL! In the usage info, note that most of the input indeed comes from writing or reading the cache, as we hoped.1 There are only 15 non-cached input tokens, which keeps the cost down.
Estimate the cost
💰 Speaking of cost….
The usage data about tokens alone don’t tell us how much we’re spending but it’s easy to check. Anthropic’s prompt caching page lists the current costs per million tokens of input, output, cache write, and cache read.
Code
def estimate_llm_cost(llm_usage: Usage) -> float:
"""Estimate the cost of a call to the Claude Sonnet 3.7 API.
Prices from https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#pricing on May 19, 2025.
"""
input_cost = 3 * llm_usage.input_tokens
output_cost = 15 * llm_usage.output_tokens
cache_write_cost = 3.75 * llm_usage.cache_creation_input_tokens
cache_read_cost = 0.3 * llm_usage.cache_read_input_tokens
total_cost_dollars = (
input_cost + output_cost + cache_write_cost + cache_read_cost
) / 1e6
return total_cost_dollars
print(f"Estimated LLM call cost: ${estimate_llm_cost(llm_usage):.4f}")
Estimated LLM call cost: $0.0048
Validate the LLM output
The next step in our pipeline is to validate the LLM’s output. In this post, we’re only going to check that the output is proper DuckDB SQL but in the wild this would be a good place for all the standard sanitization checks you would do when anybody wants to access your DB. In principle, there’s nothing stopping our favorite LLM from suddenly turning into Little Bobby Tables and deleting all of our data.
To check that the Claude’s output is valid DuckDB SQL, we’ll use the sqlglot package. In the output below, we see that the parsed query is unchanged, which is a success.
Code
def validate_sql(sql: str) -> str:
"""Validate that a SQL query string parses as proper (DuckDB) SQL."""
try:
parsed_sql = sqlglot.parse_one(sql, dialect="duckdb")
except ParseError as e:
print(f"Invalid SQL: {e}")
except Exception as e:
print(f"Error during validation: {e}")
return parsed_sql.sql()
parsed_sql = validate_sql(llm_sql)
print(parsed_sql)
SELECT COUNT(*) FROM barcelona_trees
Execute the SQL query
The last step in the pipeline is to run the validated SQL query on our database. Looks like Barcelona has almost 150,000 trees!
Put it all together
Let’s wrap all of the above logic into a single function that we can call with a question about our data.
First, we define a dataclass to hold the return objects. We don’t just want the answer; for debugging and building intuition, we need to see the LLM’s output, the validated SQL, the token counts, and estimated cost.
Next, we package all of the previous steps into a single function, llm_sql_query
.
Code
def llm_sql_query(client, con, table: str, question: str) -> LlmQueryOutput:
"""Answer a user's natural language question about a database table."""
# Get the table schema
schema = get_table_schema(con, table)
# Call the LLM
llm_sql, llm_usage = generate_llm_sql(
client, table, schema, user_prompt=question
)
llm_cost_dollars = estimate_llm_cost(llm_usage)
# Validate the LLM output as proper sql.
parsed_sql = validate_sql(llm_sql)
# Run the query on the DB.
result = con.execute(parsed_sql).df()
# Package and return the result
out = LlmQueryOutput(result, llm_sql, parsed_sql, llm_usage, llm_cost_dollars)
return out
This function is good but still a little annoying because it has three parameters that are always the same as long as we’re working with the same database. To simplify even further, we can use Python’s partial
tool to freeze those first three parameters. The query_trees
function below takes only a question, which is about as simple as we can get.
Kick the tires
Let’s make our dev question a little harder, by asking Claude to group the tree count by district (nom_districte
).
SELECT nom_districte, COUNT(*) as tree_count
FROM barcelona_trees
GROUP BY nom_districte
ORDER BY tree_count DESC;
nom_districte | tree_count | |
---|---|---|
0 | SANT MARTÍ | 29586 |
1 | EIXAMPLE | 21445 |
2 | SANTS - MONTJUÏC | 15286 |
3 | SANT ANDREU | 15167 |
4 | NOU BARRIS | 13760 |
5 | SARRIÀ - SANT GERVASI | 13369 |
6 | LES CORTS | 12415 |
7 | HORTA - GUINARDÓ | 12384 |
8 | GRÀCIA | 6995 |
9 | CIUTAT VELLA | 6570 |
10 | None | 3 |
Looks pretty good! Claude even put the counts in descending order, which is a nice touch. Let’s up the ante even further by asking Claude to group the counts by year, which must be extracted from the planting date column (data_plantacio
).
Code
SELECT EXTRACT(YEAR FROM data_plantacio) AS year,
COUNT(*) AS number_of_trees
FROM barcelona_trees
WHERE data_plantacio IS NOT NULL
GROUP BY year
ORDER BY year;
year | number_of_trees | |
---|---|---|
0 | 13 | 2 |
1 | 16 | 1 |
2 | 19 | 1 |
3 | 22 | 1 |
4 | 23 | 1 |
5 | 32 | 2 |
6 | 36 | 1 |
7 | 2001 | 3 |
8 | 2006 | 1 |
9 | 2007 | 387 |
10 | 2008 | 758 |
11 | 2009 | 1787 |
12 | 2010 | 2130 |
13 | 2011 | 1802 |
14 | 2012 | 1401 |
15 | 2013 | 1499 |
16 | 2014 | 1491 |
17 | 2015 | 905 |
18 | 2016 | 1549 |
19 | 2017 | 1356 |
20 | 2018 | 1462 |
21 | 2019 | 1589 |
22 | 2020 | 990 |
23 | 2021 | 1965 |
24 | 2022 | 2585 |
25 | 2023 | 1415 |
26 | 2024 | 3090 |
27 | 2025 | 1759 |
OK, not bad, but we’ve discovered a limitation. It seems some of the rows have malformed planting dates, yielding non-sensical year values (13, 16, etc.) A stronger query would include more defensive filters against bogus data. Or, alternatively, we could make our tool more agentic by sending the results back to Claude and asking it to iterate until the results make sense.
On the other hand, Claude has managed to implicitly translate our English query (“planted in each year”) into SQL using a schema with Catalan column names (data_plantacio
), without any explicit hints that such a translation would be necessary. It’s so easy to be jaded by the AI hype but it still just blows my mind when I see this type of capability.
Now let’s raise the complexity another couple notches by getting the rolling average of tree counts by year, which will require extracting the year, grouping, and a SQL window function.
Code
SELECT
year_planted,
AVG(tree_count) OVER (ORDER BY year_planted ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS rolling_5yr_avg
FROM (
SELECT
EXTRACT(YEAR FROM data_plantacio) AS year_planted,
COUNT(*) AS tree_count
FROM barcelona_trees
WHERE data_plantacio IS NOT NULL
GROUP BY year_planted
) AS yearly_counts
WHERE year_planted >= 2007
ORDER BY year_planted;
year_planted | rolling_5yr_avg | |
---|---|---|
0 | 2007 | 387.000000 |
1 | 2008 | 572.500000 |
2 | 2009 | 977.333333 |
3 | 2010 | 1265.500000 |
4 | 2011 | 1372.800000 |
5 | 2012 | 1575.600000 |
6 | 2013 | 1723.800000 |
7 | 2014 | 1664.600000 |
8 | 2015 | 1419.600000 |
9 | 2016 | 1369.000000 |
10 | 2017 | 1360.000000 |
11 | 2018 | 1352.600000 |
12 | 2019 | 1372.200000 |
13 | 2020 | 1389.200000 |
14 | 2021 | 1472.400000 |
15 | 2022 | 1718.200000 |
16 | 2023 | 1708.800000 |
17 | 2024 | 2009.000000 |
18 | 2025 | 2162.800000 |
Claude suggests a nested query to extract the year and get grouped counts, then a correct window function to compute the rolling average. Impressive!
Wrap up
To recap, Rami Krispin showed last week how to use GPT-4.1 and Gemini 2.0 Flash to translate natural language questions into SQL queries that run against a DuckDB table.
In this post, I extended Krispin’s work by showing how to do it with Claude Sonnet 3.7, by asking English questions about a non-English dataset, using Pandas to get the table schema, validating the output of the LLM with sqlglot, and adding prompt caching to save money and avoid sending the full schema on each LLM call.
Of course, this demo uses a simple database with a single table. The next step up in complexity to test will be to use a database with two tables and queries that require joins of various kinds. Still, it’s an impressive result and we didn’t even have to resort to prompt engineering hacks.
For non-technical data consumers, this kind of thing is a game-changer. Even for engineers fluent in SQL, it’s useful for exploratory analysis and as a building block for a more automated system.
Cover image by Bri Weldon, from https://www.flickr.com/photos/briweldon/5170241518.
Footnotes
The cache invalidates after only five minutes which makes it hard to know as I write this whether it will be a cache write or read.↩︎