Rami Krispin posted very nice slides and notebooks last week on the topic of using an LLM to query a SQL database.

Krispin showed how to query a DuckDB table in 3 steps.

Formulate a question about the data, in plain English.
Use an LLM to translate the question into a SQL query.
Execute the SQL query.

I love this because Krispin A. shows us the code and the prompts and B. does not try to chase the hype and go full agentic™. As developers, we stay in the loop, debugging and building intuition about how the LLM behaves. Once we have confidence in the solution, we can build evals and automate, but that all comes later.

Krispin’s notebooks work with either OpenAI GPT-4.1 or Google’s Gemini 2.0 Flash. My initial thought was to do a quick translation of Krispin’s code to work with Anthropic’s Claude Sonnet 3.7 but along the way I realized I could extend his work in a few relatively easy ways.

I use a non-English dataset to experiment with Claude’s ability to translate my English questions implicitly.
I validate the LLM’s ouptut as proper SQL with sqlglot.
I use prompt caching to reduce the cost of asking multiple questions about the same dataset.
I use Pandas to get the table schema in JSON format instead of custom Python.

What this post does not do is show the basics of calling an LLM API or how to iterate your way to good prompt; for that, read Krispin’s work first.

Set up

Code

import os
from dataclasses import dataclass
from functools import partial

import anthropic
from anthropic.types.usage import Usage
import duckdb
from dotenv import load_dotenv
import pandas as pd
import sqlglot
from sqlglot.errors import ParseError

The data

We’ll use a dataset of street trees in the city of Barcelona, from the Barcelona Open Data Service. Trees are a bit more uplifting than the crime dataset Krispin used and, more importantly, this dataset’s text column names are in Catalan. We’ll see that Claude Sonnet 3.7 is able to implicitly bridge between the Catalan table schema and my English questions.

Prior to running the code below, I downloaded the CSV for the 2025 Q1 dataset and loaded it into a persistent DuckDB database with a single table called barcelona_trees. Our first step is to connect to that DB and look at a couple rows of data.

Each row representes a tree in the city of Barcelona. A tree is described by its location (address, latitude/longitude, park name, neighborhood, district), species name (in Latin, Spanish, and Catalan), date of planting, watering details, etc.

Code

con = duckdb.connect("barcelona_trees.duckdb", read_only=True)

sql = "SELECT * FROM barcelona_trees USING SAMPLE 2"
rows = con.sql(sql).df()

rows.transpose()

	0	1
codi	0028076AR	0031530AR
x_etrs89	432003.084	432715.622
y_etrs89	4582669.092	4584894.08
latitud	41.392682	41.412782
longitud	2.186647	2.194922
tipus_element	ARBRE VIARI	ARBRE VIARI
espai_verd	None	None
adreca	Av Meridiana, 13	C Concili de Trento, 76
cat_especie_id	126	126
cat_nom_cientific	Platanus × acerifolia	Platanus × acerifolia
cat_nom_castella	Plátano	Plátano
cat_nom_catala	Plàtan	Plàtan
categoria_arbrat	EXEMPLAR	EXEMPLAR
data_plantacio	NaT	NaT
tipus_aigua	None	None
tipus_reg	SENSE INFORMAR	SENSE INFORMAR
geom	POINT (432003.083645579 4582669.09160047)	POINT (432715.621632449 4584894.07964375)
catalogacio	None	None
codi_barri	66	72
nom_barri	EL PARC I LA LLACUNA DEL POBLENOU	SANT MARTÍ DE PROVENÇALS
codi_districte	10	10
nom_districte	SANT MARTÍ	SANT MARTÍ

Get the table schema

For an LLM to give us SQL code that we can run directly against our data table, the LLM needs to know the schema of the table. It’s not obvious what format is best this, but Claude told me to use stringified JSON, so let’s go with that. 🤷‍♂️

Right off the bat, we’ll encapsulate the code in a function so we can re-use it easily later. After we define the function, we use it and print the first five entries of the schema, for illustration.

Code

def get_table_schema(con, table: str) -> str:
    """Return the schema for a database table in stringified JSON."""

    sql = f"""
        SELECT
            column_name,
            data_type,
            is_nullable,
            column_default AS default_value,
            numeric_scale
        FROM information_schema.columns
        WHERE table_name = '{table}'
    """

    schema = con.execute(sql).df()
    schema = schema.to_json(orient="records", indent=2)
    return schema

schema = get_table_schema(con, table='barcelona_trees')
print(schema[:690])

[
  {
    "column_name":"codi",
    "data_type":"VARCHAR",
    "is_nullable":"YES",
    "default_value":null,
    "numeric_scale":null
  },
  {
    "column_name":"x_etrs89",
    "data_type":"DOUBLE",
    "is_nullable":"YES",
    "default_value":null,
    "numeric_scale":0
  },
  {
    "column_name":"y_etrs89",
    "data_type":"DOUBLE",
    "is_nullable":"YES",
    "default_value":null,
    "numeric_scale":0
  },
  {
    "column_name":"latitud",
    "data_type":"DOUBLE",
    "is_nullable":"YES",
    "default_value":null,
    "numeric_scale":0
  },
  {
    "column_name":"longitud",
    "data_type":"DOUBLE",
    "is_nullable":"YES",
    "default_value":null,
    "numeric_scale":0
  },

Call the LLM

This is where things start to get interesting.

First, let’s talk about the model and the hyperparameters (lines 4-6 in the code block below). As I mentioned above, we’ll use Anthropic’s Claude Sonnet 3.78 model. We limit it to 1,000 tokens of output, which is way more than we need for our simple database but still good practice. Finally, we set the temperature to 0 because our task requires the single best answer from the LLM, i.e. the answer with the highest chain of next-token probabilities.

The next bit is the system prompt, where the rubber really meets the road. First, we prime the LLM with its general task, including the details that we’re using a DuckDB database and that we want only clean SQL (lines 8-17). In his demo, Krispin struggled to get Gemini to refrain from adding markdown backticks; Claude Sonnet 3.7 also includes commentary and markdown by default but simple prompting seems to be enough to get Claude to behave as we want.

In the second part of the system prompt, we pass the name of the table and the schema (lines 18-26).

At the end of the system prompt, we enable prompt caching, to save money with repeated questions about the same table. Generally, Anthropic charges $3 per million tokens of input to Claude Sonnet 3.7, but a cache read is 10X cheaper at $0.30/million tokens. Everything in the input up to the cache_control bit (line 24) will be cached. Note that Anthropic does not cache inputs with fewer than 1,024 tokens but it does so silently (which is confusing). For simple tables, the schema may not reach this threshold, although with this example it already does.

Finally, we pass the user’s question as the user_prompt (lines 27-37).

We return the LLM’s output, of course, but also the usage statistics for tracking and debugging.

Code

def generate_llm_sql(client, table: str, schema: str, user_prompt: str):
    """Translate a natural langauge question to a SQL query."""

    response = client.messages.create(
        model="claude-3-7-sonnet-20250219",
        max_tokens=1000,
        temperature=0,
        system=[
            {
                "type": "text",
                "text": (
                    "You are an AI assistant tasked with generating SQL queries "
                    "to answer questions about the data in a given DuckDB table. "
                    "Return only clean SQL code; do not include additional "
                    "commentary or even markdown formatting."
                ),
            },
            {
                "type": "text",
                "text": (
                    f"Here is the schema for the DuckDB table '{table}'."
                    f"\n\n{schema}"
                ),
                "cache_control": {"type": "ephemeral"},
            },
        ],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": user_prompt,
                    }
                ],
            },
        ],
    )

    return response.content[0].text, response.usage

Now let’s spin up the Anthropic SDK client and call our new function with an easy question:

How many trees are there in Barcelona?

Code

load_dotenv()
api_key = os.environ.get("ANTHROPIC_API_KEY")
client = anthropic.Anthropic(api_key=api_key)

llm_sql, llm_usage = generate_llm_sql(
    client=client,
    table='barcelona_trees',
    schema=schema,
    user_prompt="How many trees are there in Barcelona?"
)

print(llm_sql)
print()
print(llm_usage)

SELECT COUNT(*) FROM barcelona_trees

Usage(cache_creation_input_tokens=1233, cache_read_input_tokens=0, input_tokens=15, output_tokens=11, server_tool_use=None)

Sure, it’s an easy question but that looks like pretty good SQL! In the usage info, note that most of the input indeed comes from writing or reading the cache, as we hoped.¹ There are only 15 non-cached input tokens, which keeps the cost down.

Estimate the cost

💰 Speaking of cost….

The usage data about tokens alone don’t tell us how much we’re spending but it’s easy to check. Anthropic’s prompt caching page lists the current costs per million tokens of input, output, cache write, and cache read.

Code

def estimate_llm_cost(llm_usage: Usage) -> float:
    """Estimate the cost of a call to the Claude Sonnet 3.7 API.
    
    Prices from https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#pricing on May 19, 2025.
    """

    input_cost = 3 * llm_usage.input_tokens
    output_cost = 15 * llm_usage.output_tokens
    cache_write_cost = 3.75 * llm_usage.cache_creation_input_tokens
    cache_read_cost = 0.3 * llm_usage.cache_read_input_tokens

    total_cost_dollars = (
        input_cost + output_cost + cache_write_cost + cache_read_cost
    ) / 1e6
    return total_cost_dollars


print(f"Estimated LLM call cost: ${estimate_llm_cost(llm_usage):.4f}")

Estimated LLM call cost: $0.0048

Validate the LLM output

The next step in our pipeline is to validate the LLM’s output. In this post, we’re only going to check that the output is proper DuckDB SQL but in the wild this would be a good place for all the standard sanitization checks you would do when anybody wants to access your DB. In principle, there’s nothing stopping our favorite LLM from suddenly turning into Little Bobby Tables and deleting all of our data.

To check that the Claude’s output is valid DuckDB SQL, we’ll use the sqlglot package. In the output below, we see that the parsed query is unchanged, which is a success.

Code

def validate_sql(sql: str) -> str:
    """Validate that a SQL query string parses as proper (DuckDB) SQL."""

    try:
        parsed_sql = sqlglot.parse_one(sql, dialect="duckdb")
    except ParseError as e:
        print(f"Invalid SQL: {e}")
    except Exception as e:
        print(f"Error during validation: {e}")

    return parsed_sql.sql()


parsed_sql = validate_sql(llm_sql)
print(parsed_sql)

SELECT COUNT(*) FROM barcelona_trees

Execute the SQL query

The last step in the pipeline is to run the validated SQL query on our database. Looks like Barcelona has almost 150,000 trees!

Code

result = con.execute(parsed_sql).df()
result

	count_star()
0	146980

Put it all together

Let’s wrap all of the above logic into a single function that we can call with a question about our data.

First, we define a dataclass to hold the return objects. We don’t just want the answer; for debugging and building intuition, we need to see the LLM’s output, the validated SQL, the token counts, and estimated cost.

Code

@dataclass
class LlmQueryOutput:
    result: pd.DataFrame
    llm_sql: str
    parsed_sql: str
    llm_usage: Usage
    llm_cost_dollars: float

Next, we package all of the previous steps into a single function, llm_sql_query.

Code

def llm_sql_query(client, con, table: str, question: str) -> LlmQueryOutput:
    """Answer a user's natural language question about a database table."""

    # Get the table schema
    schema = get_table_schema(con, table)

    # Call the LLM
    llm_sql, llm_usage = generate_llm_sql(
        client, table, schema, user_prompt=question
    )
    llm_cost_dollars = estimate_llm_cost(llm_usage)

    # Validate the LLM output as proper sql.
    parsed_sql = validate_sql(llm_sql)

    # Run the query on the DB.
    result = con.execute(parsed_sql).df()

    # Package and return the result
    out = LlmQueryOutput(result, llm_sql, parsed_sql, llm_usage, llm_cost_dollars)
    return out

This function is good but still a little annoying because it has three parameters that are always the same as long as we’re working with the same database. To simplify even further, we can use Python’s partial tool to freeze those first three parameters. The query_trees function below takes only a question, which is about as simple as we can get.

Code

query_trees = partial(llm_sql_query, client, con, 'barcelona_trees')

Kick the tires

Let’s make our dev question a little harder, by asking Claude to group the tree count by district (nom_districte).

Code

out = query_trees("How many trees are in each district?")

print(out.llm_sql)
out.result

SELECT nom_districte, COUNT(*) as tree_count
FROM barcelona_trees
GROUP BY nom_districte
ORDER BY tree_count DESC;

	nom_districte	tree_count
0	SANT MARTÍ	29586
1	EIXAMPLE	21445
2	SANTS - MONTJUÏC	15286
3	SANT ANDREU	15167
4	NOU BARRIS	13760
5	SARRIÀ - SANT GERVASI	13369
6	LES CORTS	12415
7	HORTA - GUINARDÓ	12384
8	GRÀCIA	6995
9	CIUTAT VELLA	6570
10	None	3

Looks pretty good! Claude even put the counts in descending order, which is a nice touch. Let’s up the ante even further by asking Claude to group the counts by year, which must be extracted from the planting date column (data_plantacio).

Code

out = query_trees("How many trees have been planted in each year?")

print(out.llm_sql)
out.result

SELECT EXTRACT(YEAR FROM data_plantacio) AS year, 
       COUNT(*) AS number_of_trees
FROM barcelona_trees
WHERE data_plantacio IS NOT NULL
GROUP BY year
ORDER BY year;

	year	number_of_trees
0	13	2
1	16	1
2	19	1
3	22	1
4	23	1
5	32	2
6	36	1
7	2001	3
8	2006	1
9	2007	387
10	2008	758
11	2009	1787
12	2010	2130
13	2011	1802
14	2012	1401
15	2013	1499
16	2014	1491
17	2015	905
18	2016	1549
19	2017	1356
20	2018	1462
21	2019	1589
22	2020	990
23	2021	1965
24	2022	2585
25	2023	1415
26	2024	3090
27	2025	1759

OK, not bad, but we’ve discovered a limitation. It seems some of the rows have malformed planting dates, yielding non-sensical year values (13, 16, etc.) A stronger query would include more defensive filters against bogus data. Or, alternatively, we could make our tool more agentic by sending the results back to Claude and asking it to iterate until the results make sense.

On the other hand, Claude has managed to implicitly translate our English query (“planted in each year”) into SQL using a schema with Catalan column names (data_plantacio), without any explicit hints that such a translation would be necessary. It’s so easy to be jaded by the AI hype but it still just blows my mind when I see this type of capability.

Now let’s raise the complexity another couple notches by getting the rolling average of tree counts by year, which will require extracting the year, grouping, and a SQL window function.

Code

out = query_trees(
    question=(
        "Give me the rolling 5 year average of the number of trees planted, "
        "starting in 2007."
    )
)

print(out.llm_sql)
out.result

SELECT 
    year_planted,
    AVG(tree_count) OVER (ORDER BY year_planted ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS rolling_5yr_avg
FROM (
    SELECT 
        EXTRACT(YEAR FROM data_plantacio) AS year_planted,
        COUNT(*) AS tree_count
    FROM barcelona_trees
    WHERE data_plantacio IS NOT NULL
    GROUP BY year_planted
) AS yearly_counts
WHERE year_planted >= 2007
ORDER BY year_planted;

	year_planted	rolling_5yr_avg
0	2007	387.000000
1	2008	572.500000
2	2009	977.333333
3	2010	1265.500000
4	2011	1372.800000
5	2012	1575.600000
6	2013	1723.800000
7	2014	1664.600000
8	2015	1419.600000
9	2016	1369.000000
10	2017	1360.000000
11	2018	1352.600000
12	2019	1372.200000
13	2020	1389.200000
14	2021	1472.400000
15	2022	1718.200000
16	2023	1708.800000
17	2024	2009.000000
18	2025	2162.800000

Claude suggests a nested query to extract the year and get grouped counts, then a correct window function to compute the rolling average. Impressive!

Wrap up

To recap, Rami Krispin showed last week how to use GPT-4.1 and Gemini 2.0 Flash to translate natural language questions into SQL queries that run against a DuckDB table.

In this post, I extended Krispin’s work by showing how to do it with Claude Sonnet 3.7, by asking English questions about a non-English dataset, using Pandas to get the table schema, validating the output of the LLM with sqlglot, and adding prompt caching to save money and avoid sending the full schema on each LLM call.

Of course, this demo uses a simple database with a single table. The next step up in complexity to test will be to use a database with two tables and queries that require joins of various kinds. Still, it’s an impressive result and we didn’t even have to resort to prompt engineering hacks.

For non-technical data consumers, this kind of thing is a game-changer. Even for engineers fluent in SQL, it’s useful for exploratory analysis and as a building block for a more automated system.

Cover image by Bri Weldon, from https://www.flickr.com/photos/briweldon/5170241518.

Footnotes

The cache invalidates after only five minutes which makes it hard to know as I write this whether it will be a cache write or read.↩︎