How to use an LLM to generate SQL queries

Replicating and extending Rami Krispin’s excellent ODSC East talk.

AI Engineering
SQL
Python
Author

Brian Kent

Published

May 19, 2025

Rami Krispin posted very nice slides and notebooks last week on the topic of using an LLM to query a SQL database.

(1/2) I had the pleasure of presenting today at the ODSC East 2025 conference about querying data with LLMs.

Creating an AI agent (which is a fancy name for three Python functions) is easier than it sounds. At its core, all you need: 👇🏼

#python #sql #llm #ai

— Rami Krispin (@ramikrispin.bsky.social) May 16, 2025 at 3:23 AM

Krispin showed how to query a DuckDB table in 3 steps.

  1. Formulate a question about the data, in plain English.
  2. Use an LLM to translate the question into a SQL query.
  3. Execute the SQL query.

I love this because Krispin A. shows us the code and the prompts and B. does not try to chase the hype and go full agentic™. As developers, we stay in the loop, debugging and building intuition about how the LLM behaves. Once we have confidence in the solution, we can build evals and automate, but that all comes later.

Krispin’s notebooks work with either OpenAI GPT-4.1 or Google’s Gemini 2.0 Flash. My initial thought was to do a quick translation of Krispin’s code to work with Anthropic’s Claude Sonnet 3.7 but along the way I realized I could extend his work in a few relatively easy ways.

What this post does not do is show the basics of calling an LLM API or how to iterate your way to good prompt; for that, read Krispin’s work first.

Set up

Code
import os
from dataclasses import dataclass
from functools import partial

import anthropic
from anthropic.types.usage import Usage
import duckdb
from dotenv import load_dotenv
import pandas as pd
import sqlglot
from sqlglot.errors import ParseError

The data

We’ll use a dataset of street trees in the city of Barcelona, from the Barcelona Open Data Service. Trees are a bit more uplifting than the crime dataset Krispin used and, more importantly, this dataset’s text column names are in Catalan. We’ll see that Claude Sonnet 3.7 is able to implicitly bridge between the Catalan table schema and my English questions.

Prior to running the code below, I downloaded the CSV for the 2025 Q1 dataset and loaded it into a persistent DuckDB database with a single table called barcelona_trees. Our first step is to connect to that DB and look at a couple rows of data.

Each row representes a tree in the city of Barcelona. A tree is described by its location (address, latitude/longitude, park name, neighborhood, district), species name (in Latin, Spanish, and Catalan), date of planting, watering details, etc.

Code
con = duckdb.connect("barcelona_trees.duckdb", read_only=True)

sql = "SELECT * FROM barcelona_trees USING SAMPLE 2"
rows = con.sql(sql).df()

rows.transpose()
0 1
codi 0028076AR 0031530AR
x_etrs89 432003.084 432715.622
y_etrs89 4582669.092 4584894.08
latitud 41.392682 41.412782
longitud 2.186647 2.194922
tipus_element ARBRE VIARI ARBRE VIARI
espai_verd None None
adreca Av Meridiana, 13 C Concili de Trento, 76
cat_especie_id 126 126
cat_nom_cientific Platanus × acerifolia Platanus × acerifolia
cat_nom_castella Plátano Plátano
cat_nom_catala Plàtan Plàtan
categoria_arbrat EXEMPLAR EXEMPLAR
data_plantacio NaT NaT
tipus_aigua None None
tipus_reg SENSE INFORMAR SENSE INFORMAR
geom POINT (432003.083645579 4582669.09160047) POINT (432715.621632449 4584894.07964375)
catalogacio None None
codi_barri 66 72
nom_barri EL PARC I LA LLACUNA DEL POBLENOU SANT MARTÍ DE PROVENÇALS
codi_districte 10 10
nom_districte SANT MARTÍ SANT MARTÍ

Get the table schema

For an LLM to give us SQL code that we can run directly against our data table, the LLM needs to know the schema of the table. It’s not obvious what format is best this, but Claude told me to use stringified JSON, so let’s go with that. 🤷‍♂️

Right off the bat, we’ll encapsulate the code in a function so we can re-use it easily later. After we define the function, we use it and print the first five entries of the schema, for illustration.

Code
def get_table_schema(con, table: str) -> str:
    """Return the schema for a database table in stringified JSON."""

    sql = f"""
        SELECT
            column_name,
            data_type,
            is_nullable,
            column_default AS default_value,
            numeric_scale
        FROM information_schema.columns
        WHERE table_name = '{table}'
    """

    schema = con.execute(sql).df()
    schema = schema.to_json(orient="records", indent=2)
    return schema

schema = get_table_schema(con, table='barcelona_trees')
print(schema[:690])
[
  {
    "column_name":"codi",
    "data_type":"VARCHAR",
    "is_nullable":"YES",
    "default_value":null,
    "numeric_scale":null
  },
  {
    "column_name":"x_etrs89",
    "data_type":"DOUBLE",
    "is_nullable":"YES",
    "default_value":null,
    "numeric_scale":0
  },
  {
    "column_name":"y_etrs89",
    "data_type":"DOUBLE",
    "is_nullable":"YES",
    "default_value":null,
    "numeric_scale":0
  },
  {
    "column_name":"latitud",
    "data_type":"DOUBLE",
    "is_nullable":"YES",
    "default_value":null,
    "numeric_scale":0
  },
  {
    "column_name":"longitud",
    "data_type":"DOUBLE",
    "is_nullable":"YES",
    "default_value":null,
    "numeric_scale":0
  },

Call the LLM

This is where things start to get interesting.

First, let’s talk about the model and the hyperparameters (lines 4-6 in the code block below). As I mentioned above, we’ll use Anthropic’s Claude Sonnet 3.78 model. We limit it to 1,000 tokens of output, which is way more than we need for our simple database but still good practice. Finally, we set the temperature to 0 because our task requires the single best answer from the LLM, i.e. the answer with the highest chain of next-token probabilities.

The next bit is the system prompt, where the rubber really meets the road. First, we prime the LLM with its general task, including the details that we’re using a DuckDB database and that we want only clean SQL (lines 8-17). In his demo, Krispin struggled to get Gemini to refrain from adding markdown backticks; Claude Sonnet 3.7 also includes commentary and markdown by default but simple prompting seems to be enough to get Claude to behave as we want.

In the second part of the system prompt, we pass the name of the table and the schema (lines 18-26).

At the end of the system prompt, we enable prompt caching, to save money with repeated questions about the same table. Generally, Anthropic charges $3 per million tokens of input to Claude Sonnet 3.7, but a cache read is 10X cheaper at $0.30/million tokens. Everything in the input up to the cache_control bit (line 24) will be cached. Note that Anthropic does not cache inputs with fewer than 1,024 tokens but it does so silently (which is confusing). For simple tables, the schema may not reach this threshold, although with this example it already does.

Finally, we pass the user’s question as the user_prompt (lines 27-37).

We return the LLM’s output, of course, but also the usage statistics for tracking and debugging.

Code
def generate_llm_sql(client, table: str, schema: str, user_prompt: str):
    """Translate a natural langauge question to a SQL query."""

    response = client.messages.create(
        model="claude-3-7-sonnet-20250219",
        max_tokens=1000,
        temperature=0,
        system=[
            {
                "type": "text",
                "text": (
                    "You are an AI assistant tasked with generating SQL queries "
                    "to answer questions about the data in a given DuckDB table. "
                    "Return only clean SQL code; do not include additional "
                    "commentary or even markdown formatting."
                ),
            },
            {
                "type": "text",
                "text": (
                    f"Here is the schema for the DuckDB table '{table}'."
                    f"\n\n{schema}"
                ),
                "cache_control": {"type": "ephemeral"},
            },
        ],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": user_prompt,
                    }
                ],
            },
        ],
    )

    return response.content[0].text, response.usage

Now let’s spin up the Anthropic SDK client and call our new function with an easy question:

How many trees are there in Barcelona?

Code
load_dotenv()
api_key = os.environ.get("ANTHROPIC_API_KEY")
client = anthropic.Anthropic(api_key=api_key)

llm_sql, llm_usage = generate_llm_sql(
    client=client,
    table='barcelona_trees',
    schema=schema,
    user_prompt="How many trees are there in Barcelona?"
)

print(llm_sql)
print()
print(llm_usage)
SELECT COUNT(*) FROM barcelona_trees

Usage(cache_creation_input_tokens=1233, cache_read_input_tokens=0, input_tokens=15, output_tokens=11, server_tool_use=None)

Sure, it’s an easy question but that looks like pretty good SQL! In the usage info, note that most of the input indeed comes from writing or reading the cache, as we hoped.1 There are only 15 non-cached input tokens, which keeps the cost down.

Estimate the cost

💰 Speaking of cost….

The usage data about tokens alone don’t tell us how much we’re spending but it’s easy to check. Anthropic’s prompt caching page lists the current costs per million tokens of input, output, cache write, and cache read.

Code
def estimate_llm_cost(llm_usage: Usage) -> float:
    """Estimate the cost of a call to the Claude Sonnet 3.7 API.
    
    Prices from https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#pricing on May 19, 2025.
    """

    input_cost = 3 * llm_usage.input_tokens
    output_cost = 15 * llm_usage.output_tokens
    cache_write_cost = 3.75 * llm_usage.cache_creation_input_tokens
    cache_read_cost = 0.3 * llm_usage.cache_read_input_tokens

    total_cost_dollars = (
        input_cost + output_cost + cache_write_cost + cache_read_cost
    ) / 1e6
    return total_cost_dollars


print(f"Estimated LLM call cost: ${estimate_llm_cost(llm_usage):.4f}")
Estimated LLM call cost: $0.0048

Validate the LLM output

The next step in our pipeline is to validate the LLM’s output. In this post, we’re only going to check that the output is proper DuckDB SQL but in the wild this would be a good place for all the standard sanitization checks you would do when anybody wants to access your DB. In principle, there’s nothing stopping our favorite LLM from suddenly turning into Little Bobby Tables and deleting all of our data.

To check that the Claude’s output is valid DuckDB SQL, we’ll use the sqlglot package. In the output below, we see that the parsed query is unchanged, which is a success.

Code
def validate_sql(sql: str) -> str:
    """Validate that a SQL query string parses as proper (DuckDB) SQL."""

    try:
        parsed_sql = sqlglot.parse_one(sql, dialect="duckdb")
    except ParseError as e:
        print(f"Invalid SQL: {e}")
    except Exception as e:
        print(f"Error during validation: {e}")

    return parsed_sql.sql()


parsed_sql = validate_sql(llm_sql)
print(parsed_sql)
SELECT COUNT(*) FROM barcelona_trees

Execute the SQL query

The last step in the pipeline is to run the validated SQL query on our database. Looks like Barcelona has almost 150,000 trees!

Code
result = con.execute(parsed_sql).df()
result
count_star()
0 146980

Put it all together

Let’s wrap all of the above logic into a single function that we can call with a question about our data.

First, we define a dataclass to hold the return objects. We don’t just want the answer; for debugging and building intuition, we need to see the LLM’s output, the validated SQL, the token counts, and estimated cost.

Code
@dataclass
class LlmQueryOutput:
    result: pd.DataFrame
    llm_sql: str
    parsed_sql: str
    llm_usage: Usage
    llm_cost_dollars: float

Next, we package all of the previous steps into a single function, llm_sql_query.

Code
def llm_sql_query(client, con, table: str, question: str) -> LlmQueryOutput:
    """Answer a user's natural language question about a database table."""

    # Get the table schema
    schema = get_table_schema(con, table)

    # Call the LLM
    llm_sql, llm_usage = generate_llm_sql(
        client, table, schema, user_prompt=question
    )
    llm_cost_dollars = estimate_llm_cost(llm_usage)

    # Validate the LLM output as proper sql.
    parsed_sql = validate_sql(llm_sql)

    # Run the query on the DB.
    result = con.execute(parsed_sql).df()

    # Package and return the result
    out = LlmQueryOutput(result, llm_sql, parsed_sql, llm_usage, llm_cost_dollars)
    return out

This function is good but still a little annoying because it has three parameters that are always the same as long as we’re working with the same database. To simplify even further, we can use Python’s partial tool to freeze those first three parameters. The query_trees function below takes only a question, which is about as simple as we can get.

Code
query_trees = partial(llm_sql_query, client, con, 'barcelona_trees')

Kick the tires

Let’s make our dev question a little harder, by asking Claude to group the tree count by district (nom_districte).

Code
out = query_trees("How many trees are in each district?")

print(out.llm_sql)
out.result
SELECT nom_districte, COUNT(*) as tree_count
FROM barcelona_trees
GROUP BY nom_districte
ORDER BY tree_count DESC;
nom_districte tree_count
0 SANT MARTÍ 29586
1 EIXAMPLE 21445
2 SANTS - MONTJUÏC 15286
3 SANT ANDREU 15167
4 NOU BARRIS 13760
5 SARRIÀ - SANT GERVASI 13369
6 LES CORTS 12415
7 HORTA - GUINARDÓ 12384
8 GRÀCIA 6995
9 CIUTAT VELLA 6570
10 None 3

Looks pretty good! Claude even put the counts in descending order, which is a nice touch. Let’s up the ante even further by asking Claude to group the counts by year, which must be extracted from the planting date column (data_plantacio).

Code
out = query_trees("How many trees have been planted in each year?")

print(out.llm_sql)
out.result
SELECT EXTRACT(YEAR FROM data_plantacio) AS year, 
       COUNT(*) AS number_of_trees
FROM barcelona_trees
WHERE data_plantacio IS NOT NULL
GROUP BY year
ORDER BY year;
year number_of_trees
0 13 2
1 16 1
2 19 1
3 22 1
4 23 1
5 32 2
6 36 1
7 2001 3
8 2006 1
9 2007 387
10 2008 758
11 2009 1787
12 2010 2130
13 2011 1802
14 2012 1401
15 2013 1499
16 2014 1491
17 2015 905
18 2016 1549
19 2017 1356
20 2018 1462
21 2019 1589
22 2020 990
23 2021 1965
24 2022 2585
25 2023 1415
26 2024 3090
27 2025 1759

OK, not bad, but we’ve discovered a limitation. It seems some of the rows have malformed planting dates, yielding non-sensical year values (13, 16, etc.) A stronger query would include more defensive filters against bogus data. Or, alternatively, we could make our tool more agentic by sending the results back to Claude and asking it to iterate until the results make sense.

On the other hand, Claude has managed to implicitly translate our English query (“planted in each year”) into SQL using a schema with Catalan column names (data_plantacio), without any explicit hints that such a translation would be necessary. It’s so easy to be jaded by the AI hype but it still just blows my mind when I see this type of capability.

Now let’s raise the complexity another couple notches by getting the rolling average of tree counts by year, which will require extracting the year, grouping, and a SQL window function.

Code
out = query_trees(
    question=(
        "Give me the rolling 5 year average of the number of trees planted, "
        "starting in 2007."
    )
)

print(out.llm_sql)
out.result
SELECT 
    year_planted,
    AVG(tree_count) OVER (ORDER BY year_planted ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS rolling_5yr_avg
FROM (
    SELECT 
        EXTRACT(YEAR FROM data_plantacio) AS year_planted,
        COUNT(*) AS tree_count
    FROM barcelona_trees
    WHERE data_plantacio IS NOT NULL
    GROUP BY year_planted
) AS yearly_counts
WHERE year_planted >= 2007
ORDER BY year_planted;
year_planted rolling_5yr_avg
0 2007 387.000000
1 2008 572.500000
2 2009 977.333333
3 2010 1265.500000
4 2011 1372.800000
5 2012 1575.600000
6 2013 1723.800000
7 2014 1664.600000
8 2015 1419.600000
9 2016 1369.000000
10 2017 1360.000000
11 2018 1352.600000
12 2019 1372.200000
13 2020 1389.200000
14 2021 1472.400000
15 2022 1718.200000
16 2023 1708.800000
17 2024 2009.000000
18 2025 2162.800000

Claude suggests a nested query to extract the year and get grouped counts, then a correct window function to compute the rolling average. Impressive!

Wrap up

To recap, Rami Krispin showed last week how to use GPT-4.1 and Gemini 2.0 Flash to translate natural language questions into SQL queries that run against a DuckDB table.

In this post, I extended Krispin’s work by showing how to do it with Claude Sonnet 3.7, by asking English questions about a non-English dataset, using Pandas to get the table schema, validating the output of the LLM with sqlglot, and adding prompt caching to save money and avoid sending the full schema on each LLM call.

Of course, this demo uses a simple database with a single table. The next step up in complexity to test will be to use a database with two tables and queries that require joins of various kinds. Still, it’s an impressive result and we didn’t even have to resort to prompt engineering hacks.

For non-technical data consumers, this kind of thing is a game-changer. Even for engineers fluent in SQL, it’s useful for exploratory analysis and as a building block for a more automated system.


Cover image by Bri Weldon, from https://www.flickr.com/photos/briweldon/5170241518.

Footnotes

  1. The cache invalidates after only five minutes which makes it hard to know as I write this whether it will be a cache write or read.↩︎