A Claude Sonnet 4 regression

Claude Sonnet 4 seems to be a clear step backward in its ability to follow instructions regarding the format of code output.

AI Engineering
Author

Brian Kent

Published

June 5, 2025

It seems Claude Sonnet 4 has had a noticeable regression in its ability to follow text-to-SQL instructions about formatting. It’s a small thing, but it’s the kind of thing that makes building fully automated agentic systems so hard.

The BarcelonaDataTrivia bot is a (work-in-progress) fully-automated, LLM-powered tool to post a data-based factoid about the city of Barcelona every day.

One step in the bot’s flow is to generate a SQL query that will run against one of the tables in the Barcelona Open Data portal, given a natural language question and metadata about the table. A part of this is making sure the text-to-SQL response from the LLM doesn’t contain extraneous commentary or formatting. It seems small, but can be tricky to enforce.

Claude Sonnet 3.7 seems to follow the “no-formatting” instruction very well but Claude Sonnet 4 just can’t seem to comply. Here’s an example.

Set up

First the set up. I’ve copied the data from the Open Data BCN table on monthly average temperature to a local DuckDB table and the metadata to a local json text file.

First some boilerplate you can ignore.

Code
import os

from anthropic import Anthropic
from dotenv import load_dotenv
import duckdb
from openai import OpenAI
import orjson

load_dotenv()

dataset = "temperatures-hist-bcn"
table = "temperaturesbcndesde1780_2024"

Then we compile the metadata about the table.

Code
# Connect to the DB and load the table metadata.
db = duckdb.connect(f"{dataset}.db", read_only=True)

with open(f"{dataset}_metadata.json", "rb") as f:
    meta = orjson.loads(f.read())["result"]

# Get an example row of data
sql = f"""SELECT * FROM "{table}" LIMIT 1"""

result = db.sql(sql)
field_names = [x[0] for x in result.description]
example = dict(zip(field_names, result.fetchone()))

# Format the table metadata as an LLM-readable string.
table_info = [
    f"Table name: {table}",
    f"Dataset name: {meta["title_translated"]["ca"]}",
    f"Dataset source: {meta["fuente"]}",
    f"Dataset description: {meta["notes_translated"]["ca"]}",
    f"First row of data:\n{example}",
    f"Field descriptions:\n{meta["extras"]}",
]

table_info_str = "\n\n".join(table_info)
print(table_info_str)
Table name: temperaturesbcndesde1780_2024

Dataset name: Temperatures mitjanes mensuals de l’aire de la ciutat de Barcelona des de 1780

Dataset source: Servei Meteorològic de Catalunya

Dataset description: Temperatures mitjanes mensuals de l’aire de la ciutat de Barcelona des de 1780. La temperatura es proporciona en graus centígrads (ºC).

First row of data:
{'Any': 1780, 'Temp_Mitjana_Gener': 6.7, 'Temp_Mitjana_Febrer': 7.3, 'Temp_Mitjana_Marc': 11.6, 'Temp_Mitjana_Abril': 11.4, 'Temp_Mitjana_Maig': 16.3, 'Temp_Mitjana_Juny': 19.1, 'Temp_Mitjana_Juliol': 21.4, 'Temp_Mitjana_Agost': 22.0, 'Temp_Mitjana_Setembre': 20.3, 'Temp_Mitjana_Octubre': 16.3, 'Temp_Mitjana_Novembre': 9.2, 'Temp_Mitjana_Desembre': 7.3}

Field descriptions:
[{'key': '01.Any', 'value': 'Any de la medició'}, {'key': '02.Temp_Mitjana_Mes', 'value': 'Mitjana mensual corresponent en graus centígrads'}]

Task and context

Let’s ask a very simple question about this table:

Code
question = (
    "What is the average June temperature in Barcelona over the past 30 years?"
)

Here’s the key part—the system prompt. Lines 5 and 6 explicitly tell the LLM not to include any additional commentary or formatting. As we’ll see, for Claude Sonnet 3.7 and GPT-4.1, this is sufficient. For Claude 4, it’s not.

Code
system_prompt = """
    You are an AI assistant tasked with generating SQL queries to answer questions
    about a DuckDB table. The user will first ask their question  then provide 
    information about the table, including the table name,  description, an 
    example row of data, and field descriptions. Return only clean, executable 
    SQL code; do not output any additional reasoning, commentary, or formatting. 
    Remember to put quotes about table and field names.
    """

# Get rid of newlines and whitespace (HELP! This cannot be the way!)
system_prompt = " ".join(system_prompt.split())

Experiments

First, a couple functions to call the Anthropic and OpenAI API’s. The only thing to note here is how the system prompt is passed in as either system or instructions, and then the question and table info are passed to the LLM in the “user” role.

Code
anthopic_client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def generate_claude_sql(
    model: str, system_prompt: str, question: str, table_info: str
):
    message = anthopic_client.messages.create(
        model=model,
        max_tokens=2000,
        system=system_prompt,
        messages=[
            {"role": "user", "content": f"Question: {question}"},
            {"role": "user", "content": f"Table information:\n{table_info}"},
        ],
    )
    return message.content[0].text

def generate_gpt_sql(
    model: str, system_prompt: str, question: str, table_info: str
):
    response = openai_client.responses.create(
        model=model,
        instructions=system_prompt,
        input=[
            {"role": "user", "content": f"Question: {question}"},
            {"role": "user", "content": f"Table information:\n{table_info}"},
        ],
    )

    return response.output_text

Claude Sonnet 3.7

Let’s take a look first at Claude Sonnet 3.7.

Code
response = generate_claude_sql(
    model="claude-3-7-sonnet-20250219",
    system_prompt=system_prompt,
    question=question,
    table_info=table_info_str,
)
print(response)
SELECT AVG("Temp_Mitjana_Juny") AS Avg_June_Temp
FROM "temperaturesbcndesde1780_2024"
WHERE "Any" >= YEAR(CURRENT_DATE) - 30

Looks good, let’s double check by running it.

Code
print(db.sql(response))
┌───────────────────┐
│   Avg_June_Temp   │
│      double       │
├───────────────────┤
│ 21.82333333333333 │
└───────────────────┘

Excellent. Now let’s see Claude Sonnet 4.

Claude Sonnet 4

Code
response = generate_claude_sql(
    model="claude-sonnet-4-20250514",
    system_prompt=system_prompt,
    question=question,
    table_info=table_info_str,
)
print(response)
```sql
SELECT AVG("Temp_Mitjana_Juny") as average_june_temperature
FROM "temperaturesbcndesde1780_2024"
WHERE "Any" >= 1994 AND "Any" <= 2024;
```

Uh oh, what are those markdown backticks doing in there? Also, while it’s not the point of this post, Claude Sonnet 4’s SQL code is not quite as good as the previous output because the filter clause WHERE "Any" >= 1994 AND "Any" <= 2024 is a bit brittle and not actually a correct answer of the specific question we posed.

Just for comparison, let’s try GPT-4.1 too.

GPT 4.1

Code
response = generate_gpt_sql(
    model="gpt-4.1",
    system_prompt=system_prompt,
    question=question,
    table_info=table_info_str,
)
print(response)
SELECT AVG("Temp_Mitjana_Juny") AS avg_june_temperature
FROM "temperaturesbcndesde1780_2024"
WHERE "Any" >= (SELECT MAX("Any") FROM "temperaturesbcndesde1780_2024") - 29;

Not bad, although the SQL has the same issue Claude Sonnet 4’s did: it’s not actually what our question asked. Still, it’s pure SQL which can we run without further processing.

Code
print(db.sql(response))
┌──────────────────────┐
│ avg_june_temperature │
│        double        │
├──────────────────────┤
│    21.82333333333333 │
└──────────────────────┘

Claude 4 with pre-filled output

OK, so how do we fix Claude 4? One thing we can do is to “seed” its response with the beginning of the SQL query, so it doesn’t have a chance to interject the markdown formatting. Line 8 below is a message with the role “assistant” and “SELECT” as the content. From here, Claude essentially has to continue the query.

Code
message = anthopic_client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2000,
    system=system_prompt,
    messages=[
        {"role": "user", "content": f"Question: {question}"},
        {"role": "user", "content": f"Table information:\n{table_info}"},
        {"role": "assistant", "content": "SELECT"},
    ],
)

print(message.content[0].text)
 AVG("Temp_Mitjana_Juny") 
FROM "temperaturesbcndesde1780_2024" 
WHERE "Any" >= 1995 AND "Any" <= 2024;

Well, we’ve indeed solved the markdown backticks problem, but we still can’t execute the output directly. We have to go back and prepend “SELECT” to the result. This is not a general solution either because not all SQL queries start with “SELECT”.

Claude 4 with 1-shot context

Let’s show Claude Sonnet 4 an example of how we want the output to look, given a question. First we prepend the example to the system prompt, and then call the API.

Code
example = (
    "Question: Which neighborhood of Barcelona has the most trees?\n"
    "Answer: SELECT \"Nom_Barri\", COUNT(*) AS \"num_trees\" FROM """
    "\"ArbratViari\" GROUP BY 1 ORDER BY 2 DESC LIMIT 1"
)

system_prompt += "\n\n" + example
print(system_prompt)
You are an AI assistant tasked with generating SQL queries to answer questions about a DuckDB table. The user will first ask their question then provide information about the table, including the table name, description, an example row of data, and field descriptions. Return only clean, executable SQL code; do not output any additional reasoning, commentary, or formatting. Remember to put quotes about table and field names.

Question: Which neighborhood of Barcelona has the most trees?
Answer: SELECT "Nom_Barri", COUNT(*) AS "num_trees" FROM "ArbratViari" GROUP BY 1 ORDER BY 2 DESC LIMIT 1
Code
response = generate_claude_sql(
    model="claude-sonnet-4-20250514",
    system_prompt=system_prompt,
    question=question,
    table_info=table_info_str,
)
print(response)
SELECT AVG("Temp_Mitjana_Juny") AS "average_june_temperature"
FROM "temperaturesbcndesde1780_2024"
WHERE "Any" >= 1994 AND "Any" <= 2024

Brittle and slightly incorrect SQL aside, problem solved. Kinda.

Wrap up

Yes, this is a single example in a single application, but I’ve seen very little variation in this pattern over many text-to-SQL requests. Claude Sonnet 4 consistently ignores the commentary and formatting instruction until an example is added to the prompt.

It’s a small thing—extracting the SQL from markdown formatting isn’t hard—but each of these small things makes it harder to use Claude Sonnet 4’s output, especially in an automated system.

What’s your experience with Claude Sonnet 4 instruction following ability? Drop me a line in the BlueSky or LinkedIn replies.