It seems Claude Sonnet 4 has had a noticeable regression in its ability to follow text-to-SQL instructions about formatting. It’s a small thing, but it’s the kind of thing that makes building fully automated agentic systems so hard.
The BarcelonaDataTrivia bot is a (work-in-progress) fully-automated, LLM-powered tool to post a data-based factoid about the city of Barcelona every day.
One step in the bot’s flow is to generate a SQL query that will run against one of the tables in the Barcelona Open Data portal, given a natural language question and metadata about the table. A part of this is making sure the text-to-SQL response from the LLM doesn’t contain extraneous commentary or formatting. It seems small, but can be tricky to enforce.
Claude Sonnet 3.7 seems to follow the “no-formatting” instruction very well but Claude Sonnet 4 just can’t seem to comply. Here’s an example.
Set up
First the set up. I’ve copied the data from the Open Data BCN table on monthly average temperature to a local DuckDB table and the metadata to a local json text file.
First some boilerplate you can ignore.
Then we compile the metadata about the table.
Code
# Connect to the DB and load the table metadata.
db = duckdb.connect(f"{dataset}.db", read_only=True)
with open(f"{dataset}_metadata.json", "rb") as f:
meta = orjson.loads(f.read())["result"]
# Get an example row of data
sql = f"""SELECT * FROM "{table}" LIMIT 1"""
result = db.sql(sql)
field_names = [x[0] for x in result.description]
example = dict(zip(field_names, result.fetchone()))
# Format the table metadata as an LLM-readable string.
table_info = [
f"Table name: {table}",
f"Dataset name: {meta["title_translated"]["ca"]}",
f"Dataset source: {meta["fuente"]}",
f"Dataset description: {meta["notes_translated"]["ca"]}",
f"First row of data:\n{example}",
f"Field descriptions:\n{meta["extras"]}",
]
table_info_str = "\n\n".join(table_info)
print(table_info_str)
Table name: temperaturesbcndesde1780_2024
Dataset name: Temperatures mitjanes mensuals de l’aire de la ciutat de Barcelona des de 1780
Dataset source: Servei Meteorològic de Catalunya
Dataset description: Temperatures mitjanes mensuals de l’aire de la ciutat de Barcelona des de 1780. La temperatura es proporciona en graus centígrads (ºC).
First row of data:
{'Any': 1780, 'Temp_Mitjana_Gener': 6.7, 'Temp_Mitjana_Febrer': 7.3, 'Temp_Mitjana_Marc': 11.6, 'Temp_Mitjana_Abril': 11.4, 'Temp_Mitjana_Maig': 16.3, 'Temp_Mitjana_Juny': 19.1, 'Temp_Mitjana_Juliol': 21.4, 'Temp_Mitjana_Agost': 22.0, 'Temp_Mitjana_Setembre': 20.3, 'Temp_Mitjana_Octubre': 16.3, 'Temp_Mitjana_Novembre': 9.2, 'Temp_Mitjana_Desembre': 7.3}
Field descriptions:
[{'key': '01.Any', 'value': 'Any de la medició'}, {'key': '02.Temp_Mitjana_Mes', 'value': 'Mitjana mensual corresponent en graus centígrads'}]
Task and context
Let’s ask a very simple question about this table:
Here’s the key part—the system prompt. Lines 5 and 6 explicitly tell the LLM not to include any additional commentary or formatting. As we’ll see, for Claude Sonnet 3.7 and GPT-4.1, this is sufficient. For Claude 4, it’s not.
Code
system_prompt = """
You are an AI assistant tasked with generating SQL queries to answer questions
about a DuckDB table. The user will first ask their question then provide
information about the table, including the table name, description, an
example row of data, and field descriptions. Return only clean, executable
SQL code; do not output any additional reasoning, commentary, or formatting.
Remember to put quotes about table and field names.
"""
# Get rid of newlines and whitespace (HELP! This cannot be the way!)
system_prompt = " ".join(system_prompt.split())
Experiments
First, a couple functions to call the Anthropic and OpenAI API’s. The only thing to note here is how the system prompt is passed in as either system
or instructions
, and then the question and table info are passed to the LLM in the “user” role.
Code
anthopic_client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def generate_claude_sql(
model: str, system_prompt: str, question: str, table_info: str
):
message = anthopic_client.messages.create(
model=model,
max_tokens=2000,
system=system_prompt,
messages=[
{"role": "user", "content": f"Question: {question}"},
{"role": "user", "content": f"Table information:\n{table_info}"},
],
)
return message.content[0].text
def generate_gpt_sql(
model: str, system_prompt: str, question: str, table_info: str
):
response = openai_client.responses.create(
model=model,
instructions=system_prompt,
input=[
{"role": "user", "content": f"Question: {question}"},
{"role": "user", "content": f"Table information:\n{table_info}"},
],
)
return response.output_text
Claude Sonnet 3.7
Let’s take a look first at Claude Sonnet 3.7.
Code
SELECT AVG("Temp_Mitjana_Juny") AS Avg_June_Temp
FROM "temperaturesbcndesde1780_2024"
WHERE "Any" >= YEAR(CURRENT_DATE) - 30
Looks good, let’s double check by running it.
┌───────────────────┐
│ Avg_June_Temp │
│ double │
├───────────────────┤
│ 21.82333333333333 │
└───────────────────┘
Excellent. Now let’s see Claude Sonnet 4.
Claude Sonnet 4
Code
```sql
SELECT AVG("Temp_Mitjana_Juny") as average_june_temperature
FROM "temperaturesbcndesde1780_2024"
WHERE "Any" >= 1994 AND "Any" <= 2024;
```
Uh oh, what are those markdown backticks doing in there? Also, while it’s not the point of this post, Claude Sonnet 4’s SQL code is not quite as good as the previous output because the filter clause WHERE "Any" >= 1994 AND "Any" <= 2024
is a bit brittle and not actually a correct answer of the specific question we posed.
Just for comparison, let’s try GPT-4.1 too.
GPT 4.1
Code
SELECT AVG("Temp_Mitjana_Juny") AS avg_june_temperature
FROM "temperaturesbcndesde1780_2024"
WHERE "Any" >= (SELECT MAX("Any") FROM "temperaturesbcndesde1780_2024") - 29;
Not bad, although the SQL has the same issue Claude Sonnet 4’s did: it’s not actually what our question asked. Still, it’s pure SQL which can we run without further processing.
Claude 4 with pre-filled output
OK, so how do we fix Claude 4? One thing we can do is to “seed” its response with the beginning of the SQL query, so it doesn’t have a chance to interject the markdown formatting. Line 8 below is a message with the role “assistant” and “SELECT” as the content. From here, Claude essentially has to continue the query.
Code
message = anthopic_client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2000,
system=system_prompt,
messages=[
{"role": "user", "content": f"Question: {question}"},
{"role": "user", "content": f"Table information:\n{table_info}"},
{"role": "assistant", "content": "SELECT"},
],
)
print(message.content[0].text)
AVG("Temp_Mitjana_Juny")
FROM "temperaturesbcndesde1780_2024"
WHERE "Any" >= 1995 AND "Any" <= 2024;
Well, we’ve indeed solved the markdown backticks problem, but we still can’t execute the output directly. We have to go back and prepend “SELECT” to the result. This is not a general solution either because not all SQL queries start with “SELECT”.
Claude 4 with 1-shot context
Let’s show Claude Sonnet 4 an example of how we want the output to look, given a question. First we prepend the example to the system prompt, and then call the API.
Code
You are an AI assistant tasked with generating SQL queries to answer questions about a DuckDB table. The user will first ask their question then provide information about the table, including the table name, description, an example row of data, and field descriptions. Return only clean, executable SQL code; do not output any additional reasoning, commentary, or formatting. Remember to put quotes about table and field names.
Question: Which neighborhood of Barcelona has the most trees?
Answer: SELECT "Nom_Barri", COUNT(*) AS "num_trees" FROM "ArbratViari" GROUP BY 1 ORDER BY 2 DESC LIMIT 1
Code
SELECT AVG("Temp_Mitjana_Juny") AS "average_june_temperature"
FROM "temperaturesbcndesde1780_2024"
WHERE "Any" >= 1994 AND "Any" <= 2024
Brittle and slightly incorrect SQL aside, problem solved. Kinda.
Wrap up
Yes, this is a single example in a single application, but I’ve seen very little variation in this pattern over many text-to-SQL requests. Claude Sonnet 4 consistently ignores the commentary and formatting instruction until an example is added to the prompt.
It’s a small thing—extracting the SQL from markdown formatting isn’t hard—but each of these small things makes it harder to use Claude Sonnet 4’s output, especially in an automated system.
What’s your experience with Claude Sonnet 4 instruction following ability? Drop me a line in the BlueSky or LinkedIn replies.