A DSPy footgun – Brian Patrick Kent

DSPy seems to be having a bit of a moment. Apparently, it’s a must-have for serious AI engineers:

So I started exploring and found a potential gotcha right away.

First things first, let’s get our setup boilerplate out of the way. We’ll use Claude 3.7 Sonnet but it doesn’t matter much for the purpose of this post.

Code

import os
from dotenv import load_dotenv
from rich import print

from dspy import LM, configure

load_dotenv()
api_key = os.environ.get("ANTHROPIC_API_KEY")

lm = LM("anthropic/claude-3-7-sonnet-20250219", api_key=api_key)
configure(lm=lm, track_usage=True)

By design, DSPy hides all details about prompt construction. We define a task in code instead, with a Signature.

Writing signatures is far more modular, adaptive, and reproducible than hacking at prompts or finetunes. The DSPy compiler will figure out how to build a highly-optimized prompt for your LM (or finetune your small LM) for your signature, on your data

This simplest version of a signature is a string, which is a bit weird because we can include type hints, but seems clear enough that all this information will be passed to our LLM in the prompt somehow.

Code

from dspy import Predict

predictor = Predict(signature="question -> answer: float")
predictor(
    question="Two dice are tossed. What is the probability that the sum equals two?"
)

Prediction(
    answer=0.027777777777777776
)

I’m generally uncomfortable with this kind of DSL because I don’t know exactly how it’s going to parse. To be more explicit, we can define our own class that inherits from Signature. For example, to classify sentiment and spamminess of a book review:

Code

from typing import Literal
from dspy import Signature, InputField, OutputField

class Classifier(Signature):
    """Classify a given sentence."""

    sentence: str = InputField()
    sentiment: Literal['positive', 'negative', 'neutral'] = OutputField()
    spam_score: float = OutputField()

classifier = Predict(Classifier)

output = classifier(
    sentence="This book was super fun to read, though not the last chapter."
)
output

Prediction(
    sentiment='positive',
    spam_score=0.1
)

Looks good—it’s clearly a positive review and not spammy. But we want to build intuition for debugging and iterating, so we need to see the prompt that DSPy constructed and sent to the LLM. Let’s check out the system message:

Code

print(lm.history[-1]['messages'][0]['content'])

Your input fields are:
1. `sentence` (str)
Your output fields are:
1. `sentiment` (Literal['positive', 'negative', 'neutral'])
2. `spam_score` (float)
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## sentence ## ]]
{sentence}

[[ ## sentiment ## ]]
{sentiment}        # note: the value you produce must exactly match (no extra characters) one of: positive; 
negative; neutral

[[ ## spam_score ## ]]
{spam_score}        # note: the value you produce must be a single float value

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Classify a given sentence.

Wow, a lot going on in in there. What really jumps out is that the variable names sentence, sentiment, and spam_score are passed to the LLM in the prompt! This is a huge mental shift from traditional code, where variable names are meant only for other humans to read.

The Learn DSPy documentation does say this, but it’s buried and unclear, when it should be shouted from the rooftops.

While typical function signatures just describe things, DSPy Signatures declare and initialize the behavior of modules. Moreover, the field names matter in DSPy Signatures. You express semantic roles in plain English: a question is different from an answer, a sql_query is different from python_code.

There is nothing in the API docs about Signature that indicates the field names must have semantic meaning because the LLM is going to see them. Worse, in the Learn DSPy docs, it also says

start simple [with field names] and don’t prematurely optimize keywords! Leave that kind of hacking to the DSPy compiler. For example, for summarization, it’s probably fine to say “document -> summary”, “text -> gist”, or “long_context -> tldr”.

It’s probably fine? That’s super wishy-washy. What counts as starting simple? It seems like we’re really not all that much further along than hacking on long string prompts.

Maybe it’s not a big deal, but imagine a junior developer (or a vibe coding PM using Cursor/Windsurf/Bolt.new/etc) missed the bit about about the importance of variable names and used x, y1, y2 in place of sentence, sentiment, and spam_score.

Code

class Classifier(Signature):
    """Classify a given sentence."""

    x: str = InputField()
    y1: Literal['positive', 'negative', 'neutral'] = OutputField()
    y2: float = OutputField()

classifier = Predict(Classifier)

output = classifier(
    x="This book was super fun to read, though not the last chapter."
)
output

Prediction(
    y1='positive',
    y2=0.8
)

Oops, our second output is now 0.8, when it’s supposed to be 0.1. There’s no mechanism to say “Hey, don’t be a moron—name your variables”. We just get a wrong answer, silently.

The simple answer is that the InputField and OutputField classes can take descriptions that are passed to the LLM but it’s not documented well. We could also use the docstring to provide more detailed instructions to the LLM, but again, we have to discover that through trial and error.

In sum, DSPy has a lot going for it, but the obfuscation of prompt construction creates problems. Beware the footguns!

--- title: "A DSPy footgun" author: "Brian Kent" date: "2025-05-21" categories: ["AI Engineering", "Python"] description: | Variable names in DSPy signatures must have semantic meaning for your LLM. image: dspy_logo.png code-tools: true format: html: toc-expand: 2 code-line-numbers: true canonical-url: true execute: freeze: true cache: true # HEY! This is important to avoid calling an expensive API many times for no reason. --- [DSPy](https://dspy.ai/) seems to be having a bit of a moment. Apparently, it's a must-have for serious AI engineers: ::: {style="display: flex; justify-content: center;"} <blockquote class="bluesky-embed" data-bluesky-uri="at://did:plc:ckaz32jwl6t2cno6fmuw2nhn/app.bsky.feed.post/3lojyfklqws2o" data-bluesky-cid="bafyreihf3jvdbq2ng7mripc53zesrmsaaue2wft2jkjpzvf54hd2s2d4dy" data-bluesky-embed-color-mode="light"><p lang="en">if someone downplays or disses DSPy, it’s worth considering that they just have poor data hygiene</p>— Tim Kellogg (<a href="https://bsky.app/profile/did:plc:ckaz32jwl6t2cno6fmuw2nhn?ref_src=embed">@timkellogg.me</a>) <a href="https://bsky.app/profile/did:plc:ckaz32jwl6t2cno6fmuw2nhn/post/3lojyfklqws2o?ref_src=embed">May 7, 2025 at 1:01 AM</a></blockquote><script async src="https://embed.bsky.app/static/embed.js" charset="utf-8"></script> ::: So I started exploring and found a potential gotcha right away. First things first, let's get our setup boilerplate out of the way. We'll use Claude 3.7 Sonnet but it doesn't matter much for the purpose of this post. ```{python} import os from dotenv import load_dotenv from rich import print from dspy import LM, configure load_dotenv() api_key = os.environ.get("ANTHROPIC_API_KEY") lm = LM("anthropic/claude-3-7-sonnet-20250219", api_key=api_key) configure(lm=lm, track_usage=True) ``` By design, DSPy hides all details about prompt construction. We define a task in code instead, with a [Signature](https://dspy.ai/learn/programming/signatures/). > Writing signatures is far more modular, adaptive, and reproducible than hacking at prompts or finetunes. The DSPy compiler will figure out how to build a highly-optimized prompt for your LM (or finetune your small LM) for your signature, on your data This simplest version of a signature is a string, which is a bit weird because we can include type hints, but seems clear enough that all this information will be passed to our LLM in the prompt somehow. ```{python} from dspy import Predict predictor = Predict(signature="question -> answer: float") predictor( question="Two dice are tossed. What is the probability that the sum equals two?" ) ``` I'm generally uncomfortable with this kind of DSL because I don't know exactly how it's going to parse. To be more explicit, we can define our own class that inherits from `Signature`. For example, to classify sentiment and spamminess of a book review: ```{python} from typing import Literal from dspy import Signature, InputField, OutputField class Classifier(Signature): """Classify a given sentence.""" sentence: str = InputField() sentiment: Literal['positive', 'negative', 'neutral'] = OutputField() spam_score: float = OutputField() classifier = Predict(Classifier) output = classifier( sentence="This book was super fun to read, though not the last chapter." ) output ``` Looks good---it's clearly a positive review and not spammy. But we want to build intuition for debugging and iterating, so we need to see the prompt that DSPy constructed and sent to the LLM. Let's check out the system message: ```{python} print(lm.history[-1]['messages'][0]['content']) ``` Wow, a lot going on in in there. What really jumps out is that the variable names `sentence`, `sentiment`, and `spam_score` are passed to the LLM in the prompt! This is a huge mental shift from traditional code, where variable names are meant only for other *humans* to read. The [Learn DSPy documentation](https://dspy.ai/learn/programming/signatures/) does say this, but it's buried and unclear, when it should be shouted from the rooftops. > While typical function signatures just describe things, DSPy Signatures declare and initialize the behavior of modules. Moreover, the field names matter in DSPy Signatures. You express semantic roles in plain English: a question is different from an answer, a sql_query is different from python_code. There is nothing in the API docs about `Signature` that indicates the field names must have semantic meaning because the LLM is going to see them. Worse, in the Learn DSPy docs, it also says > start simple [with field names] and don't prematurely optimize keywords! Leave that kind of hacking to the DSPy compiler. For example, for summarization, it's probably fine to say "document -> summary", "text -> gist", or "long_context -> tldr". It's probably fine? That's super wishy-washy. What counts as starting simple? It seems like we're really not all that much further along than hacking on long string prompts. Maybe it's not a big deal, but imagine a junior developer (or a vibe coding PM using Cursor/Windsurf/Bolt.new/etc) missed the bit about about the importance of variable names and used `x`, `y1`, `y2` in place of `sentence`, `sentiment`, and `spam_score`. ```{python} class Classifier(Signature): """Classify a given sentence.""" x: str = InputField() y1: Literal['positive', 'negative', 'neutral'] = OutputField() y2: float = OutputField() classifier = Predict(Classifier) output = classifier( x="This book was super fun to read, though not the last chapter." ) output ``` Oops, our second output is now 0.8, when it's supposed to be 0.1. There's no mechanism to say "Hey, don't be a moron---name your variables". We just get a wrong answer, silently. The simple answer is that the `InputField` and `OutputField` classes can take descriptions that are passed to the LLM but it's not documented well. We could also use the docstring to provide more detailed instructions to the LLM, but again, we have to discover that through trial and error. In sum, DSPy has a lot going for it, but the obfuscation of prompt construction creates problems. Beware the footguns!