How to track LLM costs with LiteLLM

A more complete code example for LLM cost visibility with LiteLLM.

Author

Brian Kent

Published

June 29, 2025

Note

I’m planning a multi-part series about the Barcelona Data Trivia bot, a minimal AI Engineering project. The content of this post may change as I pull the full series together.

-Brian

As we all know, cost visibility and tracking is super important for LLM-powered applications running at scale. LiteLLM does a great job maintaining a lookup table of meter pricing across many providers and models but its documentation about how to use that info is a bit thin. So, here’s how I set it up for my minimum viable AI Engineering project, the Barcelona Data Trivia bot.

The Barcelona Data Trivia bot’s purpose is to post an interesting factoid about Barcelona each day on BlueSky, based a query to data in the Barcelona Open Data portal. It uses LLMs to generate interesting questions, convert those questions into SQL queries, validate and format SQL queries, and write the factoid based on the results. For brevity, in this post we’ll only look at the SQL generation and formatting steps.

Set up

Code: set env vars with dotenv.
import os
from dotenv import load_dotenv

load_dotenv()

We’ll start with the question

What hour of the week had the most vehicular accidents in 2024?

and we’ll use this table as the source for a truthful answer.

Code
dataset = "accidents-gu-bcn"
table = "2024_accidents_gu_bcn"
question = "What hour of the week had the most vehicular accidents in 2024?"

For reliability, I’ve downloaded the data and metadata to my local machine. To give LLMs context about the table, we query the first row of data and stringify the metadata.

Code
import duckdb
import json

# Connect to the DB.
db = duckdb.connect(f"{dataset}.db", read_only=True)

# Load the table metadata from file.
with open(f"{dataset}.json", "rb") as f:
    meta = json.loads(f.read())

# Get an example row of data.
sql = f"""SELECT * FROM "{table}" LIMIT 1"""

result = db.sql(sql)
field_names = [x[0] for x in result.description]
example = dict(zip(field_names, result.fetchone()))

# Format the table metadata as an LLM-readable string.
table_info = [
    f"Table name: {table}",
    f"Dataset name: {meta["title_translated"]["ca"]}",
    f"Dataset source: {meta["fuente"]}",
    f"Dataset description: {meta["notes_translated"]["ca"]}",
    f"First row of data:\n{example}",
    f"Field descriptions:\n{meta["extras"]}",
]

table_info_str = "\n\n".join(table_info)
print(table_info_str)
Table name: 2024_accidents_gu_bcn

Dataset name: Accidents gestionats per la Guàrdia Urbana a la ciutat de Barcelona

Dataset source: Aplicació Accidents

Dataset description: Llistat dels accidents gestionats per la Guàrdia Urbana a la ciutat de Barcelona. Incorpora el número de lesionats segons gravetat, el número de vehicles implicats i el punt d'impacte.  

First row of data:
{'Numero_expedient': '2024S000001', 'Codi_districte': 9, 'Nom_districte': 'Sant Andreu', 'Codi_barri': 57, 'Nom_barri': 'la Trinitat Vella', 'Codi_carrer': '701268', 'Nom_carrer': 'Dalt (Llobregat)', 'Num_postal': 'K165', 'Descripcio_dia_setmana': 'Dilluns', 'NK_Any': 2024, 'Mes_any': 1, 'Nom_mes': 'Gener', 'Dia_mes': 1, 'Hora_dia': 2, 'Descripcio_torn': 'Nit', 'Descripcio_causa_vianant': None, 'Numero_morts': None, 'Numero_lesionats_lleus': 2, 'Numero_lesionats_greus': None, 'Numero_victimes': 2, 'Numero_vehicles_implicats': 1, 'Coordenada_UTM_Y_ED50': 432839.501, 'Coordenada_UTM_X_ED50': 4589349.921, 'Longitud_WGS84': 2.19480599, 'Latitud_WGS84': 41.45107592}

Field descriptions:
[{'key': '01.Numero_expedient', 'value': "Número identificatiu de l'expedient"}, {'key': '02.Codi_districte', 'value': 'Codi del districte'}, {'key': '03.Nom_districte', 'value': 'Nom del districte'}, {'key': '04.Codi_barri', 'value': 'Codi del barri'}, {'key': '05.Nom_barri', 'value': 'Nom del barri'}, {'key': '06.Codi_carrer', 'value': 'Codi del carrer'}, {'key': '07.Nom_carrer', 'value': 'Nom del carrer'}, {'key': '08.Num_postal', 'value': 'Número postal'}, {'key': '09.Descripcio_dia_setmana', 'value': 'Nom del dia de la setmana'}, {'key': '10.Dia_setmana', 'value': "Diminutiu del dia de la setmana (Fins l'any 2020)"}, {'key': '11.Descripcio_tipus_dia', 'value': "Tipus de dia (Fins l'any 2020)"}, {'key': '12.NK_Any', 'value': 'Any'}, {'key': '13.Mes_any', 'value': "Mes de l'any"}, {'key': '14.Nom_mes', 'value': 'Nom del mes'}, {'key': '15.Dia_mes', 'value': 'Dia del mes'}, {'key': '16.Hora_dia', 'value': 'Hora del dia'}, {'key': '17.Descripcio_torn', 'value': 'Tipus del torn (matí, tarda, nit)'}, {'key': '18.Descripcio_causa_vianant', 'value': "Tipus de causalitat de l'accident"}, {'key': '19.Numero_morts', 'value': 'Nombre de morts'}, {'key': '20.Numero_lesionats_lleus', 'value': 'Número de lesionats lleus'}, {'key': '21.Numero_lesionats_greus', 'value': 'Número de lesionats greus'}, {'key': '22.Numero_victimes', 'value': 'Número de víctimes implicades'}, {'key': '23.Numero_vehicles_implicats', 'value': 'Número de vehicles implicats'}, {'key': '24.Coordenada_UTM_X_ED50', 'value': 'Coordenada X en format UTM (ED50)'}, {'key': '25.Coordenada_UTM_Y_ED50', 'value': 'Coordenada Y en format UTM (ED50)'}, {'key': '26.Longitud_WGS84', 'value': 'Longitud'}, {'key': '27.Latitud_WGS84', 'value': 'Latitud'}]

Now we need to set up our LiteLLM config. First, the Router, which indicates which LLMs we want to use as primary and fallback options. In our router, we’ll set GPT-4.1 as the fallback for Claude Sonnet 4.

Code
from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "claude",
            "litellm_params": {"model": "anthropic/claude-sonnet-4-20250514"},
        },
        {"model_name": "gpt4", "litellm_params": {"model": "openai/gpt-4.1"}},
    ],
    fallbacks=[{"claude": ["gpt4"]}],
    set_verbose=False,
    debug_level="INFO",
)

Usage and cost tracking

OK, here’s the part you’ve been waiting for. We are going to track LLM usage, cost, and timing metrics with an instance of the UsageTracker class that inherits from LiteLLM’s CustomLogger. A CustomLogger class has a several abstract methods that can be used to implement custom behavior on every LiteLLM call; in this example, we use log_success_event to access the already-parsed model response object. We use LiteLLM’s completion_cost utility to compute the cost and we also record the number of input and output tokens, the response time, and which model was used.

The LiteLLM documentation kinda emphasizes custom callback functions, so why use a class to track usage and cost? Two reasons. First, because we want to track metrics over the course of multiple LLM calls, aggregating the data as we go. And second, because the class is a nice place to park some simple methods that we almost always want to call on the data once it’s collected. Namely, the to_df method converts the collected data from a list of dictionaries to a Pandas DataFrame and the summarize sums the data, grouped by the model used.

Code
from datetime import datetime, timezone
from litellm import completion_cost
from litellm.integrations.custom_logger import CustomLogger
import pandas as pd

class UsageTracker(CustomLogger):
    def __init__(self):
        self.data = []
        super().__init__()

    def log_success_event(self, kwargs, response_obj, start_time, end_time):
        usage = response_obj.usage
        cost = completion_cost(completion_response=response_obj)

        self.data.append(
            {
                "timestamp": datetime.now(timezone.utc),
                "model": response_obj.model,
                "tokens_in": usage.prompt_tokens,
                "tokens_out": usage.completion_tokens,
                "cost_estimate": cost,
                "duration": (end_time - start_time).total_seconds(),
            }
        )

    def to_df(self) -> pd.DataFrame:
        if self.data:
            out = pd.DataFrame(self.data)
        else:
            out = pd.DataFrame()

        return out

    def summarize(self) -> pd.DataFrame:
        if self.data:
            df = self.to_df()
            grp = df.groupby("model")
            out = grp[["tokens_in", "tokens_out", "cost_estimate"]].sum()
            return out
        else:
            return pd.DataFrame()
1
This signature comes from the LiteLLM CustomLogger class.
2
completion_cost is a function that takes the response object.

Once we create an instance of the UsageTracker, we just add it to the global litellm.callbacks list.

Code
import litellm

usage = UsageTracker()
litellm.callbacks = [usage]

Run it

And that’s about it for setting up the tracking. To see it in action, let’s define and run a few LLM calls. The basic functions we need for this example are: * generate a SQL query given our initial question and info about the DB table. * strip formatting like markdown backticks from the output of the first function.

We’ll use Claude Sonnet 4 to generate the initial SQL and GPT-4.1 to do the formatting.

Code
def generate_sql(router, question: str, table_info: str):
    """Translate a natural langauge question to a SQL query."""

    response = router.completion(
        model="claude",
        temperature=0,
        messages=[
            {
                "role": "developer",
                "content": (
                    "You are an AI assistant tasked with generating SQL queries "
                    "to answer questions about a DuckDB table, given metadata "
                    "about the table, an example row of data, descriptions of "
                    "the fields, and optional hints from the user about the "
                    "table schema. Return only clean SQL code; do not use any "
                    "reasoning or markdown formatting. Remember to put quotes "
                    "around table and field names."
                ),
            },
            {"role": "user", "content": f"Question: {question}"},
            {"role": "user", "content": table_info}
        ],
    )

    return response.choices[0]["message"]["content"]

def strip_formatting(router, llm_sql: str):
    """Use an LLM to strip everything except clean SQL code from another LLM's
    response."""

    response = router.completion(
        model="gpt4",
        temperature=0,
        messages=[
            {
                "role": "developer",
                "content": (
                    "You are an AI assistant tasked with extracting completely "
                    "clean SQL code from input text. "
                    "Your response should be pure, executable SQL code, free "
                    "of any commentary or markdown formatting.\n\n"
                    "Input: ```sql\nSELECT * FROM vehicles LIMIT 10```"
                    "Output: SELECT * FROM vehicles LIMIT 10"
                ),
            },
            {"role": "user", "content": f"Input:\n{llm_sql}"},
        ],
    )

    return response.choices[0]["message"]["content"]
1
Setting the temperature to 0 isn’t necessarily a good idea in production but is helpful for writing blog posts (and testing).

We generate the SQL query once, then loop over the validation and formatting steps.

Code
llm_sql = generate_sql(router, question, table_info_str)
sql = strip_formatting(router, llm_sql)
print(sql)
SELECT 
    "Descripcio_dia_setmana",
    "Hora_dia",
    COUNT(*) as accident_count
FROM "2024_accidents_gu_bcn"
WHERE "Numero_vehicles_implicats" >= 1
GROUP BY "Descripcio_dia_setmana", "Hora_dia"
ORDER BY accident_count DESC
LIMIT 1

Looks like decent, runnable SQL. Excellent. Now let’s check the result from the DB, to confirm.

Code
result = db.sql(sql)
result
┌────────────────────────┬──────────┬────────────────┐
│ Descripcio_dia_setmana │ Hora_dia │ accident_count │
│        varchar         │  int64   │     int64      │
├────────────────────────┼──────────┼────────────────┤
│ Divendres              │       15 │             99 │
└────────────────────────┴──────────┴────────────────┘

Looks like Friday at 3pm had the most car accidents in 2024 (attended to by the local police).

Check the results

Now let’s see how many tokens we used, the estimated cost according to LiteLLM’s metering table, and the response time of each call.

Code
usage.to_df()
timestamp model tokens_in tokens_out cost_estimate duration
0 2025-06-29 14:02:57.555700+00:00 claude-sonnet-4-20250514 1354 103 0.005607 3.758616
1 2025-06-29 14:02:58.804912+00:00 gpt-4.1-2025-04-14 147 73 0.000878 1.246519

Our summarize method sums the results by model. In this case we only have one call to each model so the aggregation doesn’t add much.

Code
usage.summarize()
tokens_in tokens_out cost_estimate
model
claude-sonnet-4-20250514 1354 103 0.005607
gpt-4.1-2025-04-14 147 73 0.000878

Finally, a simple one-liner on our usage dataframe to get the total estimated cost.

Code
print(f"Total estimated cost: ${usage.to_df()['cost_estimate'].sum():.6f}")
Total estimated cost: $0.006485

We spent about 0.6 cents on this run, according to LiteLLM, most of which went to Anthropic.

Wrap up

There are other ways to track cost. Your LLM accounts have the ground truth, of course, but it’s a pain to check on every run of a script and most don’t have as much granularity as they should. Some enterprise monitoring tools have also added LLM cost visbility to their platforms. But for me, having a quick and easy utility that I can work with in the Python code itself is the sweet spot.

One additional note: this post implies all the code is jammed into a single file, because I don’t know how to write a Quarto blog post any other way. The actual Barcelona Data Trivia repo has code split across multiple files:

  • cli.py
  • main.py
  • llm_utils.py
  • utils.py
  • etc.a

What do you think? How do you track LLM spending? Let me know in the replies to the BlueSky or LinkedIn posts about this article.