Welcome to the first lesson of the course, where we begin our journey into behavioral benchmarking of large language models (LLMs). In this course, you will learn how to evaluate LLMs beyond just accuracy, focusing on how they use resources, how their outputs change with different settings, and how to spot unusual or incorrect responses.
In this lesson, we focus on token usage. Tokens are the basic units of text that LLMs process — think of them as pieces of words or characters. Every time you send a prompt to an LLM, it counts the number of tokens in your input and its output.
In benchmarking, efficiency matters as much as accuracy. Tracking token usage helps you understand not only how "smart" a model is, but how resource-hungry it is — something crucial for real-world applications and cost management.
By the end of this lesson, you will know how to measure token usage for different prompts and interpret what those numbers mean.
When working with LLMs, it’s important to understand the concept of a context window. The context window is the maximum number of tokens (including both your input prompt and the model’s output) that the model can process in a single request. For example, the gpt-3.5-turbo
model has a context window of 4096 tokens.
The context window determines how much information the model can "see" at once. If your prompt and the expected completion together exceed this limit, the model will either truncate the input or cut off the output. This can affect the quality and completeness of the responses.
When benchmarking or designing prompts, always keep the context window in mind:
- If your prompt is very long, you may need to shorten it to leave enough room for the model’s response.
- If you expect a long answer, make sure your prompt is concise enough to fit within the context window.
- Exceeding the context window can lead to incomplete outputs or errors.
Monitoring token usage helps you stay within the context window and ensures that your prompts and completions are processed as intended.
To measure token usage, we will use the OpenAI Python client library. This library allows you to interact with OpenAI’s models using Python code. If you are working on your own computer, you would usually install the library using a command like pip install openai
. However, on CodeSignal, the OpenAI library is already installed for you, so you can start coding right away.
The OpenAI client lets you send prompts to a model and receive responses. It also provides useful information about each request, including how many tokens were used for the prompt, the completion (the model’s response), and the total. In the next section, we will look at a practical example of how to use this client to measure token usage.
Let’s look at a code example that measures token usage for several different prompts. Here is the code you will use:
In this example, you first import the OpenAI client and create a list of prompts. For each prompt, you send a request to the gpt-3.5-turbo
model with a temperature of 0
(which makes the output more deterministic). The response from the model includes a usage
object, which contains the number of tokens used for the prompt, the completion, and the total. The code then prints out these numbers for each prompt.
When you run this code, you will see output similar to the following (the exact numbers may vary):
This output shows you, for each prompt, how many tokens were used in the input, how many in the output, and the total for the request.
Now, let’s break down what these numbers mean. The prompt tokens
value tells you how many tokens were in your input prompt. The completion tokens
value shows how many tokens the model used to generate its response. The total tokens
is simply the sum of the two.
But what do these numbers actually mean in practice?
- Prompt tokens: This number reflects the length and complexity of your input. For example, a short question like "What is photosynthesis?" might use only a few tokens, while a detailed instruction or a multi-part question will use more. If you see a high prompt token count, it means your input is long or complex.
- Completion tokens: This number tells you how much the model "said" in response. A low number means a short answer; a high number means a longer, more detailed answer. If you ask for a summary in one sentence, you’ll see fewer completion tokens than if you ask for a detailed explanation.
- Total tokens: This is the sum of prompt and completion tokens. It represents the total resources used for the request. Most LLM providers, including OpenAI, charge based on this total token count.
- Cost: Since providers charge per token, higher total token counts mean higher costs. Monitoring token usage helps you manage expenses, especially at scale.
- Efficiency: If you notice that similar prompts produce very different token counts, it may indicate that some prompts are less efficient or that the model is being unnecessarily verbose.
- Model Limits: LLMs have maximum token limits per request (the context window, for example, 4096 tokens for some models). If your prompt and expected completion together approach this limit, you may need to shorten your input or expect shorter outputs.
- Benchmarking: By comparing token usage across prompts and models, you can benchmark not just accuracy, but also efficiency. For example, if two models give similar answers but one uses fewer tokens, it may be more efficient for your use case.
Understanding these numbers helps you see how much information you are sending and receiving, estimate costs, and optimize your prompts for both quality and efficiency.
In this lesson, you learned why token usage is important when working with LLMs and how to measure it using the OpenAI Python client. You saw a practical example of sending multiple prompts to a model and reading the token usage from the response. You also learned how to interpret the output, so you can understand how your prompts and the model’s responses affect token counts.
You also learned about the context window, which is the maximum number of tokens a model can process at once. Keeping the context window in mind is essential for ensuring your prompts and completions fit within the model’s limits and for optimizing both efficiency and output quality.
This knowledge will be important as you move on to the practice exercises, where you will get hands-on experience measuring and comparing token usage for different prompts. Understanding token usage and the context window is the first step in effective LLM benchmarking, and it will help you make better decisions about how to use these models efficiently.
