Welcome back to the "Scoring LLM Outputs with Logprobs and Perplexity" course. In the previous lesson, you explored how to compare sentence likelihoods using log probabilities. In this lesson, you’ll learn about perplexity—a metric used to evaluate how well a language model predicts text.
Perplexity gives us a sense of how “surprised” a model is by a sequence. A lower value means the model finds the sentence more natural; a higher value suggests the model finds it awkward or unexpected.
By the end of this lesson, you'll understand what perplexity means, how it's related to log probabilities, and how to approximate it using OpenAI’s API.
In traditional NLP, perplexity is computed as the exponential of the average negative log-likelihood of a token sequence:
We’ll compare two sentences—one fluent, one awkward—and use the log probability of the generated token to approximate perplexity.
Example output:
Explanation of the output:
- For
"Cats sleep on the windowsill."
, the log probability is higher (less negative), and the perplexity is much lower (45.80). This means the model finds this sentence more natural and likely. - For
"Cats windowsill the on sleep."
, the log probability is lower (more negative), and the perplexity is much higher (312.27). This means the model finds this sentence much less likely or more surprising. - The absolute values of perplexity may vary depending on the model and API version, but the key point is that the fluent sentence has a significantly lower perplexity than the awkward one. This demonstrates that perplexity can be used to compare the relative fluency or naturalness of different sentences according to the language model.
- Lower perplexity means the sentence is more natural to the model.
- Higher perplexity indicates the sentence is confusing or unlikely.
- We’re approximating perplexity with one token—not a full sequence—so results are directional, not absolute.
- Make sure
max_tokens
is set to 1 or more to ensure a generated token is returned.
In this lesson, you learned how to approximate perplexity using log probabilities from OpenAI’s API. While not a full traditional measure, this technique provides a useful proxy for evaluating sentence fluency. In the next unit, you’ll compare multiple models using perplexity to evaluate their fluency on the same sentence.
