Comparing Sentence Likelihoods Using Log Probabilities

Introduction to Sentence Likelihoods

Welcome back to the course "Scoring LLM Outputs with Logprobs and Perplexity." In the previous lesson, we explored how log probabilities provide insights into a language model’s confidence when generating tokens. Now, we’ll build on that foundation by comparing sentence likelihoods using log probabilities.

Evaluating sentence likelihoods helps us understand how models judge different formulations of language. In this lesson, you’ll learn how to use the OpenAI API to score sentences based on the log probability of the model’s next token prediction.

The Importance and Applications of Likelihoods

Likelihoods are a fundamental concept in language modeling and natural language processing. They measure how probable a sequence of words is according to a model, allowing us to quantify how “natural” or “expected” a sentence is. This is crucial for a variety of tasks:

Model Evaluation: Likelihoods are widely used to compare different language models and assess their performance on tasks like text generation, translation, and summarization.
Error Detection: By identifying sentences or tokens with unusually low likelihoods, we can spot errors, anomalies, or unnatural phrasing in generated text.
Data Filtering: Likelihood scores help filter out low-quality or irrelevant data when building datasets for training or evaluation.
Downstream Applications: Many applications—such as speech recognition, machine translation, and autocomplete—rely on likelihoods to rank candidate outputs and select the most plausible one.

Because of their versatility and interpretability, likelihoods (and their log-transformed versions, log probabilities) are a standard tool for both researchers and practitioners working with language models.

Understanding the Code Structure

Let’s break down the code you’ll use in this unit. We begin by initializing the OpenAI client and defining a list of candidate sentences. For each sentence, we’ll pass it to the model and extract the log probability of the first predicted token, which gives us a proxy for how likely or “natural” the sentence feels to the model.

We use:

logprobs=True to return log probability data.
top_logprobs=5 to retrieve scores for the top 5 candidate tokens.
max_tokens=1 to generate exactly one token prediction.

Extracting and Interpreting Log Probabilities

When you request logprobs=True from the OpenAI API, the response includes a logprobs object for each generated token. This object contains the log probability assigned to each token, as well as the top alternative tokens and their logprobs. The structure looks like this:

A log probability closer to 0 means the model is more confident in that token. The plot below illustrates how log probability values relate to model confidence:

By comparing logprob values for different sentences, you can infer which one the model finds more plausible.

Example: Comparing Sentence Fluency

Let’s look at two example sentences:

"The sun is a star."
"The sun is a sandwich."

While both are syntactically valid, one is clearly more semantically coherent. We’ll use logprobs to see how the model scores them.

You might get an output like:

The model assigns a much higher log probability to the first sentence, indicating it considers it more likely.

Summary and Next Steps

In this lesson, you learned how to compare sentence plausibility by examining token-level log probabilities. This method allows you to go beyond just generating responses—now you can measure how confident the model is in its next move.

In the next unit, you’ll use this idea to calculate perplexity, a popular metric that quantifies overall sentence fluency using log probability averages. You're getting closer to evaluating language models like a pro—let’s keep going!

Previous Lesson

Next Lesson: Calculating Perplexity in Language Models

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal