Loading...

Introduction to Creating Metrics in DSPy

Welcome to the third lesson of the "Evaluation in DSPy" course. In this lesson, we will focus on creating metrics, a crucial aspect of evaluating the quality of system outputs in DSPy. Metrics allow us to quantify how well a system performs, providing a basis for improvement and optimization. Building on the data handling skills you acquired in the previous lesson, you will now learn how to define and implement metrics to assess output quality effectively. By the end of this lesson, you will be equipped with the knowledge to create and use metrics in DSPy, setting the stage for practical applications and further exploration.

Basic Metric Functions

To begin, let's explore some basic metric functions that are foundational in evaluating system responses. One such function is validate_answer, which checks if the predicted answer matches the expected answer. Here's how you can use it in DSPy:

This function compares the predicted answer with the example's answer, ignoring case differences. It returns True if they match and False otherwise. This basic validation is useful for tasks where exact matches are required.

Next, let's look at the answer_exact_match and answer_passage_match built-in functions. These functions provide more flexibility by allowing partial matches and checking if the answer is present in a passage. Here's how you can use them:

Here its the underlying implementation of these functions:

This function uses a helper function _answer_match to determine if the prediction matches any of the answers, allowing for partial matches based on the frac parameter. The answer_passage_match is also implemented in a similar way:

The answer_passage_match function is designed to evaluate whether the predicted answer is present within a given passage. It works by checking if any of the expected answers are found within the context of the predicted response. The function uses a helper function _passage_match to perform the actual matching process.

These functions are essential for evaluating the accuracy of system responses, especially in tasks involving text passages.

Completeness and Groundedness Evaluation

In DSPy, evaluating the completeness and groundedness of system responses is crucial for understanding their quality. The CompleteAndGrounded built-in class provides a structured way to perform this evaluation.

Here's how you can use the CompleteAndGrounded class:

This class calculates a score based on the completeness and groundedness of the response, using an F1 score to combine these aspects.

Here's how the CompleteAndGrounded class is implemented:

It consists of two main components: AnswerCompleteness and AnswerGroundedness.

The AnswerCompleteness component estimates how well a system's response covers the ground truth. It involves enumerating key ideas in both the ground truth and the system response, discussing their overlap, and reporting completeness. Similarly, AnswerGroundedness assesses the extent to which a system's response is supported by retrieved documents and commonsense reasoning.

Semantic Evaluation with F1 Score

Semantic evaluation involves assessing the quality of system responses based on their semantic content. The built-inSemanticF1 class provides a way to perform this evaluation using recall, precision, and F1 score. Here's how you can use it:

Recall measures the fraction of ground truth covered by the system response, while precision measures the fraction of the system response covered by the ground truth. The F1 score combines these two metrics to provide a balanced evaluation.

Here's the implementation of the SemanticF1 class:

This class allows for both standard and decompositional semantic evaluation, providing flexibility in assessing the quality of responses. The forward method calculates the F1 score based on precision and recall, offering a comprehensive evaluation metric.

Practical Example: Evaluating a Tweet

To illustrate the creation of custom metrics, let's consider a practical example of evaluating a tweet. The goal is to assess whether a generated tweet answers a given question correctly, is engaging, and adheres to the character limit. Here's how you can implement such a metric:

In this example, the Assess class defines the signature for automatic assessments, and the metric function evaluates the tweet based on correctness, engagement, and length. The function returns a score that reflects the quality of the tweet, providing a practical application of custom metrics.

Summary and Preparation for Practice

In this lesson, you learned how to create and use metrics in DSPy to evaluate the quality of system outputs. We covered basic metric functions, explored completeness and groundedness evaluation, and introduced semantic evaluation with F1 scores. Additionally, we walked through a practical example of evaluating a tweet using custom metrics. These skills are essential for assessing the performance of DSPy systems and will serve as a foundation for more advanced topics in the course. As you move on to the practice exercises, I encourage you to apply what you've learned and experiment with creating your own metrics in the CodeSignal IDE. This hands-on practice will reinforce your understanding and prepare you for the next steps in your DSPy journey.

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal