Welcome to the third lesson of the "Evaluation in DSPy" course. In this lesson, we will focus on creating metrics, a crucial aspect of evaluating the quality of system outputs in DSPy. Metrics allow us to quantify how well a system performs, providing a basis for improvement and optimization. Building on the data handling skills you acquired in the previous lesson, you will now learn how to define and implement metrics to assess output quality effectively. By the end of this lesson, you will be equipped with the knowledge to create and use metrics in DSPy, setting the stage for practical applications and further exploration.
To begin, let's explore some basic metric functions that are foundational in evaluating system responses. One such function is validate_answer
, which checks if the predicted answer matches the expected answer. Here's how you can use it in DSPy:
This function compares the predicted answer with the example's answer, ignoring case differences. It returns True
if they match and False
otherwise. This basic validation is useful for tasks where exact matches are required.
Next, let's look at the answer_exact_match
and answer_passage_match
built-in functions. These functions provide more flexibility by allowing partial matches and checking if the answer is present in a passage. Here's how you can use them:
Here its the underlying implementation of these functions:
This function uses a helper function _answer_match
to determine if the prediction matches any of the answers, allowing for partial matches based on the frac
parameter. The is also implemented in a similar way:
In DSPy, evaluating the completeness and groundedness of system responses is crucial for understanding their quality. The CompleteAndGrounded
built-in class provides a structured way to perform this evaluation.
Here's how you can use the CompleteAndGrounded
class:
This class calculates a score based on the completeness and groundedness of the response, using an F1 score to combine these aspects.
Here's how the CompleteAndGrounded
class is implemented:
Semantic evaluation involves assessing the quality of system responses based on their semantic content. The built-inSemanticF1
class provides a way to perform this evaluation using recall, precision, and F1 score. Here's how you can use it:
Recall measures the fraction of ground truth covered by the system response, while precision measures the fraction of the system response covered by the ground truth. The F1 score combines these two metrics to provide a balanced evaluation.
Here's the implementation of the SemanticF1
class:
To illustrate the creation of custom metrics, let's consider a practical example of evaluating a tweet. The goal is to assess whether a generated tweet answers a given question correctly, is engaging, and adheres to the character limit. Here's how you can implement such a metric:
In this example, the Assess
class defines the signature for automatic assessments, and the metric
function evaluates the tweet based on correctness, engagement, and length. The function returns a score that reflects the quality of the tweet, providing a practical application of custom metrics.
In this lesson, you learned how to create and use metrics in DSPy to evaluate the quality of system outputs. We covered basic metric functions, explored completeness and groundedness evaluation, and introduced semantic evaluation with F1 scores. Additionally, we walked through a practical example of evaluating a tweet using custom metrics. These skills are essential for assessing the performance of DSPy systems and will serve as a foundation for more advanced topics in the course. As you move on to the practice exercises, I encourage you to apply what you've learned and experiment with creating your own metrics in the CodeSignal IDE. This hands-on practice will reinforce your understanding and prepare you for the next steps in your DSPy journey.
