Welcome to the third lesson of the "Evaluation in DSPy" course. In this lesson, we will focus on creating metrics, a crucial aspect of evaluating the quality of system outputs in DSPy. Metrics allow us to quantify how well a system performs, providing a basis for improvement and optimization. Building on the data handling skills you acquired in the previous lesson, you will now learn how to define and implement metrics to assess output quality effectively. By the end of this lesson, you will be equipped with the knowledge to create and use metrics in DSPy, setting the stage for practical applications and further exploration.
To begin, let's explore some basic metric functions that are foundational in evaluating system responses. One such function is validate_answer
, which checks if the predicted answer matches the expected answer. Here's how you can use it in DSPy:
This function compares the predicted answer with the example's answer, ignoring case differences. It returns True
if they match and False
otherwise. This basic validation is useful for tasks where exact matches are required.
Next, let's look at the answer_exact_match
and answer_passage_match
built-in functions. These functions provide more flexibility by allowing partial matches and checking if the answer is present in a passage. Here's how you can use them:
Here its the underlying implementation of these functions:
This function uses a helper function _answer_match
to determine if the prediction matches any of the answers, allowing for partial matches based on the frac
parameter. The answer_passage_match
is also implemented in a similar way:
The answer_passage_match
function is designed to evaluate whether the predicted answer is present within a given passage. It works by checking if any of the expected answers are found within the context of the predicted response.
The function uses a helper function _passage_match
to perform the actual matching process.
These functions are essential for evaluating the accuracy of system responses, especially in tasks involving text passages.
In DSPy, evaluating the completeness and groundedness of system responses is crucial for understanding their quality. The CompleteAndGrounded
built-in class provides a structured way to perform this evaluation.
Here's how you can use the CompleteAndGrounded
class:
This class calculates a score based on the completeness and groundedness of the response, using an F1 score to combine these aspects.
Here's how the CompleteAndGrounded
class is implemented:
It consists of two main components: AnswerCompleteness
and AnswerGroundedness
.
The AnswerCompleteness
component estimates how well a system's response covers the ground truth. It involves enumerating key ideas in both the ground truth and the system response, discussing their overlap, and reporting completeness. Similarly, AnswerGroundedness
assesses the extent to which a system's response is supported by retrieved documents and commonsense reasoning.
Semantic evaluation involves assessing the quality of system responses based on their semantic content. The built-inSemanticF1
class provides a way to perform this evaluation using recall, precision, and F1 score. Here's how you can use it:
Recall measures the fraction of ground truth covered by the system response, while precision measures the fraction of the system response covered by the ground truth. The F1 score combines these two metrics to provide a balanced evaluation.
Here's the implementation of the SemanticF1
class:
This class allows for both standard and decompositional semantic evaluation, providing flexibility in assessing the quality of responses. The forward
method calculates the F1 score based on precision and recall, offering a comprehensive evaluation metric.
To illustrate the creation of custom metrics, let's consider a practical example of evaluating a tweet. The goal is to assess whether a generated tweet answers a given question correctly, is engaging, and adheres to the character limit. Here's how you can implement such a metric:
In this example, the Assess
class defines the signature for automatic assessments, and the metric
function evaluates the tweet based on correctness, engagement, and length. The function returns a score that reflects the quality of the tweet, providing a practical application of custom metrics.
In this lesson, you learned how to create and use metrics in DSPy to evaluate the quality of system outputs. We covered basic metric functions, explored completeness and groundedness evaluation, and introduced semantic evaluation with F1 scores. Additionally, we walked through a practical example of evaluating a tweet using custom metrics. These skills are essential for assessing the performance of DSPy systems and will serve as a foundation for more advanced topics in the course. As you move on to the practice exercises, I encourage you to apply what you've learned and experiment with creating your own metrics in the CodeSignal IDE. This hands-on practice will reinforce your understanding and prepare you for the next steps in your DSPy journey.
