Welcome back! In the last lesson, you learned how to measure the consistency of large language models (LLMs) by checking if they give the same answer to the same prompt every time. This is a key part of benchmarking, but there is another important aspect to consider: factual accuracy. Sometimes, LLMs generate answers that sound convincing but are actually incorrect or even completely made up. These are called hallucinations. Detecting hallucinations is crucial, especially if you want to use LLMs in settings where accuracy matters, such as education, research, or customer support.
In this lesson, you will learn how to use one LLM to check the factual accuracy of another LLM’s answers. This approach is becoming more common because it allows you to automate the process of fact-checking at scale. By the end of this lesson, you will know how to set up a simple pipeline in which one model generates answers and another model acts as a judge, telling you if those answers are correct or hallucinated.
Let’s start by creating a list of prompts that we want to fact-check. For this example, we will use two prompts: one that is straightforward and factual, and another that is intentionally impossible. Here is the code to generate answers using GPT-3.5-Turbo
:
In this code, you first define a list of prompts. The first prompt asks a factual question, while the second prompt is a trick question about Atlantis, which is a mythical place and not a real UN member. You then loop through each prompt, send it to GPT-3.5-Turbo, and collect the answers. The temperature=0
setting ensures that the model gives the most likely answer, which helps with consistency.
After running this code, you might get output like the following:
As you can see, the model gives a factual answer to the first prompt and correctly identifies the second as a myth. However, sometimes models can hallucinate and provide made-up details, especially for less obvious questions.
Now that you have answers from GPT-3.5-Turbo, you can use GPT-4 to judge whether those answers are factually correct. To do this, you will create a special prompt for GPT-4 that asks it to act as a fact-checker. Here is how you can do it:
In this code, you loop through each prompt and answer pair. For each one, you build a new prompt that asks GPT-4 to act as a fact-checker. The prompt is clear and direct: it asks GPT-4 to respond with "Correct" if the answer is accurate, or "Hallucination" if it is not, and to provide a brief explanation. You then send this prompt to GPT-4 and print out the results.
A sample output might look like this:
If GPT-3.5-Turbo had hallucinated and given a made-up answer, GPT-4 would respond with "Hallucination" and explain why.
In this lesson, you learned how to use one LLM to check the factual accuracy of another LLM’s answers. You started by generating answers to a set of prompts using GPT-3.5-Turbo, then used GPT-4 as a fact-checker to judge whether those answers were correct or hallucinated. This approach is a powerful way to automate the detection of hallucinations in LLM outputs, which is important for building reliable applications.
Now that you have seen how to set up and run this fact-checking pipeline, you will get a chance to practice these steps yourself. In the next exercises, you will try out different prompts, see how the models respond, and use GPT-4 to judge the results. This hands-on practice will help you build confidence in evaluating and improving the factual accuracy of LLM outputs.
