Compare generated answers when you provide the retrieved context and expected facts inside each test case.
What to evaluate
Put the question, retrieved context, and expected facts into the dataset row so every model sees the same evidence.
Flag answers that miss required facts or ignore important parts of the supplied context.
Use expected text, keywords, or regex checks when your prompt asks the model to include source labels.
Use side-by-side outputs to decide which model follows your RAG prompt most reliably.
PromptLens works best when each model is judged against the same dataset rows and pass criteria.
Treat each dataset row as one retrieval scenario: question, context snippet, expected facts, and known failure modes.
The comparison report makes missed expected facts visible without asking reviewers to read every answer from scratch.
When model quality is close, use failure reasons to decide whether a candidate model is good enough for that RAG prompt workflow.
Example dataset row
{
question: "What is the refund window for annual plans?",
context: "Annual plans can be refunded within 14 days...",
must_include: ["14 days", "annual plans"],
fail_if: [
"mentions a different refund window",
"omits plan type",
"adds unsupported policy"
]
}Compare the model outputs, score the failures, and share the decision record with the team.