Use case

RAG Answer Evaluation

Compare generated answers when you provide the retrieved context and expected facts inside each test case.

View example report

What to evaluate

Supplied context rows

Put the question, retrieved context, and expected facts into the dataset row so every model sees the same evidence.

Coverage checks

Flag answers that miss required facts or ignore important parts of the supplied context.

Source-label checks

Use expected text, keywords, or regex checks when your prompt asks the model to include source labels.

Model comparison

Use side-by-side outputs to decide which model follows your RAG prompt most reliably.

Checks

Build the comparison around observable failures.

PromptLens works best when each model is judged against the same dataset rows and pass criteria.

Includes supplied facts
Covers required facts
Avoids claims outside expected facts
Includes expected source labels
States uncertainty when context is missing
Evaluation example

A practical RAG comparison

Treat each dataset row as one retrieval scenario: question, context snippet, expected facts, and known failure modes.

The comparison report makes missed expected facts visible without asking reviewers to read every answer from scratch.

When model quality is close, use failure reasons to decide whether a candidate model is good enough for that RAG prompt workflow.

Example dataset row

{
  question: "What is the refund window for annual plans?",
  context: "Annual plans can be refunded within 14 days...",
  must_include: ["14 days", "annual plans"],
  fail_if: [
    "mentions a different refund window",
    "omits plan type",
    "adds unsupported policy"
  ]
}

Turn this workflow into a report.

Compare the model outputs, score the failures, and share the decision record with the team.