Compare chatbot responses across models with the same user messages, expected answers, and text-based checks.
What to evaluate
Use real support, onboarding, and FAQ messages as dataset rows so every model sees the same input.
Use expected outputs, keyword checks, and visible failure reasons to review whether each answer is specific enough.
Check whether responses include required escalation or missing-context language for high-risk cases.
Compare a candidate prompt against a known-good run before the change reaches users.
PromptLens works best when each model is judged against the same dataset rows and pass criteria.
Start with the conversations that are expensive to get wrong: billing disputes, account access, angry customers, and unclear technical issues.
Run each row across the models you are considering. The report should show the raw answer, score, pass/fail status, and failure reason in one place.
If a candidate model clears the same threshold as your baseline, you have evidence to trial it for that workflow. If the candidate prompt drops below baseline, block the change.
Example dataset row
{
input: "I was charged twice and nobody replied.",
expected_behavior: [
"acknowledge frustration",
"ask for account or invoice detail",
"explain the next support step"
],
fail_if: [
"dismisses the complaint",
"promises a refund without policy context",
"does not offer escalation"
]
}Compare the model outputs, score the failures, and share the decision record with the team.
Compare chatbot responses across models with the same user messages, expected answers, and text-based checks.
What to evaluate
Use real support, onboarding, and FAQ messages as dataset rows so every model sees the same input.
Use expected outputs, keyword checks, and visible failure reasons to review whether each answer is specific enough.
Check whether responses include required escalation or missing-context language for high-risk cases.
Compare a candidate prompt against a known-good run before the change reaches users.
PromptLens works best when each model is judged against the same dataset rows and pass criteria.
Start with the conversations that are expensive to get wrong: billing disputes, account access, angry customers, and unclear technical issues.
Run each row across the models you are considering. The report should show the raw answer, score, pass/fail status, and failure reason in one place.
If a candidate model clears the same threshold as your baseline, you have evidence to trial it for that workflow. If the candidate prompt drops below baseline, block the change.
Example dataset row
{
input: "I was charged twice and nobody replied.",
expected_behavior: [
"acknowledge frustration",
"ask for account or invoice detail",
"explain the next support step"
],
fail_if: [
"dismisses the complaint",
"promises a refund without policy context",
"does not offer escalation"
]
}Compare the model outputs, score the failures, and share the decision record with the team.