Compare support prompt outputs across models and catch weak text patterns before customers see them.
What to evaluate
Check whether the response includes the expected next step for the customer's issue.
Use expected phrasing, keyword checks, or reviewer-visible failures to catch vague or dismissive responses.
Check whether the model includes escalation language for cases you mark as high risk.
Keep policy-sensitive answers visible before they are shipped into the support flow.
PromptLens works best when each model is judged against the same dataset rows and pass criteria.
Build a small dataset from real tickets that cover billing, account access, technical errors, and angry customers.
Compare the outputs side by side so the team can see exactly which model resolved the issue and which one failed the configured checks.
Use the failure reasons to decide whether to ship the prompt, revise it, or test a candidate model for the same case set.
Example dataset row
{
input: "This export failed again and I need it today.",
expected_behavior: [
"acknowledge urgency",
"ask for report or account details",
"give a concrete troubleshooting step"
],
fail_if: ["generic apology only", "no next step", "wrong tone"]
}Compare the model outputs, score the failures, and share the decision record with the team.