Use case

Chatbot Prompt Testing

Compare chatbot responses across models with the same user messages, expected answers, and text-based checks.

View example report

What to evaluate

Representative conversations

Use real support, onboarding, and FAQ messages as dataset rows so every model sees the same input.

Expected-answer checks

Use expected outputs, keyword checks, and visible failure reasons to review whether each answer is specific enough.

Escalation language

Check whether responses include required escalation or missing-context language for high-risk cases.

Baseline comparison

Compare a candidate prompt against a known-good run before the change reaches users.

Checks

Build the comparison around observable failures.

PromptLens works best when each model is judged against the same dataset rows and pass criteria.

Answers the user question

Includes required support language

Asks for missing context

Includes escalation language

Avoids unsupported promises

Evaluation example

A practical chatbot comparison

Start with the conversations that are expensive to get wrong: billing disputes, account access, angry customers, and unclear technical issues.

Run each row across the models you are considering. The report should show the raw answer, score, pass/fail status, and failure reason in one place.

If a candidate model clears the same threshold as your baseline, you have evidence to trial it for that workflow. If the candidate prompt drops below baseline, block the change.

Example dataset row

{
  input: "I was charged twice and nobody replied.",
  expected_behavior: [
    "acknowledge frustration",
    "ask for account or invoice detail",
    "explain the next support step"
  ],
  fail_if: [
    "dismisses the complaint",
    "promises a refund without policy context",
    "does not offer escalation"
  ]
}

Turn this workflow into a report.

Compare the model outputs, score the failures, and share the decision record with the team.

Use case

Chatbot Prompt Testing

Compare chatbot responses across models with the same user messages, expected answers, and text-based checks.

View example report

What to evaluate

Representative conversations

Use real support, onboarding, and FAQ messages as dataset rows so every model sees the same input.

Expected-answer checks

Use expected outputs, keyword checks, and visible failure reasons to review whether each answer is specific enough.

Escalation language

Check whether responses include required escalation or missing-context language for high-risk cases.

Baseline comparison

Compare a candidate prompt against a known-good run before the change reaches users.

Checks

Build the comparison around observable failures.

PromptLens works best when each model is judged against the same dataset rows and pass criteria.

Answers the user question

Includes required support language

Asks for missing context

Includes escalation language

Avoids unsupported promises

Evaluation example

A practical chatbot comparison

Start with the conversations that are expensive to get wrong: billing disputes, account access, angry customers, and unclear technical issues.

Run each row across the models you are considering. The report should show the raw answer, score, pass/fail status, and failure reason in one place.

If a candidate model clears the same threshold as your baseline, you have evidence to trial it for that workflow. If the candidate prompt drops below baseline, block the change.

Example dataset row

{
  input: "I was charged twice and nobody replied.",
  expected_behavior: [
    "acknowledge frustration",
    "ask for account or invoice detail",
    "explain the next support step"
  ],
  fail_if: [
    "dismisses the complaint",
    "promises a refund without policy context",
    "does not offer escalation"
  ]
}

Turn this workflow into a report.

Compare the model outputs, score the failures, and share the decision record with the team.

Use case

Chatbot Prompt Testing

Compare chatbot responses across models with the same user messages, expected answers, and text-based checks.

View example report

What to evaluate

Representative conversations

Use real support, onboarding, and FAQ messages as dataset rows so every model sees the same input.

Expected-answer checks

Use expected outputs, keyword checks, and visible failure reasons to review whether each answer is specific enough.

Escalation language

Check whether responses include required escalation or missing-context language for high-risk cases.

Baseline comparison

Compare a candidate prompt against a known-good run before the change reaches users.

Checks

Build the comparison around observable failures.

PromptLens works best when each model is judged against the same dataset rows and pass criteria.

Answers the user question

Includes required support language

Asks for missing context

Includes escalation language

Avoids unsupported promises

Evaluation example

A practical chatbot comparison

Start with the conversations that are expensive to get wrong: billing disputes, account access, angry customers, and unclear technical issues.

Run each row across the models you are considering. The report should show the raw answer, score, pass/fail status, and failure reason in one place.

If a candidate model clears the same threshold as your baseline, you have evidence to trial it for that workflow. If the candidate prompt drops below baseline, block the change.

Example dataset row

{
  input: "I was charged twice and nobody replied.",
  expected_behavior: [
    "acknowledge frustration",
    "ask for account or invoice detail",
    "explain the next support step"
  ],
  fail_if: [
    "dismisses the complaint",
    "promises a refund without policy context",
    "does not offer escalation"
  ]
}

Turn this workflow into a report.

Compare the model outputs, score the failures, and share the decision record with the team.