Use case

Code Assistant Prompt Evaluation

Compare code assistant outputs across models and review code text, explanations, and policy-sensitive patterns.

View example report

What to evaluate

Same coding task

Run each model against the same prompt, language constraints, and expected behavior.

Reviewable text output

Keep code, explanation, score, and failure reason together so reviewers can make a quick call.

Security and style checks

Use keyword, regex, and expected-output checks for unsafe patterns, unclear assumptions, or style conventions.

Model downgrade evidence

Use the report to test whether a candidate model is acceptable for a narrow coding prompt workflow.

Checks

Build the comparison around observable failures.

PromptLens works best when each model is judged against the same dataset rows and pass criteria.

Matches expected output shape
Covers expected cases
Avoids unsafe patterns
Explains assumptions
Follows requested style
Evaluation example

A practical code assistant comparison

Start with small coding tasks where quality differences are easy to inspect: validation helpers, data transforms, API handlers, and refactors.

Score the outputs with concrete text checks that name the failure mode. The useful artifact is the comparison, not a vague winner label.

When a model change regresses a language or framework pattern, the report gives the team a concrete reason to block it.

Example dataset row

{
  task: "Write a TypeScript email validator.",
  constraints: ["no external package", "clear return type"],
  must_handle: ["empty string", "missing domain", "valid email"],
  fail_if: ["uses eval", "throws on normal input", "no explanation"]
}

Turn this workflow into a report.

Compare the model outputs, score the failures, and share the decision record with the team.