Compare code assistant outputs across models and review code text, explanations, and policy-sensitive patterns.
What to evaluate
Run each model against the same prompt, language constraints, and expected behavior.
Keep code, explanation, score, and failure reason together so reviewers can make a quick call.
Use keyword, regex, and expected-output checks for unsafe patterns, unclear assumptions, or style conventions.
Use the report to test whether a candidate model is acceptable for a narrow coding prompt workflow.
PromptLens works best when each model is judged against the same dataset rows and pass criteria.
Start with small coding tasks where quality differences are easy to inspect: validation helpers, data transforms, API handlers, and refactors.
Score the outputs with concrete text checks that name the failure mode. The useful artifact is the comparison, not a vague winner label.
When a model change regresses a language or framework pattern, the report gives the team a concrete reason to block it.
Example dataset row
{
task: "Write a TypeScript email validator.",
constraints: ["no external package", "clear return type"],
must_handle: ["empty string", "missing domain", "valid email"],
fail_if: ["uses eval", "throws on normal input", "no explanation"]
}Compare the model outputs, score the failures, and share the decision record with the team.