Model Evals

Evaluate prompts, flows, and agents with tournament brackets, pairwise comparisons, and rubric scoring.

Workflow

Create a tournament in the Evals app and define the task (prompt, agent, or conversation).
Upload contenders from GitHub, Builder exports, or manual JSON.
Pick scoring rules (auto metrics, human rubric, or hybrid).
Run matches and watch live leaderboards update.
Export results back into Builder or CI pipelines.

Integrations

Builder sessions can be promoted directly into an eval.
API responses stream into the eval viewer so you can trace reasoning steps.
Webhooks fire on regressions so you can block deployments.

CLI support

pnpm dlx @playbasis/evals submit \
 --tournament builder-weekly \
 --variant branch-build \
 --token $PLAYBASIS_EVALS_TOKEN

Access

The hosted UI is not yet publicly available. Reach out at helloplaybasis@gmail.com to request a guided walkthrough or early access credentials.

Related resources

AI guardrails policy