Model Evals
Evaluate prompts, flows, and agents with tournament brackets, pairwise comparisons, and rubric scoring.
Workflow
- Create a tournament in the Evals app and define the task (prompt, agent, or conversation).
- Upload contenders from GitHub, Builder exports, or manual JSON.
- Pick scoring rules (auto metrics, human rubric, or hybrid).
- Run matches and watch live leaderboards update.
- Export results back into Builder or CI pipelines.
Integrations
- Builder sessions can be promoted directly into an eval.
- API responses stream into the eval viewer so you can trace reasoning steps.
- Webhooks fire on regressions so you can block deployments.
CLI support
pnpm dlx @playbasis/evals submit \
--tournament builder-weekly \
--variant branch-build \
--token $PLAYBASIS_EVALS_TOKENAccess
The hosted UI is not yet publicly available. Reach out at helloplaybasis@gmail.com to request a guided walkthrough or early access credentials.