AI Judge
AI Judge adds LLM-powered semantic evaluation to your Flow gates. Define criteria in plain English — the AI evaluates each payload against them and returns structured pass/fail verdicts with confidence scores and reasoning.
Key insight: Deterministic business rules catch 70-80% of issues. AI Judge handles the remaining 20-30% that rules can't express — architectural quality, content tone, semantic compliance, contract clause analysis.
How It Works
AI Judge runs as an optional layer after deterministic validation passes:
Schema validation → Business rules → AI Judge → Approval → Delivery
If schema or business rules fail, AI Judge is skipped entirely — no LLM cost for payloads that would fail basic validation anyway.
Three-Layer Validation
| Layer | Type | Speed | Cost |
|---|---|---|---|
| Schema + Rules | Deterministic | ~10ms | Free (standard run) |
| AI Judge | LLM-based | 2-7s | 5x run credit |
| Human Review | Manual | Minutes-hours | — |
Configuring AI Judge
Enable AI Judge when creating or updating a gate:
{
"aiJudgeEnabled": true,
"aiJudgeCriteria": [
"Functions should follow the Single Responsibility Principle",
"Error handling should use specific error types, not generic catch-all blocks",
"Public API endpoints should validate input and return appropriate error codes"
],
"aiJudgeThreshold": 1.0,
"aiJudgeConfidenceThreshold": 0.7,
"aiJudgeOnTimeout": "fail",
"aiJudgeOnFail": "reject",
"aiJudgePersona": "senior code reviewer specializing in security and maintainability",
"aiJudgeSystemInstruction": "This is a Node.js/TypeScript codebase using NestJS with Prisma. Focus on code quality patterns visible in the diff."
}
Configuration Fields
| Field | Type | Default | Description |
|---|---|---|---|
aiJudgeEnabled | boolean | false | Enable AI Judge evaluation |
aiJudgeCriteria | string[] | [] | Natural language criteria (max 50 criteria, 500 chars each) |
aiJudgeThreshold | float | 1.0 | Minimum pass rate (1.0 = all must pass, 0.8 = 80%) |
aiJudgeConfidenceThreshold | float | 0.7 | Below this confidence → route to human review |
aiJudgeOnTimeout | string | 'fail' | On LLM timeout: 'fail', 'pass', or 'review' |
aiJudgeOnFail | string | 'reject' | On failure: 'reject' (hard fail) or 'review' (route to approval) |
aiJudgePersona | string | '' | LLM persona for evaluation (max 200 chars) |
aiJudgeSystemInstruction | string | '' | Domain context for evaluation (max 1000 chars) |
Writing Good Criteria
Criteria should be specific, observable, and evaluable from the payload data:
| Good | Avoid |
|---|---|
| "Functions should follow the Single Responsibility Principle" | "Code should be good" |
| "Database queries must use parameterized inputs" | "Be secure" |
| "Contract must include an indemnification clause" | "Contract should be legal" |
| "Content should be appropriate for a business audience" | "Content should be nice" |
If a criterion cannot be evaluated from the payload (e.g., "database queries should use parameterized inputs" but the payload has no database code), AI Judge marks it as pass with reasoning "Not applicable." This prevents false failures.
Persona and Domain Context
- Persona sets the LLM's role — e.g., "senior code reviewer specializing in security" or "trade compliance auditor with 10 years experience"
- Domain context provides factual background — e.g., "This codebase uses NestJS with Prisma. All database access must go through repository services."
Domain context is NOT an instruction — it's factual context the LLM uses when evaluating criteria. The evaluation rules and security constraints are hardened in the system prompt and cannot be overridden.
Verdict Format
Each evaluation returns a structured verdict:
{
"criteria": [
{
"criterion": "Functions should follow SRP",
"verdict": "pass",
"confidence": 0.95,
"reasoning": "All 3 functions in the diff handle a single responsibility"
},
{
"criterion": "Error handling should use specific types",
"verdict": "fail",
"confidence": 0.88,
"reasoning": "Line 45 uses a generic catch-all: catch(err) { throw err }"
}
],
"overall": "fail",
"summary": "1 of 2 criteria failed",
"model": "gemini-2.5-flash",
"tokensUsed": 1250,
"durationMs": 1800
}
Threshold Logic
The aiJudgeThreshold determines what pass rate is required:
1.0(default) — all criteria must pass0.8— 80% of criteria must pass0.5— majority must pass
Confidence Routing
Each criterion includes a confidence score (0.0 to 1.0). If any criterion's confidence falls below aiJudgeConfidenceThreshold:
- The run is automatically routed to human approval
- This happens regardless of the gate's normal approval configuration
AI Judge → Approval Routing
AI Judge integrates with the approval system in three ways:
1. Low Confidence → Auto-Review
When any criterion has confidence below aiJudgeConfidenceThreshold, the run is routed to human approval automatically.
2. On Fail → Review
Set aiJudgeOnFail: "review" to route failures to human approval instead of hard-rejecting:
{
"aiJudgeOnFail": "review"
}
This is useful when AI Judge catches potential issues but you want a human to make the final call.
3. Verdict in Approval Expressions
The AI Judge verdict is available in approval condition expressions as aiJudge:
// Route to approval when AI Judge fails
aiJudge.overall === 'fail'
// Route when any criterion has low confidence
aiJudge.criteria.some(c => c.confidence < 0.8)
// Combine with payload business logic
amount > 10000 || aiJudge.overall === 'fail'
// Route only when a specific criterion fails
aiJudge.criteria.some(c => c.criterion.includes('security') && c.verdict === 'fail')
Sync vs Async Execution
Default: Synchronous. The caller submits a payload and gets the full verdict (schema + rules + AI Judge) in a single HTTP response. Typical latency: 2-7 seconds total.
Opt-in: Asynchronous. Add ?async=true to return immediately:
# Sync (default) — waits for full verdict
POST /api/flow/gates/:id/runs
# Async — returns immediately with status: 'ai_judging'
POST /api/flow/gates/:id/runs?async=true
When async, poll GET /api/flow/runs/:runId for the verdict, or configure a webhook delivery channel to receive results automatically.
Pricing
AI Judge is available on paid tiers. Each evaluation consumes run credits at a 5x multiplier.
| Tier | AI Judge Evaluations/month | Max Criteria per Gate |
|---|---|---|
| Free | — | — |
| Starter ($29/mo) | 500 | 10 |
| Growth ($99/mo) | 5,000 | 20 |
| Scale ($349/mo) | 25,000 | 50 |
AI Judge runs server-side on Rynko's infrastructure using optimized models. You don't need to configure any LLM API keys.
Use Cases
Code Review
Combine deterministic tools (ESLint, tests, coverage) with AI Judge for architectural quality:
Deterministic rules: lint.errors === 0, tests.failed === 0, coverage >= 80
AI Judge criteria: "Functions should follow SRP", "Error handling should use specific types"
Content Moderation
Validate AI-generated content before publishing:
AI Judge criteria:
- "Content should be professional and appropriate for a business audience"
- "No personally identifiable information in public-facing text"
- "Tone should be helpful and constructive"
Contract Analysis
Check AI-extracted contract terms against legal requirements:
AI Judge criteria:
- "Contract must include an indemnification clause"
- "Payment terms should not exceed 90 days"
- "Non-compete scope must be limited to 12 months"
Security
AI Judge includes defense-in-depth against prompt injection:
- Prompt hardening — evaluation rules and security constraints are non-negotiable system instructions
- Input sanitization — persona, domain context, and criteria are stripped of common injection patterns
- Output sanitization — only the expected JSON structure is extracted from the LLM response
- Provider constraints — the LLM has no tools, no browsing, no code execution
The AI Judge is an evaluator only — it cannot take actions, access external systems, or modify data.