Skip to main content

AI Judge

AI Judge adds LLM-powered semantic evaluation to your Flow gates. Define criteria in plain English — the AI evaluates each payload against them and returns structured pass/fail verdicts with confidence scores and reasoning.

Key insight: Deterministic business rules catch 70-80% of issues. AI Judge handles the remaining 20-30% that rules can't express — architectural quality, content tone, semantic compliance, contract clause analysis.

How It Works

AI Judge runs as an optional layer after deterministic validation passes:

Schema validation → Business rules → AI Judge → Approval → Delivery

If schema or business rules fail, AI Judge is skipped entirely — no LLM cost for payloads that would fail basic validation anyway.

Three-Layer Validation

LayerTypeSpeedCost
Schema + RulesDeterministic~10msFree (standard run)
AI JudgeLLM-based2-7s5x run credit
Human ReviewManualMinutes-hours

Configuring AI Judge

Enable AI Judge when creating or updating a gate:

{
"aiJudgeEnabled": true,
"aiJudgeCriteria": [
"Functions should follow the Single Responsibility Principle",
"Error handling should use specific error types, not generic catch-all blocks",
"Public API endpoints should validate input and return appropriate error codes"
],
"aiJudgeThreshold": 1.0,
"aiJudgeConfidenceThreshold": 0.7,
"aiJudgeOnTimeout": "fail",
"aiJudgeOnFail": "reject",
"aiJudgePersona": "senior code reviewer specializing in security and maintainability",
"aiJudgeSystemInstruction": "This is a Node.js/TypeScript codebase using NestJS with Prisma. Focus on code quality patterns visible in the diff."
}

Configuration Fields

FieldTypeDefaultDescription
aiJudgeEnabledbooleanfalseEnable AI Judge evaluation
aiJudgeCriteriastring[][]Natural language criteria (max 50 criteria, 500 chars each)
aiJudgeThresholdfloat1.0Minimum pass rate (1.0 = all must pass, 0.8 = 80%)
aiJudgeConfidenceThresholdfloat0.7Below this confidence → route to human review
aiJudgeOnTimeoutstring'fail'On LLM timeout: 'fail', 'pass', or 'review'
aiJudgeOnFailstring'reject'On failure: 'reject' (hard fail) or 'review' (route to approval)
aiJudgePersonastring''LLM persona for evaluation (max 200 chars)
aiJudgeSystemInstructionstring''Domain context for evaluation (max 1000 chars)

Writing Good Criteria

Criteria should be specific, observable, and evaluable from the payload data:

GoodAvoid
"Functions should follow the Single Responsibility Principle""Code should be good"
"Database queries must use parameterized inputs""Be secure"
"Contract must include an indemnification clause""Contract should be legal"
"Content should be appropriate for a business audience""Content should be nice"
Not-applicable criteria pass

If a criterion cannot be evaluated from the payload (e.g., "database queries should use parameterized inputs" but the payload has no database code), AI Judge marks it as pass with reasoning "Not applicable." This prevents false failures.

Persona and Domain Context

  • Persona sets the LLM's role — e.g., "senior code reviewer specializing in security" or "trade compliance auditor with 10 years experience"
  • Domain context provides factual background — e.g., "This codebase uses NestJS with Prisma. All database access must go through repository services."

Domain context is NOT an instruction — it's factual context the LLM uses when evaluating criteria. The evaluation rules and security constraints are hardened in the system prompt and cannot be overridden.


Verdict Format

Each evaluation returns a structured verdict:

{
"criteria": [
{
"criterion": "Functions should follow SRP",
"verdict": "pass",
"confidence": 0.95,
"reasoning": "All 3 functions in the diff handle a single responsibility"
},
{
"criterion": "Error handling should use specific types",
"verdict": "fail",
"confidence": 0.88,
"reasoning": "Line 45 uses a generic catch-all: catch(err) { throw err }"
}
],
"overall": "fail",
"summary": "1 of 2 criteria failed",
"model": "gemini-2.5-flash",
"tokensUsed": 1250,
"durationMs": 1800
}

Threshold Logic

The aiJudgeThreshold determines what pass rate is required:

  • 1.0 (default) — all criteria must pass
  • 0.8 — 80% of criteria must pass
  • 0.5 — majority must pass

Confidence Routing

Each criterion includes a confidence score (0.0 to 1.0). If any criterion's confidence falls below aiJudgeConfidenceThreshold:

  • The run is automatically routed to human approval
  • This happens regardless of the gate's normal approval configuration

AI Judge → Approval Routing

AI Judge integrates with the approval system in three ways:

1. Low Confidence → Auto-Review

When any criterion has confidence below aiJudgeConfidenceThreshold, the run is routed to human approval automatically.

2. On Fail → Review

Set aiJudgeOnFail: "review" to route failures to human approval instead of hard-rejecting:

{
"aiJudgeOnFail": "review"
}

This is useful when AI Judge catches potential issues but you want a human to make the final call.

3. Verdict in Approval Expressions

The AI Judge verdict is available in approval condition expressions as aiJudge:

// Route to approval when AI Judge fails
aiJudge.overall === 'fail'

// Route when any criterion has low confidence
aiJudge.criteria.some(c => c.confidence < 0.8)

// Combine with payload business logic
amount > 10000 || aiJudge.overall === 'fail'

// Route only when a specific criterion fails
aiJudge.criteria.some(c => c.criterion.includes('security') && c.verdict === 'fail')

Sync vs Async Execution

Default: Synchronous. The caller submits a payload and gets the full verdict (schema + rules + AI Judge) in a single HTTP response. Typical latency: 2-7 seconds total.

Opt-in: Asynchronous. Add ?async=true to return immediately:

# Sync (default) — waits for full verdict
POST /api/flow/gates/:id/runs

# Async — returns immediately with status: 'ai_judging'
POST /api/flow/gates/:id/runs?async=true

When async, poll GET /api/flow/runs/:runId for the verdict, or configure a webhook delivery channel to receive results automatically.


Pricing

AI Judge is available on paid tiers. Each evaluation consumes run credits at a 5x multiplier.

TierAI Judge Evaluations/monthMax Criteria per Gate
Free
Starter ($29/mo)50010
Growth ($99/mo)5,00020
Scale ($349/mo)25,00050
No LLM API keys needed

AI Judge runs server-side on Rynko's infrastructure using optimized models. You don't need to configure any LLM API keys.


Use Cases

Code Review

Combine deterministic tools (ESLint, tests, coverage) with AI Judge for architectural quality:

Deterministic rules: lint.errors === 0, tests.failed === 0, coverage >= 80
AI Judge criteria: "Functions should follow SRP", "Error handling should use specific types"

Content Moderation

Validate AI-generated content before publishing:

AI Judge criteria:
- "Content should be professional and appropriate for a business audience"
- "No personally identifiable information in public-facing text"
- "Tone should be helpful and constructive"

Contract Analysis

Check AI-extracted contract terms against legal requirements:

AI Judge criteria:
- "Contract must include an indemnification clause"
- "Payment terms should not exceed 90 days"
- "Non-compete scope must be limited to 12 months"

Security

AI Judge includes defense-in-depth against prompt injection:

  1. Prompt hardening — evaluation rules and security constraints are non-negotiable system instructions
  2. Input sanitization — persona, domain context, and criteria are stripped of common injection patterns
  3. Output sanitization — only the expected JSON structure is extracted from the LLM response
  4. Provider constraints — the LLM has no tools, no browsing, no code execution

The AI Judge is an evaluator only — it cannot take actions, access external systems, or modify data.