Submit an image generated anywhere for automated quality evaluation. Get a pass verdict, a 0-1 score, and structured issues back. No Runflow run required.
An evaluation scores a generated image against a task description and optional reference images. Submit an image you generated anywhere (a Runflow run, nano-banana, Replicate, your own ComfyUI), and Runflow returns a structured judgment: did it pass, a weighted score, and the specific issues found.Evaluations are run-less by design. Your organization API key is the unit of access; a Runflow run_id is an optional association, not a requirement. Runflow can also auto-evaluate eligible run outputs on the platform; those platform evaluations are separate from the ones you submit here (and from any you attach to a run with run_id). This API is for images generated outside a run, or for re-checking a post-processed export.
Evaluation is asynchronous. You submit, get back an evaluation id immediately, and the verdict lands a little later (the pipeline runs several judges, so expect tens of seconds). Read the result by polling or by receiving a callback.
POST /v1/evaluations you submit an image + task -> 202 { id, status_code: "pending" } status: pending -> running -> completed \-> failedGET /v1/evaluations/{id} you poll, or receive a callback_url POST -> { overall_passed, weighted_pass_rate, top_issues, ... }
Status
Meaning
pending
Accepted by Runflow, not yet picked up for processing.
running
Being evaluated.
completed
Terminal. A verdict was produced (overall_passed / weighted_pass_rate are set).
failed
Terminal. See failure_code for why.
The status reference is also available at GET /v1/evaluations/statuses.
Credits charged, as a decimal string. Set only once a billable terminal state is reached.
client_ref
string | null
Your correlation label, echoed back unchanged.
run_id
string (uuid) | null
The associated Runflow run, if you sent one.
eval_duration_ms
number | null
How long the evaluation took.
submitted_at / completed_at
string | null
Lifecycle timestamps.
The full reasoning tree (per-judge findings, gate failures, and the action detail) is available on GET /v1/evaluations/{id} through embed (for example ?embed=judges,action,gate_failures). Each embedded issue carries a category, a subcategory, and a detail string; the top-level top_issues above is just the summary label list. The flat fields are enough for most integrations.
Issue categories are discoverable per resource. For Runflow models, GET /v1/models/{owner}/{slug}/evaluation-issue-categories returns the distinct (category, subcategory) pairs seen across that model’s evaluations, which is handy for building filters.
A submit-and-poll integration needs bothevaluations:create (to submit) and evaluations:read (to read the result back). Create a key with both from the API keys settings. The Submit an image for evaluation guide walks through a full submit-and-read cycle.
is_positive must be present in the request body, but its value may be true (👍), false (👎), or null (which clears an existing rating). An optional reason string can explain the rating. Feedback needs the evaluations:edit scope, and both org API keys and users holding it can rate — so an API-first integration can submit feedback without a logged-in user. The rating is scoped to the evaluation’s organization and works for run-less and run-scoped evaluations alike.Existing API keys are not granted evaluations:edit retroactively (least privilege); add the scope to a key, or mint a new one, from the API keys settings.
Each evaluation runs under a job class that sets its price. Discover the active classes and their prices at runtime rather than reading them from docs:
The price is read once, at submission, and frozen on the evaluation. A later price change never alters what an already-submitted evaluation costs. Send the class on submit with the optional job_class field; omit it to use the default. Today standard is the only active class.
You are charged once, at a terminal state, for the frozen price:
Terminal outcome
Billed?
completed
Yes
failed with failure_code: processing_failed
Yes (the evaluation ran and incurred cost)
failed with dispatch_failed, timed_out, or invalid_media
No
Submission runs a balance pre-flight against the frozen price and returns 402 if you cannot cover it. There is no hold or reservation; the charge is applied at the terminal write.
An inline data URI. Stored as a hosted asset on submission.
Up to 4 reference images are allowed (for example a source face or a target garment). task_type is required; generation_prompt is optional but improves prompt-adherence judging.
Pass callback_url on submit to receive a signed POST when the evaluation reaches a terminal state, instead of polling. The signing and retry mechanics are identical to run callbacks (HMAC Runflow-Signature, return 2xx fast, be idempotent), but the body is evaluation-specific:
Field
Type
Notes
event
string
evaluation.completed or evaluation.failed.
evaluation_id
string (uuid)
The evaluation.
status
string
completed or failed.
client_ref
string | null
Your correlation label from the submission.
run_id
string (uuid) | null
Associated run, if you attached one.
overall_passed
bool | null
Final verdict.
weighted_pass_rate
number | null
Score, 0.0 to 1.0.
top_issues / top_strengths
string[] | null
Summary labels.
primary_action_code
string | null
Recommended action, when one is needed.
failure_code
string | null
Set when status is failed.
completed_at
string
ISO 8601 terminal timestamp (+00:00, not Z).
The callback carries the verdict summary plus correlation handles; fetch the full reasoning tree with GET /v1/evaluations/{id}. The guide shows a worked receiver.
POST /v1/evaluations honors the Idempotency-Key header. Send a unique key per logical submission so a retried request does not create a second evaluation (and a second charge). client_ref is a correlation label echoed back in responses and callbacks; it is not an idempotency key.