Skip to content

Enhancement/evaluation fast#870

Open
AkhileshNegi wants to merge 11 commits into
mainfrom
enhancement/evaluation-fast
Open

Enhancement/evaluation fast#870
AkhileshNegi wants to merge 11 commits into
mainfrom
enhancement/evaluation-fast

Conversation

@AkhileshNegi
Copy link
Copy Markdown
Collaborator

@AkhileshNegi AkhileshNegi commented May 20, 2026

Summary

Target issue is #741

The current evaluation pipeline runs every text evaluation through the OpenAI
Batch API, which can take up to 24 hours to complete. That turnaround is fine
for large golden sets but makes prompt/config iteration painfully slow.

This PR adds a second execution mode — run_mode="fast" — to POST /evaluations.
For small datasets, the run is executed synchronously through the OpenAI
Responses API
instead of a batch job, returning generated outputs, cosine
similarity, and Langfuse traces in seconds-to-minutes. The Batch API path is
unchanged and remains the default for larger datasets.

What's included

API

  • POST /evaluations accepts a new run_mode field ("batch" default, or "fast").
    Fast requests are validated and dispatched to a dedicated orchestrator task.
  • GET /evaluations/datasets gains an eligible_for=fast filter and surfaces an
    eligible_for_fast flag on each dataset.
  • GET /evaluations/{id} now returns run_mode so clients can tell the two paths apart.

Fast-eval orchestrator (crud/evaluations/fast.py, services/evaluations/fast.py)

  • A single Celery task (run_evaluation_fast) on a new evaluations queue (priority 6)
    drives the run through stages:
  • Stage 1 — Responses: parallel OpenAI Responses calls (bounded concurrency),
    persisted as one JSON unit to S3.
  • Stage 2 — Embeddings: parallel Embeddings calls for each (output, ground_truth)
    pair, persisted to S3.
  • Stage 3 — Score/trace/cost: cosine similarity, Langfuse traces, and per-stage
    cost attribution.
  • Stage 4 — Mark completed.
  • Each stage's completion is recorded by the presence of a batch_job row
    (job_type="evaluation_fast" / "embedding_fast"), so an orchestrator retry skips
    any stage that already succeeded and never re-calls OpenAI for finished work.
    OpenAI errors; permanent errors fail the item immediately.
  • A configurable failure threshold marks the whole run failed if too many items fail.
  • Fast eval calls the OpenAI client directly — it does not route through
    execute_llm_call() and writes no llm_call rows; per-item details live in the
    S3 unit and per-run cost lives on batch_job.

Schema / config

  • Migration 064 adds evaluation_run.run_mode (NOT NULL, default 'batch',
    existing rows backfilled) and a UNIQUE (organization_id, project_id, run_name)
    constraint (guards double-click/retry races → 409). The unique index is built
    CONCURRENTLY. No new tables.
  • New settings: EVAL_FAST_MAX_UNIQUE_ROWS (10), EVAL_FAST_FAILURE_THRESHOLD (0.5),
    EVAL_FAST_API_CONCURRENCY (10).

Scope / constraints (Phase 1)

  • Text evaluations only (STT/TTS unchanged).
  • Fast mode is restricted to datasets with ≤10 unique rows (≤50 total items at the
    default duplication factor). Larger datasets must continue to use run_mode="batch".
  • Responses/Embeddings billed at standard (non-batch) OpenAI rates.

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a synchronous "fast" evaluation mode with run_mode persistence, dataset eligibility/filtering, API/body changes, Celery enqueue/queue, a four-stage fast pipeline (responses → embeddings → scoring), a DB migration, and tests.

Changes

Fast Evaluation Pipeline

Layer / File(s) Summary
Run mode schema, uniqueness constraint, and persistence
backend/app/models/evaluation.py, backend/app/alembic/versions/064_add_run_mode_to_evaluation_run.py, backend/app/crud/evaluations/core.py, backend/app/crud/evaluations/__init__.py, backend/app/models/__init__.py
Adds RunModeEnum, run_mode column on EvaluationRun (default batch), eligible_for_fast on dataset responses, a uniqueness constraint on (organization_id, project_id, run_name), migration with deduplication, and re-exports fast CRUD symbols.
Config, service re-exports, and docs
backend/app/core/config.py, backend/app/services/evaluations/__init__.py, backend/app/api/docs/evaluation/create_evaluation.md, backend/app/api/docs/evaluation/list_datasets.md
Adds EVAL_FAST_* settings, re-exports fast service helpers, and updates docs for run_mode, eligible_for_fast, and the eligible_for=fast query parameter.
API routes, request param, and dataset eligibility
backend/app/api/routes/evaluations/evaluation.py, backend/app/api/routes/evaluations/dataset.py
Adds run_mode request field to evaluation creation, wires validate_and_start_fast_evaluation with correlation_id trace, computes eligible_for_fast in dataset responses, and implements eligible_for=fast server-side filter.
Evaluation run conflict handling
backend/app/services/evaluations/evaluation.py
Adds create_evaluation_run_or_409 to catch DB IntegrityError on duplicate run_name, rollback, and return HTTP 409 with a run-name-collision message; used by evaluation start.
Fast-mode service and worker entry
backend/app/services/evaluations/fast.py
Implements is_dataset_fast_eligible, validate_and_start_fast_evaluation (validate dataset/config/type/eligibility, create fast run, enqueue task), and execute_fast_evaluation worker entry that constructs clients and calls the CRUD fast pipeline with failure handling.
Celery task registration, queue, and enqueue helpers
backend/app/celery/celery_app.py, backend/app/celery/tasks/evaluation_fast.py, backend/app/celery/utils.py
Registers app.celery.tasks.evaluation_fast, adds an evaluations priority queue, defines run_evaluation_fast Celery task (trace propagation + call execute_fast_evaluation), and start_fast_evaluation enqueue helper.
Fast evaluation four-stage processing pipeline
backend/app/crud/evaluations/fast.py
Implements four-stage fast pipeline: (1) per-item OpenAI Responses with retry/backoff and S3 persistence, (2) per-item embeddings for successful items, (3) cosine similarity scoring + Langfuse traces and cost attachment, (4) completion; includes idempotent stage skipping, failure-threshold checks, and end-to-end orchestration run_fast_evaluation.
Comprehensive fast evaluation test coverage
backend/app/tests/api/routes/test_evaluation_fast.py
Adds unit and integration-style tests for retry logic, failure-threshold behavior, POST/GET route validation for fast mode (including duplicate-run rejection), dataset eligible filtering, stage skipping on retry, mocked end-to-end pipeline assertions, and failure-threshold short-circuit.

Sequence Diagram

sequenceDiagram
  participant Client
  participant EvalRoute as POST /evaluations
  participant ValidateService as validate_and_start_fast_evaluation
  participant CeleryEnqueue as start_fast_evaluation
  participant CeleryWorker as run_evaluation_fast
  participant ExecuteService as execute_fast_evaluation
  participant CRUDPipeline as run_fast_evaluation
  participant OpenAI as OpenAI APIs
  participant Langfuse as Langfuse
  Client->>EvalRoute: POST with run_mode="fast"
  EvalRoute->>ValidateService: validate config & dataset, check eligibility
  ValidateService->>ValidateService: create EvaluationRun (run_mode=fast)
  ValidateService->>CeleryEnqueue: enqueue task (trace_id)
  CeleryEnqueue-->>EvalRoute: task enqueued
  EvalRoute-->>Client: 201 with run_mode="fast"
  CeleryEnqueue->>CeleryWorker: task queued on evaluations queue
  CeleryWorker->>ExecuteService: execute_fast_evaluation(eval_run_id)
  ExecuteService->>CRUDPipeline: run_fast_evaluation orchestration
  CRUDPipeline->>OpenAI: Stage1 per-item Responses
  OpenAI-->>CRUDPipeline: responses + usage
  CRUDPipeline->>OpenAI: Stage2 per-item Embeddings
  OpenAI-->>CRUDPipeline: embeddings + usage
  CRUDPipeline->>Langfuse: create traces and attach scores
  CRUDPipeline->>CRUDPipeline: aggregate scores, attach costs, mark completed
  ExecuteService-->>CeleryWorker: run completed (or failed)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • Prajna1999
  • vprashrex
  • nishika26

Poem

🐰 I hopped through code with a carrot and key,
Fast runs now sprint where the batch used to be.
Responses then embeddings, a quick little dance,
Cosine scores gleam — give fast mode a chance.
The rabbit cheers softly — dispatch and advance.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Enhancement/evaluation fast' is vague and uses non-descriptive formatting (slash convention) that does not clearly convey the specific change being made. Revise to a more descriptive single sentence that clearly explains the main feature, e.g., 'Add fast execution mode for text evaluations' or 'Introduce synchronous fast evaluation pipeline with run_mode parameter'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch enhancement/evaluation-fast

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sentry
Copy link
Copy Markdown

sentry Bot commented May 20, 2026

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

OpenAPI changes   🟢 8 non-breaking changes

Tip

Safe to merge from an API-contract perspective.

Full changelog  ·  8
Method Path Change
🟢 GET /api/v1/evaluations added the required property data/anyOf[subschema #1]/items/run_mode to the response with the 200 status
🟢 POST /api/v1/evaluations added the new optional request property run_mode
🟢 POST /api/v1/evaluations added the required property data/anyOf[subschema #1: EvaluationRunPublic]/run_mode to the response with the 200 status
🟢 GET /api/v1/evaluations/datasets added the new optional query request parameter eligible_for
🟢 GET /api/v1/evaluations/datasets added the optional property data/anyOf[subschema #1]/items/eligible_for_fast to the response with the 200 status
🟢 POST /api/v1/evaluations/datasets added the optional property data/anyOf[subschema #1: DatasetUploadResponse]/eligible_for_fast to the response with the 200 status
🟢 GET /api/v1/evaluations/datasets/{dataset_id} added the optional property data/anyOf[subschema #1: DatasetUploadResponse]/eligible_for_fast to the response with the 200 status
🟢 GET /api/v1/evaluations/{evaluation_id} added the required property data/anyOf[subschema #1: EvaluationRunPublic]/run_mode to the response with the 200 status

main26094ffc · generated by oasdiff

@AkhileshNegi AkhileshNegi self-assigned this Jun 3, 2026
@AkhileshNegi AkhileshNegi marked this pull request as ready for review June 3, 2026 06:02
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/app/crud/evaluations/core.py (1)

65-92: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Narrow and centralize run_mode now.

run_mode: str = "batch" lets arbitrary values reach evaluation_run, and this CRUD helper is now part of the shared fast/batch creation path. Use a shared RunModeEnum/Literal from the model layer instead of a free-form string so invalid rows cannot be persisted.

Proposed fix
-from app.models import EvaluationRun, EvaluationRunUpdate
+from app.models import EvaluationRun, EvaluationRunUpdate, RunModeEnum
...
 def create_evaluation_run(
     session: Session,
     run_name: str,
     dataset_name: str,
     dataset_id: int,
     config_id: UUID,
     config_version: int,
     organization_id: int,
     project_id: int,
-    run_mode: str = "batch",
+    run_mode: RunModeEnum = RunModeEnum.BATCH,
 ) -> EvaluationRun:
...
         status="pending",
-        run_mode=run_mode,
+        run_mode=run_mode.value,
         organization_id=organization_id,
         project_id=project_id,

As per coding guidelines, "Do not use magic values in code — extract repeated literals to constants, Enum, or settings" and "Add type hints on every parameter and return value in Python code."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/crud/evaluations/core.py` around lines 65 - 92, The run_mode
parameter on the function that creates EvaluationRun must be typed and
constrained to the shared RunMode enum/literal from the model layer instead of a
free-form str to prevent invalid values persisting; update the function
signature to use the model's RunModeEnum/RunModeLiteral type (replace run_mode:
str = "batch"), validate/convert any incoming value to that enum before
constructing EvaluationRun, and set EvaluationRun.run_mode using the enum value
(referencing the EvaluationRun constructor and the run_mode parameter) so only
allowed enum members are stored; also add the appropriate return type hint if
missing.
backend/app/api/routes/evaluations/dataset.py (1)

121-133: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Apply the fast-eligibility filter before pagination.

list_evaluation_datasets(...) already consumed limit and offset before this in-memory filter runs, so /evaluations/datasets?eligible_for=fast can return short pages and miss eligible datasets that are pushed onto later pages by ineligible rows. This needs to move into the CRUD query so paging stays correct.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/api/routes/evaluations/dataset.py` around lines 121 - 133, The
current in-memory filtering of eligible_for == "fast" after calling
list_evaluation_datasets causes incorrect pagination; instead, extend the
list_evaluation_datasets call and underlying CRUD/query function to accept an
eligibility filter (e.g., a parameter like eligible_for_fast or eligible_for)
and apply that condition in the DB query (not in Python after fetching); update
the route to pass eligible_for=="fast" into list_evaluation_datasets (instead of
filtering responses afterwards) and remove the post-query list comprehension
that filters by r.eligible_for_fast so paging is driven by the DB-level WHERE
clause.
🧹 Nitpick comments (5)
backend/app/models/evaluation.py (1)

56-59: ⚡ Quick win

Tighten the eligible_for_fast description.

This description hardcodes 10 and reads as if the flag means the dataset is fully runnable in fast mode, but the fast-start flow still checks other prerequisites like a Langfuse dataset id. Please describe it as the dataset-size predicate and source the threshold from settings so the OpenAPI text cannot drift. As per coding guidelines, "Do not use magic values in code — extract repeated literals to constants, Enum, or settings".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/models/evaluation.py` around lines 56 - 59, The Field description
for eligible_for_fast currently hardcodes "10" and implies readiness for fast
run mode; update the description to state this flag represents only the
dataset-size predicate (e.g., "True if dataset has a number of unique rows at or
below the fast-start threshold") and reference the threshold from settings
instead of a magic literal — for example mention settings.FAST_START_MAX_ROWS
(or the project's equivalent constant) so the OpenAPI text stays accurate and
doesn't promise other prerequisites like Langfuse dataset ID; change the text on
the eligible_for_fast Field in evaluation.py accordingly.
backend/app/services/evaluations/fast.py (1)

70-74: ⚡ Quick win

Normalize the new log lines to the repo format.

These messages use key=value; the repo convention requires | key: value after the prefix. Keeping the format consistent matters if downstream parsing or alerting expects the documented shape.

Example cleanup
-    logger.info(
-        f"[validate_and_start_fast_evaluation] Starting fast eval | "
-        f"run_name={run_name} | dataset_id={dataset_id} | "
-        f"org_id={organization_id} | project_id={project_id}"
-    )
+    logger.info(
+        f"[validate_and_start_fast_evaluation] Starting fast eval | "
+        f"run_name: {run_name} | dataset_id: {dataset_id} | "
+        f"org_id: {organization_id} | project_id: {project_id}"
+    )

As per coding guidelines, "**/*.py: Every log line must start with the function name in square brackets, following the format: logger.info(f\"[function_name] Message | key: {value}\")"

Also applies to: 152-155, 179-182, 212-275

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/evaluations/fast.py` around lines 70 - 74, The log lines
in validate_and_start_fast_evaluation (and the other occurrences flagged at
ranges 152-155, 179-182, 212-275) use "key=value" but must follow the repo
convention of starting with the function name in square brackets and using " |
key: {value}" pairs; update the logger.info/logger.debug calls (e.g., the logger
call shown and the other flagged calls) to use the prefix
"[validate_and_start_fast_evaluation]" and format each metadata item as " | key:
{variable}" (for example " | run_name: {run_name} | dataset_id: {dataset_id} |
org_id: {organization_id} | project_id: {project_id}"), and apply the same
normalization to all other logged messages in that function and the other
flagged blocks so they match the documented log shape.
backend/app/api/routes/evaluations/evaluation.py (1)

56-61: ⚡ Quick win

Use the standard structured log format here as well.

This new route log uses key=value, which diverges from the repo’s required | key: value pattern.

As per coding guidelines, "**/*.py: Every log line must start with the function name in square brackets, following the format: logger.info(f\"[function_name] Message | key: {value}\")"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/api/routes/evaluations/evaluation.py` around lines 56 - 61, The
log call in the evaluation route should follow the project's structured format;
update the logger.info invocation (the one currently using f"[evaluate] Starting
evaluation | run_mode={run_mode.value} | ...") to start with the function name
in square brackets and use " | key: {value}" pairs (e.g., " | run_mode:
{run_mode.value}", " | experiment_name: {experiment_name}", " | dataset_id:
{dataset_id}", " | org_id: {auth_context.organization_.id}", " | project_id:
{auth_context.project_.id}") so the message matches the required pattern; keep
the existing logger.info and f-string but change the separators and keys to the
"| key: {value}" format.
backend/app/celery/utils.py (1)

210-213: ⚡ Quick win

Use the repo log format for the enqueue log too.

This helper follows the right prefix, but the payload still uses key=value instead of | key: value.

As per coding guidelines, "**/*.py: Every log line must start with the function name in square brackets, following the format: logger.info(f\"[function_name] Message | key: {value}\")"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/celery/utils.py` around lines 210 - 213, Update the log line in
start_fast_evaluation to follow the repo log format: keep the
"[start_fast_evaluation]" prefix but change the payload from
"eval_run_id={eval_run_id} | task_id={task_id}" to use the pipe-colon style like
"| eval_run_id: {eval_run_id} | task_id: {task_id}" so the logger.info call
(logger.info(...)) matches the required "[function_name] Message | key: {value}"
pattern.
backend/app/celery/tasks/evaluation_fast.py (1)

43-46: ⚡ Quick win

Match the repository log format in the task log.

This log line still uses key=value instead of the documented | key: value format.

As per coding guidelines, "**/*.py: Every log line must start with the function name in square brackets, following the format: logger.info(f\"[function_name] Message | key: {value}\")"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/celery/tasks/evaluation_fast.py` around lines 43 - 46, Update the
logger.info call in run_evaluation_fast to follow repository log format: start
the message with the function name in square brackets (e.g.,
"[run_evaluation_fast]") and use " | key: value" pairs instead of "key=value".
Concretely, modify the logger.info in evaluation_fast.py that currently logs
eval_run_id and current_task.request.id to use the pattern
f"[run_evaluation_fast] Starting fast evaluation task | eval_run_id:
{eval_run_id} | task_id: {current_task.request.id}" so the function name and
keys match the documented style.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/agents/convention-reviewer.md:
- Around line 125-137: The fenced code block for the example (the
triple-backtick block that begins before "## Summary") is missing a language
identifier and triggers markdownlint MD040; update the opening fence to include
a language (e.g., change the opening "```" to "```markdown") so the block is
explicitly marked as Markdown and CI/pre-commit linting will pass; locate the
block in .claude/agents/convention-reviewer.md around the example header and
replace the fence accordingly.

In @.claude/agents/migration-writer.md:
- Line 12: Remove the hardcoded "060 → 061" baseline from the migration rule
text that follows the "ls backend/app/alembic/versions/" guidance; instead state
that the next revision id is "the highest existing NNN from NNN_*.py plus one
(zero-padded to 3 digits)" and instruct authors to derive it dynamically from
the current files; move any concrete numeric example into a clearly labeled
"historical example" section (if kept) so the main rule never contains hardcoded
numbers.

In `@backend/app/alembic/versions/063_add_run_mode_to_evaluation_run.py`:
- Around line 69-86: The current migration (in
063_add_run_mode_to_evaluation_run.py) directly deletes duplicate rows via
op.execute inside the autocommit_block, which drops historical runs; instead
replace this destructive DELETE with a safe approach: detect duplicates and
either abort the migration with a clear precheck (raise/exit and log the
duplicate ids from evaluation_run) or move/rename duplicates into an archive
(create evaluation_run_archive and INSERT duplicate rows there or UPDATE
run_name to append a unique suffix), then enforce the uniqueness constraint;
locate the SQL executed in the op.execute call and change it to a
non-destructive SELECT/INSERT/UPDATE precheck or archival operation, or fail
fast listing the offending (organization_id, project_id, run_name, id) so
cleanup is explicit.

In `@backend/app/api/docs/evaluation/create_evaluation.md`:
- Around line 42-48: The "Fast-mode error responses" section incorrectly implies
the 409 run_name_already_exists is fast-only; update the docs so only the 422
entries are labeled as fast-specific — either rename the header (e.g., to "Error
responses" and move 409 into a general errors table) or split into two tables
(one "Fast-mode error responses" containing only the 422 rows
config_type_unsupported and dataset_too_large_for_fast, and a separate "General
error responses" that includes 409 run_name_already_exists). Ensure you update
the table rows and header text in create_evaluation.md so the 409 remains
documented but not limited to fast-mode.

In `@backend/app/celery/tasks/evaluation_fast.py`:
- Around line 26-28: run_evaluation_fast currently uses untyped self and an
unused **kwargs and a magic default "N/A" for trace_id; fix by typing self as
celery.app.task.Task (or from celery import Task and use self: Task), either
remove **kwargs if it's not used or type it explicitly (e.g. **kwargs: Any or
kwargs: Mapping[str, Any]), and replace the magic literal with a constant like
UNKNOWN_TRACE_ID = "unknown" referenced in the function signature (trace_id: str
= UNKNOWN_TRACE_ID) so the function signature becomes e.g. def
run_evaluation_fast(self: Task, eval_run_id: int, trace_id: str =
UNKNOWN_TRACE_ID) -> None (or include typed kwargs if needed).

In `@backend/app/crud/evaluations/fast.py`:
- Around line 410-505: The current check-then-create flow around
eval_run.batch_job_id (used with get_batch_job, create_batch_job,
update_evaluation_run) is racy; before calling external OpenAI work
(_responses_call_for_item loop, _upload_unit_to_s3) acquire an atomic claim on
the stage by doing a compare-and-set or row-lock on the EvaluationRun (e.g.,
issue a DB UPDATE that sets a stage marker / provisional batch_job_id or
stage_claim where batch_job_id IS NULL and return the row, or SELECT FOR UPDATE
then re-check and set) so only one worker proceeds to call OpenAI and write the
final batch_job; if the atomic claim fails, reload the existing batch job via
get_batch_job and return early. Ensure you update the same fields via
update_evaluation_run (or the DB transaction) so the marker and final
batch_job_id are consistent.

In `@backend/app/models/evaluation.py`:
- Around line 303-308: The run_mode SQLField is currently free-form; enforce
allowed values ("batch" | "fast") at the model/persistence layer: change the
run_mode declaration in the Evaluation model to use a constrained enum/SQL CHECK
(or SQLAlchemy Enum) rather than a plain string and update related code paths
(e.g., backend/app/crud/evaluations/core.py:create_evaluation_run and any
callers) to accept/validate the same enum type so invalid values are rejected
before persistence and discovered consistently (this keeps
execute_fast_evaluation from receiving unexpected values).

In `@backend/app/services/evaluations/fast.py`:
- Around line 124-133: The code treats a missing
dataset.dataset_metadata["original_items_count"] as 0, letting malformed/old
datasets bypass the fast-mode size check; change the logic in the block that
computes original_items_count (and before calling is_dataset_fast_eligible) to
detect a missing/None value from dataset.dataset_metadata and reject the request
with an HTTPException (422) referencing ERR_DATASET_TOO_LARGE_FOR_FAST or a
clear message instead of defaulting to 0; ensure you still call
is_dataset_fast_eligible only when original_items_count is present and use
settings.EVAL_FAST_MAX_UNIQUE_ROWS in the error detail as currently done.
- Around line 138-164: The current except block around
create_evaluation_run(session, ...) catches every IntegrityError and always maps
it to the run-name uniqueness 409; change it to catch IntegrityError as e and
inspect the underlying DB error (e.orig / str(e) / e.orig.diag/pgcode) to detect
the specific unique constraint for run_name (the constraint name or unique
violation text), only convert that case to the HTTP 409 with
ERR_RUN_NAME_ALREADY_EXISTS; for any other IntegrityError (e.g., FK violations
from dataset/config deletions) re-raise or return an HTTP 500/propagate the
original error so the real failure isn’t masked. Ensure the change is applied in
validate_and_start_fast_evaluation around the create_evaluation_run call.

In `@backend/app/tests/api/routes/test_evaluation_fast.py`:
- Around line 41-145: Add explicit return type annotations for all fixtures,
test methods, and helper builders in this module: annotate pytest fixtures that
yield (e.g. _seeded_random, _patch_dispatch) with a generator/iterator type
(e.g. -> Iterator[None] or the appropriate Iterator[MagicMock]) and annotate
plain fixtures/tests (all test_* methods in TestFailureThreshold and
TestCallWithRetry) with -> None; for helper builders (e.g.
_make_fast_eligible_dataset and _make_text_openai_config) add the concrete
return types that match what create_test_evaluation_dataset and
create_test_config return (use the actual model/class names from the codebase
rather than Any), and ensure all function parameters keep their existing
annotations while adding these return annotations.

In `@CLAUDE.md`:
- Line 24: Replace the hardcoded Alembic revision ID in the example command so
contributors don’t copy a stale value; update the example that currently shows
`--rev-id 061` to use a placeholder like `<next_rev_id>` (or remove the flag)
and add a short note that the rev-id should be computed as the latest existing
revision ID + 1 at runtime before running `alembic revision --autogenerate -m
"Description"`, ensuring the docs no longer hardcode a numeric revision.

---

Outside diff comments:
In `@backend/app/api/routes/evaluations/dataset.py`:
- Around line 121-133: The current in-memory filtering of eligible_for == "fast"
after calling list_evaluation_datasets causes incorrect pagination; instead,
extend the list_evaluation_datasets call and underlying CRUD/query function to
accept an eligibility filter (e.g., a parameter like eligible_for_fast or
eligible_for) and apply that condition in the DB query (not in Python after
fetching); update the route to pass eligible_for=="fast" into
list_evaluation_datasets (instead of filtering responses afterwards) and remove
the post-query list comprehension that filters by r.eligible_for_fast so paging
is driven by the DB-level WHERE clause.

In `@backend/app/crud/evaluations/core.py`:
- Around line 65-92: The run_mode parameter on the function that creates
EvaluationRun must be typed and constrained to the shared RunMode enum/literal
from the model layer instead of a free-form str to prevent invalid values
persisting; update the function signature to use the model's
RunModeEnum/RunModeLiteral type (replace run_mode: str = "batch"),
validate/convert any incoming value to that enum before constructing
EvaluationRun, and set EvaluationRun.run_mode using the enum value (referencing
the EvaluationRun constructor and the run_mode parameter) so only allowed enum
members are stored; also add the appropriate return type hint if missing.

---

Nitpick comments:
In `@backend/app/api/routes/evaluations/evaluation.py`:
- Around line 56-61: The log call in the evaluation route should follow the
project's structured format; update the logger.info invocation (the one
currently using f"[evaluate] Starting evaluation | run_mode={run_mode.value} |
...") to start with the function name in square brackets and use " | key:
{value}" pairs (e.g., " | run_mode: {run_mode.value}", " | experiment_name:
{experiment_name}", " | dataset_id: {dataset_id}", " | org_id:
{auth_context.organization_.id}", " | project_id: {auth_context.project_.id}")
so the message matches the required pattern; keep the existing logger.info and
f-string but change the separators and keys to the "| key: {value}" format.

In `@backend/app/celery/tasks/evaluation_fast.py`:
- Around line 43-46: Update the logger.info call in run_evaluation_fast to
follow repository log format: start the message with the function name in square
brackets (e.g., "[run_evaluation_fast]") and use " | key: value" pairs instead
of "key=value". Concretely, modify the logger.info in evaluation_fast.py that
currently logs eval_run_id and current_task.request.id to use the pattern
f"[run_evaluation_fast] Starting fast evaluation task | eval_run_id:
{eval_run_id} | task_id: {current_task.request.id}" so the function name and
keys match the documented style.

In `@backend/app/celery/utils.py`:
- Around line 210-213: Update the log line in start_fast_evaluation to follow
the repo log format: keep the "[start_fast_evaluation]" prefix but change the
payload from "eval_run_id={eval_run_id} | task_id={task_id}" to use the
pipe-colon style like "| eval_run_id: {eval_run_id} | task_id: {task_id}" so the
logger.info call (logger.info(...)) matches the required "[function_name]
Message | key: {value}" pattern.

In `@backend/app/models/evaluation.py`:
- Around line 56-59: The Field description for eligible_for_fast currently
hardcodes "10" and implies readiness for fast run mode; update the description
to state this flag represents only the dataset-size predicate (e.g., "True if
dataset has a number of unique rows at or below the fast-start threshold") and
reference the threshold from settings instead of a magic literal — for example
mention settings.FAST_START_MAX_ROWS (or the project's equivalent constant) so
the OpenAPI text stays accurate and doesn't promise other prerequisites like
Langfuse dataset ID; change the text on the eligible_for_fast Field in
evaluation.py accordingly.

In `@backend/app/services/evaluations/fast.py`:
- Around line 70-74: The log lines in validate_and_start_fast_evaluation (and
the other occurrences flagged at ranges 152-155, 179-182, 212-275) use
"key=value" but must follow the repo convention of starting with the function
name in square brackets and using " | key: {value}" pairs; update the
logger.info/logger.debug calls (e.g., the logger call shown and the other
flagged calls) to use the prefix "[validate_and_start_fast_evaluation]" and
format each metadata item as " | key: {variable}" (for example " | run_name:
{run_name} | dataset_id: {dataset_id} | org_id: {organization_id} | project_id:
{project_id}"), and apply the same normalization to all other logged messages in
that function and the other flagged blocks so they match the documented log
shape.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 38aa7de9-77d7-4ccc-a875-9b5036f7cbf9

📥 Commits

Reviewing files that changed from the base of the PR and between 901517d and d13ab1c.

📒 Files selected for processing (27)
  • .claude/agents/celery-task-writer.md
  • .claude/agents/convention-reviewer.md
  • .claude/agents/crud-writer.md
  • .claude/agents/migration-writer.md
  • .claude/agents/model-writer.md
  • .claude/agents/route-writer.md
  • .claude/agents/service-writer.md
  • .claude/agents/test-writer.md
  • CLAUDE.md
  • backend/app/alembic/versions/063_add_run_mode_to_evaluation_run.py
  • backend/app/api/docs/evaluation/create_evaluation.md
  • backend/app/api/docs/evaluation/list_datasets.md
  • backend/app/api/routes/evaluations/dataset.py
  • backend/app/api/routes/evaluations/evaluation.py
  • backend/app/celery/celery_app.py
  • backend/app/celery/tasks/evaluation_fast.py
  • backend/app/celery/utils.py
  • backend/app/core/config.py
  • backend/app/crud/evaluations/__init__.py
  • backend/app/crud/evaluations/core.py
  • backend/app/crud/evaluations/fast.py
  • backend/app/models/__init__.py
  • backend/app/models/evaluation.py
  • backend/app/services/evaluations/__init__.py
  • backend/app/services/evaluations/evaluation.py
  • backend/app/services/evaluations/fast.py
  • backend/app/tests/api/routes/test_evaluation_fast.py

Comment thread .claude/agents/convention-reviewer.md Outdated
Comment thread .claude/agents/migration-writer.md Outdated
Comment thread backend/app/alembic/versions/064_add_run_mode_to_evaluation_run.py
Comment thread backend/app/api/docs/evaluation/create_evaluation.md
Comment thread backend/app/celery/tasks/evaluation_fast.py
Comment thread backend/app/models/evaluation.py
Comment thread backend/app/services/evaluations/fast.py Outdated
Comment thread backend/app/services/evaluations/fast.py Outdated
Comment thread backend/app/tests/api/routes/test_evaluation_fast.py
Comment thread CLAUDE.md Outdated
@AkhileshNegi AkhileshNegi linked an issue Jun 3, 2026 that may be closed by this pull request
The fast-evaluation PR inadvertently checked in the layer-specialist
agent definitions under .claude/agents/ and trimmed CLAUDE.md to delegate
to them. Revert that out of this PR: delete the agent files and restore
CLAUDE.md to its origin/main state so the two stay consistent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/alembic/versions/064_add_run_mode_to_evaluation_run.py (1)

49-49: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add return type hints to migration entrypoints.

upgrade and downgrade should explicitly declare -> None to comply with the Python typing standard used in this repo.

Suggested patch
-def upgrade():
+def upgrade() -> None:
@@
-def downgrade():
+def downgrade() -> None:

As per coding guidelines, **/*.py: Always add type hints to all function parameters and return values in Python code.

Also applies to: 104-104

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/alembic/versions/064_add_run_mode_to_evaluation_run.py` at line
49, The migration entrypoints upgrade and downgrade are missing return type
annotations; update the function signatures for both upgrade() and downgrade()
to explicitly declare "-> None" (e.g., def upgrade() -> None: and def
downgrade() -> None:) so they comply with the repo's typing standard and the
guideline to annotate all Python function return types.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@backend/app/alembic/versions/064_add_run_mode_to_evaluation_run.py`:
- Line 49: The migration entrypoints upgrade and downgrade are missing return
type annotations; update the function signatures for both upgrade() and
downgrade() to explicitly declare "-> None" (e.g., def upgrade() -> None: and
def downgrade() -> None:) so they comply with the repo's typing standard and the
guideline to annotate all Python function return types.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b44f762e-68ae-4c6f-9fd7-85e86a6677b0

📥 Commits

Reviewing files that changed from the base of the PR and between d13ab1c and eb5b0a6.

📒 Files selected for processing (1)
  • backend/app/alembic/versions/064_add_run_mode_to_evaluation_run.py

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
backend/app/crud/evaluations/fast.py (1)

510-511: 💤 Low value

Redundant fallback in max_workers calculation.

The or 1 is unnecessary since max(1, ...) already guarantees the result is at least 1.

♻️ Suggested simplification
-    max_workers = max(
-        1, min(settings.EVAL_FAST_API_CONCURRENCY, len(embed_candidates) or 1)
-    )
+    max_workers = max(1, min(settings.EVAL_FAST_API_CONCURRENCY, len(embed_candidates)))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/crud/evaluations/fast.py` around lines 510 - 511, The max_workers
assignment redundantly uses "or 1" inside the min call; simplify the expression
by removing the "or 1" fallback and rely on the outer max(1, ...) to enforce a
minimum of 1—update the line that sets max_workers (which references
settings.EVAL_FAST_API_CONCURRENCY and embed_candidates) to use max(1,
min(settings.EVAL_FAST_API_CONCURRENCY, len(embed_candidates))) instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@backend/app/crud/evaluations/fast.py`:
- Around line 510-511: The max_workers assignment redundantly uses "or 1" inside
the min call; simplify the expression by removing the "or 1" fallback and rely
on the outer max(1, ...) to enforce a minimum of 1—update the line that sets
max_workers (which references settings.EVAL_FAST_API_CONCURRENCY and
embed_candidates) to use max(1, min(settings.EVAL_FAST_API_CONCURRENCY,
len(embed_candidates))) instead.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b0d2cbf4-d4dd-4de5-b362-d7ad943bea92

📥 Commits

Reviewing files that changed from the base of the PR and between 324777f and 8da52fe.

📒 Files selected for processing (2)
  • backend/app/crud/evaluations/fast.py
  • backend/app/tests/api/routes/test_evaluation_fast.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/tests/api/routes/test_evaluation_fast.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluation: Speed up smaller runs

1 participant