Enhancement/evaluation fast#870
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a synchronous "fast" evaluation mode with run_mode persistence, dataset eligibility/filtering, API/body changes, Celery enqueue/queue, a four-stage fast pipeline (responses → embeddings → scoring), a DB migration, and tests. ChangesFast Evaluation Pipeline
Sequence DiagramsequenceDiagram
participant Client
participant EvalRoute as POST /evaluations
participant ValidateService as validate_and_start_fast_evaluation
participant CeleryEnqueue as start_fast_evaluation
participant CeleryWorker as run_evaluation_fast
participant ExecuteService as execute_fast_evaluation
participant CRUDPipeline as run_fast_evaluation
participant OpenAI as OpenAI APIs
participant Langfuse as Langfuse
Client->>EvalRoute: POST with run_mode="fast"
EvalRoute->>ValidateService: validate config & dataset, check eligibility
ValidateService->>ValidateService: create EvaluationRun (run_mode=fast)
ValidateService->>CeleryEnqueue: enqueue task (trace_id)
CeleryEnqueue-->>EvalRoute: task enqueued
EvalRoute-->>Client: 201 with run_mode="fast"
CeleryEnqueue->>CeleryWorker: task queued on evaluations queue
CeleryWorker->>ExecuteService: execute_fast_evaluation(eval_run_id)
ExecuteService->>CRUDPipeline: run_fast_evaluation orchestration
CRUDPipeline->>OpenAI: Stage1 per-item Responses
OpenAI-->>CRUDPipeline: responses + usage
CRUDPipeline->>OpenAI: Stage2 per-item Embeddings
OpenAI-->>CRUDPipeline: embeddings + usage
CRUDPipeline->>Langfuse: create traces and attach scores
CRUDPipeline->>CRUDPipeline: aggregate scores, attach costs, mark completed
ExecuteService-->>CeleryWorker: run completed (or failed)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
OpenAPI changes 🟢 8 non-breaking changesTip Safe to merge from an API-contract perspective. Full changelog ·
|
| Method | Path | Change | |
|---|---|---|---|
| 🟢 | GET |
/api/v1/evaluations |
added the required property data/anyOf[subschema #1]/items/run_mode to the response with the 200 status |
| 🟢 | POST |
/api/v1/evaluations |
added the new optional request property run_mode |
| 🟢 | POST |
/api/v1/evaluations |
added the required property data/anyOf[subschema #1: EvaluationRunPublic]/run_mode to the response with the 200 status |
| 🟢 | GET |
/api/v1/evaluations/datasets |
added the new optional query request parameter eligible_for |
| 🟢 | GET |
/api/v1/evaluations/datasets |
added the optional property data/anyOf[subschema #1]/items/eligible_for_fast to the response with the 200 status |
| 🟢 | POST |
/api/v1/evaluations/datasets |
added the optional property data/anyOf[subschema #1: DatasetUploadResponse]/eligible_for_fast to the response with the 200 status |
| 🟢 | GET |
/api/v1/evaluations/datasets/{dataset_id} |
added the optional property data/anyOf[subschema #1: DatasetUploadResponse]/eligible_for_fast to the response with the 200 status |
| 🟢 | GET |
/api/v1/evaluations/{evaluation_id} |
added the required property data/anyOf[subschema #1: EvaluationRunPublic]/run_mode to the response with the 200 status |
main ↔ 26094ffc · generated by oasdiff
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/app/crud/evaluations/core.py (1)
65-92:⚠️ Potential issue | 🟠 Major | ⚡ Quick winNarrow and centralize
run_modenow.
run_mode: str = "batch"lets arbitrary values reachevaluation_run, and this CRUD helper is now part of the shared fast/batch creation path. Use a sharedRunModeEnum/Literalfrom the model layer instead of a free-form string so invalid rows cannot be persisted.Proposed fix
-from app.models import EvaluationRun, EvaluationRunUpdate +from app.models import EvaluationRun, EvaluationRunUpdate, RunModeEnum ... def create_evaluation_run( session: Session, run_name: str, dataset_name: str, dataset_id: int, config_id: UUID, config_version: int, organization_id: int, project_id: int, - run_mode: str = "batch", + run_mode: RunModeEnum = RunModeEnum.BATCH, ) -> EvaluationRun: ... status="pending", - run_mode=run_mode, + run_mode=run_mode.value, organization_id=organization_id, project_id=project_id,As per coding guidelines, "Do not use magic values in code — extract repeated literals to constants,
Enum, or settings" and "Add type hints on every parameter and return value in Python code."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/crud/evaluations/core.py` around lines 65 - 92, The run_mode parameter on the function that creates EvaluationRun must be typed and constrained to the shared RunMode enum/literal from the model layer instead of a free-form str to prevent invalid values persisting; update the function signature to use the model's RunModeEnum/RunModeLiteral type (replace run_mode: str = "batch"), validate/convert any incoming value to that enum before constructing EvaluationRun, and set EvaluationRun.run_mode using the enum value (referencing the EvaluationRun constructor and the run_mode parameter) so only allowed enum members are stored; also add the appropriate return type hint if missing.backend/app/api/routes/evaluations/dataset.py (1)
121-133:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftApply the fast-eligibility filter before pagination.
list_evaluation_datasets(...)already consumedlimitandoffsetbefore this in-memory filter runs, so/evaluations/datasets?eligible_for=fastcan return short pages and miss eligible datasets that are pushed onto later pages by ineligible rows. This needs to move into the CRUD query so paging stays correct.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/api/routes/evaluations/dataset.py` around lines 121 - 133, The current in-memory filtering of eligible_for == "fast" after calling list_evaluation_datasets causes incorrect pagination; instead, extend the list_evaluation_datasets call and underlying CRUD/query function to accept an eligibility filter (e.g., a parameter like eligible_for_fast or eligible_for) and apply that condition in the DB query (not in Python after fetching); update the route to pass eligible_for=="fast" into list_evaluation_datasets (instead of filtering responses afterwards) and remove the post-query list comprehension that filters by r.eligible_for_fast so paging is driven by the DB-level WHERE clause.
🧹 Nitpick comments (5)
backend/app/models/evaluation.py (1)
56-59: ⚡ Quick winTighten the
eligible_for_fastdescription.This description hardcodes
10and reads as if the flag means the dataset is fully runnable in fast mode, but the fast-start flow still checks other prerequisites like a Langfuse dataset id. Please describe it as the dataset-size predicate and source the threshold from settings so the OpenAPI text cannot drift. As per coding guidelines, "Do not use magic values in code — extract repeated literals to constants,Enum, or settings".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/models/evaluation.py` around lines 56 - 59, The Field description for eligible_for_fast currently hardcodes "10" and implies readiness for fast run mode; update the description to state this flag represents only the dataset-size predicate (e.g., "True if dataset has a number of unique rows at or below the fast-start threshold") and reference the threshold from settings instead of a magic literal — for example mention settings.FAST_START_MAX_ROWS (or the project's equivalent constant) so the OpenAPI text stays accurate and doesn't promise other prerequisites like Langfuse dataset ID; change the text on the eligible_for_fast Field in evaluation.py accordingly.backend/app/services/evaluations/fast.py (1)
70-74: ⚡ Quick winNormalize the new log lines to the repo format.
These messages use
key=value; the repo convention requires| key: valueafter the prefix. Keeping the format consistent matters if downstream parsing or alerting expects the documented shape.Example cleanup
- logger.info( - f"[validate_and_start_fast_evaluation] Starting fast eval | " - f"run_name={run_name} | dataset_id={dataset_id} | " - f"org_id={organization_id} | project_id={project_id}" - ) + logger.info( + f"[validate_and_start_fast_evaluation] Starting fast eval | " + f"run_name: {run_name} | dataset_id: {dataset_id} | " + f"org_id: {organization_id} | project_id: {project_id}" + )As per coding guidelines, "
**/*.py: Every log line must start with the function name in square brackets, following the format:logger.info(f\"[function_name] Message | key: {value}\")"Also applies to: 152-155, 179-182, 212-275
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/services/evaluations/fast.py` around lines 70 - 74, The log lines in validate_and_start_fast_evaluation (and the other occurrences flagged at ranges 152-155, 179-182, 212-275) use "key=value" but must follow the repo convention of starting with the function name in square brackets and using " | key: {value}" pairs; update the logger.info/logger.debug calls (e.g., the logger call shown and the other flagged calls) to use the prefix "[validate_and_start_fast_evaluation]" and format each metadata item as " | key: {variable}" (for example " | run_name: {run_name} | dataset_id: {dataset_id} | org_id: {organization_id} | project_id: {project_id}"), and apply the same normalization to all other logged messages in that function and the other flagged blocks so they match the documented log shape.backend/app/api/routes/evaluations/evaluation.py (1)
56-61: ⚡ Quick winUse the standard structured log format here as well.
This new route log uses
key=value, which diverges from the repo’s required| key: valuepattern.As per coding guidelines, "
**/*.py: Every log line must start with the function name in square brackets, following the format:logger.info(f\"[function_name] Message | key: {value}\")"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/api/routes/evaluations/evaluation.py` around lines 56 - 61, The log call in the evaluation route should follow the project's structured format; update the logger.info invocation (the one currently using f"[evaluate] Starting evaluation | run_mode={run_mode.value} | ...") to start with the function name in square brackets and use " | key: {value}" pairs (e.g., " | run_mode: {run_mode.value}", " | experiment_name: {experiment_name}", " | dataset_id: {dataset_id}", " | org_id: {auth_context.organization_.id}", " | project_id: {auth_context.project_.id}") so the message matches the required pattern; keep the existing logger.info and f-string but change the separators and keys to the "| key: {value}" format.backend/app/celery/utils.py (1)
210-213: ⚡ Quick winUse the repo log format for the enqueue log too.
This helper follows the right prefix, but the payload still uses
key=valueinstead of| key: value.As per coding guidelines, "
**/*.py: Every log line must start with the function name in square brackets, following the format:logger.info(f\"[function_name] Message | key: {value}\")"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/celery/utils.py` around lines 210 - 213, Update the log line in start_fast_evaluation to follow the repo log format: keep the "[start_fast_evaluation]" prefix but change the payload from "eval_run_id={eval_run_id} | task_id={task_id}" to use the pipe-colon style like "| eval_run_id: {eval_run_id} | task_id: {task_id}" so the logger.info call (logger.info(...)) matches the required "[function_name] Message | key: {value}" pattern.backend/app/celery/tasks/evaluation_fast.py (1)
43-46: ⚡ Quick winMatch the repository log format in the task log.
This log line still uses
key=valueinstead of the documented| key: valueformat.As per coding guidelines, "
**/*.py: Every log line must start with the function name in square brackets, following the format:logger.info(f\"[function_name] Message | key: {value}\")"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/celery/tasks/evaluation_fast.py` around lines 43 - 46, Update the logger.info call in run_evaluation_fast to follow repository log format: start the message with the function name in square brackets (e.g., "[run_evaluation_fast]") and use " | key: value" pairs instead of "key=value". Concretely, modify the logger.info in evaluation_fast.py that currently logs eval_run_id and current_task.request.id to use the pattern f"[run_evaluation_fast] Starting fast evaluation task | eval_run_id: {eval_run_id} | task_id: {current_task.request.id}" so the function name and keys match the documented style.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.claude/agents/convention-reviewer.md:
- Around line 125-137: The fenced code block for the example (the
triple-backtick block that begins before "## Summary") is missing a language
identifier and triggers markdownlint MD040; update the opening fence to include
a language (e.g., change the opening "```" to "```markdown") so the block is
explicitly marked as Markdown and CI/pre-commit linting will pass; locate the
block in .claude/agents/convention-reviewer.md around the example header and
replace the fence accordingly.
In @.claude/agents/migration-writer.md:
- Line 12: Remove the hardcoded "060 → 061" baseline from the migration rule
text that follows the "ls backend/app/alembic/versions/" guidance; instead state
that the next revision id is "the highest existing NNN from NNN_*.py plus one
(zero-padded to 3 digits)" and instruct authors to derive it dynamically from
the current files; move any concrete numeric example into a clearly labeled
"historical example" section (if kept) so the main rule never contains hardcoded
numbers.
In `@backend/app/alembic/versions/063_add_run_mode_to_evaluation_run.py`:
- Around line 69-86: The current migration (in
063_add_run_mode_to_evaluation_run.py) directly deletes duplicate rows via
op.execute inside the autocommit_block, which drops historical runs; instead
replace this destructive DELETE with a safe approach: detect duplicates and
either abort the migration with a clear precheck (raise/exit and log the
duplicate ids from evaluation_run) or move/rename duplicates into an archive
(create evaluation_run_archive and INSERT duplicate rows there or UPDATE
run_name to append a unique suffix), then enforce the uniqueness constraint;
locate the SQL executed in the op.execute call and change it to a
non-destructive SELECT/INSERT/UPDATE precheck or archival operation, or fail
fast listing the offending (organization_id, project_id, run_name, id) so
cleanup is explicit.
In `@backend/app/api/docs/evaluation/create_evaluation.md`:
- Around line 42-48: The "Fast-mode error responses" section incorrectly implies
the 409 run_name_already_exists is fast-only; update the docs so only the 422
entries are labeled as fast-specific — either rename the header (e.g., to "Error
responses" and move 409 into a general errors table) or split into two tables
(one "Fast-mode error responses" containing only the 422 rows
config_type_unsupported and dataset_too_large_for_fast, and a separate "General
error responses" that includes 409 run_name_already_exists). Ensure you update
the table rows and header text in create_evaluation.md so the 409 remains
documented but not limited to fast-mode.
In `@backend/app/celery/tasks/evaluation_fast.py`:
- Around line 26-28: run_evaluation_fast currently uses untyped self and an
unused **kwargs and a magic default "N/A" for trace_id; fix by typing self as
celery.app.task.Task (or from celery import Task and use self: Task), either
remove **kwargs if it's not used or type it explicitly (e.g. **kwargs: Any or
kwargs: Mapping[str, Any]), and replace the magic literal with a constant like
UNKNOWN_TRACE_ID = "unknown" referenced in the function signature (trace_id: str
= UNKNOWN_TRACE_ID) so the function signature becomes e.g. def
run_evaluation_fast(self: Task, eval_run_id: int, trace_id: str =
UNKNOWN_TRACE_ID) -> None (or include typed kwargs if needed).
In `@backend/app/crud/evaluations/fast.py`:
- Around line 410-505: The current check-then-create flow around
eval_run.batch_job_id (used with get_batch_job, create_batch_job,
update_evaluation_run) is racy; before calling external OpenAI work
(_responses_call_for_item loop, _upload_unit_to_s3) acquire an atomic claim on
the stage by doing a compare-and-set or row-lock on the EvaluationRun (e.g.,
issue a DB UPDATE that sets a stage marker / provisional batch_job_id or
stage_claim where batch_job_id IS NULL and return the row, or SELECT FOR UPDATE
then re-check and set) so only one worker proceeds to call OpenAI and write the
final batch_job; if the atomic claim fails, reload the existing batch job via
get_batch_job and return early. Ensure you update the same fields via
update_evaluation_run (or the DB transaction) so the marker and final
batch_job_id are consistent.
In `@backend/app/models/evaluation.py`:
- Around line 303-308: The run_mode SQLField is currently free-form; enforce
allowed values ("batch" | "fast") at the model/persistence layer: change the
run_mode declaration in the Evaluation model to use a constrained enum/SQL CHECK
(or SQLAlchemy Enum) rather than a plain string and update related code paths
(e.g., backend/app/crud/evaluations/core.py:create_evaluation_run and any
callers) to accept/validate the same enum type so invalid values are rejected
before persistence and discovered consistently (this keeps
execute_fast_evaluation from receiving unexpected values).
In `@backend/app/services/evaluations/fast.py`:
- Around line 124-133: The code treats a missing
dataset.dataset_metadata["original_items_count"] as 0, letting malformed/old
datasets bypass the fast-mode size check; change the logic in the block that
computes original_items_count (and before calling is_dataset_fast_eligible) to
detect a missing/None value from dataset.dataset_metadata and reject the request
with an HTTPException (422) referencing ERR_DATASET_TOO_LARGE_FOR_FAST or a
clear message instead of defaulting to 0; ensure you still call
is_dataset_fast_eligible only when original_items_count is present and use
settings.EVAL_FAST_MAX_UNIQUE_ROWS in the error detail as currently done.
- Around line 138-164: The current except block around
create_evaluation_run(session, ...) catches every IntegrityError and always maps
it to the run-name uniqueness 409; change it to catch IntegrityError as e and
inspect the underlying DB error (e.orig / str(e) / e.orig.diag/pgcode) to detect
the specific unique constraint for run_name (the constraint name or unique
violation text), only convert that case to the HTTP 409 with
ERR_RUN_NAME_ALREADY_EXISTS; for any other IntegrityError (e.g., FK violations
from dataset/config deletions) re-raise or return an HTTP 500/propagate the
original error so the real failure isn’t masked. Ensure the change is applied in
validate_and_start_fast_evaluation around the create_evaluation_run call.
In `@backend/app/tests/api/routes/test_evaluation_fast.py`:
- Around line 41-145: Add explicit return type annotations for all fixtures,
test methods, and helper builders in this module: annotate pytest fixtures that
yield (e.g. _seeded_random, _patch_dispatch) with a generator/iterator type
(e.g. -> Iterator[None] or the appropriate Iterator[MagicMock]) and annotate
plain fixtures/tests (all test_* methods in TestFailureThreshold and
TestCallWithRetry) with -> None; for helper builders (e.g.
_make_fast_eligible_dataset and _make_text_openai_config) add the concrete
return types that match what create_test_evaluation_dataset and
create_test_config return (use the actual model/class names from the codebase
rather than Any), and ensure all function parameters keep their existing
annotations while adding these return annotations.
In `@CLAUDE.md`:
- Line 24: Replace the hardcoded Alembic revision ID in the example command so
contributors don’t copy a stale value; update the example that currently shows
`--rev-id 061` to use a placeholder like `<next_rev_id>` (or remove the flag)
and add a short note that the rev-id should be computed as the latest existing
revision ID + 1 at runtime before running `alembic revision --autogenerate -m
"Description"`, ensuring the docs no longer hardcode a numeric revision.
---
Outside diff comments:
In `@backend/app/api/routes/evaluations/dataset.py`:
- Around line 121-133: The current in-memory filtering of eligible_for == "fast"
after calling list_evaluation_datasets causes incorrect pagination; instead,
extend the list_evaluation_datasets call and underlying CRUD/query function to
accept an eligibility filter (e.g., a parameter like eligible_for_fast or
eligible_for) and apply that condition in the DB query (not in Python after
fetching); update the route to pass eligible_for=="fast" into
list_evaluation_datasets (instead of filtering responses afterwards) and remove
the post-query list comprehension that filters by r.eligible_for_fast so paging
is driven by the DB-level WHERE clause.
In `@backend/app/crud/evaluations/core.py`:
- Around line 65-92: The run_mode parameter on the function that creates
EvaluationRun must be typed and constrained to the shared RunMode enum/literal
from the model layer instead of a free-form str to prevent invalid values
persisting; update the function signature to use the model's
RunModeEnum/RunModeLiteral type (replace run_mode: str = "batch"),
validate/convert any incoming value to that enum before constructing
EvaluationRun, and set EvaluationRun.run_mode using the enum value (referencing
the EvaluationRun constructor and the run_mode parameter) so only allowed enum
members are stored; also add the appropriate return type hint if missing.
---
Nitpick comments:
In `@backend/app/api/routes/evaluations/evaluation.py`:
- Around line 56-61: The log call in the evaluation route should follow the
project's structured format; update the logger.info invocation (the one
currently using f"[evaluate] Starting evaluation | run_mode={run_mode.value} |
...") to start with the function name in square brackets and use " | key:
{value}" pairs (e.g., " | run_mode: {run_mode.value}", " | experiment_name:
{experiment_name}", " | dataset_id: {dataset_id}", " | org_id:
{auth_context.organization_.id}", " | project_id: {auth_context.project_.id}")
so the message matches the required pattern; keep the existing logger.info and
f-string but change the separators and keys to the "| key: {value}" format.
In `@backend/app/celery/tasks/evaluation_fast.py`:
- Around line 43-46: Update the logger.info call in run_evaluation_fast to
follow repository log format: start the message with the function name in square
brackets (e.g., "[run_evaluation_fast]") and use " | key: value" pairs instead
of "key=value". Concretely, modify the logger.info in evaluation_fast.py that
currently logs eval_run_id and current_task.request.id to use the pattern
f"[run_evaluation_fast] Starting fast evaluation task | eval_run_id:
{eval_run_id} | task_id: {current_task.request.id}" so the function name and
keys match the documented style.
In `@backend/app/celery/utils.py`:
- Around line 210-213: Update the log line in start_fast_evaluation to follow
the repo log format: keep the "[start_fast_evaluation]" prefix but change the
payload from "eval_run_id={eval_run_id} | task_id={task_id}" to use the
pipe-colon style like "| eval_run_id: {eval_run_id} | task_id: {task_id}" so the
logger.info call (logger.info(...)) matches the required "[function_name]
Message | key: {value}" pattern.
In `@backend/app/models/evaluation.py`:
- Around line 56-59: The Field description for eligible_for_fast currently
hardcodes "10" and implies readiness for fast run mode; update the description
to state this flag represents only the dataset-size predicate (e.g., "True if
dataset has a number of unique rows at or below the fast-start threshold") and
reference the threshold from settings instead of a magic literal — for example
mention settings.FAST_START_MAX_ROWS (or the project's equivalent constant) so
the OpenAPI text stays accurate and doesn't promise other prerequisites like
Langfuse dataset ID; change the text on the eligible_for_fast Field in
evaluation.py accordingly.
In `@backend/app/services/evaluations/fast.py`:
- Around line 70-74: The log lines in validate_and_start_fast_evaluation (and
the other occurrences flagged at ranges 152-155, 179-182, 212-275) use
"key=value" but must follow the repo convention of starting with the function
name in square brackets and using " | key: {value}" pairs; update the
logger.info/logger.debug calls (e.g., the logger call shown and the other
flagged calls) to use the prefix "[validate_and_start_fast_evaluation]" and
format each metadata item as " | key: {variable}" (for example " | run_name:
{run_name} | dataset_id: {dataset_id} | org_id: {organization_id} | project_id:
{project_id}"), and apply the same normalization to all other logged messages in
that function and the other flagged blocks so they match the documented log
shape.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 38aa7de9-77d7-4ccc-a875-9b5036f7cbf9
📒 Files selected for processing (27)
.claude/agents/celery-task-writer.md.claude/agents/convention-reviewer.md.claude/agents/crud-writer.md.claude/agents/migration-writer.md.claude/agents/model-writer.md.claude/agents/route-writer.md.claude/agents/service-writer.md.claude/agents/test-writer.mdCLAUDE.mdbackend/app/alembic/versions/063_add_run_mode_to_evaluation_run.pybackend/app/api/docs/evaluation/create_evaluation.mdbackend/app/api/docs/evaluation/list_datasets.mdbackend/app/api/routes/evaluations/dataset.pybackend/app/api/routes/evaluations/evaluation.pybackend/app/celery/celery_app.pybackend/app/celery/tasks/evaluation_fast.pybackend/app/celery/utils.pybackend/app/core/config.pybackend/app/crud/evaluations/__init__.pybackend/app/crud/evaluations/core.pybackend/app/crud/evaluations/fast.pybackend/app/models/__init__.pybackend/app/models/evaluation.pybackend/app/services/evaluations/__init__.pybackend/app/services/evaluations/evaluation.pybackend/app/services/evaluations/fast.pybackend/app/tests/api/routes/test_evaluation_fast.py
The fast-evaluation PR inadvertently checked in the layer-specialist agent definitions under .claude/agents/ and trimmed CLAUDE.md to delegate to them. Revert that out of this PR: delete the agent files and restore CLAUDE.md to its origin/main state so the two stay consistent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/alembic/versions/064_add_run_mode_to_evaluation_run.py (1)
49-49:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd return type hints to migration entrypoints.
upgradeanddowngradeshould explicitly declare-> Noneto comply with the Python typing standard used in this repo.Suggested patch
-def upgrade(): +def upgrade() -> None: @@ -def downgrade(): +def downgrade() -> None:As per coding guidelines,
**/*.py: Always add type hints to all function parameters and return values in Python code.Also applies to: 104-104
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/alembic/versions/064_add_run_mode_to_evaluation_run.py` at line 49, The migration entrypoints upgrade and downgrade are missing return type annotations; update the function signatures for both upgrade() and downgrade() to explicitly declare "-> None" (e.g., def upgrade() -> None: and def downgrade() -> None:) so they comply with the repo's typing standard and the guideline to annotate all Python function return types.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@backend/app/alembic/versions/064_add_run_mode_to_evaluation_run.py`:
- Line 49: The migration entrypoints upgrade and downgrade are missing return
type annotations; update the function signatures for both upgrade() and
downgrade() to explicitly declare "-> None" (e.g., def upgrade() -> None: and
def downgrade() -> None:) so they comply with the repo's typing standard and the
guideline to annotate all Python function return types.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b44f762e-68ae-4c6f-9fd7-85e86a6677b0
📒 Files selected for processing (1)
backend/app/alembic/versions/064_add_run_mode_to_evaluation_run.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/app/crud/evaluations/fast.py (1)
510-511: 💤 Low valueRedundant fallback in
max_workerscalculation.The
or 1is unnecessary sincemax(1, ...)already guarantees the result is at least 1.♻️ Suggested simplification
- max_workers = max( - 1, min(settings.EVAL_FAST_API_CONCURRENCY, len(embed_candidates) or 1) - ) + max_workers = max(1, min(settings.EVAL_FAST_API_CONCURRENCY, len(embed_candidates)))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/crud/evaluations/fast.py` around lines 510 - 511, The max_workers assignment redundantly uses "or 1" inside the min call; simplify the expression by removing the "or 1" fallback and rely on the outer max(1, ...) to enforce a minimum of 1—update the line that sets max_workers (which references settings.EVAL_FAST_API_CONCURRENCY and embed_candidates) to use max(1, min(settings.EVAL_FAST_API_CONCURRENCY, len(embed_candidates))) instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@backend/app/crud/evaluations/fast.py`:
- Around line 510-511: The max_workers assignment redundantly uses "or 1" inside
the min call; simplify the expression by removing the "or 1" fallback and rely
on the outer max(1, ...) to enforce a minimum of 1—update the line that sets
max_workers (which references settings.EVAL_FAST_API_CONCURRENCY and
embed_candidates) to use max(1, min(settings.EVAL_FAST_API_CONCURRENCY,
len(embed_candidates))) instead.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b0d2cbf4-d4dd-4de5-b362-d7ad943bea92
📒 Files selected for processing (2)
backend/app/crud/evaluations/fast.pybackend/app/tests/api/routes/test_evaluation_fast.py
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/app/tests/api/routes/test_evaluation_fast.py
…o enhancement/evaluation-fast
Summary
Target issue is #741
The current evaluation pipeline runs every text evaluation through the OpenAI
Batch API, which can take up to 24 hours to complete. That turnaround is fine
for large golden sets but makes prompt/config iteration painfully slow.
This PR adds a second execution mode —
run_mode="fast"— toPOST /evaluations.For small datasets, the run is executed synchronously through the OpenAI
Responses API instead of a batch job, returning generated outputs, cosine
similarity, and Langfuse traces in seconds-to-minutes. The Batch API path is
unchanged and remains the default for larger datasets.
What's included
API
POST /evaluationsaccepts a newrun_modefield ("batch"default, or"fast").Fast requests are validated and dispatched to a dedicated orchestrator task.
GET /evaluations/datasetsgains aneligible_for=fastfilter and surfaces aneligible_for_fastflag on each dataset.GET /evaluations/{id}now returnsrun_modeso clients can tell the two paths apart.Fast-eval orchestrator (
crud/evaluations/fast.py,services/evaluations/fast.py)run_evaluation_fast) on a newevaluationsqueue (priority 6)drives the run through stages:
persisted as one JSON unit to S3.
pair, persisted to S3.
cost attribution.
batch_jobrow(
job_type="evaluation_fast"/"embedding_fast"), so an orchestrator retry skipsany stage that already succeeded and never re-calls OpenAI for finished work.
OpenAI errors; permanent errors fail the item immediately.
failedif too many items fail.execute_llm_call()and writes nollm_callrows; per-item details live in theS3 unit and per-run cost lives on
batch_job.Schema / config
064addsevaluation_run.run_mode(NOT NULL, default'batch',existing rows backfilled) and a
UNIQUE (organization_id, project_id, run_name)constraint (guards double-click/retry races →
409). The unique index is builtCONCURRENTLY. No new tables.EVAL_FAST_MAX_UNIQUE_ROWS(10),EVAL_FAST_FAILURE_THRESHOLD(0.5),EVAL_FAST_API_CONCURRENCY(10).Scope / constraints (Phase 1)
default duplication factor). Larger datasets must continue to use
run_mode="batch".Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.