[OPIK-5269] [SDK] feat: add evaluate_resume to continue interrupted evaluations by alexkuzmik · Pull Request #6941 · comet-ml/opik

alexkuzmik · 2026-06-01T14:09:59Z

Details

Adds opik.evaluate_resume(experiment_id, task, ...) so a long-running evaluation that was interrupted (Ctrl-C, OOM, network blip, crash mid-scoring) can be continued from where it stopped — replaying only the trials that did not complete, and merging them with the trials that did.

Public API

import opik

result = opik.evaluate_resume(
    experiment_id="...",
    task=my_task,
    scoring_metrics=[Equals()],
)

Returns the same EvaluationResult shape as evaluate() — but test_results spans both reconstructed prior trials and the trials this call executed. The original evaluate(...) call signs the experiment with a JSON resume blob in experiment_config; evaluate_resume reads it back to reproduce the exact iteration (pinned dataset version, filter, nb_samples, per-item trial counts).

If the original evaluate(...) used a custom dataset_sampler or explicit dataset_item_ids, the SDK also writes a local checkpoint of the resolved item ids next to the experiment id — those cases cannot be reproduced from server-side state alone.

Design highlights

Completion detection — output as marker. A trial counts as complete iff trace.output is set. The engine sets trace.output only after the happy-path-only line runs (task + scoring + score-logging all returned cleanly); any failure mode that prevents reaching it — sync exception, BaseException escaping a metric, KeyboardInterrupt between task and score-log — leaves the trace with output = None. Resume replays those. Catches a gap the old design had: scoring crashes used to land with output set but no scores, indistinguishable from full completion.

Resume state lives in experiment_config. Serialized as a single JSON string under one key so the Configuration UI doesn't flatten every field as a separate row. Includes schema version, default runs_per_item, dataset_filter_string, pinned dataset_version_name, nb_samples, and a requires_local_checkpoint flag.

Dataset iteration stays lazy. No-sampler / no-explicit-ids evaluate() calls stream items into the engine as they arrive from the backend, same as main. The local-checkpoint path takes resolved_ids directly without consuming the iterator. Sampler path still materializes (samplers can't operate on a stream) — unchanged from main.

Backward-compatible state reads. Pydantic models in resume/state.py use extra="allow" so a newer experiment blob doesn't break an older SDK reader. Missing dataset_version_name → downgrade to non-resumable rather than silently iterate against a moving HEAD.

What this PR does NOT do

It does not deduplicate or repair trial state that landed in a partial state on the backend — the merge is "fully-completed trials only", and partial items get all trials redone. The all-or-nothing rationale is documented inline at resume/iteration.py::remaining_runs_for_item.
It does not add UI surface for resume. Resume is SDK-only in this PR.

Change checklist

User facing — opik.evaluate_resume is a new public entrypoint.
Documentation update — follow-up once the API stabilizes.

Issues

OPIK-5269

AI-WATERMARK

AI-WATERMARK: yes

Tools: Claude Code
Model(s): Claude Opus 4.7
Scope: assisted (design exploration, engine/resume wiring, e2e verification, tests)
Human verification: manual resume_start.py + resume_continue.py flows on a local backend, REST round-trip + UI inspection, full unit + e2e suite green.

Testing

Unit (`tests/unit/evaluation/`)

Resume orchestration: resume/test_state.py, resume/test_context.py, resume/test_iteration.py, resume/test_merge.py, resume/test_integration.py.
Top-level test_evaluate_resume.py exercises happy-path, no-pending, and item-resolution branches against a mocked engine.
All 3605 unit tests pass (pytest tests/unit/ --ignore=tests/unit/evaluation/metrics — metrics excluded only due to an unrelated optional rouge-score dep on this machine).

End-to-end (`tests/e2e/evaluation/`)

Resume e2e tests cover happy path, task-side failures, scoring-side BaseException failures (the case output-as-marker exists for), and mixed-failure scenarios.
tests/e2e/evaluation/test_evaluate_streaming.py passes against a local backend (it was red on the branch before the streaming-restore commit in this PR).
Verifier verify_experiment_items_completed uses the same is_trial_fully_completed predicate the SDK uses.

Manual flows

resume_start.py produces a 10-item experiment with planned crashes on items 3 (task-side) and 7 (scoring-side).
resume_continue.py calls evaluate_resume(...) against the failed experiment — yields 12 rows total (10 original + 2 fresh replays), 10 with output set and 2 kept as failure history.

Documentation

Inline rationale at engine/helpers.py::EvaluationContextState, resume/context.py::is_trial_fully_completed, resume/iteration.py::remaining_runs_for_item, and helpers.py::resolve_dataset_items.
Trace UI semantics: a failed-scoring trial has output = None on the trace level, but the task's @opik.track span still preserves what the function returned, so per-span debugging is unaffected.

…valuations Adds opik.evaluate_resume(experiment_id, task, ...) so an interrupted evaluate() run can be picked up where it left off without re-processing items that already completed. Key behaviors: - Resume state (default_runs_per_item, dataset_filter_string, nb_samples, pinned dataset_version_name) is embedded in experiment_config at evaluate() time and read back on resume. Persistence stores only small reproducible configs — never resolved data lists. - Iteration always runs against the pinned DatasetVersion the original call saw, never a moving Dataset HEAD. Experiments created without a pinned version are marked non-resumable at write time and refused at read time. - Sampler / explicit-ids cases snapshot the resolved item ids to a local checkpoint (~/.opik/resume/<experiment_id>.json). Resume from a machine without the checkpoint raises LocalCheckpointMissing. - Trial bookkeeping is all-or-nothing: an item with completed < expected trials gets every trial redone, so the merged result never mixes outputs from the buggy original task and the fixed resume task. - The returned EvaluationResult is the full experiment: fully-completed items are reconstructed (read-only) from their stored feedback scores and concatenated with the freshly-executed slice. experiment_scoring_functions run over the union. Modular structure under sdks/python/src/opik/evaluation/resume/: state.py ResumableState / NonResumableState sum type + persistence checkpoint.py local ~/.opik/resume/<id>.json file I/O iteration.py expected_runs / remaining_runs / pending iterator context.py ResumeContext + prepare_resume_context orchestrator integration.py evaluator-facing glue (state embedding + checkpoint write) merge.py reconstruct_previous_test_results (read-only) Exceptions ExperimentNotResumable + LocalCheckpointMissing live in opik.exceptions (subclass OpikException). Includes a runnable example at sdks/python/examples/resume_evaluation.py that demonstrates an interrupted run followed by a resume against a 20-item sentiment dataset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Conflict in evaluator.py::_evaluate_test_suite_task — our side hoisted item resolution out of the function, main side added local-emulator activation + scoring_tool_strategy override. Both kept: function still takes a resolved `items: List[DatasetItem]` (ours) and wraps the engine call in the emulator/strategy machinery from main. Also: dropped reserved `id` field from the resume_evaluation.py example — crash is now triggered off the (unique) review text instead. The local helper that returned a set of completed dataset_item_ids became ``completed_count`` (count is what the script actually reports). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… in e2e - state.py: _opik_resume value is a single JSON-encoded string under one metadata key. Stops the experiment Configuration UI from listing seven rows (_opik_resume.dataset_filter_string, .dataset_version_name, etc.). - state._read_raw_resume_state: dict-form value is no longer accepted; the schema introduced in this PR has always been the string form, so there is no legacy to support. - tests: route blob metadata through _metadata_with_blob(dict) helpers that JSON-encode the input. Added a test pinning that a raw-dict value is treated as no resume state. - e2e: switch 5 tests off literal string ids ("item-0", "the-item") to UUID ids paired with stable labels carried in input.text. The backend requires UUIDs for dataset_item.id. - e2e: fix mixed_partial_and_fully_completed_items assumption — the engine drains every submitted trial before re-raising, so a crash on item-1's 2nd trial does NOT stop item-2's trials. Test now asserts the actual final shape (item-1 partial; item-0 and item-2 fully completed). All 547 unit tests + 10 resume e2e tests pass against production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the ``evaluation_task_output is not None`` predicate that ``evaluate_resume`` used to decide which trials are fully completed. The old rule misses every failure mode where the task succeeded (output written to the trace) but scoring did not finish — synchronous exception escaping the per-metric handler, ``log_test_result_feedback_scores`` raising, ``KeyboardInterrupt`` mid-scoring, or any other ``BaseException`` that lets the outer ``finally`` write the trace. The engine now seeds ``trace.metadata['_opik_evaluation_pending'] = True`` when the trace is built, and flips it to ``False`` on a happy-path-only line after ``_compute_test_result_for_test_case`` returns. Any failure that prevents reaching that line leaves the marker at its default. Resume counts only trials whose persisted trace metadata carries the cleared marker. ``ExperimentItemContent`` carries the new ``trace_metadata`` field that the backend exposes on the experiment-item comparison join. Resume reads the marker from there through one round trip — no per-trial trace fetch. Includes: * engine + resume marker plumbing and a new ``is_trial_fully_completed`` predicate used by both ``context.py`` and ``merge.py``. * Default ``metadata=ANY`` on the test ``TraceModel`` so existing trace- shape assertions don't break against the seeded marker. * Unit coverage for the four marker states (cleared / pending / absent / unrelated key) and the count function. * ``examples/resume_evaluation_interactive.py`` rewritten so it can simulate task-side and scoring-side failures (the scoring case uses ``SystemExit`` to escape the per-metric ``except Exception`` handler). Mode picker via ``OPIK_DEMO_FAILURE_MODE``.

github-actions · 2026-06-01T14:17:01Z