[OPIK-5269] [SDK] feat: add evaluate_resume to continue interrupted evaluations#6941
Conversation
…valuations Adds opik.evaluate_resume(experiment_id, task, ...) so an interrupted evaluate() run can be picked up where it left off without re-processing items that already completed. Key behaviors: - Resume state (default_runs_per_item, dataset_filter_string, nb_samples, pinned dataset_version_name) is embedded in experiment_config at evaluate() time and read back on resume. Persistence stores only small reproducible configs — never resolved data lists. - Iteration always runs against the pinned DatasetVersion the original call saw, never a moving Dataset HEAD. Experiments created without a pinned version are marked non-resumable at write time and refused at read time. - Sampler / explicit-ids cases snapshot the resolved item ids to a local checkpoint (~/.opik/resume/<experiment_id>.json). Resume from a machine without the checkpoint raises LocalCheckpointMissing. - Trial bookkeeping is all-or-nothing: an item with completed < expected trials gets every trial redone, so the merged result never mixes outputs from the buggy original task and the fixed resume task. - The returned EvaluationResult is the full experiment: fully-completed items are reconstructed (read-only) from their stored feedback scores and concatenated with the freshly-executed slice. experiment_scoring_functions run over the union. Modular structure under sdks/python/src/opik/evaluation/resume/: state.py ResumableState / NonResumableState sum type + persistence checkpoint.py local ~/.opik/resume/<id>.json file I/O iteration.py expected_runs / remaining_runs / pending iterator context.py ResumeContext + prepare_resume_context orchestrator integration.py evaluator-facing glue (state embedding + checkpoint write) merge.py reconstruct_previous_test_results (read-only) Exceptions ExperimentNotResumable + LocalCheckpointMissing live in opik.exceptions (subclass OpikException). Includes a runnable example at sdks/python/examples/resume_evaluation.py that demonstrates an interrupted run followed by a resume against a 20-item sentiment dataset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflict in evaluator.py::_evaluate_test_suite_task — our side hoisted item resolution out of the function, main side added local-emulator activation + scoring_tool_strategy override. Both kept: function still takes a resolved `items: List[DatasetItem]` (ours) and wraps the engine call in the emulator/strategy machinery from main. Also: dropped reserved `id` field from the resume_evaluation.py example — crash is now triggered off the (unique) review text instead. The local helper that returned a set of completed dataset_item_ids became ``completed_count`` (count is what the script actually reports). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… in e2e
- state.py: _opik_resume value is a single JSON-encoded string under one
metadata key. Stops the experiment Configuration UI from listing seven
rows (_opik_resume.dataset_filter_string, .dataset_version_name, etc.).
- state._read_raw_resume_state: dict-form value is no longer accepted;
the schema introduced in this PR has always been the string form, so
there is no legacy to support.
- tests: route blob metadata through _metadata_with_blob(dict) helpers
that JSON-encode the input. Added a test pinning that a raw-dict value
is treated as no resume state.
- e2e: switch 5 tests off literal string ids ("item-0", "the-item") to
UUID ids paired with stable labels carried in input.text. The backend
requires UUIDs for dataset_item.id.
- e2e: fix mixed_partial_and_fully_completed_items assumption — the
engine drains every submitted trial before re-raising, so a crash on
item-1's 2nd trial does NOT stop item-2's trials. Test now asserts the
actual final shape (item-1 partial; item-0 and item-2 fully completed).
All 547 unit tests + 10 resume e2e tests pass against production.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the ``evaluation_task_output is not None`` predicate that ``evaluate_resume`` used to decide which trials are fully completed. The old rule misses every failure mode where the task succeeded (output written to the trace) but scoring did not finish — synchronous exception escaping the per-metric handler, ``log_test_result_feedback_scores`` raising, ``KeyboardInterrupt`` mid-scoring, or any other ``BaseException`` that lets the outer ``finally`` write the trace. The engine now seeds ``trace.metadata['_opik_evaluation_pending'] = True`` when the trace is built, and flips it to ``False`` on a happy-path-only line after ``_compute_test_result_for_test_case`` returns. Any failure that prevents reaching that line leaves the marker at its default. Resume counts only trials whose persisted trace metadata carries the cleared marker. ``ExperimentItemContent`` carries the new ``trace_metadata`` field that the backend exposes on the experiment-item comparison join. Resume reads the marker from there through one round trip — no per-trial trace fetch. Includes: * engine + resume marker plumbing and a new ``is_trial_fully_completed`` predicate used by both ``context.py`` and ``merge.py``. * Default ``metadata=ANY`` on the test ``TraceModel`` so existing trace- shape assertions don't break against the seeded marker. * Unit coverage for the four marker states (cleared / pending / absent / unrelated key) and the count function. * ``examples/resume_evaluation_interactive.py`` rewritten so it can simulate task-side and scoring-side failures (the scoring case uses ``SystemExit`` to escape the per-metric ``except Exception`` handler). Mode picker via ``OPIK_DEMO_FAILURE_MODE``.
The interactive resume demo (sdks/python/examples/resume_evaluation_interactive.py) was useful for local end-to-end verification but doesn't belong in the reviewable surface of this PR.
Python SDK Compatibility V1 E2E Tests Results (Python 3.11)92 tests 92 ✅ 2m 21s ⏱️ Results for commit d5207e9. ♻️ This comment has been updated with latest results. |
Python SDK Compatibility V1 E2E Tests Results (Python 3.12)92 tests 92 ✅ 2m 27s ⏱️ Results for commit d5207e9. ♻️ This comment has been updated with latest results. |
Python SDK Compatibility V1 E2E Tests Results (Python 3.13)92 tests 92 ✅ 2m 6s ⏱️ Results for commit d5207e9. ♻️ This comment has been updated with latest results. |
Python SDK E2E Tests Results (Python 3.12)280 tests 277 ✅ 5m 19s ⏱️ Results for commit d5207e9. ♻️ This comment has been updated with latest results. |
Python SDK E2E Tests Results (Python 3.13)281 tests 279 ✅ 4m 30s ⏱️ Results for commit 8c6f7fc. ♻️ This comment has been updated with latest results. |
Python SDK E2E Tests Results (Python 3.11)0 tests 0 ✅ 0s ⏱️ Results for commit 9e8ca26. ♻️ This comment has been updated with latest results. |
Three new e2e tests for the marker-based completion predicate: - ``test_evaluate_resume__scoring_crash_after_task_success__trial_replayed`` — the case the marker exists for: task succeeds (output set), metric raises ``BaseException`` mid-scoring (escapes the per-metric ``except Exception``), trial is recorded with marker=True and resume replays it. - ``test_evaluate_resume__metric_scoring_failed_inside_loop__not_replayed`` — regression-guard the inverse: when a metric raises a regular ``Exception`` (caught by the engine and converted to ``scoring_failed=True``), the scoring loop still finishes and the marker is flipped to False. Resume must NOT replay even though the stored feedback score is failed. - ``test_evaluate_resume__mixed_task_and_scoring_failures__only_failed_items_replayed`` — combined coverage: one task-failed item, one scoring-failed item, one all-good item. Verifies the marker alone distinguishes the three. Also updates ``verifiers.verify_experiment_items_completed`` to use the marker (via ``resume.context.is_trial_fully_completed``) as the source of truth instead of ``evaluation_task_output is not None``. The old predicate happened to agree on task-side failures but disagreed on scoring-side failures — the new helper matches what ``evaluate_resume`` actually does.
…ace_metadata Adds a ``requires_completion_marker`` field to the persisted ``ResumableState`` blob. ``evaluate`` always writes ``True`` (this SDK seeds the marker on every trace). ``evaluate_resume`` reads it back; when set, it verifies the connected backend actually surfaces ``trace_metadata`` on the experiment-item compare response — if none of the items carry any trace metadata, it raises the new ``BackendTooOldForResume`` exception with an actionable message pointing at the OPIK-5269 BE projection. Without the check, an old backend would return ``trace_metadata=None`` on every item, the marker predicate would treat all trials as incomplete, and resume would silently replay everything. Tests cover: * round-trip of the new field * raise on (marker required + empty trace_metadata across items) * no raise when at least one item carries metadata * no raise when the experiment is empty or the persisted blob doesn't require the marker
Knowledge of *how* the trial-completion marker is stored (key name, sentinel values, the BE-too-old check) was leaking across engine, resume, tests and scripts. Centralized into a single private module ``opik.evaluation._completion_marker`` exposing four entry points: - ``initial_metadata()`` — marker seed for ``TraceData(metadata=...)`` - ``completed_metadata()`` — happy-path-only mutation - ``is_trial_fully_completed()`` — predicate read at resume time - ``ensure_backend_supports_marker()`` — raise if BE doesn't project trace_metadata Callers (``engine/engine.py``, ``resume/context.py``, ``resume/merge.py``, tests, verifier script) now go through this module and never reference the underlying key directly. Also drops ``requires_completion_marker`` from ``ResumableState``: it was always ``True`` for this SDK version, so the explicit boolean was YAGNI. Any resumable experiment was created by an SDK that writes the marker; ``ensure_backend_supports_marker`` runs unconditionally on resume.
Replaces the trace-metadata marker with a cleaner contract: the engine sets ``trace.output`` only when the trial's happy-path-only line ran (task + scoring + score-logging all returned). The context-manager ``finally`` strips ``output`` back to ``None`` when the line didn't run, so a persisted trace's output presence is the resume signal — no separate marker, no BE projection needed. Engine changes: - ``helpers.evaluate_llm_task_context`` now yields a small ``EvaluationContextState``; its ``finally`` strips ``trace_data.output`` if ``state.evaluation_completed`` stayed ``False``. - ``engine.engine`` flips that flag on the happy line right after ``_compute_test_result_for_test_case`` returns. The ``update_current_trace(output=...)`` call stays where it is so the agentic judge keeps seeing the output during scoring; the strip happens after scoring. Resume changes: - ``resume.context.is_trial_fully_completed`` is now ``item.evaluation_task_output is not None``. - BE-too-old detection (``BackendTooOldForResume`` + ``ensure_backend_supports_marker``) deleted — the predicate uses a field that's universally projected. - ``trace_metadata`` dropped from ``ExperimentItemContent``; the BE projection itself will be handled separately. Other call sites audited; ``evaluate_experiment``'s log message for items with no stored output is now a debug-level message describing expected behavior, not an "Unexpected error" alarm. Tests + verifier + manual scripts updated to reflect the new contract.
``evaluate()`` writes ``_opik_resume`` into the experiment config so ``evaluate_resume`` can recover the iteration. The verifier already filters out ``prompt`` and ``prompts``; ``_opik_resume`` belongs in the same SDK-managed bucket. Empty-after-strip metadata is also coerced back to ``None`` so tests that didn't supply ``experiment_config`` keep their existing expectations. Pre-existing failure surfaced when running the full e2e eval suite; unrelated to the marker / output design.
| def compute_experiment_scores( | ||
| experiment_scoring_functions: List["ExperimentScoreFunction"], | ||
| test_results: List[test_result.TestResult], | ||
| ) -> List[score_result.ScoreResult]: |
There was a problem hiding this comment.
Should we rename compute_experiment_scores and merge_resume_results to private helpers (_...) since they’re only used internally and not re-exported, to avoid widening the public SDK API?
Want Baz to fix this for you? Activate Fixer You can also update your AI coding guidelines based on this comment by apply pr to [branch name]
Other fix methods
Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
sdks/python/src/opik/evaluation/evaluation_result.py around lines 18-21, the
module-level function `compute_experiment_scores` is currently a public symbol even
though it’s only used internally by `opik.evaluation.evaluator`. Rename it to
`_compute_experiment_scores` and update all in-repo call sites/imports accordingly,
ensuring there’s no re-export from `opik.evaluation.__init__` (and remove any
`__all__` entries if present). Similarly, in the same file around lines 45-51, rename
`merge_resume_results` to `_merge_resume_results` and update internal references so the
SDK public API isn’t widened.
There was a problem hiding this comment.
Skipping in this PR — compute_experiment_scores already exists pre-resume (in evaluator.py) and was just extracted to its own module; renaming it now would be a wider rename across the SDK. merge_resume_results is new and only consumed internally, so the rename is fair but would be a drive-by here. Happy to follow up with a name-cleanup PR.
🤖 Reply posted via /address-github-pr-comments
There was a problem hiding this comment.
Thanks, that makes sense — I’ll treat extraction-only moves differently from new internal APIs when suggesting renames. I’ll save this to memory once the PR is merged.
| merged_test_results = previous_test_results + list(new_result.test_results) | ||
|
|
||
| merged_experiment_scores = compute_experiment_scores( | ||
| experiment_scoring_functions=experiment_scoring_functions, | ||
| test_results=merged_test_results, | ||
| ) | ||
| if merged_experiment_scores: | ||
| context.experiment.log_experiment_scores(score_results=merged_experiment_scores) | ||
|
|
There was a problem hiding this comment.
merge_resume_results repeats the compute-and-log flow from _evaluate_task, should we extract a shared helper so compute_experiment_scores / experiment.log_experiment_scores stay centralized?
Want Baz to fix this for you? Activate Fixer
There was a problem hiding this comment.
Skipping — the "duplication" is three lines (compute → if truthy → log); a shared helper would be more code than the duplication and would obscure the per-call control flow (resume reasons about merged-vs-pending scores; _evaluate_task just logs the fresh-eval result). Per the SDK code-style rule we avoid helpers for single-use operations.
🤖 Reply posted via /address-github-pr-comments
There was a problem hiding this comment.
Commit 5b33aca addressed this comment by extracting compute_experiment_scores and having merge_resume_results call it before logging via experiment.log_experiment_scores. That centralizes the compute-and-log flow instead of duplicating it inline.
Reverts the eager `list(...)` materialization that the resume feature added to `helpers.resolve_dataset_items` and re-introduces the `Tuple[Iterator, Optional[int]]` shape main used before this branch. Resume's local-checkpoint requirement is preserved by passing `resolved_ids` to `write_checkpoint_if_needed` directly: the explicit-ids path already knows the ids up front, and the sampler path materializes once internally anyway (samplers operate on the full list). The streaming path (no sampler / no explicit ids) is now lazy end-to-end again, so `test_evaluate_streaming.py` is green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…, narrower test catches - evaluator._materialize_for_checkpoint: when both ``dataset_sampler`` and ``dataset_item_ids`` are passed, the checkpoint must reflect what the engine actually iterated (the post-sampler subset), not the raw input ids. Sampler now takes precedence — otherwise resume would replay a different item set than the original eval. - evaluation_result.merge_resume_results: drop the ``new_result.experiment_scores`` fallback when merged scoring returns no results. Returning slice-only aggregates while advertising whole-experiment coverage was misleading. - tests/e2e/.../test_evaluate_resume.py: narrow ``except BaseException`` to ``except SystemExit`` so unrelated ``KeyboardInterrupt`` / ``GeneratorExit`` aren't silently swallowed. - tests/unit/.../test_evaluate_resume.py: convert E731 lambda-assignment to a ``def`` so the linter passes on the file end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`reconstruct_previous_test_results` called `is_trial_fully_completed` as a gate, but mypy could not see through the function call to narrow `evaluation_task_output` from `Optional[Dict]` to `Dict` for the `TestCase(task_output=...)` construction below it. Replaces the helper call with the inline `output is None` test against a local variable; the local then carries the narrowed type into `TestCase(...)`. Semantics are identical — the helper's body is the same one-line predicate. The helper is still used elsewhere (`context.py` / `rest_operations.py`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| task_output = experiment_item_content.evaluation_task_output | ||
| if task_output is None: | ||
| continue | ||
| if ( | ||
| experiment_item_content.dataset_item_id |
There was a problem hiding this comment.
This block duplicates the dataset-item lookup/skip logic from get_experiment_test_cases(), should we extract a shared TestCase helper/factory so missing items and absent evaluation_task_output stay handled consistently?
Want Baz to fix this for you? Activate Fixer
Trials of the same item are independent: when an item had N expected runs and only K completed, resume now replays the missing (N - K) runs instead of redoing all N. The completed K runs are reconstructed from the backend alongside fully-completed items, so the merged ``EvaluationResult`` carries the original outputs untouched. Mechanics: - ``iteration.remaining_runs_for_item`` returns ``max(0, expected - completed)`` instead of returning the full ``expected`` for any partial item. - ``merge.reconstruct_previous_test_results`` no longer gates on ``fully_completed_dataset_item_ids``. Every backend experiment item whose ``evaluation_task_output`` is set reconstructs as a ``TestResult`` — including completed runs of items that still have missing runs to replay. - ``evaluator.evaluate_resume`` snapshots ``reconstruct_previous_test_results`` **before** ``_evaluate_task`` writes new experiment items; otherwise the resume's own freshly written trials would be reconstructed back into the merge and double-counted. ``merge_resume_results`` now takes the snapshot as a parameter instead of computing it. Tests: - Updated ``test_iteration``, ``test_merge`` and ``test_evaluate_resume`` (unit) to the new semantics. - Updated the two affected e2e scenarios (``test_evaluate_resume__trial_count__partial_item_replays_only_missing_runs`` and ``test_evaluate_resume__mixed_partial_and_fully_completed_items``) to assert "only missing runs replay" instead of "redo all". All 552 evaluation unit tests pass; all 13 e2e resume tests pass against a local backend. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ments; reflect missing-runs-only semantics
- Terminology: replace SDK-internal 'trial' with the user-facing 'run'
vocabulary used in evaluation/concepts.mdx ("each item is run N
times", "a run passes if...").
- "When you can resume" section: drop the JSON-blob / experiment_config
internals; keep the two requirements the user actually needs to know
(resume-aware SDK + versioned dataset) and point at the
ExperimentNotResumable error for everything else.
- "What gets replayed vs reconstructed" multi-run paragraph: replace
the old "redo all if any trial is missing" rule with the new
"replay only the missing runs" semantics that ship with #6941.
- "Wrong tool" bullets: tighten wording (existing-run → existing
experiment; original run → original evaluation) so 'run' consistently
means a single execution.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ruff flagged these as F401 in CI; the imports were leftover from an earlier draft of the file and are never referenced. No behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… against ingestion lag These tests assert that each experiment item carries the full set of feedback scores (regular metrics + task_span metrics). The verifier ``verifiers.verify_experiment`` already polls the experiment-level aggregate to convergence, but per-item scores can land a beat later, especially when task-span scoring writes its second batch via ``client.log_traces_feedback_scores`` after the trace has already been emitted. Direct read-back without polling produced occasional ``len(item.feedback_scores) == K`` failures with K - 1 observed — half the expected count, classic ingestion-lag race. Adds a small ``_wait_for_per_item_feedback_scores`` helper that re-fetches the experiment items until every item reaches the expected count, bounded by ``max_try_seconds`` so a real regression still fails. Routes the four affected test bodies through it. No SDK code change; this is a CI hardening fix only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
``vertex_ai/gemini-2.0-flash`` returns ``NotFoundError`` from the test GCP project — the model has been retired or removed from the project's allow-list. The rest of the test suite (ADK e2e, ``llm_constants``) already standardized on ``gemini-2.5-flash`` via ``tests/llm_constants.GEMINI_FLASH``; pin this one outlier the same way. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
#4 verifiers.verify_experiment_items_completed: the initial ``get_experiment_by_id`` was called once outside the poll, so a 404 from eventual consistency (or a transient ApiError) aborted on the first hit and the cached handle was reused across the polling loop. Move the lookup inside the polling lambda and pass ``allow_errors=True`` — matches the pattern used by every other polling verifier in this file. #5 New ``test_engine_helpers.py`` unit-tests the resume-completion marker on ``evaluate_llm_task_context``: happy path preserves ``output``; flag-never-set strips it to ``None``; exception path strips AND captures ``error_info``; the yielded state object is the ``EvaluationContextState`` dataclass. #6-#10 Restore concrete ``experiment_config`` assertions across five ``evaluate_prompt`` tests (``test_evaluate.py`` × 3, ``test_evaluate_experiment_name.py`` × 2). The prior assertions pinned ``{'prompt_template': [...], 'model': 'gpt-3.5-turbo'}``; this branch had collapsed them to ``mock.ANY``, so the prompt-template / model auto-population contract was no longer covered anywhere. Drilling into the captured kwargs rather than asserting whole-dict equality keeps the resume blob (also in the dict) out of the way. #11 ``test_reconstructed_test_case_carries_stored_output_and_dataset_content``: pin ``trial_id == 0``. The hard-code in ``reconstruct_previous_test_results`` is intentional (REST payload doesn't carry trial index) but was unverified by tests. #12 ``examples/resume_evaluation.py``: ``opik_client.flush()`` after the caught crash. ``_evaluate_task`` re-raises before reaching its own flush, so the demo's stage-3 ``completed_count`` could under-count the partial state. Sibling ``resume_evaluation_interactive.py`` already does this. #13 ``examples/resume_evaluation.py``: unique-per-process ``EXPERIMENT_NAME`` suffix (uuid4 fragment) + tighten the lookup assertion to ``len(experiments) == 1`` and pick ``[0]`` instead of ``[-1]``. ``get_experiments_by_name`` is a case-insensitive substring search, so prior demo runs (or any same-prefix experiment) used to silently match and ``[-1]`` could select the wrong one. All 557 unit tests pass; pre-commit clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CRA finding #2: ``_evaluate_task`` computes the experiment-level aggregate from the freshly-replayed slice and immediately calls ``experiment.log_experiment_scores(...)`` (line 617). On the resume path, ``merge_resume_results`` then recomputes the aggregate from ``previous + new`` and overwrites the same field. Between the two writes the backend advertises the slice-only mean as the whole-experiment score; a concurrent reader / crash / 429 in between leaves the backend stuck on the slice-only view. Fix: ``evaluate_resume`` passes ``experiment_scoring_functions=[]`` into ``_evaluate_task`` so the inner compute-and-log is skipped on the resume path. ``merge_resume_results`` does the only write, with the real merged aggregate. Also drops the ``if not previous_test_results: return new_result`` short-circuit in ``merge_resume_results`` — that path used to return ``new_result`` unchanged, but now ``new_result.experiment_scores`` is ``[]`` by construction. The merge-time compute has to run even when there are no prior runs, so the user's ``experiment_scoring_functions`` is still applied (over the new-only test results, which happen to be the whole experiment in that case). Non-resume paths (``evaluate``, ``evaluate_prompt``, ``evaluate_optimization_trial``) keep their original ``_evaluate_task`` behavior unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ment-score compute+log to evaluate_resume ``merge_resume_results`` is now responsible for one thing — folding ``previous_test_results`` into ``new_result`` and returning a merged ``EvaluationResult``. No backend calls, no experiment-score recomputation, no ``context`` parameter. ``experiment_scores`` on the returned object is empty by construction. ``evaluate_resume`` now: - forwards the user's ``experiment_scoring_functions`` to ``_evaluate_task`` (so the inner slice-only compute-and-log runs again, as on non-resume paths); - calls the pure ``merge_resume_results``; - recomputes the experiment-level aggregate over the merged test_results and logs it, overwriting the slice-only write. This means the backend transiently holds the slice-only aggregate between the two writes — explicitly accepted: rate-limit / concurrent- read risk is negligible, and the separation of concerns is more valuable than avoiding the redundant write. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…etrics Existing ``resume_start.py`` / ``resume_continue.py`` scripts use a hand-rolled ``_FlakyMetric`` / ``_HealthyAccuracyMetric`` pair (custom ``BaseMetric`` subclasses) — useful for testing the marker's failure modes but not a clean demo of how a real user would invoke the feature. The new ``resume_demo_start.py`` + ``resume_demo_continue.py`` use the built-in heuristic metrics (``metrics.Equals`` + ``metrics.Contains``). The dataset is a small sentiment-classification toy with a deterministic classifier (no LLM needed). The start script crashes on item #6 to leave a partial state; the continue script resumes from the printed experiment id with the **same** metric set and a healthy task. Why the metric list must match: ``evaluate_resume`` persists the iteration knobs (dataset version, filter, nb_samples, default trial count) in the experiment record, but it cannot persist live Python metric objects — the caller has to re-supply them. Mismatched metrics between phase 1 and phase 2 would leave the merged result advertising feedback scores under one name on the reconstructed runs and under another name on the freshly replayed runs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ommitted These two scripts (added in 0053301) are meant to stay local to the author's checkout for ad-hoc demoing. They were committed by mistake. The repo already has the ``resume_start.py`` / ``resume_continue.py`` pair for the test-harness flow and ``examples/resume_evaluation.py`` for the canonical example.
…-evaluate-resume # Conflicts: # sdks/python/tests/library_integration/openai/agents_tests/test_opik_tracing_processor.py
…#6950) * [OPIK-5269] [DOCS] docs: add page on resuming interrupted evaluations Documents ``opik.evaluate_resume(experiment_id, ...)``: when it applies, the replayed-vs-reconstructed contract, what happens with custom samplers / explicit ``dataset_item_ids``, and which existing tools to reach for when resume isn't the right fit (``evaluate_experiment`` for re-scoring; a fresh ``evaluate()`` for new items). The page lands under Evaluation → Advanced, between Datasets & Experiments and Manage datasets — placement that matches when a user is most likely to need it. Pairs with the SDK feature shipping in #6941. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(resume): narrow scope note to evaluate() only Drop the parenthetical mention of evaluate_prompt and evaluate_optimization_trial — keeps the note focused on the entrypoint users actually reach for. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(resume): use canonical 'run' vocabulary; simplify resume requirements; reflect missing-runs-only semantics - Terminology: replace SDK-internal 'trial' with the user-facing 'run' vocabulary used in evaluation/concepts.mdx ("each item is run N times", "a run passes if..."). - "When you can resume" section: drop the JSON-blob / experiment_config internals; keep the two requirements the user actually needs to know (resume-aware SDK + versioned dataset) and point at the ExperimentNotResumable error for everything else. - "What gets replayed vs reconstructed" multi-run paragraph: replace the old "redo all if any trial is missing" rule with the new "replay only the missing runs" semantics that ship with #6941. - "Wrong tool" bullets: tighten wording (existing-run → existing experiment; original run → original evaluation) so 'run' consistently means a single execution. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(resume): make the page scannable; cut implementation chatter Reorders the page around what a user wants to scan in order: what it is → quick start → what it preserves vs replays → requirements → same-machine caveat → when it's the wrong tool → reference. Drops: - The detailed "outcome of the original run → what resume does" table (replaced with three short bullets). - The "engine writes the trace's `output` only at the end of the happy path" implementation note — users don't need to know the marker mechanism. - The "JSON blob in experiment_config / resume reads it back" plumbing. - The two-script "Putting it together" section (the quick start already shows the call). - The local-checkpoint code block and the long explanation of why a sampler needs one — replaced with two short sentences. Result: 80 lines vs 164. Same content, faster to scan. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(resume): move same-machine caveat below 'when it's the wrong tool' It's a niche caveat (only the sampler / explicit-ids paths hit it), so it belongs after the broader "wrong tool" decision points rather than in the main reading flow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(resume): widen task-change caveat to metrics; link evaluate_experiment; drop Reference section - Third 'wrong tool' bullet now also covers metrics: providing the same ``task`` and ``scoring_metrics`` between calls is the caller's responsibility. The same already-completed-runs-keep-their-original- outputs warning applies to both. - Link to the Python SDK reference page for evaluate_experiment so users have a single click to learn the alternative. - Drop the Reference section at the bottom — the function and exception names already appear inline where users see them. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(resume): reorder — guidance first, requirements + same-machine caveat at the bottom New flow: intro → quick start → what resume does → when it's the wrong tool → requirements (now including the same-machine sampler caveat as a follow-up paragraph in the same section). Decision-help moves up, gotchas move down. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Details
Adds
opik.evaluate_resume(experiment_id, task, ...)so a long-running evaluation that was interrupted (Ctrl-C, OOM, network blip, crash mid-scoring) can be continued from where it stopped — replaying only the trials that did not complete, and merging them with the trials that did.Public API
Returns the same
EvaluationResultshape asevaluate()— buttest_resultsspans both reconstructed prior trials and the trials this call executed. The originalevaluate(...)call signs the experiment with a JSON resume blob inexperiment_config;evaluate_resumereads it back to reproduce the exact iteration (pinned dataset version, filter,nb_samples, per-item trial counts).If the original
evaluate(...)used a customdataset_sampleror explicitdataset_item_ids, the SDK also writes a local checkpoint of the resolved item ids next to the experiment id — those cases cannot be reproduced from server-side state alone.Design highlights
Completion detection — output as marker. A trial counts as complete iff
trace.outputis set. The engine setstrace.outputonly after the happy-path-only line runs (task + scoring + score-logging all returned cleanly); any failure mode that prevents reaching it — sync exception,BaseExceptionescaping a metric,KeyboardInterruptbetween task and score-log — leaves the trace withoutput = None. Resume replays those. Catches a gap the old design had: scoring crashes used to land withoutputset but no scores, indistinguishable from full completion.Resume state lives in
experiment_config. Serialized as a single JSON string under one key so the Configuration UI doesn't flatten every field as a separate row. Includes schema version, defaultruns_per_item,dataset_filter_string, pinneddataset_version_name,nb_samples, and arequires_local_checkpointflag.Dataset iteration stays lazy. No-sampler / no-explicit-ids
evaluate()calls stream items into the engine as they arrive from the backend, same asmain. The local-checkpoint path takesresolved_idsdirectly without consuming the iterator. Sampler path still materializes (samplers can't operate on a stream) — unchanged frommain.Backward-compatible state reads. Pydantic models in
resume/state.pyuseextra="allow"so a newer experiment blob doesn't break an older SDK reader. Missingdataset_version_name→ downgrade to non-resumable rather than silently iterate against a moving HEAD.What this PR does NOT do
resume/iteration.py::remaining_runs_for_item.Change checklist
opik.evaluate_resumeis a new public entrypoint.Issues
AI-WATERMARK
AI-WATERMARK: yes
resume_start.py+resume_continue.pyflows on a local backend, REST round-trip + UI inspection, full unit + e2e suite green.Testing
Unit (
tests/unit/evaluation/)resume/test_state.py,resume/test_context.py,resume/test_iteration.py,resume/test_merge.py,resume/test_integration.py.test_evaluate_resume.pyexercises happy-path, no-pending, and item-resolution branches against a mocked engine.pytest tests/unit/ --ignore=tests/unit/evaluation/metrics— metrics excluded only due to an unrelated optionalrouge-scoredep on this machine).End-to-end (
tests/e2e/evaluation/)BaseExceptionfailures (the case output-as-marker exists for), and mixed-failure scenarios.tests/e2e/evaluation/test_evaluate_streaming.pypasses against a local backend (it was red on the branch before the streaming-restore commit in this PR).verify_experiment_items_completeduses the sameis_trial_fully_completedpredicate the SDK uses.Manual flows
resume_start.pyproduces a 10-item experiment with planned crashes on items 3 (task-side) and 7 (scoring-side).resume_continue.pycallsevaluate_resume(...)against the failed experiment — yields 12 rows total (10 original + 2 fresh replays), 10 withoutputset and 2 kept as failure history.Documentation
engine/helpers.py::EvaluationContextState,resume/context.py::is_trial_fully_completed,resume/iteration.py::remaining_runs_for_item, andhelpers.py::resolve_dataset_items.output = Noneon the trace level, but the task's@opik.trackspan still preserves what the function returned, so per-span debugging is unaffected.