Skip to content

[OPIK-5269] [SDK] feat: add evaluate_resume to continue interrupted evaluations#6941

Merged
alexkuzmik merged 29 commits into
mainfrom
aliaksandrk/OPIK-5269-evaluate-resume
Jun 9, 2026
Merged

[OPIK-5269] [SDK] feat: add evaluate_resume to continue interrupted evaluations#6941
alexkuzmik merged 29 commits into
mainfrom
aliaksandrk/OPIK-5269-evaluate-resume

Conversation

@alexkuzmik

@alexkuzmik alexkuzmik commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Details

Adds opik.evaluate_resume(experiment_id, task, ...) so a long-running evaluation that was interrupted (Ctrl-C, OOM, network blip, crash mid-scoring) can be continued from where it stopped — replaying only the trials that did not complete, and merging them with the trials that did.

Public API

import opik

result = opik.evaluate_resume(
    experiment_id="...",
    task=my_task,
    scoring_metrics=[Equals()],
)

Returns the same EvaluationResult shape as evaluate() — but test_results spans both reconstructed prior trials and the trials this call executed. The original evaluate(...) call signs the experiment with a JSON resume blob in experiment_config; evaluate_resume reads it back to reproduce the exact iteration (pinned dataset version, filter, nb_samples, per-item trial counts).

If the original evaluate(...) used a custom dataset_sampler or explicit dataset_item_ids, the SDK also writes a local checkpoint of the resolved item ids next to the experiment id — those cases cannot be reproduced from server-side state alone.

Design highlights

Completion detection — output as marker. A trial counts as complete iff trace.output is set. The engine sets trace.output only after the happy-path-only line runs (task + scoring + score-logging all returned cleanly); any failure mode that prevents reaching it — sync exception, BaseException escaping a metric, KeyboardInterrupt between task and score-log — leaves the trace with output = None. Resume replays those. Catches a gap the old design had: scoring crashes used to land with output set but no scores, indistinguishable from full completion.

Resume state lives in experiment_config. Serialized as a single JSON string under one key so the Configuration UI doesn't flatten every field as a separate row. Includes schema version, default runs_per_item, dataset_filter_string, pinned dataset_version_name, nb_samples, and a requires_local_checkpoint flag.

Dataset iteration stays lazy. No-sampler / no-explicit-ids evaluate() calls stream items into the engine as they arrive from the backend, same as main. The local-checkpoint path takes resolved_ids directly without consuming the iterator. Sampler path still materializes (samplers can't operate on a stream) — unchanged from main.

Backward-compatible state reads. Pydantic models in resume/state.py use extra="allow" so a newer experiment blob doesn't break an older SDK reader. Missing dataset_version_name → downgrade to non-resumable rather than silently iterate against a moving HEAD.

What this PR does NOT do

  • It does not deduplicate or repair trial state that landed in a partial state on the backend — the merge is "fully-completed trials only", and partial items get all trials redone. The all-or-nothing rationale is documented inline at resume/iteration.py::remaining_runs_for_item.
  • It does not add UI surface for resume. Resume is SDK-only in this PR.

Change checklist

  • User facing — opik.evaluate_resume is a new public entrypoint.
  • Documentation update — follow-up once the API stabilizes.

Issues

  • OPIK-5269

AI-WATERMARK

AI-WATERMARK: yes

  • Tools: Claude Code
  • Model(s): Claude Opus 4.7
  • Scope: assisted (design exploration, engine/resume wiring, e2e verification, tests)
  • Human verification: manual resume_start.py + resume_continue.py flows on a local backend, REST round-trip + UI inspection, full unit + e2e suite green.

Testing

Unit (tests/unit/evaluation/)

  • Resume orchestration: resume/test_state.py, resume/test_context.py, resume/test_iteration.py, resume/test_merge.py, resume/test_integration.py.
  • Top-level test_evaluate_resume.py exercises happy-path, no-pending, and item-resolution branches against a mocked engine.
  • All 3605 unit tests pass (pytest tests/unit/ --ignore=tests/unit/evaluation/metrics — metrics excluded only due to an unrelated optional rouge-score dep on this machine).

End-to-end (tests/e2e/evaluation/)

  • Resume e2e tests cover happy path, task-side failures, scoring-side BaseException failures (the case output-as-marker exists for), and mixed-failure scenarios.
  • tests/e2e/evaluation/test_evaluate_streaming.py passes against a local backend (it was red on the branch before the streaming-restore commit in this PR).
  • Verifier verify_experiment_items_completed uses the same is_trial_fully_completed predicate the SDK uses.

Manual flows

  • resume_start.py produces a 10-item experiment with planned crashes on items 3 (task-side) and 7 (scoring-side).
  • resume_continue.py calls evaluate_resume(...) against the failed experiment — yields 12 rows total (10 original + 2 fresh replays), 10 with output set and 2 kept as failure history.

Documentation

  • Inline rationale at engine/helpers.py::EvaluationContextState, resume/context.py::is_trial_fully_completed, resume/iteration.py::remaining_runs_for_item, and helpers.py::resolve_dataset_items.
  • Trace UI semantics: a failed-scoring trial has output = None on the trace level, but the task's @opik.track span still preserves what the function returned, so per-span debugging is unaffected.

alexkuzmik and others added 4 commits May 29, 2026 15:43
…valuations

Adds opik.evaluate_resume(experiment_id, task, ...) so an interrupted
evaluate() run can be picked up where it left off without re-processing
items that already completed.

Key behaviors:

- Resume state (default_runs_per_item, dataset_filter_string, nb_samples,
  pinned dataset_version_name) is embedded in experiment_config at evaluate()
  time and read back on resume. Persistence stores only small reproducible
  configs — never resolved data lists.
- Iteration always runs against the pinned DatasetVersion the original call
  saw, never a moving Dataset HEAD. Experiments created without a pinned
  version are marked non-resumable at write time and refused at read time.
- Sampler / explicit-ids cases snapshot the resolved item ids to a local
  checkpoint (~/.opik/resume/<experiment_id>.json). Resume from a machine
  without the checkpoint raises LocalCheckpointMissing.
- Trial bookkeeping is all-or-nothing: an item with completed < expected
  trials gets every trial redone, so the merged result never mixes outputs
  from the buggy original task and the fixed resume task.
- The returned EvaluationResult is the full experiment: fully-completed
  items are reconstructed (read-only) from their stored feedback scores and
  concatenated with the freshly-executed slice. experiment_scoring_functions
  run over the union.

Modular structure under sdks/python/src/opik/evaluation/resume/:
  state.py        ResumableState / NonResumableState sum type + persistence
  checkpoint.py   local ~/.opik/resume/<id>.json file I/O
  iteration.py    expected_runs / remaining_runs / pending iterator
  context.py      ResumeContext + prepare_resume_context orchestrator
  integration.py  evaluator-facing glue (state embedding + checkpoint write)
  merge.py        reconstruct_previous_test_results (read-only)

Exceptions ExperimentNotResumable + LocalCheckpointMissing live in
opik.exceptions (subclass OpikException).

Includes a runnable example at sdks/python/examples/resume_evaluation.py
that demonstrates an interrupted run followed by a resume against a
20-item sentiment dataset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflict in evaluator.py::_evaluate_test_suite_task — our side hoisted item
resolution out of the function, main side added local-emulator activation
+ scoring_tool_strategy override. Both kept: function still takes a
resolved `items: List[DatasetItem]` (ours) and wraps the engine call in
the emulator/strategy machinery from main.

Also: dropped reserved `id` field from the resume_evaluation.py example —
crash is now triggered off the (unique) review text instead. The local
helper that returned a set of completed dataset_item_ids became
``completed_count`` (count is what the script actually reports).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… in e2e

- state.py: _opik_resume value is a single JSON-encoded string under one
  metadata key. Stops the experiment Configuration UI from listing seven
  rows (_opik_resume.dataset_filter_string, .dataset_version_name, etc.).
- state._read_raw_resume_state: dict-form value is no longer accepted;
  the schema introduced in this PR has always been the string form, so
  there is no legacy to support.
- tests: route blob metadata through _metadata_with_blob(dict) helpers
  that JSON-encode the input. Added a test pinning that a raw-dict value
  is treated as no resume state.
- e2e: switch 5 tests off literal string ids ("item-0", "the-item") to
  UUID ids paired with stable labels carried in input.text. The backend
  requires UUIDs for dataset_item.id.
- e2e: fix mixed_partial_and_fully_completed_items assumption — the
  engine drains every submitted trial before re-raising, so a crash on
  item-1's 2nd trial does NOT stop item-2's trials. Test now asserts the
  actual final shape (item-1 partial; item-0 and item-2 fully completed).

All 547 unit tests + 10 resume e2e tests pass against production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the ``evaluation_task_output is not None`` predicate that
``evaluate_resume`` used to decide which trials are fully completed.
The old rule misses every failure mode where the task succeeded
(output written to the trace) but scoring did not finish — synchronous
exception escaping the per-metric handler, ``log_test_result_feedback_scores``
raising, ``KeyboardInterrupt`` mid-scoring, or any other ``BaseException``
that lets the outer ``finally`` write the trace.

The engine now seeds ``trace.metadata['_opik_evaluation_pending'] = True``
when the trace is built, and flips it to ``False`` on a happy-path-only
line after ``_compute_test_result_for_test_case`` returns. Any failure
that prevents reaching that line leaves the marker at its default. Resume
counts only trials whose persisted trace metadata carries the cleared
marker.

``ExperimentItemContent`` carries the new ``trace_metadata`` field that
the backend exposes on the experiment-item comparison join. Resume reads
the marker from there through one round trip — no per-trial trace fetch.

Includes:
* engine + resume marker plumbing and a new ``is_trial_fully_completed``
  predicate used by both ``context.py`` and ``merge.py``.
* Default ``metadata=ANY`` on the test ``TraceModel`` so existing trace-
  shape assertions don't break against the seeded marker.
* Unit coverage for the four marker states (cleared / pending / absent /
  unrelated key) and the count function.
* ``examples/resume_evaluation_interactive.py`` rewritten so it can
  simulate task-side and scoring-side failures (the scoring case uses
  ``SystemExit`` to escape the per-metric ``except Exception`` handler).
  Mode picker via ``OPIK_DEMO_FAILURE_MODE``.
@alexkuzmik alexkuzmik requested a review from a team as a code owner June 1, 2026 14:10
@github-actions github-actions Bot added python Pull requests that update Python code tests Including test files, or tests related like configuration. Python SDK labels Jun 1, 2026
Comment thread sdks/python/src/opik/evaluation/resume/context.py Outdated
Comment thread sdks/python/src/opik/evaluation/resume/merge.py
Comment thread sdks/python/src/opik/evaluation/helpers.py Outdated
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Python SDK Compatibility V1 E2E Tests Results (Python 3.14)

92 tests  ±0   92 ✅ ±0   2m 20s ⏱️ -2s
 1 suites ±0    0 💤 ±0 
 1 files   ±0    0 ❌ ±0 

Results for commit d5207e9. ± Comparison against base commit 73c1d15.

♻️ This comment has been updated with latest results.

The interactive resume demo (sdks/python/examples/resume_evaluation_interactive.py) was useful for local end-to-end verification but doesn't belong in the reviewable surface of this PR.
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Python SDK Compatibility V1 E2E Tests Results (Python 3.11)

92 tests   92 ✅  2m 21s ⏱️
 1 suites   0 💤
 1 files     0 ❌

Results for commit d5207e9.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Python SDK Compatibility V1 E2E Tests Results (Python 3.12)

92 tests   92 ✅  2m 27s ⏱️
 1 suites   0 💤
 1 files     0 ❌

Results for commit d5207e9.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Python SDK Compatibility V1 E2E Tests Results (Python 3.13)

92 tests   92 ✅  2m 6s ⏱️
 1 suites   0 💤
 1 files     0 ❌

Results for commit d5207e9.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Python SDK Compatibility V1 E2E Tests Results (Python 3.10)

92 tests  ±0   92 ✅ ±0   2m 24s ⏱️ +8s
 1 suites ±0    0 💤 ±0 
 1 files   ±0    0 ❌ ±0 

Results for commit d5207e9. ± Comparison against base commit 73c1d15.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Python SDK E2E Tests Results (Python 3.12)

280 tests   277 ✅  5m 19s ⏱️
  1 suites    3 💤
  1 files      0 ❌

Results for commit d5207e9.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Python SDK E2E Tests Results (Python 3.10)

280 tests  +13   277 ✅ +13   4m 13s ⏱️ -2s
  1 suites ± 0     3 💤 ± 0 
  1 files   ± 0     0 ❌ ± 0 

Results for commit 4be27ac. ± Comparison against base commit 73c1d15.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Python SDK E2E Tests Results (Python 3.13)

281 tests   279 ✅  4m 30s ⏱️
  1 suites    2 💤
  1 files      0 ❌

Results for commit 8c6f7fc.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Python SDK E2E Tests Results (Python 3.11)

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit 9e8ca26.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Python SDK E2E Tests Results (Python 3.14)

281 tests  +14   279 ✅ +15   5m 40s ⏱️ + 1m 20s
  1 suites ± 0     2 💤  -  1 
  1 files   ± 0     0 ❌ ± 0 

Results for commit 0a3507b. ± Comparison against base commit 73c1d15.

♻️ This comment has been updated with latest results.

Three new e2e tests for the marker-based completion predicate:

- ``test_evaluate_resume__scoring_crash_after_task_success__trial_replayed``
  — the case the marker exists for: task succeeds (output set), metric
  raises ``BaseException`` mid-scoring (escapes the per-metric
  ``except Exception``), trial is recorded with marker=True and resume
  replays it.
- ``test_evaluate_resume__metric_scoring_failed_inside_loop__not_replayed``
  — regression-guard the inverse: when a metric raises a regular
  ``Exception`` (caught by the engine and converted to
  ``scoring_failed=True``), the scoring loop still finishes and the
  marker is flipped to False. Resume must NOT replay even though the
  stored feedback score is failed.
- ``test_evaluate_resume__mixed_task_and_scoring_failures__only_failed_items_replayed``
  — combined coverage: one task-failed item, one scoring-failed item,
  one all-good item. Verifies the marker alone distinguishes the three.

Also updates ``verifiers.verify_experiment_items_completed`` to use the
marker (via ``resume.context.is_trial_fully_completed``) as the source
of truth instead of ``evaluation_task_output is not None``. The old
predicate happened to agree on task-side failures but disagreed on
scoring-side failures — the new helper matches what
``evaluate_resume`` actually does.
Comment thread sdks/python/tests/e2e/evaluation/test_evaluate_resume.py
…ace_metadata

Adds a ``requires_completion_marker`` field to the persisted
``ResumableState`` blob. ``evaluate`` always writes ``True`` (this SDK
seeds the marker on every trace). ``evaluate_resume`` reads it back;
when set, it verifies the connected backend actually surfaces
``trace_metadata`` on the experiment-item compare response — if none
of the items carry any trace metadata, it raises the new
``BackendTooOldForResume`` exception with an actionable message
pointing at the OPIK-5269 BE projection.

Without the check, an old backend would return ``trace_metadata=None``
on every item, the marker predicate would treat all trials as
incomplete, and resume would silently replay everything.

Tests cover:
* round-trip of the new field
* raise on (marker required + empty trace_metadata across items)
* no raise when at least one item carries metadata
* no raise when the experiment is empty or the persisted blob doesn't
  require the marker
Knowledge of *how* the trial-completion marker is stored (key name,
sentinel values, the BE-too-old check) was leaking across engine,
resume, tests and scripts. Centralized into a single private module
``opik.evaluation._completion_marker`` exposing four entry points:

  - ``initial_metadata()``           — marker seed for ``TraceData(metadata=...)``
  - ``completed_metadata()``         — happy-path-only mutation
  - ``is_trial_fully_completed()``   — predicate read at resume time
  - ``ensure_backend_supports_marker()`` — raise if BE doesn't project trace_metadata

Callers (``engine/engine.py``, ``resume/context.py``,
``resume/merge.py``, tests, verifier script) now go through this
module and never reference the underlying key directly.

Also drops ``requires_completion_marker`` from ``ResumableState``: it
was always ``True`` for this SDK version, so the explicit boolean was
YAGNI. Any resumable experiment was created by an SDK that writes the
marker; ``ensure_backend_supports_marker`` runs unconditionally on
resume.
Replaces the trace-metadata marker with a cleaner contract: the
engine sets ``trace.output`` only when the trial's happy-path-only
line ran (task + scoring + score-logging all returned). The
context-manager ``finally`` strips ``output`` back to ``None`` when
the line didn't run, so a persisted trace's output presence is the
resume signal — no separate marker, no BE projection needed.

Engine changes:
- ``helpers.evaluate_llm_task_context`` now yields a small
  ``EvaluationContextState``; its ``finally`` strips
  ``trace_data.output`` if ``state.evaluation_completed`` stayed
  ``False``.
- ``engine.engine`` flips that flag on the happy line right after
  ``_compute_test_result_for_test_case`` returns. The
  ``update_current_trace(output=...)`` call stays where it is so the
  agentic judge keeps seeing the output during scoring; the strip
  happens after scoring.

Resume changes:
- ``resume.context.is_trial_fully_completed`` is now
  ``item.evaluation_task_output is not None``.
- BE-too-old detection (``BackendTooOldForResume`` +
  ``ensure_backend_supports_marker``) deleted — the predicate uses a
  field that's universally projected.
- ``trace_metadata`` dropped from ``ExperimentItemContent``; the BE
  projection itself will be handled separately.

Other call sites audited; ``evaluate_experiment``'s log message for
items with no stored output is now a debug-level message describing
expected behavior, not an "Unexpected error" alarm.

Tests + verifier + manual scripts updated to reflect the new
contract.
Comment thread sdks/python/src/opik/evaluation/rest_operations.py
``evaluate()`` writes ``_opik_resume`` into the experiment config so
``evaluate_resume`` can recover the iteration. The verifier already
filters out ``prompt`` and ``prompts``; ``_opik_resume`` belongs in
the same SDK-managed bucket. Empty-after-strip metadata is also
coerced back to ``None`` so tests that didn't supply
``experiment_config`` keep their existing expectations.

Pre-existing failure surfaced when running the full e2e eval suite;
unrelated to the marker / output design.
Comment on lines +18 to +21
def compute_experiment_scores(
experiment_scoring_functions: List["ExperimentScoreFunction"],
test_results: List[test_result.TestResult],
) -> List[score_result.ScoreResult]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename compute_experiment_scores and merge_resume_results to private helpers (_...) since they’re only used internally and not re-exported, to avoid widening the public SDK API?

Severity

Want Baz to fix this for you? Activate Fixer You can also update your AI coding guidelines based on this comment by apply pr to [branch name]

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
sdks/python/src/opik/evaluation/evaluation_result.py around lines 18-21, the
module-level function `compute_experiment_scores` is currently a public symbol even
though it’s only used internally by `opik.evaluation.evaluator`. Rename it to
`_compute_experiment_scores` and update all in-repo call sites/imports accordingly,
ensuring there’s no re-export from `opik.evaluation.__init__` (and remove any
`__all__` entries if present). Similarly, in the same file around lines 45-51, rename
`merge_resume_results` to `_merge_resume_results` and update internal references so the
SDK public API isn’t widened.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping in this PR — compute_experiment_scores already exists pre-resume (in evaluator.py) and was just extracted to its own module; renaming it now would be a wider rename across the SDK. merge_resume_results is new and only consumed internally, so the rename is fair but would be a drive-by here. Happy to follow up with a name-cleanup PR.

🤖 Reply posted via /address-github-pr-comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense — I’ll treat extraction-only moves differently from new internal APIs when suggesting renames. I’ll save this to memory once the PR is merged.

Comment thread sdks/python/src/opik/evaluation/evaluation_result.py
Comment on lines +69 to +77
merged_test_results = previous_test_results + list(new_result.test_results)

merged_experiment_scores = compute_experiment_scores(
experiment_scoring_functions=experiment_scoring_functions,
test_results=merged_test_results,
)
if merged_experiment_scores:
context.experiment.log_experiment_scores(score_results=merged_experiment_scores)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge_resume_results repeats the compute-and-log flow from _evaluate_task, should we extract a shared helper so compute_experiment_scores / experiment.log_experiment_scores stay centralized?

Severity

Want Baz to fix this for you? Activate Fixer

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — the "duplication" is three lines (compute → if truthy → log); a shared helper would be more code than the duplication and would obscure the per-call control flow (resume reasons about merged-vs-pending scores; _evaluate_task just logs the fresh-eval result). Per the SDK code-style rule we avoid helpers for single-use operations.

🤖 Reply posted via /address-github-pr-comments

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 5b33aca addressed this comment by extracting compute_experiment_scores and having merge_resume_results call it before logging via experiment.log_experiment_scores. That centralizes the compute-and-log flow instead of duplicating it inline.

Comment thread sdks/python/src/opik/evaluation/evaluation_result.py Outdated
Reverts the eager `list(...)` materialization that the resume feature
added to `helpers.resolve_dataset_items` and re-introduces the
`Tuple[Iterator, Optional[int]]` shape main used before this branch.

Resume's local-checkpoint requirement is preserved by passing
`resolved_ids` to `write_checkpoint_if_needed` directly: the explicit-ids
path already knows the ids up front, and the sampler path materializes
once internally anyway (samplers operate on the full list). The
streaming path (no sampler / no explicit ids) is now lazy end-to-end
again, so `test_evaluate_streaming.py` is green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@alexkuzmik alexkuzmik changed the title [OPIK-5269] [SDK] feat: detect incomplete evaluation trials via trace output presence on resume [OPIK-5269] [SDK] feat: add evaluate_resume to continue interrupted evaluations Jun 2, 2026
Comment thread sdks/python/src/opik/evaluation/evaluator.py
Comment thread sdks/python/src/opik/evaluation/evaluator.py Outdated
Comment thread sdks/python/src/opik/evaluation/evaluator.py
alexkuzmik and others added 2 commits June 2, 2026 16:26
…, narrower test catches

- evaluator._materialize_for_checkpoint: when both ``dataset_sampler``
  and ``dataset_item_ids`` are passed, the checkpoint must reflect what
  the engine actually iterated (the post-sampler subset), not the raw
  input ids. Sampler now takes precedence — otherwise resume would
  replay a different item set than the original eval.
- evaluation_result.merge_resume_results: drop the
  ``new_result.experiment_scores`` fallback when merged scoring returns
  no results. Returning slice-only aggregates while advertising
  whole-experiment coverage was misleading.
- tests/e2e/.../test_evaluate_resume.py: narrow ``except BaseException``
  to ``except SystemExit`` so unrelated ``KeyboardInterrupt`` /
  ``GeneratorExit`` aren't silently swallowed.
- tests/unit/.../test_evaluate_resume.py: convert E731 lambda-assignment
  to a ``def`` so the linter passes on the file end-to-end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`reconstruct_previous_test_results` called `is_trial_fully_completed`
as a gate, but mypy could not see through the function call to narrow
`evaluation_task_output` from `Optional[Dict]` to `Dict` for the
`TestCase(task_output=...)` construction below it. Replaces the helper
call with the inline `output is None` test against a local variable;
the local then carries the narrowed type into `TestCase(...)`.

Semantics are identical — the helper's body is the same one-line
predicate. The helper is still used elsewhere (`context.py` /
`rest_operations.py`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment on lines +62 to +66
task_output = experiment_item_content.evaluation_task_output
if task_output is None:
continue
if (
experiment_item_content.dataset_item_id

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block duplicates the dataset-item lookup/skip logic from get_experiment_test_cases(), should we extract a shared TestCase helper/factory so missing items and absent evaluation_task_output stay handled consistently?

Severity

Want Baz to fix this for you? Activate Fixer

Trials of the same item are independent: when an item had N expected
runs and only K completed, resume now replays the missing (N - K) runs
instead of redoing all N. The completed K runs are reconstructed from
the backend alongside fully-completed items, so the merged
``EvaluationResult`` carries the original outputs untouched.

Mechanics:

- ``iteration.remaining_runs_for_item`` returns ``max(0, expected -
  completed)`` instead of returning the full ``expected`` for any
  partial item.
- ``merge.reconstruct_previous_test_results`` no longer gates on
  ``fully_completed_dataset_item_ids``. Every backend experiment item
  whose ``evaluation_task_output`` is set reconstructs as a
  ``TestResult`` — including completed runs of items that still have
  missing runs to replay.
- ``evaluator.evaluate_resume`` snapshots
  ``reconstruct_previous_test_results`` **before** ``_evaluate_task``
  writes new experiment items; otherwise the resume's own freshly
  written trials would be reconstructed back into the merge and
  double-counted. ``merge_resume_results`` now takes the snapshot as a
  parameter instead of computing it.

Tests:

- Updated ``test_iteration``, ``test_merge`` and ``test_evaluate_resume``
  (unit) to the new semantics.
- Updated the two affected e2e scenarios
  (``test_evaluate_resume__trial_count__partial_item_replays_only_missing_runs``
  and ``test_evaluate_resume__mixed_partial_and_fully_completed_items``)
  to assert "only missing runs replay" instead of "redo all".

All 552 evaluation unit tests pass; all 13 e2e resume tests pass against
a local backend.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alexkuzmik added a commit that referenced this pull request Jun 2, 2026
…ments; reflect missing-runs-only semantics

- Terminology: replace SDK-internal 'trial' with the user-facing 'run'
  vocabulary used in evaluation/concepts.mdx ("each item is run N
  times", "a run passes if...").
- "When you can resume" section: drop the JSON-blob / experiment_config
  internals; keep the two requirements the user actually needs to know
  (resume-aware SDK + versioned dataset) and point at the
  ExperimentNotResumable error for everything else.
- "What gets replayed vs reconstructed" multi-run paragraph: replace
  the old "redo all if any trial is missing" rule with the new
  "replay only the missing runs" semantics that ship with #6941.
- "Wrong tool" bullets: tighten wording (existing-run → existing
  experiment; original run → original evaluation) so 'run' consistently
  means a single execution.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
alexkuzmik and others added 8 commits June 3, 2026 00:52
Ruff flagged these as F401 in CI; the imports were leftover from an
earlier draft of the file and are never referenced. No behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… against ingestion lag

These tests assert that each experiment item carries the full set of
feedback scores (regular metrics + task_span metrics). The verifier
``verifiers.verify_experiment`` already polls the experiment-level
aggregate to convergence, but per-item scores can land a beat later,
especially when task-span scoring writes its second batch via
``client.log_traces_feedback_scores`` after the trace has already been
emitted. Direct read-back without polling produced occasional
``len(item.feedback_scores) == K`` failures with K - 1 observed — half
the expected count, classic ingestion-lag race.

Adds a small ``_wait_for_per_item_feedback_scores`` helper that
re-fetches the experiment items until every item reaches the expected
count, bounded by ``max_try_seconds`` so a real regression still fails.
Routes the four affected test bodies through it.

No SDK code change; this is a CI hardening fix only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
``vertex_ai/gemini-2.0-flash`` returns ``NotFoundError`` from the test
GCP project — the model has been retired or removed from the project's
allow-list. The rest of the test suite (ADK e2e, ``llm_constants``)
already standardized on ``gemini-2.5-flash`` via
``tests/llm_constants.GEMINI_FLASH``; pin this one outlier the same way.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
#4 verifiers.verify_experiment_items_completed: the initial
   ``get_experiment_by_id`` was called once outside the poll, so a 404
   from eventual consistency (or a transient ApiError) aborted on the
   first hit and the cached handle was reused across the polling loop.
   Move the lookup inside the polling lambda and pass
   ``allow_errors=True`` — matches the pattern used by every other
   polling verifier in this file.

#5 New ``test_engine_helpers.py`` unit-tests the resume-completion
   marker on ``evaluate_llm_task_context``: happy path preserves
   ``output``; flag-never-set strips it to ``None``; exception path
   strips AND captures ``error_info``; the yielded state object is the
   ``EvaluationContextState`` dataclass.

#6-#10 Restore concrete ``experiment_config`` assertions across five
   ``evaluate_prompt`` tests (``test_evaluate.py`` × 3,
   ``test_evaluate_experiment_name.py`` × 2). The prior assertions
   pinned ``{'prompt_template': [...], 'model': 'gpt-3.5-turbo'}``;
   this branch had collapsed them to ``mock.ANY``, so the
   prompt-template / model auto-population contract was no longer
   covered anywhere. Drilling into the captured kwargs rather than
   asserting whole-dict equality keeps the resume blob (also in the
   dict) out of the way.

#11 ``test_reconstructed_test_case_carries_stored_output_and_dataset_content``:
   pin ``trial_id == 0``. The hard-code in ``reconstruct_previous_test_results``
   is intentional (REST payload doesn't carry trial index) but was
   unverified by tests.

#12 ``examples/resume_evaluation.py``: ``opik_client.flush()`` after
    the caught crash. ``_evaluate_task`` re-raises before reaching its
    own flush, so the demo's stage-3 ``completed_count`` could
    under-count the partial state. Sibling
    ``resume_evaluation_interactive.py`` already does this.

#13 ``examples/resume_evaluation.py``: unique-per-process
    ``EXPERIMENT_NAME`` suffix (uuid4 fragment) + tighten the lookup
    assertion to ``len(experiments) == 1`` and pick ``[0]`` instead of
    ``[-1]``. ``get_experiments_by_name`` is a case-insensitive
    substring search, so prior demo runs (or any same-prefix
    experiment) used to silently match and ``[-1]`` could select the
    wrong one.

All 557 unit tests pass; pre-commit clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CRA finding #2: ``_evaluate_task`` computes the experiment-level
aggregate from the freshly-replayed slice and immediately calls
``experiment.log_experiment_scores(...)`` (line 617). On the resume
path, ``merge_resume_results`` then recomputes the aggregate from
``previous + new`` and overwrites the same field. Between the two
writes the backend advertises the slice-only mean as the
whole-experiment score; a concurrent reader / crash / 429 in between
leaves the backend stuck on the slice-only view.

Fix: ``evaluate_resume`` passes ``experiment_scoring_functions=[]``
into ``_evaluate_task`` so the inner compute-and-log is skipped on the
resume path. ``merge_resume_results`` does the only write, with the
real merged aggregate.

Also drops the ``if not previous_test_results: return new_result``
short-circuit in ``merge_resume_results`` — that path used to return
``new_result`` unchanged, but now ``new_result.experiment_scores`` is
``[]`` by construction. The merge-time compute has to run even when
there are no prior runs, so the user's ``experiment_scoring_functions``
is still applied (over the new-only test results, which happen to be
the whole experiment in that case).

Non-resume paths (``evaluate``, ``evaluate_prompt``,
``evaluate_optimization_trial``) keep their original
``_evaluate_task`` behavior unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread sdks/python/src/opik/evaluation/evaluation_result.py Outdated
alexkuzmik and others added 2 commits June 8, 2026 13:11
…ment-score compute+log to evaluate_resume

``merge_resume_results`` is now responsible for one thing — folding
``previous_test_results`` into ``new_result`` and returning a merged
``EvaluationResult``. No backend calls, no experiment-score
recomputation, no ``context`` parameter. ``experiment_scores`` on the
returned object is empty by construction.

``evaluate_resume`` now:
- forwards the user's ``experiment_scoring_functions`` to
  ``_evaluate_task`` (so the inner slice-only compute-and-log runs
  again, as on non-resume paths);
- calls the pure ``merge_resume_results``;
- recomputes the experiment-level aggregate over the merged
  test_results and logs it, overwriting the slice-only write.

This means the backend transiently holds the slice-only aggregate
between the two writes — explicitly accepted: rate-limit / concurrent-
read risk is negligible, and the separation of concerns is more
valuable than avoiding the redundant write.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…etrics

Existing ``resume_start.py`` / ``resume_continue.py`` scripts use a
hand-rolled ``_FlakyMetric`` / ``_HealthyAccuracyMetric`` pair (custom
``BaseMetric`` subclasses) — useful for testing the marker's failure
modes but not a clean demo of how a real user would invoke the
feature.

The new ``resume_demo_start.py`` + ``resume_demo_continue.py`` use the
built-in heuristic metrics (``metrics.Equals`` + ``metrics.Contains``).
The dataset is a small sentiment-classification toy with a
deterministic classifier (no LLM needed). The start script crashes on
item #6 to leave a partial state; the continue script resumes from
the printed experiment id with the **same** metric set and a healthy
task.

Why the metric list must match: ``evaluate_resume`` persists the
iteration knobs (dataset version, filter, nb_samples, default trial
count) in the experiment record, but it cannot persist live Python
metric objects — the caller has to re-supply them. Mismatched metrics
between phase 1 and phase 2 would leave the merged result advertising
feedback scores under one name on the reconstructed runs and under
another name on the freshly replayed runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread scripts/resume_demo_start.py Outdated
Comment thread scripts/resume_demo_continue.py Outdated
…ommitted

These two scripts (added in 0053301) are meant to stay local to the
author's checkout for ad-hoc demoing. They were committed by mistake.
The repo already has the ``resume_start.py`` / ``resume_continue.py``
pair for the test-harness flow and ``examples/resume_evaluation.py``
for the canonical example.
…-evaluate-resume

# Conflicts:
#	sdks/python/tests/library_integration/openai/agents_tests/test_opik_tracing_processor.py
@alexkuzmik alexkuzmik merged commit b74f3bc into main Jun 9, 2026
134 checks passed
@alexkuzmik alexkuzmik deleted the aliaksandrk/OPIK-5269-evaluate-resume branch June 9, 2026 10:34
alexkuzmik added a commit that referenced this pull request Jun 9, 2026
…#6950)

* [OPIK-5269] [DOCS] docs: add page on resuming interrupted evaluations

Documents ``opik.evaluate_resume(experiment_id, ...)``: when it applies,
the replayed-vs-reconstructed contract, what happens with custom
samplers / explicit ``dataset_item_ids``, and which existing tools to
reach for when resume isn't the right fit (``evaluate_experiment`` for
re-scoring; a fresh ``evaluate()`` for new items).

The page lands under Evaluation → Advanced, between Datasets &
Experiments and Manage datasets — placement that matches when a user
is most likely to need it.

Pairs with the SDK feature shipping in #6941.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(resume): narrow scope note to evaluate() only

Drop the parenthetical mention of evaluate_prompt and
evaluate_optimization_trial — keeps the note focused on the entrypoint
users actually reach for.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(resume): use canonical 'run' vocabulary; simplify resume requirements; reflect missing-runs-only semantics

- Terminology: replace SDK-internal 'trial' with the user-facing 'run'
  vocabulary used in evaluation/concepts.mdx ("each item is run N
  times", "a run passes if...").
- "When you can resume" section: drop the JSON-blob / experiment_config
  internals; keep the two requirements the user actually needs to know
  (resume-aware SDK + versioned dataset) and point at the
  ExperimentNotResumable error for everything else.
- "What gets replayed vs reconstructed" multi-run paragraph: replace
  the old "redo all if any trial is missing" rule with the new
  "replay only the missing runs" semantics that ship with #6941.
- "Wrong tool" bullets: tighten wording (existing-run → existing
  experiment; original run → original evaluation) so 'run' consistently
  means a single execution.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(resume): make the page scannable; cut implementation chatter

Reorders the page around what a user wants to scan in order: what it
is → quick start → what it preserves vs replays → requirements →
same-machine caveat → when it's the wrong tool → reference.

Drops:
- The detailed "outcome of the original run → what resume does" table
  (replaced with three short bullets).
- The "engine writes the trace's `output` only at the end of the happy
  path" implementation note — users don't need to know the marker
  mechanism.
- The "JSON blob in experiment_config / resume reads it back" plumbing.
- The two-script "Putting it together" section (the quick start already
  shows the call).
- The local-checkpoint code block and the long explanation of why a
  sampler needs one — replaced with two short sentences.

Result: 80 lines vs 164. Same content, faster to scan.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(resume): move same-machine caveat below 'when it's the wrong tool'

It's a niche caveat (only the sampler / explicit-ids paths hit it), so
it belongs after the broader "wrong tool" decision points rather than
in the main reading flow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(resume): widen task-change caveat to metrics; link evaluate_experiment; drop Reference section

- Third 'wrong tool' bullet now also covers metrics: providing the same
  ``task`` and ``scoring_metrics`` between calls is the caller's
  responsibility. The same already-completed-runs-keep-their-original-
  outputs warning applies to both.
- Link to the Python SDK reference page for evaluate_experiment so
  users have a single click to learn the alternative.
- Drop the Reference section at the bottom — the function and
  exception names already appear inline where users see them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(resume): reorder — guidance first, requirements + same-machine caveat at the bottom

New flow: intro → quick start → what resume does → when it's the wrong
tool → requirements (now including the same-machine sampler caveat as
a follow-up paragraph in the same section). Decision-help moves up,
gotchas move down.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Python SDK python Pull requests that update Python code tests Including test files, or tests related like configuration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants