Skip to content

Self-heal Amazee credentials provisioned without resolved model names#17

Merged
jeremyandrews merged 1 commit into
mainfrom
fix/amazee-model-resolution-self-heal
Jun 14, 2026
Merged

Self-heal Amazee credentials provisioned without resolved model names#17
jeremyandrews merged 1 commit into
mainfrom
fix/amazee-model-resolution-self-heal

Conversation

@jeremyandrews

Copy link
Copy Markdown
Member

The bug

In the scolta-node slack-archive demo, the AI summarize endpoint returns HTTP 200 with an empty body {} and query expansion silently runs unexpanded. The same defect exists in scolta-python and scolta-php — they share the provisioning design (scolta-php is canonical; this is the python port).

Root cause

Auto-provisioning persists credentials and resolves model names as two non-atomic steps: AmazeeTrialProvisioner.provision() calls storage.store(token, url, region), then calls /model/info. When the model-info call fails, AmazeeClient.get_available_models() swallows the error and returns [], so AmazeeModelResolver.resolve() returns {"ai_model": None, "ai_expansion_model": None} and the on_models_resolved gate never fires — no model name is persisted. But ConfigStorage.load() requires only token+url, so it reports the half-provisioned credentials as valid, and ensure_ai_available() short-circuited on stored credentials on every later request, never re-resolving. The caller then built its client with the dated config default claude-sonnet-4-5-20250929, which the Amazee LiteLLM gateway rejects with HTTP 400 "Invalid model name". The endpoint swallows the 400 (summarize → {}, expand → unexpanded 200). A creds record that load() considers complete but that lacks resolved models is worse than no record — it bypasses graceful degrade and sends a model the gateway always rejects.

This is outside KeyExpiryRecovery's remit, which recovers from auth-class (401/403) failures and explicitly excludes budget — a 400 "Invalid model name" is neither.

The fix (implementation choice)

I chose the incomplete-provision self-heal: treat "credentials present but models unresolved" as an incomplete provision and re-attempt model resolution against the already-stored key (never a fresh trial — trial keys are server-side-limited). ensure_ai_available() gains an optional has_resolved_models predicate; when stored credentials exist and it reports models are still unresolved, resolution re-runs and on_models_resolved fires with the result. Without the predicate the historical no-op is unchanged (back-compat).

Why a new predicate rather than the §3 sketch's stored_models() check: in scolta-python (as in scolta-php), ConfigStorage persists only token/url/region — resolved models flow to each consumer's own config via on_models_resolved, so storage cannot report whether models are resolved (unlike scolta-node, whose FilesystemConfigStorage does persist them). The caller (which knows its config) supplies the signal. This is the documented structural difference between the bindings; the python change mirrors the canonical php one exactly.

Contract coverage (§3): (1) self-heals on the next lazy-init pass; (2) never sends the dated default — that fallback lives in the consuming adapter/demo client construction, which adopts the predicate when it re-vendors (out of scope here, noted below); (3) degrades gracefully — when resolution genuinely fails, no model is persisted and the existing no-AI degrade path is taken; (4) does not waste trial keys — re-resolves model-info only, against the stored key; (5) explicit-key path untouched — guarded by has_explicit_api_key first.

Out of scope (flagged, not fixed): whether claude-sonnet-4-5-20250929 is a current valid Anthropic model name for the explicit-key path. The dated default is only wrong on the Amazee gateway path; left unchanged.

Test

test_auto_provisioner_self_heals_half_provisioned_state drives the real sequence: pass 1 provisions a trial whose /model/info returns no models (credentials stored, models unresolved); pass 2 self-heals by re-resolving against the stored key, asserting exactly one trial was ever provisioned (no second trial) and that the resolved model is a real undated alias, never the dated default. Confirmed failing on the pre-fix logic (re-resolution never happens) and passing after. Plus guards: no re-resolution when models are already resolved, and the no-predicate back-compat no-op.

Local results

  • pytest — 740 tests pass
  • ruff check . — all checks passed
  • ruff format --check — already formatted

Sequencing

Independent of the scolta-php PR — the model-resolution logic is local to this binding. The canonical php change lands ahead of the scolta-php 1.0.4 tag; python re-vendors browser assets, not the AI subsystem, at its own release points.

A provision whose /model/info call fails stores token+url but no model
names: get_available_models() swallows the error and returns [], so the
on_models_resolved gate never fires. ConfigStorage.load() requires only
token+url, so the half-provisioned credentials read as valid and
ensure_ai_available() short-circuited on every later request, never
re-resolving. The caller then fell back to the dated config default the
Amazee LiteLLM gateway rejects with HTTP 400, breaking AI silently with no
self-recovery (outside KeyExpiryRecovery's auth-only remit).

ensure_ai_available() now takes an optional has_resolved_models predicate:
when stored credentials exist but the caller reports models are still
unresolved, model resolution re-runs against the already-stored key (never
a fresh trial) and on_models_resolved fires with the result, healing the
incomplete-provision state on the next lazy-init. Without the predicate the
historical no-op is unchanged.

Regression test drives the full provision -> failed-resolution -> store ->
re-resolve sequence.
@jeremyandrews jeremyandrews force-pushed the fix/amazee-model-resolution-self-heal branch from 50a1761 to deddb99 Compare June 14, 2026 07:51
@jeremyandrews jeremyandrews merged commit d4b68cb into main Jun 14, 2026
6 checks passed
@jeremyandrews jeremyandrews deleted the fix/amazee-model-resolution-self-heal branch June 14, 2026 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant