Self-heal Amazee credentials provisioned without resolved model names#17
Merged
Merged
Conversation
A provision whose /model/info call fails stores token+url but no model names: get_available_models() swallows the error and returns [], so the on_models_resolved gate never fires. ConfigStorage.load() requires only token+url, so the half-provisioned credentials read as valid and ensure_ai_available() short-circuited on every later request, never re-resolving. The caller then fell back to the dated config default the Amazee LiteLLM gateway rejects with HTTP 400, breaking AI silently with no self-recovery (outside KeyExpiryRecovery's auth-only remit). ensure_ai_available() now takes an optional has_resolved_models predicate: when stored credentials exist but the caller reports models are still unresolved, model resolution re-runs against the already-stored key (never a fresh trial) and on_models_resolved fires with the result, healing the incomplete-provision state on the next lazy-init. Without the predicate the historical no-op is unchanged. Regression test drives the full provision -> failed-resolution -> store -> re-resolve sequence.
50a1761 to
deddb99
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug
In the scolta-node slack-archive demo, the AI summarize endpoint returns HTTP 200 with an empty body
{}and query expansion silently runs unexpanded. The same defect exists in scolta-python and scolta-php — they share the provisioning design (scolta-php is canonical; this is the python port).Root cause
Auto-provisioning persists credentials and resolves model names as two non-atomic steps:
AmazeeTrialProvisioner.provision()callsstorage.store(token, url, region), then calls/model/info. When the model-info call fails,AmazeeClient.get_available_models()swallows the error and returns[], soAmazeeModelResolver.resolve()returns{"ai_model": None, "ai_expansion_model": None}and theon_models_resolvedgate never fires — no model name is persisted. ButConfigStorage.load()requires only token+url, so it reports the half-provisioned credentials as valid, andensure_ai_available()short-circuited on stored credentials on every later request, never re-resolving. The caller then built its client with the dated config defaultclaude-sonnet-4-5-20250929, which the Amazee LiteLLM gateway rejects with HTTP 400 "Invalid model name". The endpoint swallows the 400 (summarize →{}, expand → unexpanded 200). A creds record thatload()considers complete but that lacks resolved models is worse than no record — it bypasses graceful degrade and sends a model the gateway always rejects.This is outside
KeyExpiryRecovery's remit, which recovers from auth-class (401/403) failures and explicitly excludes budget — a 400 "Invalid model name" is neither.The fix (implementation choice)
I chose the incomplete-provision self-heal: treat "credentials present but models unresolved" as an incomplete provision and re-attempt model resolution against the already-stored key (never a fresh trial — trial keys are server-side-limited).
ensure_ai_available()gains an optionalhas_resolved_modelspredicate; when stored credentials exist and it reports models are still unresolved, resolution re-runs andon_models_resolvedfires with the result. Without the predicate the historical no-op is unchanged (back-compat).Why a new predicate rather than the §3 sketch's
stored_models()check: in scolta-python (as in scolta-php),ConfigStoragepersists only token/url/region — resolved models flow to each consumer's own config viaon_models_resolved, so storage cannot report whether models are resolved (unlike scolta-node, whoseFilesystemConfigStoragedoes persist them). The caller (which knows its config) supplies the signal. This is the documented structural difference between the bindings; the python change mirrors the canonical php one exactly.Contract coverage (§3): (1) self-heals on the next lazy-init pass; (2) never sends the dated default — that fallback lives in the consuming adapter/demo client construction, which adopts the predicate when it re-vendors (out of scope here, noted below); (3) degrades gracefully — when resolution genuinely fails, no model is persisted and the existing no-AI degrade path is taken; (4) does not waste trial keys — re-resolves model-info only, against the stored key; (5) explicit-key path untouched — guarded by
has_explicit_api_keyfirst.Out of scope (flagged, not fixed): whether
claude-sonnet-4-5-20250929is a current valid Anthropic model name for the explicit-key path. The dated default is only wrong on the Amazee gateway path; left unchanged.Test
test_auto_provisioner_self_heals_half_provisioned_statedrives the real sequence: pass 1 provisions a trial whose/model/inforeturns no models (credentials stored, models unresolved); pass 2 self-heals by re-resolving against the stored key, asserting exactly one trial was ever provisioned (no second trial) and that the resolved model is a real undated alias, never the dated default. Confirmed failing on the pre-fix logic (re-resolution never happens) and passing after. Plus guards: no re-resolution when models are already resolved, and the no-predicate back-compat no-op.Local results
pytest— 740 tests passruff check .— all checks passedruff format --check— already formattedSequencing
Independent of the scolta-php PR — the model-resolution logic is local to this binding. The canonical php change lands ahead of the scolta-php 1.0.4 tag; python re-vendors browser assets, not the AI subsystem, at its own release points.