Moltbot

Author	SHA1	Message	Date
Altay	6e962d8b9e	fix(agents): handle overloaded failover separately (#38301 ) * fix(agents): skip auth-profile failure on overload * fix(agents): note overload auth-profile fallback fix * fix(agents): classify overloaded failures separately * fix(agents): back off before overload failover * fix(agents): tighten overload probe and backoff state * fix(agents): persist overloaded cooldown across runs * fix(agents): tighten overloaded status handling * test(agents): add overload regression coverage * fix(agents): restore runner imports after rebase * test(agents): add overload fallback integration coverage * fix(agents): harden overloaded failover abort handling * test(agents): tighten overload classifier coverage * test(agents): cover all-overloaded fallback exhaustion * fix(cron): retry overloaded fallback summaries * fix(cron): treat HTTP 529 as overloaded retry	2026-03-07 01:42:11 +03:00
Altay	f014e255df	refactor(agents): share failover HTTP status classification (#36615 ) * fix(agents): classify transient failover statuses consistently * fix(agents): preserve legacy failover status mapping	2026-03-05 23:50:36 +03:00
Peter Steinberger	6472e03412	refactor(agents): share failover error matchers	2026-03-03 02:51:00 +00:00
AI南柯(KingMo)	30ab9b2068	fix(agents): recognize connection errors as retryable timeout failures (#31697 ) * fix(agents): recognize connection errors as retryable timeout failures ## Problem When a model endpoint becomes unreachable (e.g., local proxy down, relay server offline), the failover system fails to switch to the next candidate model. Errors like "Connection error." are not classified as retryable, causing the session to hang on a broken endpoint instead of falling back to healthy alternatives. ## Root Cause Connection/network errors are not recognized by the current failover classifier: - Text patterns like "Connection error.", "fetch failed", "network error" - Error codes like ECONNREFUSED, ENOTFOUND, EAI_AGAIN (in message text) While `failover-error.ts` handles these as error codes (err.code), it misses them when they appear as plain text in error messages. ## Solution Extend timeout error patterns to include connection/network failures: In `errors.ts` (ERROR_PATTERNS.timeout): - Text: "connection error", "network error", "fetch failed", etc. - Regex: /\beconn(?:refused\|reset\|aborted)\b/i, /\benotfound\b/i, /\beai_again\b/i In `failover-error.ts` (TIMEOUT_HINT_RE): - Same patterns for non-assistant error paths ## Testing Added test cases covering: - "Connection error." - "fetch failed" - "network error: ECONNREFUSED" - "ENOTFOUND" / "EAI_AGAIN" in message text ## Impact - Compatibility: High - only expands retryable error detection - Behavior: Connection failures now trigger automatic fallback - Risk: Low - changes are additive and well-tested * style: fix code formatting for test file	2026-03-03 02:37:23 +00:00
Peter Steinberger	1bd20dbdb6	fix(failover): treat stop reason error as timeout	2026-03-03 01:05:24 +00:00
Peter Steinberger	a2fdc3415f	fix(failover): handle unhandled stop reason error	2026-03-03 01:05:24 +00:00
Peter Steinberger	9617ac9dd5	refactor: dedupe agent and reply runtimes	2026-03-02 19:57:33 +00:00
Saurabh	1ef9a2a8ea	fix: handle HTTP 529 (Anthropic overloaded) in failover error classification Classify Anthropic's 529 status code as "rate_limit" so model fallback triggers reliably without depending on fragile message-based detection. Closes #28502	2026-03-02 18:59:10 +00:00
Ayane	76ed274aad	fix(agents): trigger model failover on connection-refused and network-unreachable errors Previously, only ETIMEDOUT / ESOCKETTIMEDOUT / ECONNRESET / ECONNABORTED were recognised as failover-worthy network errors. Connection-level failures such as ECONNREFUSED (server down), ENETUNREACH / EHOSTUNREACH (network disconnected), ENETRESET, and EAI_AGAIN (DNS failure) were treated as unknown errors and did not advance the fallback chain. This is particularly impactful when a local fallback model (e.g. Ollama) is configured: if the remote provider is unreachable due to a network outage, the gateway should fall back to the local model instead of returning an error to the user. Add the missing error codes to resolveFailoverReasonFromError() and corresponding e2e tests. Closes #18868	2026-03-02 02:08:27 +00:00
Frank Yang	ed86252aa5	fix: handle CLI session expired errors gracefully instead of crashing gateway (#31090 ) * fix: handle CLI session expired errors gracefully - Add session_expired to FailoverReason type - Add isCliSessionExpiredErrorMessage to detect expired CLI sessions - Modify runCliAgent to retry with new session when session expires - Update agentCommand to clear expired session IDs from session store - Add proper error handling to prevent gateway crashes on expired sessions Fixes #30986 * fix: add session_expired to AuthProfileFailureReason and missing log import * fix: type cli-runner usage field to match EmbeddedPiAgentMeta * fix: harden CLI session-expiry recovery handling * build: regenerate host env security policy swift --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>	2026-03-02 01:11:05 +00:00
Aleksandrs Tihenko	c0026274d9	fix(auth): distinguish revoked API keys from transient auth errors (#25754 ) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 8f9c07a200644284e11adae76368adab40c5fa4e Co-authored-by: rrenamed <87486610+rrenamed@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras	2026-02-25 19:47:16 -05:00
taw0002	3c57bf4c85	fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) (#21017 ) * fix: treat HTTP 502/503/504 as failover-eligible (timeout reason) When a model API returns 502 Bad Gateway, 503 Service Unavailable, or 504 Gateway Timeout, the error object carries the status code directly. resolveFailoverReasonFromError() only checked 402/429/401/403/408/400, so 5xx server errors fell through to message-based classification which requires the status code to appear at the start of the error message. Many API SDKs (Google, Anthropic) set err.status = 503 without prefixing the message with '503', so the message classifier never matched and failover never triggered — the run retried the same broken model. Add 502/503/504 to the status-code branch, returning 'timeout' (matching the existing behavior of isTransientHttpError in the message classifier). Fixes #20999 * Changelog: add failover 502/503/504 note with credits * Failover: classify HTTP 504 as transient in message parser * Changelog: credit taw0002 and vincentkoc for failover fix --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-02-23 03:01:57 -05:00
mudrii	7ecfc1d93c	fix(auth): bidirectional mode/type compat + sync OAuth to all agents (#12692 ) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 2dee8e1174e637e50d10bf7020f1de2990b804dc Co-authored-by: mudrii <220262+mudrii@users.noreply.github.com> Co-authored-by: obviyus <22031114+obviyus@users.noreply.github.com> Reviewed-by: @obviyus	2026-02-20 16:01:09 +05:30
Protocol Zero	2af3415fac	fix: treat HTTP 503 as failover-eligible for LLM provider errors (#21086 ) * fix: treat HTTP 503 as failover-eligible for LLM provider errors When LLM SDKs wrap 503 responses, the leading "503" prefix is lost (e.g. Google Gemini returns "high demand" / "UNAVAILABLE" without a numeric prefix). The existing isTransientHttpError only matches messages starting with "503 ...", so these wrapped errors silently skip failover — no profile rotation, no model fallback. This patch closes that gap: - resolveFailoverReasonFromError: map HTTP status 503 → rate_limit (covers structured error objects with a status field) - ERROR_PATTERNS.overloaded: add /\b503\b/, "service unavailable", "high demand" (covers message-only classification when the leading status prefix is absent) Existing isTransientHttpError behavior is unchanged; these additions are complementary and only fire for errors that previously fell through unclassified. * fix: address review feedback — drop /\b503\b/ pattern, add test coverage - Remove `/\b503\b/` from ERROR_PATTERNS.overloaded to resolve the semantic inconsistency noted by reviewers: `isTransientHttpError` already handles messages prefixed with "503" (→ "timeout"), so a redundant overloaded pattern would classify the same class of errors differently depending on message formatting. - Keep "service unavailable" and "high demand" patterns — these are the real gap-fillers for SDK-rewritten messages that lack a numeric prefix. - Add test case for JSON-wrapped 503 error body containing "overloaded" to strengthen coverage. * fix: unify 503 classification — status 503 → timeout (consistent with isTransientHttpError) resolveFailoverReasonFromError previously mapped status 503 → "rate_limit", while the string-based isTransientHttpError mapped "503 ..." → "timeout". Align both paths: structured {status: 503} now also returns "timeout", matching the existing transient-error convention. Both reasons are failover-eligible, so runtime behavior is unchanged. --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-02-19 12:45:09 -08:00
Sebastian	fbda9a93fd	fix(failover): align abort timeout detection and regressions	2026-02-16 21:00:27 -05:00
Daniel Sauer	12ce358da5	fix(failover): recognize 'abort' stop reason as timeout for model fallback When streaming providers (GLM, OpenRouter, etc.) return 'stop reason: abort' due to stream interruption, OpenClaw's failover mechanism did not recognize this as a timeout condition. This prevented fallback models from being triggered, leaving users with failed requests instead of graceful failover. Changes: - Add abort patterns to ERROR_PATTERNS.timeout in pi-embedded-helpers/errors.ts - Extend TIMEOUT_HINT_RE regex to include abort patterns in failover-error.ts Fixes #18453 Co-authored-by: James <james@openclaw.ai>	2026-02-16 23:49:51 +01:00
Oren	71b4be8799	fix: handle 400 status in failover to enable model fallback (#1879 )	2026-02-08 23:12:06 -08:00
cpojer	5ceff756e1	chore: Enable "curly" rule to avoid single-statement if confusion/errors.	2026-01-31 16:19:20 +09:00
Luke	be1cdc9370	fix(agents): treat provider request-aborted as timeout for fallback (#1576 ) * fix(agents): treat request-aborted as timeout for fallback * test(e2e): add provider timeout fallback	2026-01-24 11:27:24 +00:00
Peter Steinberger	ec27c813cc	fix(fallback): handle timeout aborts Co-authored-by: Mykyta Bozhenko <21245729+cheeeee@users.noreply.github.com>	2026-01-18 07:52:44 +00:00
Peter Steinberger	c379191f80	chore: migrate to oxlint and oxfmt Co-authored-by: Christoph Nakazawa <christoph.pojer@gmail.com>	2026-01-14 15:02:19 +00:00
Peter Steinberger	53ec8e36cb	refactor: centralize failover error parsing	2026-01-10 01:26:06 +01:00
Peter Steinberger	402c35b91c	refactor(agents): centralize failover normalization	2026-01-09 22:15:06 +01:00
Peter Steinberger	c27b1441f7	fix(auth): billing backoff + cooldown UX	2026-01-09 22:00:14 +01:00

24 Commits