* fix: ensure CLI exits after command completion
The CLI process would hang indefinitely after commands like
`openclaw gateway restart` completed successfully. Two root causes:
1. `runCli()` returned without calling `process.exit()` after
`program.parseAsync()` resolved, and Commander.js does not
force-exit the process.
2. `daemon-cli/register.ts` eagerly called `createDefaultDeps()`
which imported all messaging-provider modules, creating persistent
event-loop handles that prevented natural Node exit.
Changes:
- Add `flushAndExit()` helper that drains stdout/stderr before calling
`process.exit()`, preventing truncated piped output in CI/scripts.
- Call `flushAndExit()` after both `tryRouteCli()` and
`program.parseAsync()` resolve.
- Remove unnecessary `void createDefaultDeps()` from daemon-cli
registration — daemon lifecycle commands never use messaging deps.
- Make `serveAcpGateway()` return a promise that resolves on
intentional shutdown (SIGINT/SIGTERM), so `openclaw acp` blocks
`parseAsync` for the bridge lifetime and exits cleanly on signal.
- Handle the returned promise in the standalone main-module entry
point to avoid unhandled rejections.
Fixes#12904
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: refactor CLI lifecycle and lazy outbound deps (#12906) (thanks @DrCrinkle)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
* Browser/Security: constrain trace and download output paths to temp roots
* Changelog: remove advisory ID from pre-public security note
* Browser/Security: constrain trace and download output paths to temp roots
* Changelog: remove advisory ID from pre-public security note
* test(bluebubbles): align timeout status expectation to 408
* test(discord): remove unused race-condition counter in threading test
* test(bluebubbles): align timeout status expectation to 408
* initial commit
* removes assesment from docs
* resolves automated review comments
* resolves lint , type , tests , refactors , and submits
* solves : why do we have to lint the tests xD
* adds greptile fixes
* solves a type error
* solves a ci error
* refactors auths
* solves a failing test after i pulled from main lol
* solves a failing test after i pulled from main lol
* resolves token naming issue to comply with better practices when using hf / huggingface
* fixes curly lints !
* fixes failing tests for google api from main
* solve merge conflicts
* solve failing tests with a defensive check 'undefined' openrouterapi key
* fix: preserve Hugging Face auth-choice intent and token behavior (#13472) (thanks @Josephrp)
* test: resolve auth-choice cherry-pick conflict cleanup (#13472)
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>
The initial fix using snapshot.parsed broke configs with $include directives.
This commit adds a new 'resolved' field to ConfigFileSnapshot that contains
the config after $include and ${ENV} substitution but BEFORE runtime defaults
are applied. This is now used by config set/unset to avoid:
1. Breaking configs with $include directives
2. Leaking runtime defaults into the written config file
Also removes applyModelDefaults from writeConfigFile since runtime defaults
should only be applied when loading, not when writing.
Fixes#6070
The config set/unset commands were using snapshot.config (which contains
runtime-merged defaults) instead of snapshot.parsed (the raw user config).
This caused runtime defaults like agents.defaults to leak into the written
config file when any value was set or unset.
Changed both set and unset commands to use structuredClone(snapshot.parsed)
to preserve only user-specified config values.
* fix(gateway): drain active turns before restart to prevent message loss
On SIGUSR1 restart, the gateway now waits up to 30s for in-flight agent
turns to complete before tearing down the server. This prevents buffered
messages from being dropped when config.patch or update triggers a restart
while agents are mid-turn.
Changes:
- command-queue.ts: add getActiveTaskCount() and waitForActiveTasks()
helpers to track and wait on active lane tasks
- run-loop.ts: on restart signal, drain active tasks before server.close()
with a 30s timeout; extend force-exit timer accordingly
- command-queue.test.ts: update imports for new exports
Fixes#13883
* fix(queue): snapshot active tasks for restart drain
---------
Co-authored-by: Elonito <0xRaini@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
When the gateway is installed as a macOS launch agent and no token is
configured, the service enters an infinite restart loop because launchd
does not inherit shell environment variables. Auto-generate a token
during `gateway install` when auth mode is `token` and no token exists,
matching the existing pattern in doctor.ts and configure.gateway.ts.
The token is persisted to the config file and embedded in the plist
EnvironmentVariables for belt-and-suspenders reliability.
Relates-to: #5103, #2433, #1690, #7749
* feat: add LiteLLM provider types, env var, credentials, and auth choice
Add litellm-api-key auth choice, LITELLM_API_KEY env var mapping,
setLitellmApiKey() credential storage, and LITELLM_DEFAULT_MODEL_REF.
* feat: add LiteLLM onboarding handler and provider config
Add applyLitellmProviderConfig which properly registers
models.providers.litellm with baseUrl, api type, and model definitions.
This fixes the critical bug from PR #6488 where the provider entry was
never created, causing model resolution to fail at runtime.
* docs: add LiteLLM provider documentation
Add setup guide covering onboarding, manual config, virtual keys,
model routing, and usage tracking. Link from provider index.
* docs: add LiteLLM to sidebar navigation in docs.json
Add providers/litellm to both English and Chinese provider page lists
so the docs page appears in the sidebar navigation.
* test: add LiteLLM non-interactive onboarding test
Wire up litellmApiKey flag inference and auth-choice handler for the
non-interactive onboarding path, and add an integration test covering
profile, model default, and credential storage.
* fix: register --litellm-api-key CLI flag and add preferred provider mapping
Wire up the missing Commander CLI option, action handler mapping, and
help text for --litellm-api-key. Add litellm-api-key to the preferred
provider map for consistency with other providers.
* fix: remove zh-CN sidebar entry for litellm (no localized page yet)
* style: format buildLitellmModelDefinition return type
* fix(onboarding): harden LiteLLM provider setup (#12823)
* refactor(onboarding): keep auth-choice provider dispatcher under size limit
---------
Co-authored-by: Peter Steinberger <steipete@gmail.com>
* fix: prune stale session entries, cap entry count, and rotate sessions.json
The sessions.json file grows unbounded over time. Every heartbeat tick (default: 30m)
triggers multiple full rewrites, and session keys from groups, threads, and DMs
accumulate indefinitely with large embedded objects (skillsSnapshot,
systemPromptReport). At >50MB the synchronous JSON parse blocks the event loop,
causing Telegram webhook timeouts and effectively taking the bot down.
Three mitigations, all running inside saveSessionStoreUnlocked() on every write:
1. Prune stale entries: remove entries with updatedAt older than 30 days
(configurable via session.maintenance.pruneDays in openclaw.json)
2. Cap entry count: keep only the 500 most recently updated entries
(configurable via session.maintenance.maxEntries). Entries without updatedAt
are evicted first.
3. File rotation: if the existing sessions.json exceeds 10MB before a write,
rename it to sessions.json.bak.{timestamp} and keep only the 3 most recent
backups (configurable via session.maintenance.rotateBytes).
All three thresholds are configurable under session.maintenance in openclaw.json
with Zod validation. No env vars.
Existing tests updated to use Date.now() instead of epoch-relative timestamps
(1, 2, 3) that would be incorrectly pruned as stale.
27 new tests covering pruning, capping, rotation, and integration scenarios.
* feat: auto-prune expired cron run sessions (#12289)
Add TTL-based reaper for isolated cron run sessions that accumulate
indefinitely in sessions.json.
New config option:
cron.sessionRetention: string | false (default: '24h')
The reaper runs piggy-backed on the cron timer tick, self-throttled
to sweep at most every 5 minutes. It removes session entries matching
the pattern cron:<jobId>:run:<uuid> whose updatedAt + retention < now.
Design follows the Kubernetes ttlSecondsAfterFinished pattern:
- Sessions are persisted normally (observability/debugging)
- A periodic reaper prunes expired entries
- Configurable retention with sensible default
- Set to false to disable pruning entirely
Files changed:
- src/config/types.cron.ts: Add sessionRetention to CronConfig
- src/config/zod-schema.ts: Add Zod validation for sessionRetention
- src/cron/session-reaper.ts: New reaper module (sweepCronRunSessions)
- src/cron/session-reaper.test.ts: 12 tests covering all paths
- src/cron/service/state.ts: Add cronConfig/sessionStorePath to deps
- src/cron/service/timer.ts: Wire reaper into onTimer tick
- src/gateway/server-cron.ts: Pass config and session store path to deps
Closes#12289
* fix: sweep cron session stores per agent
* docs: add changelog for session maintenance (#13083) (thanks @skyfallsin, @Glucksberg)
* fix: add warn-only session maintenance mode
* fix: warn-only maintenance defaults to active session
* fix: deliver maintenance warnings to active session
* docs: add session maintenance examples
* fix: accept duration and size maintenance thresholds
* refactor: share cron run session key check
* fix: format issues and replace defaultRuntime.warn with console.warn
---------
Co-authored-by: Pradeep Elankumaran <pradeepe@gmail.com>
Co-authored-by: Glucksberg <markuscontasul@gmail.com>
Co-authored-by: max <40643627+quotentiroler@users.noreply.github.com>
Co-authored-by: quotentiroler <max.nussbaumer@maxhealth.tech>