When a model API call hangs indefinitely (e.g. Anthropic quota exceeded
mid-call), the gateway acquires a session .jsonl.lock but the promise
never resolves, so the try/finally block never reaches release(). Since
the owning PID is the gateway itself, stale detection cannot help —
isPidAlive() always returns true.
This commit adds four layers of defense:
1. **In-process lock watchdog** (session-write-lock.ts)
- Track acquiredAt timestamp on each held lock
- 60-second interval timer checks all held locks
- Auto-releases any lock held longer than maxHoldMs (default 5 min)
- Catches the hung-API-call case that try/finally cannot
2. **Gateway startup cleanup** (server-startup.ts)
- On boot, scan all agent session directories for *.jsonl.lock files
- Remove locks with dead PIDs or older than staleMs (30 min)
- Log each cleaned lock for diagnostics
3. **openclaw doctor stale lock detection** (doctor-session-locks.ts)
- New health check scans for .jsonl.lock files
- Reports PID status and age of each lock found
- In --fix mode, removes stale locks automatically
4. **Transcript error entry on API failure** (attempt.ts)
- When promptError is set, write an error marker to the session
transcript before releasing the lock
- Preserves conversation history even on model API failures
Closes#18060
The gateway unconditionally scheduled a SIGUSR1 restart after every
update.run call, even when the update itself failed (broken deps,
build errors, etc.). This left the process restarting into a broken
state — corrupted node_modules, partial builds — causing a crash loop
that required manual intervention.
Three fixes:
1. Only restart on success: scheduleGatewaySigusr1Restart is now
gated on result.status === "ok". Failed or skipped updates still
write the restart sentinel (so the status can be reported back to
the user) but the running gateway stays alive.
2. Early bail on step failure: deps install, build, and ui:build now
check exit codes immediately (matching the preflight section) so a
failed deps install no longer cascades into a broken build and
ui:build.
3. Auto-repair config during update: the doctor step now runs with
--fix alongside --non-interactive, so unknown config keys left over
from schema changes between versions are stripped automatically
instead of causing a startup validation crash.
When the gateway connection fails due to device token mismatch (e.g., after
re-pairing the device), clear the stored device-auth token so that
subsequent connection attempts can obtain a fresh token.
This fixes the cron tool failing with 'device token mismatch' error
after running 'openclaw configure' to re-pair the device.
Fixes#18175
Follow-up to #18066 — three session file write sites were missed:
- auto-reply/reply/session.ts: forked session transcript header
- pi-embedded-runner/session-manager-init.ts: session file reset
- gateway/server-methods/sessions.ts: compacted transcript rewrite
All now use mode 0o600 consistent with transcript.ts and chat.ts.