Skip to content

Worker retries infinitely on 400 API_KEY_INVALID — no circuit breaker for fatal errors #1481

@jaydenbot369-ctrl

Description

@jaydenbot369-ctrl

Bug Description

When the configured Gemini API key expires (or is invalid), the worker enters an infinite retry loop with no circuit breaker or backoff. This burns through API quota with 400 errors indefinitely until the worker is manually killed.

Impact

Over 6 days with an expired key, my worker generated ~77,000 failed requests (0.02% success rate on Gemini API dashboard). The pending_messages queue accumulated 25K+ failed entries, and the SQLite DB bloated to 564MB.

Reproduction

  1. Configure CLAUDE_MEM_PROVIDER=gemini with a valid API key
  2. Let the worker run and accumulate some pending messages
  3. Revoke or let the API key expire
  4. Observe: worker retries every 4-14 seconds per session, indefinitely

Log Evidence

[ERROR] [SDK] [session-603] Session generator failed {provider=Fg} Gemini API error: 400 - {
  "error": {
    "code": 400,
    "message": "API key expired. Please renew the API key.",
    "status": "INVALID_ARGUMENT",
    "details": [{ "reason": "API_KEY_INVALID" }]
  }
}
[INFO] [SYSTEM] [session-603] Pending work remains after generator exit, restarting with fresh AbortController {pendingCount=172}
[INFO] [SYSTEM] [session-603] Starting generator (pending-work-restart) using Fg

This cycle repeats every ~5-14 seconds per session, with 13 active sessions running concurrently.

Expected Behavior

  • 400 API_KEY_INVALID should be treated as a fatal, non-retryable error
  • Worker should stop retrying after detecting this error class, log a clear message, and either:
    • Pause processing until the key is updated (preferred), or
    • Mark all pending messages as failed and shut down gracefully
  • At minimum, implement exponential backoff with a max retry cap for all 4xx errors

Current Behavior

  • No circuit breaker exists
  • No distinction between retryable (429, 503) and non-retryable (400, 401, 403) errors
  • retry_count field on pending_messages is never incremented (always 0)
  • Sessions retry indefinitely via pending-work-restart generator pattern

Environment

  • claude-mem version: 10.5.5
  • Provider: gemini (gemini-2.5-flash-lite)
  • OS: macOS

Suggested Fix

Classify HTTP status codes into retryable vs non-retryable:

  • Retryable: 429 (rate limit), 500, 502, 503, 504 → exponential backoff with max retries
  • Non-retryable: 400 (bad request), 401 (unauthorized), 403 (forbidden) → fail immediately, log error, stop session processing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions