Skip to content

[Bug]: file-embeddings queue retry storm exhausts Redis connections when Ollama is not installed #351

@skyam25

Description

@skyam25

Issue Category

Docker/Container Issues / System Performance/Resources

Bug Description

When Project N.O.M.A.D. is deployed without Ollama (data-only setup for offline knowledge), the file-embeddings background queue worker causes a retry storm that eventually exhausts Redis client connections, producing cascading EPIPE, ECONNRESET, and ERR max number of clients reached errors.

The root cause is that entrypoint.sh unconditionally starts workers for all queues (node ace queue:work --all), including file-embeddings. Each downloaded ZIM file triggers embedding jobs that immediately fail with "Ollama service is not installed or running", then retry 30 times with 60-second backoff. With many ZIM files downloading in parallel, dozens of jobs accumulate and cycle through retries simultaneously, overwhelming Redis.

Steps to Reproduce

  1. Install Project N.O.M.A.D. without Ollama (skip AI setup)
  2. Use the Easy Setup wizard to select curated collections for download
  3. Wait for ZIM files to start downloading
  4. Observe docker logs nomad_admin — embedding errors begin immediately and compound as more ZIM files complete

Expected Behavior

The file-embeddings queue should not run (or should gracefully no-op) when Ollama is not installed. Downloading ZIM files for offline knowledge should not require an AI/embedding service.

Actual Behavior

The admin container logs fill with repeating errors:

[ error ] [file-embeddings] Job failed: <job_id>, Error: Ollama service is not installed or running.

After enough jobs accumulate, Redis connection errors begin:

Error: write EPIPE
Error: read ECONNRESET
ERR max number of clients reached

This can degrade the downloads and other healthy queues that share the same Redis instance.

N.O.M.A.D. Version

v1.29.1

Operating System

Other Debian-based

Docker Version

Docker version 24.0 (Synology DSM Container Manager, but the bug is platform-agnostic — it affects any deployment without Ollama)

Do you have a dedicated GPU?

No

System Specifications

  • The bug is not hardware-dependent
  • Reproduced on x86_64 with 20GB RAM and 57TB storage
  • Redis is default configuration (redis:7-alpine)

Relevant Logs

[ info ] [file-embeddings] Processing job: 8cf56dfc5718f379 of type: embed-file
[ error ] [file-embeddings] Job failed: 8cf56dfc5718f379, Error: Ollama service is not installed or running.
[ info ] [file-embeddings] Processing job: db86870d385f276a of type: embed-file
[ error ] [file-embeddings] Job failed: db86870d385f276a, Error: Ollama service is not installed or running.
... (repeats for every queued job, every 60 seconds, up to 30 retries each)

Error: write EPIPE
    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:94:16)
    at Socket._write (node:net:978:8)
    ...
    errno: -32,
    code: 'EPIPE',
    syscall: 'write'

Error: read ECONNRESET

Additional Context

Workaround: Replace node ace queue:work --all in entrypoint.sh with individual queue workers, excluding file-embeddings:

node ace queue:work --queue downloads &
node ace queue:work --queue model-downloads &
node ace queue:work --queue benchmarks &
node ace queue:work --queue system &
node ace queue:work --queue service-updates &

Suggested fixes (any of these would resolve it):

  1. Don't start file-embeddings worker when Ollama is not installed — check Ollama service status before launching the queue worker in entrypoint.sh or via an env var toggle
  2. Don't dispatch embedding jobs when Ollama is absent — check in rag_service.ts / rag_controller.ts before calling EmbedFileJob.dispatch()
  3. Fail fast without retries for "service not installed" — in embed_file_job.ts, distinguish between transient errors (network timeout → retry) and permanent errors (service not installed → don't retry). Currently all failures retry 30 times with 60s backoff
  4. Add an env var like DISABLE_EMBEDDINGS=true — skip starting the file-embeddings worker entirely for data-only deployments

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions