2026-03-2614 min read

Workers and queues: how we moved heavy processes out of the main app

Practical case of how we delegated AI content generation, video, and image processing to dedicated workers with Redis and BullMQ so the main app doesn't block under load.

WorkersBullMQRedisNode.jsArchitectureMicroservicesAsync

The real problem: async isn't enough

We're building an app with my business partner that has several time-consuming processes: LLM content generation, video generation, image processing, and deep research with multiple chained calls. At first, we assumed async/await solved the problem. Node.js is asynchronous, it doesn't block, all good.

Reality was different. Under real load, the app became unresponsive. Simple endpoints like login or GET profile started timing out. Heavy processes didn't just take time themselves; they degraded everything else running in the same process.

The diagnosis: the Node.js process was consuming every available resource between open connections, memory, and CPU cycles. It wasn't a badly written async problem. It was an architecture problem.

  • LLM content generation: between 10 and 60 seconds per request, depending on the model and output length.
  • Video generation: between 2 and 10 minutes, with heavy file downloads at the end.
  • Image processing: CPU-bound, directly blocks the event loop.
  • Deep research: 3 to 5 chained LLM calls with intermediate processing, up to several minutes total.

App handling everything in-process: the saturation problem

Renderizando grafico...

When multiple users trigger heavy processes at the same time, the Node.js process consumes all its resources and simple endpoints start failing.

Why async doesn't mean free

The most common misconception: if I use await, the thread is free. That's partially true for lightweight I/O such as database queries. When the database driver responds in 5ms, the event loop does other things while waiting. But that's not what happens with our heavy processes.

An LLM call that takes 45 seconds keeps an HTTP connection open for 45 seconds. That connection consumes one socket from the pool. If we have 10 concurrent LLM requests, we have 10 occupied sockets, memory buffers for partial streaming responses, and when the full response arrives there is JSON processing that can become CPU-bound. The event loop isn't free; it's busy handling all of that.

Image processing is worse: it's directly CPU-bound. While the processing event runs, the event loop is blocked and no other callback can execute. For Node.js, it's as if the whole server pauses.

  • Connection pool sockets: each open HTTP request occupies one until it ends.
  • Memory for buffers: large responses (LLMs, files) accumulate in memory during transfer.
  • DB connection slots: if the worker needs to save intermediate results, it occupies additional connections.
  • CPU time: parsing large JSON, image manipulation, and text processing directly block the event loop.
Type of operationBlocks event loop?Consumes sockets?Impact under load
DB query (5ms)NoYes, for 5msAlmost none
LLM call (45s)No (but occupies socket for 45s)Yes, for 45 secondsHigh: exhausts the connection pool
Image processingYes (CPU-bound)NoCritical: pauses the entire event loop
Deep research (5 chained LLMs)No (but 5 sockets in series)Yes, in series for minutesVery high: monopolizes resources for minutes

Lightweight I/O vs heavy operations in the event loop

Renderizando grafico...

Not all async operations are the same. Lightweight I/O frees the event loop in milliseconds. Heavy operations consume resources for seconds or minutes.

Our concrete case: which processes block us

Our app has four categories of heavy processes. Each one blocks in a different way and has a different resource profile. Understanding that was key to deciding how to split it.

LLM content generation is the most frequent. Each request generates long-form text: articles, analyses, reports. The most capable models can take between 30 and 60 seconds to complete, and the response can be several KB of JSON. Deep research chains 3 to 5 of those calls with intermediate processing between each one, which means a single research operation can keep a socket busy for 3 to 5 minutes total.

Video generation is the slowest. We make calls to external APIs that take between 2 and 10 minutes to respond. Then we need to download the resulting file, which can weigh several hundred MB. Image processing, on the other hand, is fast but CPU-bound: resizing, compressing, converting formats. It doesn't last long, but it blocks the event loop while it runs.

  • LLM generation: 10-60s per call, responses of several KB, high frequency of use.
  • Deep research: 3-5 chained LLM calls + intermediate processing, up to 5 minutes per operation.
  • Video generation: 2-10 minutes of waiting + heavy file downloads, low frequency but highly blocking.
  • Image processing: seconds long but CPU-bound, directly blocks the event loop.

Deep research blocking the app while other requests wait

Renderizando grafico...

A single deep research request consumes app resources for minutes. Simple requests from other users wait in line.

The worker/queue pattern: delegate and free up

The solution is conceptually simple: the main app stops executing heavy work. When a video generation request arrives, the app enqueues a job in Redis and responds to the client immediately with a job ID. The client can check the status later. Dedicated workers in separate processes consume the queue and perform the real work.

The paradigm shift matters: instead of synchronous request-response, the flow becomes asynchronous from the client's point of view. The app says 'I got your request, I'll let you know when it's ready' instead of 'wait while I process it now'. This completely frees the main process.

  • Main app stays free: it never waits for heavy-process results anymore, it only enqueues and responds.
  • Isolated workers: if the video worker crashes, the main app and the other workers keep running.
  • Contained failures: processing errors don't affect app availability.
  • Independent scaling: if video is the bottleneck, I add more video workers without touching anything else.

New architecture: lightweight app + dedicated workers

Renderizando grafico...

The main app only manages the queue. Each process type has its own dedicated worker that scales and fails independently.

Architecture with Redis + BullMQ

BullMQ is a queue library for Node.js built on top of Redis. It has two main actors: Queue and Worker. The Queue is the producer: the main app creates a Queue instance and calls queue.add() with the job payload. BullMQ persists the job in Redis with a unique ID and places it in the 'waiting' state.

The Worker is the consumer: it runs in a separate process, connects to the same Redis queue, and processes jobs one by one (or N at a time with configured concurrency). When it finishes, it marks the job as 'completed' and emits an event. If it fails, it marks it as 'failed' and BullMQ can retry it automatically.

The job state lives in Redis at all times. That means the main app can check the status of any job at any moment using only the job ID, without needing direct coordination with the worker.

  • Queue: instance in the main app, only adds jobs. It processes nothing.
  • Worker: separate process, consumes jobs from the queue and runs the real logic.
  • Job: unit of work with ID, payload, state, and result.
  • States: waiting, active, completed, failed, delayed.
  • Events: completed, failed, progress; the app can listen to them via Redis.

Full BullMQ job lifecycle

Renderizando grafico...

The job state lives in Redis during the whole process. The app can query it at any time with the job ID without direct coordination with the worker.

Scaling workers independently

One of the clearest advantages of this pattern is that scaling becomes surgical. If the video queue has 50 accumulated jobs but AI and Research are fine, I spin up more video workers. The main app doesn't change, the other workers don't change. I only add more consumers for that specific queue.

With Docker Compose it's almost trivial: each worker type is a service, and scaling is done with replicas. The video worker can run on instances with more memory and better network for downloads. The image worker can run on instances with more CPU. BullMQ's internal concurrency also allows a single worker process to handle N jobs in parallel.

  • Horizontal scaling: add instances of the specific worker that is saturated.
  • Concurrency per worker: BullMQ lets you configure how many jobs each instance processes in parallel.
  • Different resources: video workers with more network, image workers with more CPU.
  • Docker Compose replicas: scaling is changing a number, not redeploying the main app.

Independent scaling by worker type

Renderizando grafico...

Video is the bottleneck: we run 4 instances. AI and Research run with 2 each. Images with 1. The main app doesn't even notice the scaling.

How the frontend knows the job is done

The client launched a heavy process and received a job ID. Now it needs to know when it finished. There are three approaches with different tradeoffs, and we went through all three at different times.

We started with polling: the frontend calls GET /jobs/:id every N seconds and the main app checks the state in Redis. Simple to implement, works, but it generates unnecessary server load and notification latency depends on the interval. The natural upgrade was SSE (Server-Sent Events): the app keeps an HTTP connection open and pushes the event when the job completes. Near-zero latency, much less load than polling. We use webhooks for service-to-service integrations where there is no frontend waiting.

  • Polling: periodic GET to the status endpoint. Simple but inefficient under load.
  • SSE: long HTTP connection, the server pushes when the job completes. More efficient, minimal latency.
  • Webhooks: the server calls a configured URL when the job finishes. Ideal for backend-to-backend integrations.
StrategyComplexityNotification latencyServer loadTypical use
PollingLowDepends on interval (1-30s)High (many requests)MVP, simple cases
SSEMediumAlmost instantMedium (persistent connection)Frontend waiting for result
WebhooksMediumAlmost instantLow (just 1 request on completion)Backend-to-backend integrations

Errors, retries, and dead letter queues

External APIs fail. A lot. Rate limits, timeouts, sporadic 500 errors, services down for maintenance. In our experience, video generation APIs are especially unstable. Without a solid retry strategy, jobs would fail permanently on any transient error.

BullMQ has built-in retry support: when a job fails, you can configure how many retries to make and which backoff strategy to use. We use exponential backoff for most workers: first retry after 30 seconds, second after 2 minutes, third after 10 minutes. This gives external APIs time to recover without hammering them.

Jobs that exhaust all retries go to the Dead Letter Queue: a separate queue where failed jobs stay for manual review or reprocessing. We have a monitoring worker that alerts us when something reaches the DLQ.

  • Configurable attempts per job type: AI with 3 retries, video with 5 (more unstable).
  • Exponential backoff: 30s, 2min, 10min, 1h. Gives APIs time to recover.
  • Timeout per job: video jobs with a 15-minute timeout so they don't hang forever.
  • Dead Letter Queue: failed jobs go to a separate queue for analysis and manual reprocessing.

Retry and dead letter queue flow

Renderizando grafico...

Exponential backoff protects external APIs from being hammered in a loop. The DLQ guarantees that no job is lost silently.

Connection with progressive migration

What we describe in this post is exactly the pattern from the previous post applied to a real case. The app started as a monolith doing everything: product endpoints, authentication, content generation, file processing. Traffic monitoring (storing metrics in the DB) quickly showed that heavy processes were the bottleneck: they concentrated most of the response time and degraded everything else.

Instead of redesigning the entire architecture at once, we first extracted the highest-impact processes: AI generation and deep research. We stabilized them, monitored them. Then video. Then images. Each extraction was a small, controlled cycle. The monolith kept shrinking, each worker kept stabilizing, and today the main app is significantly lighter and more reliable.

  • Monitor: identify which processes concentrated the highest response time.
  • Detect: heavy processes degraded all endpoints, not just their own.
  • Extract: first AI (highest frequency), then video (highest blocking), then images.
  • Stabilize: each worker in production before starting the next extraction cycle.

Architecture evolution: before and after

Renderizando grafico...

The migration wasn't a single event but a repeated cycle: monitor, detect the highest-impact bottleneck, extract, stabilize. Same logic as the previous post, real case.

Sources