Using a Key-Value Store as a Job Queue (and When It Bites)
Why a naive BRPOP list queue silently loses jobs when workers crash, the reliable BLMOVE two-list and Redis Streams consumer-group patterns that fix it, and when to stop hand-rolling.
A key-value store makes a tempting job queue. You already run one for caching, it does LPUSH and BRPOP, and in twenty minutes you have a worker pulling jobs off a list. For a lot of background work that is genuinely the right call. The trouble is that the twenty-minute version quietly drops jobs, and the failure mode does not show up in development or in the demo. It shows up the first time a worker is killed mid-job in production, and the job it was holding is simply gone. This is a walk through why a naive KV queue loses work, the two patterns that fix it, and the honest line where you should stop and reach for something built for the job.
The naive queue and where it loses jobs
The textbook version is one list per queue. Producers LPUSH a serialized job onto the left, workers BRPOP it off the right, blocking until something arrives. It is fast, it is simple, and it works right up until a worker dies.
The problem is BRPOP. The moment it returns, the job is gone from Redis and exists only in the worker's memory. If the worker crashes, is OOM-killed, gets a SIGKILL during a deploy, or loses the network after popping but before finishing, that job is lost forever. Nothing in Redis remembers it was ever being worked on. This is not a bug in Redis; it is what BRPOP is defined to do. Sidekiq's own reliability docs are blunt about it: open-source Sidekiq uses BRPOP, and jobs being processed when a worker crashes are lost. The only way to guarantee durability is to not remove a job from the store until it is actually complete.
So a single-list queue gives you at-most-once delivery. For sending a "your export is ready" email that is fine; a missed one is an annoyance. For charging a card or provisioning a server, at-most-once is a silent data-loss bug waiting for your next bad deploy.
Pattern one: the reliable queue with two lists
The fix is to never let a job leave the store while it is in flight. Instead of popping it into the worker, you move it atomically to a second list, the processing list, and only delete it from there once the work succeeds.
Redis has a single command for the move. RPOPLPUSH did this for years and is still everywhere in old code, but it has been deprecated since Redis 6.2 in favor of LMOVE (and the blocking BLMOVE), which adds explicit direction control; LMOVE src dst RIGHT LEFT is the equivalent. The worker does:
BLMOVE jobs:queue jobs:processing RIGHT LEFT 5
This pops a job off the work queue and pushes it onto the processing list in one atomic operation. If the worker crashes now, the job is still sitting in jobs:processing, not lost. On success the worker runs LREM jobs:processing 1 <job> to remove exactly that entry. The whole pattern is documented by Redis as the reliable queue, and antirez's original "Reliable queue" pattern note describes the same idea.
That leaves one gap: a job stuck in jobs:processing because its worker died. Nothing pops it back automatically. You handle this with a side hash that records a claim timestamp per in-flight job, and a recovery worker that periodically scans the processing list and re-queues anything whose claim is older than a visibility timeout. This is the same dead-holder problem you face with distributed locks: a timeout is the only way to tell "still working" from "crashed and never coming back," and any timeout you pick can be wrong. Set it too short and you re-run jobs that were merely slow; too long and a crash leaves work stalled. There is no value that is right for both.
Pattern two: streams with consumer groups
If your store speaks the Redis Streams API, you get most of the reliable-queue bookkeeping built in. A stream consumer group tracks delivery for you. XREADGROUP hands a message to a consumer and records it in the group's Pending Entries List (PEL), the set of messages delivered but not yet acknowledged. The worker calls XACK only after it finishes, which removes the message from the PEL. A crash between read and ack leaves the message in the PEL, not lost.
Recovery is built in too. XPENDING shows you what is stuck and for how long, and XCLAIM (or XAUTOCLAIM, which is the scan-friendly "find pending older than N ms and reassign" helper) lets another consumer take over messages whose original owner has gone quiet past an idle threshold. This is at-least-once delivery: a message can be redelivered after a crash, so your handlers still need to be idempotent, which is exactly the idempotency-key problem again. Streams give you the queue mechanics; they do not give you exactly-once. Nothing does, for free.
Streams are the better default for a KV-backed queue today. The two-list pattern is what you build when your store has lists but not streams.
The durability gap underneath both patterns
Both patterns assume the store still has your queue after a restart. With in-memory Redis, that assumption depends entirely on your persistence config, and the default is not what you want for a queue. Redis acknowledges a write before it hits disk. Even with the append-only file enabled, the default appendfsync everysec flushes once a second, so a crash can lose up to a second of writes, jobs that were accepted and confirmed but never made it to disk. appendfsync always closes that window at a real throughput cost. This is the same trade-off covered in Redis persistence configuration, and for a queue it is not academic: that one-second window is jobs your producers think were enqueued and your workers will never see.
A disk-based KV store changes the shape of this. Because the data already lives in a file rather than RAM, a durable write is the normal path instead of an opt-in flag. BaseKV, for instance, fsyncs each write before acknowledging by default (its durability and write-loop design covers the single-writer model), so an enqueue that returned has hit disk. That removes the "lost the last second of jobs" failure mode, though it does nothing about the at-most-once BRPOP problem: you still have to use the reliable-queue or streams pattern on top. Durability of the store and durability of the queue protocol are two separate things, and you need both.
When NOT to use a KV store as your queue
A KV store is a fine job queue when jobs are independent, redeliveries are tolerable because handlers are idempotent, volume is moderate, and the work is genuinely background. Most "send an email," "resize an image," "rebuild a cache" workloads fit comfortably.
Stop and pick a real queue or workflow engine when you need any of these: fan-out to many independent subscribers (that is pub/sub or a log, not a work queue), strict ordering across a partition under concurrency, multi-step workflows with retries and compensation, scheduled or delayed jobs at scale, or auditable history of every job's outcome. Tools built for this, from managed queues to Postgres-backed job runners using SELECT ... FOR UPDATE SKIP LOCKED, to durable-execution frameworks, give you those guarantees instead of asking you to reconstruct them out of lists and timeouts. The recurring "Celery just stops running tasks" and "my webhook queue silently dropped orders" threads are usually a queue that outgrew the simple pattern it was built on.
The takeaway
A key-value store can be a solid job queue, but a single list with BRPOP is not one; it loses every job a crashing worker was holding. Use the reliable two-list pattern with BLMOVE and a visibility-timeout recovery sweep, or use streams with consumer groups and XACK/XAUTOCLAIM, and make handlers idempotent because both give you at-least-once, not exactly-once. Underneath, make sure the store actually persists an acknowledged enqueue, because a one-second fsync window is invisible until the day it eats a batch of jobs. And when you need ordering, scheduling, fan-out, or workflows, that is the signal to stop hand-rolling and use a tool built for queues.
Related: Distributed Locks with a Key-Value Store, Idempotency Keys with a Key-Value Store, Redis Persistence Configuration, BaseKV Internals: Durability, the Write Loop, and TTL, Key-Value vs Redis.