Idempotency Keys with a Key-Value Store: The Atomic Claim Everyone Gets Wrong
How to implement idempotency keys on a KV store: the record you actually store, the read-then-write race that double-charges cards, and why the atomic claim (Redis SET NX) is the whole game. Plus where Cloudflare Workers KV's eventual consistency breaks the guarantee.
A client sends a POST to charge a card. The request succeeds on the server, but the response gets lost on the way back. The client times out, retries, and now you have charged the card twice. Idempotency keys are the standard fix, and a key-value store is the natural place to keep them. This is a walkthrough of the pattern: the data you actually need to store, the race condition almost every first implementation gets wrong, and how the storage choice changes the guarantees you can make.
What an idempotency key actually is
The client attaches a unique token to a request, usually in an Idempotency-Key header. The server records the outcome of the first request under that token. If a request arrives with a token the server has already seen, the server skips the work and returns the stored outcome instead of executing again.
Stripe's API is the reference implementation. It accepts an Idempotency-Key header on POST requests, suggests a V4 UUID or any random string with enough entropy to avoid collisions, and allows keys up to 255 characters. When a request replays, Stripe returns the saved status code and body of the original request, including a 500 if the first attempt failed that way. If you reuse a key with different request parameters, Stripe rejects it rather than silently returning the wrong stored result.
There is also an IETF Internet-Draft, draft-ietf-httpapi-idempotency-key-header (httpapi working group, draft 07, which expired in October 2025), that formalizes the header. It is worth reading for the response-code conventions even though it never reached RFC status. The Idempotency-Key header itself is documented on MDN.
Why KV is the right shape
The access pattern is pure point lookup by a single opaque key, with a write that should only land if the key is new. There are no joins, no range scans, no secondary indexes. That is the textbook key-value workload. The two operations you need are:
- Atomically claim a key if it does not already exist.
- Read back the stored result for a key that does exist.
The atomic claim is the whole game. If two retries of the same request hit two server instances at the same moment, both must not proceed to do the work. Exactly one should win the claim and run; the other should wait for or read the winner's result.
The record you store
A common mistake is storing only the key. You need enough to answer a replay correctly:
- The key itself (the lookup).
- A status:
in_progress,completed, orfailed. - The saved response: HTTP status code and serialized body, written once the work finishes.
- A request fingerprint: a hash of the method, path, and body. This lets you detect when the same key is reused for a different payload.
- A created-at timestamp, so you can expire old records.
The fingerprint matters more than it looks. Without it, a client that recycles a key for an unrelated request gets back the wrong stored response, which is worse than a duplicate execution. With it, you can reject the mismatch. The IETF draft recommends a 409 Conflict when a request with the same key is still being processed, and a 422 when the key is reused with a different request fingerprint. Those two codes cover the cases a naive implementation drops.
The race condition
Here is the version that looks correct and is not:
val existing = kv.get(key)
if (existing != null) return existing.response
doTheWork()
kv.set(key, completedRecord)
Two concurrent retries both run get, both see null, and both call doTheWork(). The check and the write are separate steps, so the gap between them is a window where a second request slips through. You charged the card twice anyway.
The fix is a single atomic claim. In Redis that is one command:
SET <key> "in_progress" NX EX 86400
NX means set only if the key does not exist. EX 86400 sets a one-day expiry in the same operation. The reply is OK if you won the claim and nil if the key already existed. There is no window between a read and a write because there is no read; the decision and the write are the same operation. The client that got OK runs the work and overwrites the record with the final response. The client that got nil reads the existing record and either returns the stored response or, if the status is still in_progress, returns 409 so the caller retries shortly. Since Redis 7.0 you can combine SET with the GET option to claim and read the previous value in one round trip.
The general rule: the claim has to be one atomic operation that both decides ownership and records it. Any design that reads, branches in your application code, and then writes has the race.
Where storage choice changes the guarantees
The pattern is identical across stores, but the consistency model is not, and that determines whether the guarantee actually holds.
A store with a real conditional or compare-and-set primitive, exposed to your code, gives you a true atomic claim. Redis SET NX is the simplest example. Anything with a documented "put if absent" or conditional-write API works the same way.
Cloudflare Workers KV is the case to watch. Its put() takes an expirationTtl (minimum 60 seconds) which handles expiry cleanly, but the public API has no conditional-write or check-and-set primitive, and reads are eventually consistent: a write is visible immediately at the location that made it but can take up to 60 seconds to propagate to other locations. That means two retries hitting two different points of presence can both read "no key" well after one has written it. Workers KV alone cannot enforce a single-claim guarantee. Cloudflare's own guidance is to route writes for a given key through a Durable Object when you need write-after-write consistency, which is exactly the situation here. Workers KV is fine as the durable store of the final response; it is not fine as the thing that arbitrates the claim.
The practical decision: if your idempotency layer protects something with real consequences, like a payment or an outbound webhook, the claim must run against a strongly consistent store with an atomic conditional write. If you are only deduplicating idempotent-ish background work where a rare double-execution is harmless, eventual consistency is tolerable.
TTL and cleanup
Idempotency records are not permanent. Stripe retains keys for at least 24 hours and may prune them after that; a request that replays past the window is treated as new. Twenty-four hours is a sensible default because it comfortably outlasts any reasonable client retry loop while keeping the keyspace from growing without bound.
Set the expiry in the same write that creates the record, the way SET ... EX or expirationTtl do, so cleanup is automatic and you never run a sweep job. Tie the TTL to your retry policy: if a client can retry for up to an hour, a one-hour minimum is the floor, and 24 hours gives you margin.
A minimal, correct flow
Putting it together, the request handler does this:
- Read the
Idempotency-Keyheader. If absent and the endpoint requires it, reject with400. - Compute the request fingerprint (hash of method, path, body).
- Atomically claim the key with
in_progressand a TTL. One operation, not a read then a write. - If the claim succeeded, do the work, then overwrite the record with the final status code, body, and fingerprint. Return the response.
- If the claim failed, read the existing record.
- If its fingerprint differs from this request, return
422. - If its status is
in_progress, return409so the client retries shortly. - If it is
completedorfailed, return the stored status and body verbatim.
- If its fingerprint differs from this request, return
That is the entire pattern. The hard part was never the data model; it was making the claim atomic and being honest about whether your store can do that.
The takeaway
Idempotency keys are a key-value problem with one sharp edge: the claim must be a single atomic conditional write, or concurrent retries defeat the whole thing. Pick a store that exposes that primitive and is consistent enough that the claim actually arbitrates. Store the status, the saved response, and a fingerprint, not just the key. Expire records on a TTL tied to your retry window. Get those right and "the client retried and we did it twice" stops being a class of bug.
Related: Building a Global Rate Limiter for Your OpenAI Wrapper, Key-Value vs Redis, Persistent Key-Value Storage.