What is the difference between an LSM-tree and a B-tree?

A B-tree updates data in place and keeps keys sorted in a shallow tree, so a read is one descent from root to leaf and point reads and short scans are cheap. An LSM-tree never edits data in place: writes are appended to a log and an in-memory buffer, flushed to immutable sorted files (SSTables), and merged later by compaction. That makes writes sequential and cheap but adds read work and background compaction. RocksDB and Cassandra are LSM-trees; LMDB, etcd, SQLite, and BaseKV are B-trees.

Is an LSM-tree faster than a B-tree?

Only for write-heavy workloads. LSM-trees turn every write into a sequential append, which is ideal for high write volume, but they pay with read amplification (a key may live in several files) and write amplification from compaction. B-trees win point reads and short range scans because there is one path to any key. The faster engine is the one whose strength matches your workload mix.

Which key-value databases use a B-tree?

LMDB, BoltDB/bbolt (used by etcd and BaseKV), SQLite, and Berkeley DB use B-trees (B+trees in practice). They favor predictable, low-amplification reads and operational simplicity, with no background compaction that can spike tail latency.

Does the LSM-tree advantage still hold on SSDs?

Less than it used to. The LSM-tree's case rested heavily on spinning disks, where random in-place writes were catastrophic. On modern NVMe SSDs random and sequential writes are far closer in cost, so the write-amplification penalty of in-place B-trees matters less, which strengthens the B-tree case for read-leaning workloads. USENIX has been actively revisiting the tradeoff on current hardware.

LSM-Tree vs B-Tree: How Key-Value Storage Engines Actually Differ (2026)

RocksDB and Cassandra are LSM-trees; LMDB, etcd, SQLite, and BaseKV are B-trees. The same GET/PUT interface hides two opposite machines. What each optimizes, the amplification tradeoffs, and which fits your workload on modern NVMe.

BaseKV Team•June 15, 2026•9 min read

b-treelsm-treestorage-enginerocksdbarchitecturekey-value

Ask two engineers what a "key-value store" is and they will describe the same interface: a key maps to a value, you can GET it, PUT it, and DELETE it. But underneath that identical interface sit two completely different machines, and which one a database picked decades ago still decides whether it is fast at the thing you are about to do with it. The two machines are the log-structured merge tree (LSM-tree) and the B-tree. RocksDB, LevelDB, Cassandra, ScyllaDB, and TiKV are LSM-trees. LMDB, BoltDB, etcd, SQLite, and BaseKV are B-trees. This is the most consequential fork in storage-engine design, and most "Redis vs X" comparisons skip right past it. Here is what actually differs, and how to tell which side of the fork your workload belongs on.

The B-tree: update in place

A B-tree (in practice almost always a B+tree, where values live only in the leaves) keeps your keys sorted on disk in a shallow, wide tree. To read a key you walk from the root down to a leaf, a handful of page reads even for a huge dataset, because the tree is wide and only a few levels deep. To write a key you find the leaf it belongs in and modify that page in place. If the leaf is full, it splits.

The defining property is that word, in place. The B-tree edits the same pages over and over as data changes. That has two happy consequences. Reads are cheap and predictable: there is exactly one path to any key, so a point lookup is O(log n) page reads with no ambiguity about where the current value lives. And short range scans are excellent, because sorted neighbors sit in physically adjacent leaves, so a cursor walks them sequentially.

BaseKV is built on this family. Its storage layer is bbolt, a single memory-mapped B+tree file, the same engine etcd trusts for cluster state. We chose it deliberately, and the reasoning is in the disk-based Redis design bet: the workloads people actually run behind an application are mostly point reads and short scans, the access pattern a B+tree is built for, and the operational story is calm because there is no background machinery that can suddenly stall.

The B-tree's costs are real and worth naming. In-place updates mean random writes scattered across the file, which were brutal on spinning disks and are merely suboptimal on SSDs. Copy-on-write B-trees like bbolt and LMDB avoid corrupting live data by writing new page versions rather than overwriting, which gives clean crash recovery and lock-free reads but means a single changed key can dirty a whole page. And B-trees tend to leave pages partly empty after splits, so they occupy more disk than the logical data strictly needs.

The LSM-tree: never go back and edit

The LSM-tree refuses to update in place at all. A write goes to two places: an append-only write-ahead log on disk for crash safety, and an in-memory sorted buffer called the MemTable. When the MemTable fills, it is flushed to disk as an immutable, sorted file called an SSTable, and a fresh MemTable takes over. Crucially the database never edits an SSTable after writing it. New values for an existing key simply land in a newer SSTable, and old values are reconciled later.

This makes writes extraordinarily cheap at the moment they happen. Every write is a sequential append, the friendliest possible pattern for both SSDs and spinning disks, which is why LSM-trees dominate write-heavy systems. MyRocks, the RocksDB storage engine for MySQL, was built precisely because Facebook's social-graph write volume overwhelmed in-place engines.

The bill comes due in two places. First, a read may have to check several SSTables across multiple levels before it finds the current value of a key, because that key might have been written many times into many files. Engines fight this with Bloom filters and block caches, but the read path is fundamentally more work than a B-tree's single descent. Second, all those immutable files have to be merged and de-duplicated in the background, a process called compaction that rewrites data again and again as it moves down the levels. Compaction is what reclaims the space and keeps reads from degrading, but it consumes disk bandwidth and CPU, and a poorly tuned compaction can stall foreground traffic. RocksDB exposes more than forty tuning parameters largely to manage this tradeoff. LMDB, the canonical B-tree, requires essentially none.

How the costs compare

The cleanest way to reason about the difference is through the three amplifications, which we cover in depth in read, write, and space amplification. The short version:

Write amplification (bytes written to disk per byte your app writes) is the LSM-tree's weak spot. Compaction rewrites the same data several times on its way down the levels. A B-tree writes a changed page once, though copy-on-write can double that.
Read amplification (bytes read per byte requested) is the B-tree's strength. One descent to a leaf versus an LSM-tree's possible walk across multiple files.
Space amplification (disk used per byte of logical data) favors the LSM-tree, because compaction plus block compression packs SSTables tightly, while B-trees leave partly filled pages.

You cannot win all three at once. This is the RUM conjecture: optimize for any two of read, update, and memory cost and you pay for it on the third. LSM-trees buy cheap writes and tight storage by spending read effort. B-trees buy cheap, predictable reads by spending some write and space efficiency. There is no free engine, only an engine whose tradeoffs match your workload or do not.

A useful twist for 2026: the hardware that made this choice obvious has changed. The classic argument that random writes are catastrophic was a spinning-disk argument. On modern NVMe SSDs, random and sequential writes are far closer in cost, which is why USENIX ;login: has been revisiting the B+tree-versus-LSM tradeoff on current storage hardware. The write-amplification penalty of in-place B-trees matters less on NVMe than it did on disk, which quietly strengthens the B-tree case for read-leaning workloads.

Which one is your workload?

Reach for an LSM-tree engine when:

Writes dominate and arrive fast (event logs, time-series, metrics, ingestion pipelines).
The dataset is large and storage cost matters, so you want compression and tight packing.
You can tolerate background compaction and have the operational maturity to tune it.

Reach for a B-tree engine when:

Reads dominate, especially point lookups and short range scans (sessions, config, feature flags, user records, secondary indexes).
You want predictable latency with no background process that can spike tail latency.
You value operational simplicity over squeezing maximum write throughput.

It is no accident that the systems that need bulletproof, low-drama reads of cluster-critical state, like etcd, sit on a B-tree, while the systems built to swallow firehose write volume, like Cassandra and RocksDB-backed stores, are LSM-trees. The interface is the same; the machine underneath is the decision.

Where BaseKV lands

BaseKV speaks the Redis, Memcached, and DynamoDB wire protocols but stores everything in a single B+tree file on disk. That is a deliberate position in the fork above. The disk-first, cost-sensitive lane it targets, durable data that is mostly read and mostly cold, is exactly where a B-tree's predictable reads and quiet operations beat an LSM-tree's write throughput, and where the modern-NVMe shift has narrowed the only metric an LSM-tree clearly won. If you want the full mechanics of how Redis data types are folded onto one ordered tree, the series starts at fitting Redis data types into a B+tree.