Profil picture of Clément Renault

Clément Renault

March 13, 20260 comments

Patching LMDB: How We Made Meilisearch’s Vector Store 333% Faster

Note

This is one blog post in a series:

I propose starting this article with a brief introduction to Meilisearch. Meilisearch is a Rust-based search engine that is (very) easy to integrate into your workflow. It features an optional hybrid search that mixes keyword-only results with semantic search. Meilisearch uses LMDB, a lightning-fast memory-mapped key-value store for the indexes under the hood. To perform approximate nearest neighbors (ANNs) search, we developed a homemade LMDB-based HNSW, named hannoy.

If you are wondering what a Hierarchical Navigable Small World (HNSW) is, I recommend reading this Wikipedia article. If you feel lazy, it is basically a multi-layer graph with increasing connections between nodes that improves the time to find the nearest neighbors of a given middle embedding. And an embedding, also commonly defined as a vector, is a set of floating-point numbers, called dimensions, usually between 512 and 3072, that represent the meaning of a text, an image, a sound... Embeddings that are physically closer are considered semantically closer to each other by the embedder that generated them.

This article will not dive into the gears of an HNSW but rather into those of LMDB and our vector store. We will talk about a missing feature of our key-value store that we eventually implemented to unlock an even more powerful search engine, enabling indexation to scale linearly with the number of CPUs and to have lower time complexity.

Reminder of the Facts

It's been some time since I talked about an unsafe but safe approach to read the entries of an LMDB write transaction, one that is not committed yet and whose content is not visible to a newly created transaction until it is committed. I'll write it again here, but you can read the more detailed article describing this technique.

The trick is to basically do a full scan of all the entries we will need to access to insert the newly arrived embeddings and collect the pointers and associated lengths in a very basic Rust HashMap. We confirmed the validity and safety of this approach with Howard Chu, author and maintainer of LMDB. This way, we can wrap this HashMap in a type and unsafe-ly implement Sync on it so we can use multiple threads to access those memory-mapped pointers (embeddings, HNSW graph nodes).

For people who prefer reading code to long paragraphs like me, here is the data structure I am talking about. You can clearly see the full scan, that's the only for loop in this code sample... Yup, that's where the problem lies, and we'll discuss the impact of this loop soon.

pub struct ImmutableItems<'t, D> {
    item_ids: RoaringBitmap,
    constant_length: Option<usize>,
    offsets: Vec<*const u8>,
    _marker: marker::PhantomData<(&'t (), D)>,
}

impl<'t, D: Distance> ImmutableItems<'t, D> {
    pub fn new(rtxn: &'t RoTxn, database: Database<D>, index: u16) -> heed::Result<Self> {
        let mut item_ids = RoaringBitmap::new();
        let mut constant_length = None;
        let mut offsets = Vec::new();

        for result in database.items()? {
            let (item_id, bytes) = result?;
            assert_eq!(*constant_length.get_or_insert(bytes.len()), bytes.len());
            item_ids.push(item_id);
            offsets.push(bytes.as_ptr());
        }

        Ok(ImmutableItems { item_ids, constant_length, offsets, _marker: marker::PhantomData })
    }

    pub fn get(&self, item_id: ItemId) -> heed::Result<Option<Leaf<'t, D>>> {
        let len = match self.constant_length {
            Some(len) => len,
            None => return Ok(None),
        };
        let ptr = match self.item_ids.rank(item_id).checked_sub(1).and_then(|offset| self.offsets.get(offset)) {
            Some(ptr) => *ptr,
            None => return Ok(None),
        };
        let bytes = unsafe { slice::from_raw_parts(ptr, len) };
        ItemCodec::bytes_decode(bytes).map(Some)
    }
}

unsafe impl<D> Sync for ImmutableItems<'_, D> {}

Here is an extract of what happens in Hannoy for a customer that has nearly 19M (1024 dimensions) binary-quantized embeddings in its vector store and is inserting 7000 embeddings. As you may have noticed, a lot of time is spent retrieving the item IDs, fetching item and link pointers, and finally writing the items to disk.

"progressTrace": {
    // [..]
    "processing tasks > indexing > writing embeddings to database > retrieving the items ids": "559.56s",
    "processing tasks > indexing > writing embeddings to database > retrieve the updated items": "3.31ms",
    "processing tasks > indexing > writing embeddings to database > fetch item pointers": "5.54s",
    "processing tasks > indexing > writing embeddings to database > fetch links pointers": "134.15s",
    "processing tasks > indexing > writing embeddings to database > resolve graph entry points": "228.10µs",
    "processing tasks > indexing > writing embeddings to database > building the graph > inserting items": "57.39ms",
    "processing tasks > indexing > writing embeddings to database > building the graph": "57.43ms",
    "processing tasks > indexing > writing embeddings to database > patch old new deleted links": "21.40s",
    "processing tasks > indexing > writing embeddings to database > writing the items": "413.43s",
    "processing tasks > indexing > writing embeddings to database > deleting the links": "2.04s",
    "processing tasks > indexing > writing embeddings to database > write the metadata": "4.10s",
    "processing tasks > indexing > writing embeddings to database": "1140.28s",
    // [..]
    "processing tasks > indexing > finalizing > committing": "16.23s",
    "processing tasks > indexing > finalizing > computing stats": "22.37ms",
    "processing tasks > indexing > finalizing": "16.62s",
    "processing tasks > indexing": "2750.89s",
    "processing tasks": "2751.21s"
  },
  "writeChannelCongestion": {
    "attempts": 48317068,
    "blocking_attempts": 48126087,
    "blocking_ratio": 0.9960473775863647
  },
  "internalDatabaseSizes": {
    // [..]
    "vectorStore": "8.65 GiB (+132 KiB)",
    "documents": "60.8 GiB (+1.05 MiB)"
  }
}

Stop Computing Useless Stat

The part where we retrieve the item IDs corresponds to when we fetch the internal IDs of newly inserted or removed embeddings, and this part was... computing the mean degree to help us debug the vector store back then... So, I recently removed this, and as you can see, it was performing full scans of the vector store, which is an operation to avoid, especially as the total number of embeddings grows over time. Removing this part reduces the time from 559s to about 2-3s and reduces it to O(n), where n is the number of newly inserted or removed embeddings.

Full Scans are Harmful

And here is where the removal of the ImmutableItems and ImmutableLinks data structure tricks have the most effect: fetching the item/link pointers! This is no longer necessary; we can simply remove those steps from the Meilisearch progress trace. Removing these full scans is very beneficial, as we were polluting the cache with useless, unnecessary disk pages just because we were iterating over all of them to fetch pointers. Retrieving the item IDs was the part that was already preparing this huge cache pollution to compute the useless stats. That's why the two fetching steps weren't that huge; once you let them do the full work, you can clearly see them taking a long time.

But what's the New Trick?

That's the meat of the article, and it's very important to understand LMDB's semantics here. LMDB is an MVCC ACID-first memory-mapped key-value store. A lot of complex words, but what's important is that it allows opening a write transaction at the same time you open any number of read transactions. The write transaction is not blocking the read transactions. However, it doesn't support concurrent writes, and opening a write transaction blocks a thread attempting to open one until the opened transaction commits or aborts.

Read transactions can only see entries that were committed by a write transaction that's been committed before the read one was opened, which also means that opening a read transaction from the thread using the write transaction will only show entries without the changes currently made in the write one. It's a very useful semantic that we already use in the new Meilisearch indexer, but it doesn't help in this particular situation. Note that write transactions can also read their own writes.

In Meilisearch, when we receive new documents to index, we want them to be available to the search atomically. We don't want the database to be half valid, with references to internal documents/embedding IDs pointing to not-yet-available data. Exposing half-valid internal data structures could break search requests that start in the middle of two valid states, or we would have to implement a way to ignore certain changes until they are marked as valid, e.g., tombstones for not-yet-valid document IDs. We therefore open a write transaction (i.e., RwTxn) at the beginning of task processing, pass it the &mut RwTxn, and let the task-processing function write to the database as needed. Once it is done, it exists, and we commit the transaction if everything went right (i.e., RwTxn::commit(self)), thereby consuming it.

The tricky part is that when we receive embeddings to insert, we put them into our vector store using a write transaction; they're therefore only visible to that transaction, but we need to unlock the power of parallelization: being able to read and modify the HNSW graph in parallel. We previously stored all the entry pointers because we didn't know in advance which one we would need to update the HNSW layers. LMDB scales linearly with the number of cores used to read a database; however, I was wondering if it would be possible to have multiple transactions on an uncommitted write. It would be ideal to read newly inserted embeddings without having to open multiple read transactions. That could have done the trick, but it would mean that we are exposing orphan embeddings, i.e., embeddings not connected to the graph, to other read transactions, complexifying and uglifying the algorithm, i.e., track them, remove them if need be.

Designing a New API for LMDB

So, I documented what kind of API LMDB exposed, and I stopped on a very interesting one: LMDB lets you create nested write transactions. If you provide a parent write transaction when you create a new one, you can get a new write transaction that applies its set of changes and either commits or aborts, rolling back only the subset of changes it modified. The child transaction can read the parent's changes. The parent transaction is unusable while this only child transaction is alive. I had the idea that what I needed was to be able to create as many read transactions from my write transaction to read, and move the read transactions onto different threads to read my uncommitted changes. Howard (@hyc) and I briefly discussed what could be the API, and he gave me permission to work on this very complex C codebase 😮‍💨

So I went on an adventure to implement this new API. It was mostly about modifying the behavior of the mdb_txn_begin function, which was throwing an error when called with a parent transaction and the MDB_RDONLY flag simultaneously, allowing starting another nested read-only transaction even when the parent transaction must have been disabled, enabling multiple nested read transactions simultaneously. As you can see from the linked thread, I wasn't following C99 but rather C11, as I had to reimplement an atomically ref-counted (ARC) system to ensure that only the last dropped nested read transaction freed the allocations and was therefore using atomics. Unfortunately, the atomics APIs are only C11-compatible and require enabling extensions... I hadn't disabled the parent transaction when child read transactions were still alive because I was using my LMDB Rust wrapper: heed, which handles this issue via lifetimes. I finally implemented everything C users needed so that I could propose a well-working version of the nested read transactions feature.

// Disclaimer: I am not a C developer

MDB_env *env;
MDB_dbi dbi;
MDB_val key, data;
MDB_txn *parent_txn, *txn;

mdb_env_create(&env);
mdb_env_set_maxreaders(env, 1);
mdb_env_set_mapsize(env, 10485760);
mdb_env_open(env, "./testdb", MDB_FIXEDMAP, 0664);

mdb_txn_begin(env, NULL, 0, &parent_txn);
// We can now create as many nested read transactions from the parent transaction
mdb_txn_begin(env, &parent_txn, MDB_RDONLY, &txn);

Integrating the API into hannoy

So, it's time to replace our slow-to-initialize ImmutableItems and ImmutableLinks data structures with the following one: a wrapper for nested read transactions that lazily fetches items and graph links on demand. As you can see, we are generating and collecting enough nested read transactions from the current write transaction to fulfill the threads in the current thread pool. Every time the FrozenReader::item(&self, ItemId) or the FrozenReader::links(&self, ItemId, usize) is invoked (by a rayon thread), we pull a transaction from the channel and put it in the local thread local storage (TLS).

pub(crate) struct FrozenReader<'t, D> {
    rtxns_pool: crossbeam_channel::Receiver<RoTxn<'t, WithoutTls>>,
    rtxns: thread_local::ThreadLocal<RoTxn<'t, WithoutTls>>,
    index: u16,
    database: Database<D>,
}

impl<'t, D: Distance> FrozenReader<'t, D> {
    pub fn new(wtxn: &'t RwTxn<'_>, index: u16, database: Database<D>) -> Result<Self> {
        // We make sure to have one more thread so the current/main thread has a nested rtxn.
        let num_threads = rayon::current_num_threads() + 1;
        let (sender, rtxns_pool) = crossbeam_channel::bounded(num_threads);

        // Sequentially generate read transactions from the writer transaction
        for _ in 0..num_threads {
            let rtxn = wtxn.nested_read_txn()?;
            sender.try_send(rtxn).unwrap();
        }

        Ok(Self { rtxns_pool, rtxns: thread_local::ThreadLocal::new(), index, database })
    }

    pub fn item<'a>(&'a self, item_id: ItemId) -> Result<Item<'a, D>> {
        let rtxn = self.rtxns.get_or(|| self.rtxns_pool.try_recv().unwrap());
        let key = Key::item(self.index, item_id);
        // key is a `Key::item` so returned result must be a Node::Item
        self.database.get(rtxn, &key)?.and_then(|node| node.item()).ok_or(Error::missing_key(key))
    }

    pub fn links<'a>(&'a self, item_id: ItemId, level: usize) -> Result<Links<'a>> {
        let rtxn = self.rtxns.get_or(|| self.rtxns_pool.try_recv().unwrap());
        let key = Key::links(self.index, item_id, level as u8);
        // key is a `Key::links` so returned result must be a Node::Links
        self.database.get(rtxn, &key)?.and_then(|node| node.links()).ok_or(Error::missing_key(key))
    }
}

Note that the strange + 1 added to the number of nested read transactions has a purpose. When calling the item and links methods, the channel was sometimes empty because the threads from the rayon pool pulled all the transactions and stored them in their own thread-local storage. When we accessed this data structure from the main thread, which is not part of the rayon thread pool, one read transaction was missing, so we accounted for it. Thanks, @dureuill, for the help on this one.

The ThreadLocal data structure stores data in a thread's local storage, eliminating the need for inter-thread synchronization to access it. We only need to do a bit of synchronisation when pulling a read transaction from the crossbeam channel, but once it's stored thread-locally, we can retrieve it freely. That may seem odd to use read transaction with the WithoutTls marker, i.e., MDB_NOTLS, which indicates it can be moved between threads safely because it does not use TLS. The reason is that we need to create the nexted read transactions from the main thread as the RwTxn: !Sync so cannot be accessed concurrently.

And all those reasons led me to design this approach, which satisfies our needs. It's memory safe by design; there cannot be writes while an Item or a Links is borrowed, as the FrozenReader data structure lives longer than the write transaction's lifetime, which means, in terms of Rust rules, you cannot use the RwTxn while you hold the lifetime. You must drop the FrozenReader to be able to write anything in the database, and that's exactly what we do: we collect the different graph modifications we want to perform, in memory, and apply them once we are done, while reading the database in parallel. After that, we write the changes synchronously using our original write transaction. Note that even without this safe, compile-time Rust lifetime-based design, the parent transaction would throw an error if used while child transactions are still alive.

We updated this customer to the brand-new version of Meilisearch, featuring our patched vector store. Let's take a look at a typical indexation now. As a reminder, we were inserting 7000 embeddings in about 1140s, which is about 6 embeddings per second. In the following progress trace, we are indexing ~13300 embeddings in about 668s, resulting in 20 embeddings per second! More than three times faster! This is even better than what I expected.

"progressTrace": {
  // [..]
  "processing tasks > indexing > writing embeddings to database > retrieve the updated items": "3.80ms",
  "processing tasks > indexing > writing embeddings to database > resolve graph entry points": "173.33µs",
  "processing tasks > indexing > writing embeddings to database > building the graph > inserting items": "27.05s",
  "processing tasks > indexing > writing embeddings to database > building the graph": "27.05s",
  "processing tasks > indexing > writing embeddings to database > patch old new deleted links": "232.62s",
  "processing tasks > indexing > writing embeddings to database > writing the items": "402.52s",
  "processing tasks > indexing > writing embeddings to database > deleting the links": "2.13s",
  "processing tasks > indexing > writing embeddings to database > write the metadata": "3.89s",
  "processing tasks > indexing > writing embeddings to database": "668.22s",
  // [..]
  "processing tasks > indexing > finalizing > committing": "24.15s",
  "processing tasks > indexing > finalizing > computing stats": "18.27ms",
  "processing tasks > indexing > finalizing": "24.60s",
  "processing tasks > indexing": "1773.29s",
  "processing tasks": "1773.73s",
  "writing tasks to disk > task": "415.73ms",
  "writing tasks to disk": "415.82ms"
},
"writeChannelCongestion": {
  "attempts": 14819408,
  "blocking_attempts": 14435620,
  "blocking_ratio": 0.9741023182868958
},
"internalDatabaseSizes": {
  // [..]
  "vectorStore": "8.64 GiB (+412 KiB)",
  "documents": "61.01 GiB (+2.55 MiB)"
}

Note that the customer is using sharding to speed up embedding indexation, not only indexing 20 embeddings per second, but many more. Note that we are using AWS EBS to ensure we don't lose data. This disk technology is great because it replicates your data across different NVMe drives, but communication between the machine and the disks is over the network, which is "slow": any page fetch (set up to do look-ahead) takes an average of 1 millisecond. For a disk, it's slow, but for a network, it's quite impressive.

So, the sinews of war in this kind of environment are not to waste a single bit of disk cache; disk cache prevents you from triggering page faults too often and from fetching too many pages from disk. With this new patch, we have just landed, and we ensure only the necessary pages are loaded into the cache, not all available pages in LMDB. We avoid evicting important pages. As a result, you can see that building the graph and patch old new deleted links increase because those tasks are now, lazily, loading the necessary pages that retrieving the items ids, fetch item pointers, and fetch links pointers were doing before. Better pages are in the disk page cache now!

In conclusion, this new LMDB feature is extremely useful and, as a bonus, does not require any API changes. We have been using it in production since December 10th, 2025. It would be helpful if the LMDB team could review the work I have done. I dedicated a lot of personal time to it and ensured that I met all the requirements to eventually get it upstream. Maintaining a fork and keeping it up to date with various upstream patches is not ideal. Thanks anyway for your time and the wonderful tools you provide.

This blog post has been handcrafted by using my own wording and thoughts 🌱

🙂 ✚
0 comments, join the discussion