Mistral AI details fix for vLLM memory leak traced to UCX hooks

Engineering deep dive outlines how disabling UCX mmap hooks stopped runaway RSS in disaggregated serving on 21 January 2026.

Thursday January 22, 2026 , 2 min Read

Mistral AI has published an engineering deep dive explaining how its team tracked and mitigated a hard-to-pin memory leak observed while serving large language models on vLLM in a prefill, decode disaggregated set-up. The leak showed up as a steady rise in resident memory during production-like traffic, and was ultimately resolved by changing a UCX configuration, according to the company.

What triggered the investigation

The problem surfaced during pre-production tests with Mistral Medium 3.1 and graph compilation enabled. Engineers saw resident memory increasing by roughly 400 MB per minute on the decode side of a disaggregated deployment that used NIXL for KV cache transfer over high-speed networking. There were no crashes, only a linear climb that would end in an out-of-memory condition, the company said.

How the team traced the leak

Mistral first ruled out a traditional heap leak using Python profilers such as Memray and Guppy 3, then moved to Heaptrack, which hooks malloc and free. Heap metrics looked normal, yet peak RSS kept diverging, a signal that the allocations were occurring outside the heap. By watching /proc maps with pmap and then tracing system calls with BPFtrace, the team saw repeated anonymous mappings appear, pointing to mmap activity that standard hooks were not catching.

Because BPFtrace showed calls originating from a raw syscall wrapper, engineers automated GDB to break only when mmap was invoked from the suspicious address. The targeted traces revealed that UCX, used under NIXL, was intercepting memory operations and that mmap could be triggered even during munmap paths in UCX’s memory pool handling.

Root cause and mitigation

According to Mistral’s write-up, UCX improves RDMA performance by dynamically patching function pointers to intercept mmap and munmap for its registration cache. In this vLLM scenario, that behaviour led to growing anonymous mappings that were not reclaimed promptly. Disabling the hook by setting the environment variable UCX_MEM_MMAP_HOOK_MODE to none stopped the leak without measurable performance loss in this workload. As an alternative safety valve, setting UCX_RCACHE_MAX_UNRELEASED to a finite value forces cleanup.

Open-source coordination and timeline

Mistral raised the matter with the vLLM maintainers through a public issue, confirming reproducibility in the specific disaggregated configuration. A corresponding pull request that ensures the UCX setting is applied before NIXL loads was merged into vLLM’s main branch in mid January, providing a practical guardrail for the wider community.

Advertise with us