feat: support mla disagg pd pull mode via mooncake transfer on mlu device. by phantomlei3 · Pull Request #1167 · jd-opensource/xllm

phantomlei3 · 2026-04-02T11:44:11Z

No description provided.

gemini-code-assist

Code Review

This pull request enables Disaggregated Prefill/Decode (PD) support for the MLU backend by integrating and refactoring the Mooncake transfer engine. Key changes include the introduction of a descriptor-based memory registration system (using BufferDesc and MemInfo) and a new GetRegisteredMemory RPC to synchronize memory layouts between nodes, which is essential for supporting complex architectures like MLA. Additionally, the PR enhances validation logic in LLMEngine::pull_kv_blocks, automates device_ip resolution, and adds comprehensive unit tests for the transfer engine. Review feedback identifies several opportunities to align with the project's style guide, specifically regarding the use of emplace_back for vector operations and the annotation of constant arguments in function calls.

Clement-Wang26 · 2026-04-08T08:25:58Z

+  return os.str();
+}
+
+std::vector<proto::BufferDesc> to_proto_buffer_descs(


The transfer_engine is designed to facilitate data transfer between specified source and destination addresses. It should remain agnostic to the higher-level semantics of the data being transferred, whether it's KV cache, model weights, or any other type.

Clement-Wang26 · 2026-04-08T08:28:15Z

-  size_per_block_ = size_per_block;
+  num_layers_ = reg_layers;
+  local_memory_info_ = std::move(mem_info);
+  core_.set_registered_memory_info(addrs, lens, size_per_block_, num_layers_);


these engine core modifications been tested on NPU?

Not tested due to the fact that NPU is currently not using Mooncake as disagg PD backend. But I think it is going t o be fine on NPU since MLU and NPU has no difference on engine part via Mooncake. We can test it in the future if NPU has switched to Mooncake

Clement-Wang26 · 2026-04-08T08:34:54Z

  }

+  Transport::SegmentHandle handle;
+  handle = engine_->openSegment(remote_addr);


Previously, during open_session, the decode node would establish a connection with the P-node,
informing the P-node of its own address (addr_) to enable KV cache transfer, with cluster_id set to 0.
However, it now appears to directly open_segment. Where does the remote_addr come from in this new setup?

remote_addr is not generated inside open_session(). It comes from the peer cache-info registration path: each worker exposes its Mooncake transfer address via get_cache_info(), the scheduler stores those addresses in instance metadata, and link_cluster() passes the registered peer address into open_session().

The reason for changing the flow is that the new MLU PD path needs more than just session bootstrap. After the session is established, we also need to fetch the peer's registered memory metadata and validate the memory layout before issuing block moves, especially for the new per-buffer descriptor path (key/value/index) and asymmetric capacity cases. That is why the local side now opens the peer segment directly and then calls GetRegisteredMemory().

We still keep the reverse OpenSession(addr_) RPC step when cluster_id != 0, so the peer can establish the reverse session as before. The new part here is the explicit remote registered-memory sync/validation, not removing the previous bootstrap mechanism.

Clement-Wang26 · 2026-04-10T06:23:06Z

  std::string hostname = host_ip_ + ":" + std::to_string(listen_port_);
+#if defined(USE_NPU)
+  int32_t device_id = dev.index();
+  hostname += ":" + get_transfer_engine_device_name(device_id);


Npu does not need to add device name now.

Clement-Wang26 · 2026-04-10T06:27:30Z

                          const std::vector<uint64_t>& src_blocks,
                          const std::vector<uint64_t>& dst_blocks,
-                          const std::vector<int64_t>& layer_ids);
+                          const std::vector<int64_t>& group_ids);


what is group ids？

phantomlei3 requested review from DongheJin, JimHsiung, RobbieLeung, XuZhang99, liutongxuan, walsonyang and yq33victor as code owners April 2, 2026 11:44

phantomlei3 requested a review from sunbaosong April 2, 2026 11:44

gemini-code-assist bot reviewed Apr 2, 2026

View reviewed changes

JimHsiung previously approved these changes Apr 3, 2026

View reviewed changes

phantomlei3 dismissed JimHsiung’s stale review via f024a9f April 7, 2026 08:32

phantomlei3 force-pushed the feat/pd-support-mlu branch from c8d119d to f024a9f Compare April 7, 2026 08:32

XuZhang99 changed the title ~~feat: support mlu mla pd pull with mooncake transfer.~~ feat: support mla disagg pd pull mode via mooncake transfer on mlu device. Apr 7, 2026

Clement-Wang26 reviewed Apr 8, 2026

View reviewed changes

phantomlei3 force-pushed the feat/pd-support-mlu branch from f024a9f to 609b8d6 Compare April 8, 2026 09:13

phantomlei3 added 4 commits April 8, 2026 21:00

feat: support mlu mla pd pull with mooncake transfer.

4ae2afd

style: address PR 1167 review feedback

46c2bae

refactor: correct the npu part.

c4f82f0

refactor mooncake transfer engine layout descriptors

95427f6

phantomlei3 force-pushed the feat/pd-support-mlu branch from 609b8d6 to 95427f6 Compare April 8, 2026 13:32

phantomlei3 closed this Apr 9, 2026

Clement-Wang26 reviewed Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support mla disagg pd pull mode via mooncake transfer on mlu device.#1167

feat: support mla disagg pd pull mode via mooncake transfer on mlu device.#1167
phantomlei3 wants to merge 4 commits intojd-opensource:mainfrom
phantomlei3:feat/pd-support-mlu

phantomlei3 commented Apr 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clement-Wang26 Apr 8, 2026

Uh oh!

Clement-Wang26 Apr 8, 2026

Uh oh!

phantomlei3 Apr 8, 2026

Uh oh!

Clement-Wang26 Apr 8, 2026

Uh oh!

phantomlei3 Apr 8, 2026

Uh oh!

Clement-Wang26 Apr 10, 2026

Uh oh!

Clement-Wang26 Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

phantomlei3 commented Apr 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clement-Wang26 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Clement-Wang26 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

phantomlei3 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Clement-Wang26 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

phantomlei3 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Clement-Wang26 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Clement-Wang26 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants