[Draft] support mooncake barebone connectorV1 #1011
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?
方案与遗留问题
当前AscendTransport中无法感知到PD所在的NPU卡号,在容器中使用ASCEND_RT_VISIBLE_DEVICES指定vllm运行的卡之后,无论vllm运行在几号npu上,使用aclrtGetDevice(&deviceId)获取当前运行设备,设备号一直显示为0,使用aclrtSetDevice重新设置运行设备,发现也只能设置为0,现象显示deviceId概念是逻辑概念而非硬件物理概念,不确定是否符合实际情况(可以考虑对外接口新增传入环境变量参数ASCEND_RT_VISIBLE_DEVICES来区分卡,待评估)
问题解决方案:
提供四种解决方案,待评审:
1、从ranktable里面去读取设备id, 传环境变量的id,在init时新增一个参数
2、传入的hostname(ip:port+NPU_rank_id),通过传入的port减去基准port获取到rankid
3、修改hostname的格式,ip:port:NPU_rank_id,通过解析获取rank_id
4、可以研究下有没有现有接口,可以在进程中查询到device_id或者rank_id
背景:V1 barebone和transfer_enine会共同维护一个ranktable,可以通过环境变量DISAGGREGATED_RPEFILL_RANK_TABLE_PATH获取到rank_table路径,device_ip通过传入的rank_id,查询ranktable获取
遗留问题:
1、多卡的情形下,怎么去写mooncake.json,当前是一张卡上一个进程,一个进程启一个transfor_engine,这样需要不同的mooncke.json
2、当前收发kv_cache会收发全量的kv_cache,需要给出用户想要发送指定kv_cache传递命令的方式
运行指南
1、启动metadata_server
进到Mooncake的文件中
启动metadata_server,ip和port写成自己的,保证和mooncake.json中的配置一致
2、拉起producer和consumer
前置准备:
环境变量
DISAGGREGATED_RPEFILL_RANK_TABLE_PATH
要配置成自己的rank_table.json如下配置
"super_pod_id": 超节点的id,这个不重要,
"server_id": 配置成自己本机的ip,
"device_id": 当前设备的卡号,
"device_ip": 当前卡的ip地址,
"super_device_id": 超节点设备的id,没用到,
"cluster_id": 从上往下按顺序编号
环境变量
MOONCAKE_CONFIG_PATH
配置为mooncake.json"prefill_url": 配置prefill所在ip和prot,
"decode_url": 配置decode所在ip和prot",
"metadata_server": 要求与第一步配置的ip和端口一致,
"metadata_backend": 选http,
"protocol": 配置hccl的通信,
"device_name": ""
启动
在vllm_ascend有新提交的启动脚本:
拉起producer
拉起consumer
3、启动proxy_server
proxy_server.py文件中的有注释Configure the IP and port to your own settings和Set the host configuration to your own IP的地方将ip和port全都修改成自己实际的ip,8000端口时proxy_server启动的端口,8100是配置在producer的shell脚本里的端口,8200是配置在consumer的shell脚本里的端口
4、推理任务下发
推理中的ip配置成自己的
其中model变量配置成自己模型的路径,同时保证和shell脚本里面的一致
5、Mooncake TE打通适配总结
4、