vllm命令:
vllm serve ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code
--reasoning-parser ernie45
--tool-call-parser ernie45
--enable-auto-tool-choice --cpu-offload-gb 48
跟踪模型的输入,结果是:
<|begin_of_sentence|>You are a multimodal AI assistant called ERNIE developed by Baidu based on the PaddlePaddle framework.\nUser: <|IMAGE_START|><|image@placeholder|><|IMAGE_END|>\nFrom which era does the artifact in the image originate?\nAssistant: \n\n
而直接读取tokenizer_config.json里的chat_template,结果是:
<|begin_of_sentence|>You are a multimodal AI assistant called ERNIE developed by Baidu based on the PaddlePaddle framework.
User: Picture 1:<|IMAGE_START|><|image@placeholder|><|IMAGE_END|>From which era does the artifact in the image originate?
Assistant:
直接读取chat_template结果中多了Picture 1。同样,视频多了Video 1。
虽然以上两个输入,模型的输出几乎不受影响。
本着严谨的态度,询问一下哪个才是正确的?