π vLLM-Kunlun New Model Adaptation Manual#
Based on in-depth analysis of baidu/vLLM-Kunlun and vllm-project/vllm repositories.
Applicable Versions: vLLM v0.15.1+ / vLLM-Kunlun main branch
Table of Contents#
I. Understanding the Overall Architecture#
1.1 Plugin System#
vLLM-Kunlun uses the OOT (Out-of-Tree) Plugin approach to integrate with vLLM, primarily registered via entry_points in setup.py:
# setup.py
entry_points={
'vllm.platform_plugins': ["kunlun = vllm_kunlun:register"], # Platform Plugin
'vllm.general_plugins': [
"kunlun_model = vllm_kunlun:register_model", # Model Registration
"kunlun_quant = vllm_kunlun:register_quant_method" # Quantization Method
],
"console_scripts": [
"vllm_kunlun = vllm_kunlun.entrypoints.main:main"
]
}
1.2 Startup Process#
vllm Startup
ββ 1. Discover platform_plugin β Call vllm_kunlun:register()
β ββ Register KunlunPlatform (defines Attention Backend, Worker, etc.)
β ββ Apply import hook (module redirection)
β ββ Register custom operators (custom_op)
ββ 2. Discover general_plugin β Call vllm_kunlun:register_model()
β ββ Register all Kunlun-adapted models via ModelRegistry.register_model()
ββ 3. Model Loading β Match registered model classes based on the architectures field in config.json
1.3 Import Hook Mechanism#
vLLM-Kunlun uses a custom import hook to transparently replace certain vLLM modules with Kunlun-customized versions:
# vllm_kunlun/__init__.py
def _custom_import(module_name, globals=None, locals=None, fromlist=(), level=0):
try:
module_mappings = {
"vllm.compilation.wrapper": "vllm_kunlun.compilation.wrapper",
"vllm.v1.worker.utils": "vllm_kunlun.v1.worker.utils",
"vllm.model_executor.model_loader.bitsandbytes_loader": "vllm_kunlun.models.model_loader.bitsandbytes_loader",
"vllm.v1.sample.ops.topk_topp_sampler": "vllm_kunlun.v1.sample.ops.topk_topp_sampler",
"vllm.model_executor.layers.sampler": "vllm_kunlun.ops.sample.sampler",
"vllm.v1.sample.rejection_sampler": "vllm_kunlun.v1.sample.rejection_sampler",
"vllm.attention.ops.merge_attn_states": "vllm_kunlun.ops.attention.merge_attn_states",
}
if module_name in module_mappings:
if module_name in sys.modules:
return sys.modules[module_name]
target_module = module_mappings[module_name]
module = importlib.import_module(target_module)
sys.modules[module_name] = module
sys.modules[target_module] = module
except Exception:
pass
return OLD_IMPORT_HOOK(module_name, globals=globals, locals=locals, fromlist=fromlist, level=level)
β οΈ Understanding this mechanism is crucial: Even if you use
from vllm.xxx import YYYin your model code, what you actually get might bevllm_kunlun.xxx.YYY.
1.4 Code Architecture#
vllm_kunlun/
βββ __init__.py # Plugin Entry: register() + import_hook()
βββ platforms/kunlun.py # KunlunPlatform: Defines Attention Backend, Worker, etc.
βββ models/ # β Model Implementation Directory (where you add files)
β βββ __init__.py # β Model Registration Entry
β βββ deepseek_v2.py # DeepSeek V2/V3 Reference Implementation
β βββ deepseek_mtp.py # DeepSeek MTP (Speculative Decoding)
β βββ qwen3.py # Qwen3 Reference Implementation (Dense Model)
β βββ qwen3_moe.py # Qwen3 MoE Reference Implementation
β βββ qwen3_next.py # Qwen3-Next (Hybrid Attention)
β βββ qwen3_vl.py # Qwen3 VL (Multi-modal)
β βββ qwen3_vl_moe.py # Qwen3 VL MoE (Multi-modal + MoE)
β βββ qwen2_vl.py # Qwen2 VL
β βββ qwen2_5_vl.py # Qwen2.5 VL
β βββ internlm2.py # InternLM2 Reference Implementation
β βββ internvl.py # InternVL (Multi-modal)
β βββ interns1.py # InternS1
β βββ seed_oss.py # SeedOss
β βββ gpt_oss.py # GptOss
β βββ mimo_v2_flash.py # MiMo-V2-Flash
βββ ops/ # Kunlun Custom Operators
β βββ _kunlun_ops.py # KunlunOps: paged_attention, rms_norm, silu...
β βββ _custom_ops.py # vllm custom_op registration
β βββ activation.py # Activation functions like SiluAndMul, GeluAndMul
β βββ attention/ # Attention Operators
β β βββ layer.py # Attention Layer Wrapper
β β βββ backends/kunlun_attn.py # KunlunAttentionBackend + KunlunAttentionImpl
β βββ quantization/ # Quantization related: AWQ, GPTQ, CompressedTensors...
β βββ vocab_parallel_embedding.py # Custom Embedding
β βββ rotary_embedding.py # Split_Norm_Rope (QKNorm + RoPE Fusion)
βββ v1/attention/backends/ # Attention Backend for v1 Engine
β βββ kunlun_attn.py # Standard Attention
β βββ mla/ # MLA (Multi-Latent Attention) Implementation
β βββ flashmla.py
β βββ flashmla_sparse.py
β βββ common.py
βββ compilation/wrapper.py # torch.compile Wrapper
βββ config/ # Model Configuration Overrides
β βββ model.py # Patch for attributes like is_deepseek_mla
βββ distributed/ # Communication related
β βββ kunlun_communicator.py # Kunlun Device Communication
βββ csrc/ # C++ Extensions
βββ utils.cpp
II. New Model Adaptation Step-by-Step#
Step 0: Pre-assessment#
Before starting, confirm which scenario your model falls into:
Scenario |
Description |
Effort |
|---|---|---|
Case A: vLLM already supports the model |
Only need to replace Attention / Activation with Kunlun versions |
β Minimal |
Case B: vLLM does not support, new architecture needed |
Requires full implementation of model class + registration |
βββ High |
Case C: MoE variant of an existing model |
Add MoE layer on top of the Dense version |
ββ Medium |
Case D: Multi-modal model |
Language Model + Vision Encoder + Projector |
ββββ Maximum |
Recommended Workflow:
Check the vLLM Supported Models List to see if the model is already there.
If yes β Copy the corresponding file from
vllm/model_executor/models/tovllm_kunlun/models/and perform replacements.If no β Refer to the vLLM Adding a New Model Documentation to understand the principles first, then follow this manual.
Step 1: Implement Model Files#
Create a model file in the vllm_kunlun/models/ directory, e.g., my_new_model.py.
1.1 Key Replacement Comparison Table#
Component |
vLLM Native Import |
vLLM-Kunlun Replacement Import |
Required? |
|---|---|---|---|
Attention Layer |
|
|
β Yes |
SiluAndMul |
|
|
β Yes |
GeluAndMul |
|
|
β οΈ As needed |
QuickGELU |
|
|
β οΈ As needed |
VocabParallelEmbedding |
|
|
β οΈ Some models |
ParallelLMHead |
Same as above |
|
β οΈ Some models |
RoPE (Special) |
|
|
β οΈ MoE+QKNorm |
Linear / RMSNorm, etc. |
Use vLLM native directly |
No replacement needed |
β |
π‘ Core Principle: Any component involving CUDA kernel calls (Attention, Activation, Sampling) must be replaced with the Kunlun version; pure PyTorch components (Linear, RMSNorm, RoPE, etc.) can use vLLM native directly.
1.2 Standard Dense Decoder-Only Model Template#
Refer to qwen3.py or internlm2.py:
"""Inference-only MyNewModel compatible with HuggingFace weights."""
from collections.abc import Iterable
from typing import Optional, Union
import torch
from torch import nn
from transformers import MyNewModelConfig # HuggingFace config
# ==========================================
# β Key Replacement 1: Use Kunlun-customized Attention
# ==========================================
# Do not use from vllm.attention import Attention
from vllm_kunlun.ops.attention.layer import Attention
# ==========================================
# β Key Replacement 2: Use Kunlun-customized Activation
# ==========================================
# Do not use from vllm.model_executor.layers.activation import SiluAndMul
from vllm_kunlun.ops.activation import SiluAndMul
# Other layers can use vLLM native directly
from vllm.compilation.decorators import support_torch_compile
from vllm.config import CacheConfig, VllmConfig
from vllm.distributed import get_pp_group, get_tensor_model_parallel_world_size
from vllm.model_executor.layers.layernorm import RMSNorm
from vllm.model_executor.layers.linear import (
QKVParallelLinear, RowParallelLinear, MergedColumnParallelLinear
)
from vllm.model_executor.layers.logits_processor import LogitsProcessor
from vllm.model_executor.layers.quantization import QuantizationConfig
from vllm.model_executor.layers.rotary_embedding import get_rope
from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead
from vllm.model_executor.model_loader.weight_utils import default_weight_loader
from vllm.sequence import IntermediateTensors
from vllm.model_executor.models.interfaces import SupportsPP, SupportsLoRA
from vllm.model_executor.models.utils import (
AutoWeightsLoader, PPMissingLayer, extract_layer_index,
is_pp_missing_parameter, make_empty_intermediate_tensors_factory,
make_layers, maybe_prefix
)
# ============================
# 1. MLP Layer
# ============================
class MyNewModelMLP(nn.Module):
def __init__(self, hidden_size, intermediate_size, hidden_act,
quant_config=None, prefix=""):
super().__init__()
self.gate_up_proj = MergedColumnParallelLinear(
hidden_size, [intermediate_size] * 2,
bias=False, quant_config=quant_config,
prefix=f"{prefix}.gate_up_proj",
)
self.down_proj = RowParallelLinear(
intermediate_size, hidden_size,
bias=False, quant_config=quant_config,
prefix=f"{prefix}.down_proj",
)
self.act_fn = SiluAndMul() # β Use Kunlun version
def forward(self, x):
# Implementation...
1.3 Key Implementation Requirements#
All modules must include the
prefixparameter, passed in__init__().@support_torch_compiledecorator must be added to the main model class (e.g.,MyNewModel).load_weights()method must correctly handle weight name mapping (stacked_params_mapping).Pipeline Parallelism (PP) requires using tools like
PPMissingLayer,is_pp_missing_parameter, etc.
Step 2: Register the Model#
Add registration code in vllm_kunlun/models/__init__.py:
# vllm_kunlun/models/__init__.py
from vllm import ModelRegistry
def register_model():
# ... Existing model registrations ...
# β Add your new model (using lazy loading string format)
ModelRegistry.register_model(
"MyNewModelForCausalLM", # β Must match architectures in config.json
"vllm_kunlun.models.my_new_model:MyNewModelForCausalLM" # β Module path:Class name
)
β οΈ Key Considerations:
The first parameter of
register_model()is the modelβsarchitectureidentifier, which must exactly match the"architectures"field in the HuggingFace modelβsconfig.json.Use the string format for the module path (
"module:class") to implement lazy loading, avoiding CUDA initialization conflicts (RuntimeError: Cannot re-initialize CUDA in forked subprocess).If the model already exists in vLLM (e.g.,
Qwen3ForCausalLM), the Kunlun version will overwrite the original vLLM version upon registration.
Step 3: Verify Registration#
Case A: Overwriting an Existing vLLM Model Architecture#
If your model architecture name (e.g., "Qwen3ForCausalLM") already exists in vLLM, vLLM will output the following log during registration:
WARNING [...] Model architecture Qwen3ForCausalLM is already registered,
and will be overwritten by the new model class
vllm_kunlun.models.qwen3:Qwen3ForCausalLM.
Seeing this log indicates a successful overwrite β .
Case B: Brand New Model Architecture#
If you are registering an architecture that does not exist in vLLM, there is no default log confirmation. It is recommended to verify manually during the debugging phase:
from vllm import ModelRegistry
assert "MyNewModelForCausalLM" in ModelRegistry.get_supported_archs()
print("β
Model registration successful!")
Step 4: Testing#
4.1 Offline Inference Test#
from vllm import LLM, SamplingParams
llm = LLM(
model="/path/to/MyNewModel",
trust_remote_code=True,
dtype="float16",
tensor_parallel_size=1, # Verify with single card first
)
outputs = llm.generate(
["Hello, please introduce yourself."],
SamplingParams(temperature=0.7, max_tokens=256),
)
for output in outputs:
print(output.outputs[0].text)
4.2 Online Service Test#
XPU_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 --port 8888 \
--model /path/to/MyNewModel \
--trust-remote-code \
--dtype float16 \
--max-model-len 4096 \
--block-size 64
4.3 Accuracy Verification#
It is recommended to compare results with HuggingFace Transformers CPU/GPU inference:
# Transformers reference output
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("/path/to/MyNewModel", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("/path/to/MyNewModel")
# ... Generate and compare output
III. Adaptation Guide for Special Model Types#
3.1 MoE Models (e.g., Qwen3-MoE, DeepSeek-V3)#
Reference Files:
vllm_kunlun/models/qwen3_moe.pyvllm_kunlun/models/deepseek_v2.py
Additional Points:
Use
vllm.model_executor.layers.fused_moe.layer.FusedMoE; Kunlun has replaced the underlying kernel via import hook.MoEβs
load_weights()is more complex, requiring expert parameter mapping:
expert_params_mapping = FusedMoE.make_expert_params_mapping(
ckpt_gate_proj_name="gate_proj",
ckpt_down_proj_name="down_proj",
ckpt_up_proj_name="up_proj",
num_experts=config.n_routed_experts,
)
Recommended environment variables:
export KUNLUN_USE_MOE_FFN_BLOCK=True
export XPU_USE_MOE_SORTED_THRES=120
3.2 MLA (Multi-Latent Attention) Models (e.g., DeepSeek-V3)#
Reference File: vllm_kunlun/models/deepseek_v2.py
MLA Special Handling:
KV compression dimensions:
kv_lora_rank,qk_nope_head_dim,qk_rope_head_dim.Platform layer automatically selects
FlashMLABackend:
# vllm_kunlun/platforms/kunlun.py
if use_mla:
if use_sparse:
return "vllm_kunlun.v1.attention.backends.mla.flashmla_sparse.FlashMLASparseBackend"
return "vllm_kunlun.v1.attention.backends.mla.flashmla.FlashMLABackend"
block_sizeusually needs to be set to 64.Recommended setting:
export USE_ORI_ROPE=1.
3.3 Multi-modal Models (e.g., Qwen2-VL, InternVL)#
Reference Files:
vllm_kunlun/models/qwen3_vl.pyvllm_kunlun/models/internvl.pyvllm_kunlun/models/interns1.py
Additional Components to Implement:
Component |
Description |
|---|---|
|
Declares that the model supports multi-modal input |
Vision Encoder |
Usually |
Projector |
Vision β Language mapping (e.g., MLP) |
|
Register multi-modal processor |
|
Handles multi-modal input |
|
Handles processing info |
|
Dummy inputs for the profiling phase |
3.4 Hybrid Attention Models (e.g., Qwen3-Next)#
Reference File: vllm_kunlun/models/qwen3_next.py
This model contains both Linear Attention and Full Attention layer types:
# Select different attention calculations based on layer_type
if self.layer_type == "linear_attention":
self.linear_attn(hidden_states=hidden_states, output=self_attention_output)
elif self.layer_type == "full_attention":
self.self_attn(hidden_states=hidden_states, output=self_attention_output, positions=positions)
Note:
Linear Attention uses
GatedDeltaNetor similar implementations.Need to register custom
custom_op(e.g.,vllm.gdn_attention) forsplitting_opsintorch.compile.
IV. Quantized Model Adaptation#
4.1 Supported Quantization Methods#
Quantization Method |
Adaptation File |
Status |
|---|---|---|
INT8 Dynamic (W8A8) |
|
β Recommended |
AWQ (INT4) |
|
β Supported |
GPTQ (INT4) |
|
β Supported |
CompressedTensors (INT8 MoE) |
|
β Supported |
FP8 |
β |
β οΈ Partial Support |
bfloat16 |
β |
β οΈ Double VRAM bug |
4.2 Special Handling for Quantization#
Kunlun chips use the max value for scale calculation instead of vLLMβs default absmax:
# ops/quantization/kernels/kunlun_scale_mm.py
class KunlunScaledMMLinearKernel(CutlassScaledMMLinearKernel):
def process_weights_after_loading(self, layer):
super().process_weights_after_loading(layer)
# β Key: Multiply scale by 127.0 to convert to max format
with torch.no_grad():
getattr(layer, self.w_s_name).mul_(127.0)
INT4 weights need to be repacked into the Kunlun layout order:
# AWQ repack example
AWQ_TO_KUNLUN_ORDER_NORMAL = [4, 0, 5, 1, 6, 2, 7, 3]
unpacked_kunlun = unpacked_awq[..., AWQ_TO_KUNLUN_ORDER_NORMAL]
V. Custom Operators (if new low-level Ops are needed)#
If your model requires new low-level operators:
5.1 Wrap kunlun_ops calls in _kunlun_ops.py#
# vllm_kunlun/ops/_kunlun_ops.py
class KunlunOps:
@staticmethod
def my_new_op(input, weight, out):
"""Call underlying kunlun_ops implementation"""
kunlun_ops.my_new_op(input, weight, out=out)
5.2 Register to vLLM in _custom_ops.py#
Follow the three-piece pattern:
# vllm_kunlun/ops/_custom_ops.py
# 1. Define the actual implementation of the op
def my_new_op_impl(input: torch.Tensor, weight: torch.Tensor) -> torch.Tensor:
output = torch.empty_like(input)
KunlunOps.my_new_op(input, weight, output)
return output
# 2. Define fake tensor implementation (for torch.compile)
def my_new_op_fake(input: torch.Tensor, weight: torch.Tensor) -> torch.Tensor:
return torch.empty_like(input)
# 3. Register
direct_register_custom_op(
op_name="my_new_op",
op_func=my_new_op_impl,
mutates_args=[],
fake_impl=my_new_op_fake,
)
VI. Common Pitfalls Checklist#
Before submitting a PR, please check each item:
[ ] Attention uses
vllm_kunlun.ops.attention.layer.Attention?[ ] Activation functions use
vllm_kunlun.ops.activation.SiluAndMul, etc.?[ ] All submodules in
__init__()have theprefixparameter passed?[ ]
load_weights()correctly handles weight name mapping (stacked_params_mapping)?[ ]
@support_torch_compiledecorator is added to the main model class?[ ] The first parameter of
ModelRegistry.register_model()exactly matchesarchitecturesinconfig.json?[ ] No use of
VLLM_USE_V1environment variable for logic (deprecated, v0.15.1 is V1-only)?[ ] Type annotations use
Optional[T]instead ofT | None(to avoidinfer_schemafailure)?[ ] Quantized model scales are correctly multiplied by
127.0?[ ] Supports Pipeline Parallelism (using
PPMissingLayer,is_pp_missing_parameter)?[ ] Ran
pre-commitformat checks?[ ] Commits use
-ssignature (DCO compliance)?
VII. Reference Template Quick Look-up#
Model Type |
Best Reference File |
Features |
|---|---|---|
Standard Dense LLM |
|
Simplest, recommended for beginners |
Dense LLM (Custom Embedding) |
|
Custom VocabParallelEmbedding |
MoE LLM |
|
FusedMoE + EP + SharedExpert |
MLA + MoE (DeepSeek) |
|
MLA attention + MoE + Indexer |
Hybrid Attention |
|
Linear + Full attention |
Multi-modal (VL) |
|
ViT + Projector + LLM |
Speculative Decoding (MTP) |
|
Multi-Token Prediction |
VIII. Debugging Tips#
8.1 Startup Failure#
ModuleNotFoundError: Check if the import hook mapping table in__init__.pycovers the corresponding module.circular import: Check if your new code introduces heavy dependencies during theregister()phase.Model architecture XXX is not supported: Check if the first parameter ofregister_model()matchesconfig.json.
8.2 Abnormal Output#
Garbage output: Compare with HF transformers output on CPU; likely an operator precision issue or weight loading mapping error.
Repeated tokens: Check if
rotary_embeddingis applied correctly and if theis_neox_styleparameter is correct.Truncated output: Check
max_model_lensettings and if KV cache is sufficient.
8.3 VRAM Issues#
Use
--dtype float16(avoid bfloat16 due to double VRAM bug).Set
VLLM_KUNLUN_ENABLE_INT8_BMM=1(saves ~0.1GB).Lower
--gpu-memory-utilization(default is 0.9).Use INT8 quantized models.
8.4 Weight Loading Failure#
# Debugging method: Print parameter names for comparison
params_dict = dict(self.named_parameters())
print("=== Model params ===")
for k in sorted(params_dict.keys()):
print(f" {k}: {params_dict[k].shape}")
# Print in load_weights
for name, loaded_weight in weights:
if name not in params_dict:
print(f" β οΈ Skipped: {name}")
8.5 Kunlun Graph Failure#
Confirm that splitting_ops in compilation-config includes your attention op name:
{
"splitting_ops": [
"vllm.unified_attention",
"vllm.unified_attention_with_output",
"vllm.unified_attention_with_output_kunlun",
"vllm.sparse_attn_indexer_vllm_kunlun"
],
"cudagraph_mode": "PIECEWISE"
}
IX. Environment Variables Cheat Sheet#
# === Required ===
export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 # Specify Kunlun cards to use
export VLLM_HOST_IP=$(hostname -i) # IP for distributed communication
# === Recommended ===
export XMLIR_FORCE_USE_XPU_GRAPH=1 # Enable XPU Graph acceleration
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false # Disable mock compile
export XMLIR_CUDNN_ENABLED=1 # Enable cuDNN equivalent acceleration
export XPU_USE_DEFAULT_CTX=1 # Default context
export BKCL_FORCE_SYNC=1 # BKCL forced sync (multi-card stability)
# === Model Specific ===
export USE_ORI_ROPE=1 # DeepSeek series uses original RoPE
export XFT_USE_FAST_SWIGLU=1 # Fast SwiGLU activation
export XPU_USE_FAST_SWIGLU=1 # Same as above (some versions)
export XPU_USE_MOE_SORTED_THRES=120 # MoE sorting threshold
export KUNLUN_USE_MOE_FFN_BLOCK=True # MoE FFN block optimization
# === Optional Tuning ===
export VLLM_KUNLUN_ENABLE_INT8_BMM=1 # Enable INT8 BMM (saves ~0.1GB)
X. PR Submission Standards#
10.1 Branch Naming#
feature/add-my-new-model
bugfix/fix-attention-output
10.2 Commit Message Prefix#
Prefix |
Description |
|---|---|
|
New functionality / New model |
|
Bug fix |
|
CI / Build related |
|
Documentation update |
|
Others |
10.3 Before Submission#
# 1. Install pre-commit
pre-commit install
# 2. Run checks
pre-commit run --all-files
# 3. Signed commit (DCO compliance)
git commit -s -m "[Feature] Add MyNewModel support for Kunlun"
10.4 PR Checklist#
[ ] Code passes
pre-commitchecks.[ ] Single-card offline inference test passed.
[ ] Multi-card TP test passed (if applicable).
[ ] Quantized model test passed (if applicable).
[ ] Updated
vllm_kunlun/models/__init__.pyregistration.[ ] Updated supported models list in README (if applicable).
Appendix: Standard Startup Command Templates#
A. Standard Dense Model (Single Card)#
XPU_VISIBLE_DEVICES=0 \
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 --port 8888 \
--model /path/to/model \
--trust-remote-code \
--dtype float16 \
--max-model-len 8192 \
--block-size 64
B. MoE Model (8-card TP)#
XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
XMLIR_FORCE_USE_XPU_GRAPH=1 \
KUNLUN_USE_MOE_FFN_BLOCK=True \
XPU_USE_MOE_SORTED_THRES=120 \
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 --port 8888 \
--model /path/to/moe-model-int8 \
--trust-remote-code \
--dtype float16 \
--max-model-len 32768 \
--tensor-parallel-size 8 \
--max_num_seqs 4 \
--block-size 64 \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--no-enable-prefix-caching
C. DeepSeek-V3 (MLA + MoE, W8A8)#
XMLIR_ENABLE_MOCK_TORCH_COMPILE=false \
USE_ORI_ROPE=1 \
XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 --port 8806 \
--model /path/to/DeepSeek-V3-w8a8 \
--gpu-memory-utilization 0.98 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 8 \
--dtype float16 \
--max_num_seqs 4 \
--block-size 64 \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--no-enable-prefix-caching
π Document Maintenance: If you have questions or suggestions, please provide feedback in GitHub Issues.