📖 vLLM-Kunlun New Model Adaptation Manual

📖 vLLM-Kunlun New Model Adaptation Manual#

Based on in-depth analysis of baidu/vLLM-Kunlun and vllm-project/vllm repositories.

Applicable Versions: vLLM v0.15.1+ / vLLM-Kunlun main branch

Table of Contents#

I. Understanding the Overall Architecture#

1.1 Plugin System#

vLLM-Kunlun uses the OOT (Out-of-Tree) Plugin approach to integrate with vLLM, primarily registered via entry_points in setup.py:

# setup.py
entry_points={
    'vllm.platform_plugins': ["kunlun = vllm_kunlun:register"],       # Platform Plugin
    'vllm.general_plugins': [
        "kunlun_model = vllm_kunlun:register_model",                   # Model Registration
        "kunlun_quant = vllm_kunlun:register_quant_method"             # Quantization Method
    ],
    "console_scripts": [
        "vllm_kunlun = vllm_kunlun.entrypoints.main:main"
    ]
}

1.2 Startup Process#

vllm Startup
  ├─ 1. Discover platform_plugin → Call vllm_kunlun:register()
  │      ├─ Register KunlunPlatform (defines Attention Backend, Worker, etc.)
  │      ├─ Apply import hook (module redirection)
  │      └─ Register custom operators (custom_op)
  ├─ 2. Discover general_plugin → Call vllm_kunlun:register_model()
  │      └─ Register all Kunlun-adapted models via ModelRegistry.register_model()
  └─ 3. Model Loading → Match registered model classes based on the architectures field in config.json

1.3 Import Hook Mechanism#

vLLM-Kunlun uses a custom import hook to transparently replace certain vLLM modules with Kunlun-customized versions:

# vllm_kunlun/__init__.py
def _custom_import(module_name, globals=None, locals=None, fromlist=(), level=0):
    try:
        module_mappings = {
            "vllm.compilation.wrapper":                        "vllm_kunlun.compilation.wrapper",
            "vllm.v1.worker.utils":                            "vllm_kunlun.v1.worker.utils",
            "vllm.model_executor.model_loader.bitsandbytes_loader": "vllm_kunlun.models.model_loader.bitsandbytes_loader",
            "vllm.v1.sample.ops.topk_topp_sampler":            "vllm_kunlun.v1.sample.ops.topk_topp_sampler",
            "vllm.model_executor.layers.sampler":              "vllm_kunlun.ops.sample.sampler",
            "vllm.v1.sample.rejection_sampler":                "vllm_kunlun.v1.sample.rejection_sampler",
            "vllm.attention.ops.merge_attn_states":            "vllm_kunlun.ops.attention.merge_attn_states",
        }

        if module_name in module_mappings:
            if module_name in sys.modules:
                return sys.modules[module_name]
            target_module = module_mappings[module_name]
            module = importlib.import_module(target_module)
            sys.modules[module_name] = module
            sys.modules[target_module] = module
    except Exception:
        pass

    return OLD_IMPORT_HOOK(module_name, globals=globals, locals=locals, fromlist=fromlist, level=level)

⚠️ Understanding this mechanism is crucial: Even if you use from vllm.xxx import YYY in your model code, what you actually get might be vllm_kunlun.xxx.YYY.

1.4 Code Architecture#

vllm_kunlun/
├── __init__.py                    # Plugin Entry: register() + import_hook()
├── platforms/kunlun.py            # KunlunPlatform: Defines Attention Backend, Worker, etc.
├── models/                        # ⭐ Model Implementation Directory (where you add files)
│   ├── __init__.py                # ⭐ Model Registration Entry
│   ├── deepseek_v2.py             # DeepSeek V2/V3 Reference Implementation
│   ├── deepseek_mtp.py            # DeepSeek MTP (Speculative Decoding)
│   ├── qwen3.py                   # Qwen3 Reference Implementation (Dense Model)
│   ├── qwen3_moe.py               # Qwen3 MoE Reference Implementation
│   ├── qwen3_next.py              # Qwen3-Next (Hybrid Attention)
│   ├── qwen3_vl.py                # Qwen3 VL (Multi-modal)
│   ├── qwen3_vl_moe.py            # Qwen3 VL MoE (Multi-modal + MoE)
│   ├── qwen2_vl.py                # Qwen2 VL
│   ├── qwen2_5_vl.py              # Qwen2.5 VL
│   ├── internlm2.py               # InternLM2 Reference Implementation
│   ├── internvl.py                # InternVL (Multi-modal)
│   ├── interns1.py                # InternS1
│   ├── seed_oss.py                # SeedOss
│   ├── gpt_oss.py                 # GptOss
│   └── mimo_v2_flash.py           # MiMo-V2-Flash
├── ops/                           # Kunlun Custom Operators
│   ├── _kunlun_ops.py             # KunlunOps: paged_attention, rms_norm, silu...
│   ├── _custom_ops.py             # vllm custom_op registration
│   ├── activation.py              # Activation functions like SiluAndMul, GeluAndMul
│   ├── attention/                 # Attention Operators
│   │   ├── layer.py               # Attention Layer Wrapper
│   │   └── backends/kunlun_attn.py # KunlunAttentionBackend + KunlunAttentionImpl
│   ├── quantization/              # Quantization related: AWQ, GPTQ, CompressedTensors...
│   ├── vocab_parallel_embedding.py # Custom Embedding
│   └── rotary_embedding.py        # Split_Norm_Rope (QKNorm + RoPE Fusion)
├── v1/attention/backends/         # Attention Backend for v1 Engine
│   ├── kunlun_attn.py             # Standard Attention
│   └── mla/                       # MLA (Multi-Latent Attention) Implementation
│       ├── flashmla.py
│       ├── flashmla_sparse.py
│       └── common.py
├── compilation/wrapper.py         # torch.compile Wrapper
├── config/                        # Model Configuration Overrides
│   └── model.py                   # Patch for attributes like is_deepseek_mla
├── distributed/                   # Communication related
│   └── kunlun_communicator.py     # Kunlun Device Communication
└── csrc/                          # C++ Extensions
    └── utils.cpp

II. New Model Adaptation Step-by-Step#

Step 0: Pre-assessment#

Before starting, confirm which scenario your model falls into:

Scenario	Description	Effort
Case A: vLLM already supports the model	Only need to replace Attention / Activation with Kunlun versions	⭐ Minimal
Case B: vLLM does not support, new architecture needed	Requires full implementation of model class + registration	⭐⭐⭐ High
Case C: MoE variant of an existing model	Add MoE layer on top of the Dense version	⭐⭐ Medium
Case D: Multi-modal model	Language Model + Vision Encoder + Projector	⭐⭐⭐⭐ Maximum

Recommended Workflow:

Check the vLLM Supported Models List to see if the model is already there.
If yes → Copy the corresponding file from vllm/model_executor/models/ to vllm_kunlun/models/ and perform replacements.
If no → Refer to the vLLM Adding a New Model Documentation to understand the principles first, then follow this manual.

Step 1: Implement Model Files#

Create a model file in the vllm_kunlun/models/ directory, e.g., my_new_model.py.

1.1 Key Replacement Comparison Table#

Component	vLLM Native Import	vLLM-Kunlun Replacement Import	Required?
Attention Layer	`from vllm.attention import Attention`	`from vllm_kunlun.ops.attention.layer import Attention`	✅ Yes
SiluAndMul	`from vllm.model_executor.layers.activation import SiluAndMul`	`from vllm_kunlun.ops.activation import SiluAndMul`	✅ Yes
GeluAndMul	`...activation import GeluAndMul`	`from vllm_kunlun.ops.activation import GeluAndMul`	⚠️ As needed
QuickGELU	`...activation import QuickGELU`	`from vllm_kunlun.ops.activation import QuickGELU`	⚠️ As needed
VocabParallelEmbedding	`from vllm...vocab_parallel_embedding import VocabParallelEmbedding`	`from vllm_kunlun.ops.vocab_parallel_embedding import VocabParallelEmbedding`	⚠️ Some models
ParallelLMHead	Same as above	`from vllm_kunlun.ops.vocab_parallel_embedding import ParallelLMHead`	⚠️ Some models
RoPE (Special)	`from vllm...rotary_embedding import get_rope`	`from vllm_kunlun.ops.rotary_embedding import Split_Norm_Rope`	⚠️ MoE+QKNorm
Linear / RMSNorm, etc.	Use vLLM native directly	No replacement needed	—

💡 Core Principle: Any component involving CUDA kernel calls (Attention, Activation, Sampling) must be replaced with the Kunlun version; pure PyTorch components (Linear, RMSNorm, RoPE, etc.) can use vLLM native directly.

1.2 Standard Dense Decoder-Only Model Template#

Refer to qwen3.py or internlm2.py:

"""Inference-only MyNewModel compatible with HuggingFace weights."""
from collections.abc import Iterable
from typing import Optional, Union

import torch
from torch import nn
from transformers import MyNewModelConfig  # HuggingFace config

# ==========================================
# ⭐ Key Replacement 1: Use Kunlun-customized Attention
# ==========================================
# Do not use from vllm.attention import Attention
from vllm_kunlun.ops.attention.layer import Attention

# ==========================================
# ⭐ Key Replacement 2: Use Kunlun-customized Activation
# ==========================================
# Do not use from vllm.model_executor.layers.activation import SiluAndMul
from vllm_kunlun.ops.activation import SiluAndMul

# Other layers can use vLLM native directly
from vllm.compilation.decorators import support_torch_compile
from vllm.config import CacheConfig, VllmConfig
from vllm.distributed import get_pp_group, get_tensor_model_parallel_world_size
from vllm.model_executor.layers.layernorm import RMSNorm
from vllm.model_executor.layers.linear import (
    QKVParallelLinear, RowParallelLinear, MergedColumnParallelLinear
)
from vllm.model_executor.layers.logits_processor import LogitsProcessor
from vllm.model_executor.layers.quantization import QuantizationConfig
from vllm.model_executor.layers.rotary_embedding import get_rope
from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead
from vllm.model_executor.model_loader.weight_utils import default_weight_loader
from vllm.sequence import IntermediateTensors
from vllm.model_executor.models.interfaces import SupportsPP, SupportsLoRA
from vllm.model_executor.models.utils import (
    AutoWeightsLoader, PPMissingLayer, extract_layer_index,
    is_pp_missing_parameter, make_empty_intermediate_tensors_factory,
    make_layers, maybe_prefix
)


# ============================
# 1. MLP Layer
# ============================
class MyNewModelMLP(nn.Module):
    def __init__(self, hidden_size, intermediate_size, hidden_act,
                 quant_config=None, prefix=""):
        super().__init__()
        self.gate_up_proj = MergedColumnParallelLinear(
            hidden_size, [intermediate_size] * 2,
            bias=False, quant_config=quant_config,
            prefix=f"{prefix}.gate_up_proj",
        )
        self.down_proj = RowParallelLinear(
            intermediate_size, hidden_size,
            bias=False, quant_config=quant_config,
            prefix=f"{prefix}.down_proj",
        )
        self.act_fn = SiluAndMul()  # ⭐ Use Kunlun version

    def forward(self, x):
        # Implementation...

1.3 Key Implementation Requirements#

All modules must include the prefix parameter, passed in __init__().
@support_torch_compile decorator must be added to the main model class (e.g., MyNewModel).
load_weights() method must correctly handle weight name mapping (stacked_params_mapping).
Pipeline Parallelism (PP) requires using tools like PPMissingLayer, is_pp_missing_parameter, etc.

Step 2: Register the Model#

Add registration code in vllm_kunlun/models/__init__.py:

# vllm_kunlun/models/__init__.py

from vllm import ModelRegistry

def register_model():
    # ... Existing model registrations ...

    # ⭐ Add your new model (using lazy loading string format)
    ModelRegistry.register_model(
        "MyNewModelForCausalLM",                                    # ← Must match architectures in config.json
        "vllm_kunlun.models.my_new_model:MyNewModelForCausalLM"    # ← Module path:Class name
    )

⚠️ Key Considerations:

The first parameter of register_model() is the model’s architecture identifier, which must exactly match the "architectures" field in the HuggingFace model’s config.json.
Use the string format for the module path ("module:class") to implement lazy loading, avoiding CUDA initialization conflicts (RuntimeError: Cannot re-initialize CUDA in forked subprocess).
If the model already exists in vLLM (e.g., Qwen3ForCausalLM), the Kunlun version will overwrite the original vLLM version upon registration.

Step 3: Verify Registration#

Case A: Overwriting an Existing vLLM Model Architecture#

If your model architecture name (e.g., "Qwen3ForCausalLM") already exists in vLLM, vLLM will output the following log during registration:

WARNING [...] Model architecture Qwen3ForCausalLM is already registered,
and will be overwritten by the new model class
vllm_kunlun.models.qwen3:Qwen3ForCausalLM.

Seeing this log indicates a successful overwrite ✅.

Case B: Brand New Model Architecture#

If you are registering an architecture that does not exist in vLLM, there is no default log confirmation. It is recommended to verify manually during the debugging phase:

from vllm import ModelRegistry
assert "MyNewModelForCausalLM" in ModelRegistry.get_supported_archs()
print("✅ Model registration successful!")

Step 4: Testing#

4.1 Offline Inference Test#

from vllm import LLM, SamplingParams

llm = LLM(
    model="/path/to/MyNewModel",
    trust_remote_code=True,
    dtype="float16",
    tensor_parallel_size=1,  # Verify with single card first
)

outputs = llm.generate(
    ["Hello, please introduce yourself."],
    SamplingParams(temperature=0.7, max_tokens=256),
)
for output in outputs:
    print(output.outputs[0].text)

4.2 Online Service Test#

XPU_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 8888 \
    --model /path/to/MyNewModel \
    --trust-remote-code \
    --dtype float16 \
    --max-model-len 4096 \
    --block-size 64

4.3 Accuracy Verification#

It is recommended to compare results with HuggingFace Transformers CPU/GPU inference:

# Transformers reference output
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("/path/to/MyNewModel", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("/path/to/MyNewModel")
# ... Generate and compare output

III. Adaptation Guide for Special Model Types#

3.1 MoE Models (e.g., Qwen3-MoE, DeepSeek-V3)#

Reference Files:

vllm_kunlun/models/qwen3_moe.py
vllm_kunlun/models/deepseek_v2.py

Additional Points:

Use vllm.model_executor.layers.fused_moe.layer.FusedMoE; Kunlun has replaced the underlying kernel via import hook.
MoE’s load_weights() is more complex, requiring expert parameter mapping:

expert_params_mapping = FusedMoE.make_expert_params_mapping(
    ckpt_gate_proj_name="gate_proj",
    ckpt_down_proj_name="down_proj",
    ckpt_up_proj_name="up_proj",
    num_experts=config.n_routed_experts,
)

Recommended environment variables:

export KUNLUN_USE_MOE_FFN_BLOCK=True
export XPU_USE_MOE_SORTED_THRES=120

3.2 MLA (Multi-Latent Attention) Models (e.g., DeepSeek-V3)#

Reference File: vllm_kunlun/models/deepseek_v2.py

MLA Special Handling:

KV compression dimensions: kv_lora_rank, qk_nope_head_dim, qk_rope_head_dim.
Platform layer automatically selects FlashMLABackend:

# vllm_kunlun/platforms/kunlun.py
if use_mla:
    if use_sparse:
        return "vllm_kunlun.v1.attention.backends.mla.flashmla_sparse.FlashMLASparseBackend"
    return "vllm_kunlun.v1.attention.backends.mla.flashmla.FlashMLABackend"

block_size usually needs to be set to 64.
Recommended setting: export USE_ORI_ROPE=1.

3.3 Multi-modal Models (e.g., Qwen2-VL, InternVL)#

Reference Files:

vllm_kunlun/models/qwen3_vl.py
vllm_kunlun/models/internvl.py
vllm_kunlun/models/interns1.py

Additional Components to Implement:

Component	Description
`SupportsMultiModal` Interface	Declares that the model supports multi-modal input
Vision Encoder	Usually `InternVisionModel` or custom ViT
Projector	Vision → Language mapping (e.g., MLP)
`@MULTIMODAL_REGISTRY.register_processor(...)`	Register multi-modal processor
`BaseMultiModalProcessor`	Handles multi-modal input
`BaseProcessingInfo`	Handles processing info
`BaseDummyInputsBuilder`	Dummy inputs for the profiling phase

3.4 Hybrid Attention Models (e.g., Qwen3-Next)#

Reference File: vllm_kunlun/models/qwen3_next.py

This model contains both Linear Attention and Full Attention layer types:

# Select different attention calculations based on layer_type
if self.layer_type == "linear_attention":
    self.linear_attn(hidden_states=hidden_states, output=self_attention_output)
elif self.layer_type == "full_attention":
    self.self_attn(hidden_states=hidden_states, output=self_attention_output, positions=positions)

Note:

Linear Attention uses GatedDeltaNet or similar implementations.
Need to register custom custom_op (e.g., vllm.gdn_attention) for splitting_ops in torch.compile.

IV. Quantized Model Adaptation#

4.1 Supported Quantization Methods#

Quantization Method	Adaptation File	Status
INT8 Dynamic (W8A8)	`ops/quantization/kernels/kunlun_scale_mm.py`	✅ Recommended
AWQ (INT4)	`ops/quantization/awq.py`	✅ Supported
GPTQ (INT4)	`ops/quantization/gptq.py`	✅ Supported
CompressedTensors (INT8 MoE)	`ops/quantization/compressed_tensors/`	✅ Supported
FP8	—	⚠️ Partial Support
bfloat16	—	⚠️ Double VRAM bug

4.2 Special Handling for Quantization#

Kunlun chips use the max value for scale calculation instead of vLLM’s default absmax:

# ops/quantization/kernels/kunlun_scale_mm.py
class KunlunScaledMMLinearKernel(CutlassScaledMMLinearKernel):
    def process_weights_after_loading(self, layer):
        super().process_weights_after_loading(layer)
        # ⭐ Key: Multiply scale by 127.0 to convert to max format
        with torch.no_grad():
            getattr(layer, self.w_s_name).mul_(127.0)

INT4 weights need to be repacked into the Kunlun layout order:

# AWQ repack example
AWQ_TO_KUNLUN_ORDER_NORMAL = [4, 0, 5, 1, 6, 2, 7, 3]
unpacked_kunlun = unpacked_awq[..., AWQ_TO_KUNLUN_ORDER_NORMAL]

V. Custom Operators (if new low-level Ops are needed)#

If your model requires new low-level operators:

5.1 Wrap kunlun_ops calls in `_kunlun_ops.py`#

# vllm_kunlun/ops/_kunlun_ops.py
class KunlunOps:
    @staticmethod
    def my_new_op(input, weight, out):
        """Call underlying kunlun_ops implementation"""
        kunlun_ops.my_new_op(input, weight, out=out)

5.2 Register to vLLM in `_custom_ops.py`#

Follow the three-piece pattern:

# vllm_kunlun/ops/_custom_ops.py

# 1. Define the actual implementation of the op
def my_new_op_impl(input: torch.Tensor, weight: torch.Tensor) -> torch.Tensor:
    output = torch.empty_like(input)
    KunlunOps.my_new_op(input, weight, output)
    return output

# 2. Define fake tensor implementation (for torch.compile)
def my_new_op_fake(input: torch.Tensor, weight: torch.Tensor) -> torch.Tensor:
    return torch.empty_like(input)

# 3. Register
direct_register_custom_op(
    op_name="my_new_op",
    op_func=my_new_op_impl,
    mutates_args=[],
    fake_impl=my_new_op_fake,
)

VI. Common Pitfalls Checklist#

Before submitting a PR, please check each item:

[ ] Attention uses vllm_kunlun.ops.attention.layer.Attention?
[ ] Activation functions use vllm_kunlun.ops.activation.SiluAndMul, etc.?
[ ] All submodules in __init__() have the prefix parameter passed?
[ ] load_weights() correctly handles weight name mapping (stacked_params_mapping)?
[ ] @support_torch_compile decorator is added to the main model class?
[ ] The first parameter of ModelRegistry.register_model() exactly matches architectures in config.json?
[ ] No use of VLLM_USE_V1 environment variable for logic (deprecated, v0.15.1 is V1-only)?
[ ] Type annotations use Optional[T] instead of T | None (to avoid infer_schema failure)?
[ ] Quantized model scales are correctly multiplied by 127.0?
[ ] Supports Pipeline Parallelism (using PPMissingLayer, is_pp_missing_parameter)?
[ ] Ran pre-commit format checks?
[ ] Commits use -s signature (DCO compliance)?

VII. Reference Template Quick Look-up#

Model Type	Best Reference File	Features
Standard Dense LLM	`qwen3.py`	Simplest, recommended for beginners
Dense LLM (Custom Embedding)	`seed_oss.py`, `internlm2.py`	Custom VocabParallelEmbedding
MoE LLM	`qwen3_moe.py`	FusedMoE + EP + SharedExpert
MLA + MoE (DeepSeek)	`deepseek_v2.py`	MLA attention + MoE + Indexer
Hybrid Attention	`qwen3_next.py`	Linear + Full attention
Multi-modal (VL)	`qwen3_vl.py`, `internvl.py`	ViT + Projector + LLM
Speculative Decoding (MTP)	`deepseek_mtp.py`	Multi-Token Prediction

VIII. Debugging Tips#

8.1 Startup Failure#

ModuleNotFoundError: Check if the import hook mapping table in __init__.py covers the corresponding module.
circular import: Check if your new code introduces heavy dependencies during the register() phase.
Model architecture XXX is not supported: Check if the first parameter of register_model() matches config.json.

8.2 Abnormal Output#

Garbage output: Compare with HF transformers output on CPU; likely an operator precision issue or weight loading mapping error.
Repeated tokens: Check if rotary_embedding is applied correctly and if the is_neox_style parameter is correct.
Truncated output: Check max_model_len settings and if KV cache is sufficient.

8.3 VRAM Issues#

Use --dtype float16 (avoid bfloat16 due to double VRAM bug).
Set VLLM_KUNLUN_ENABLE_INT8_BMM=1 (saves ~0.1GB).
Lower --gpu-memory-utilization (default is 0.9).
Use INT8 quantized models.

8.4 Weight Loading Failure#

# Debugging method: Print parameter names for comparison
params_dict = dict(self.named_parameters())
print("=== Model params ===")
for k in sorted(params_dict.keys()):
    print(f"  {k}: {params_dict[k].shape}")

# Print in load_weights
for name, loaded_weight in weights:
    if name not in params_dict:
        print(f"  ⚠️ Skipped: {name}")

8.5 Kunlun Graph Failure#

Confirm that splitting_ops in compilation-config includes your attention op name:

{
  "splitting_ops": [
    "vllm.unified_attention",
    "vllm.unified_attention_with_output",
    "vllm.unified_attention_with_output_kunlun",
    "vllm.sparse_attn_indexer_vllm_kunlun"
  ],
  "cudagraph_mode": "PIECEWISE"
}

IX. Environment Variables Cheat Sheet#

# === Required ===
export XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7   # Specify Kunlun cards to use
export VLLM_HOST_IP=$(hostname -i)             # IP for distributed communication

# === Recommended ===
export XMLIR_FORCE_USE_XPU_GRAPH=1             # Enable XPU Graph acceleration
export XMLIR_ENABLE_MOCK_TORCH_COMPILE=false   # Disable mock compile
export XMLIR_CUDNN_ENABLED=1                   # Enable cuDNN equivalent acceleration
export XPU_USE_DEFAULT_CTX=1                   # Default context
export BKCL_FORCE_SYNC=1                       # BKCL forced sync (multi-card stability)

# === Model Specific ===
export USE_ORI_ROPE=1                          # DeepSeek series uses original RoPE
export XFT_USE_FAST_SWIGLU=1                   # Fast SwiGLU activation
export XPU_USE_FAST_SWIGLU=1                   # Same as above (some versions)
export XPU_USE_MOE_SORTED_THRES=120            # MoE sorting threshold
export KUNLUN_USE_MOE_FFN_BLOCK=True           # MoE FFN block optimization

# === Optional Tuning ===
export VLLM_KUNLUN_ENABLE_INT8_BMM=1           # Enable INT8 BMM (saves ~0.1GB)

X. PR Submission Standards#

10.1 Branch Naming#

feature/add-my-new-model
bugfix/fix-attention-output

10.2 Commit Message Prefix#

Prefix	Description
`[Feature]`	New functionality / New model
`[Bugfix]`	Bug fix
`[CI/Build]`	CI / Build related
`[Doc]`	Documentation update
`[Misc]`	Others

10.3 Before Submission#

# 1. Install pre-commit
pre-commit install

# 2. Run checks
pre-commit run --all-files

# 3. Signed commit (DCO compliance)
git commit -s -m "[Feature] Add MyNewModel support for Kunlun"

10.4 PR Checklist#

[ ] Code passes pre-commit checks.
[ ] Single-card offline inference test passed.
[ ] Multi-card TP test passed (if applicable).
[ ] Quantized model test passed (if applicable).
[ ] Updated vllm_kunlun/models/__init__.py registration.
[ ] Updated supported models list in README (if applicable).

Appendix: Standard Startup Command Templates#

A. Standard Dense Model (Single Card)#

XPU_VISIBLE_DEVICES=0 \
python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 8888 \
    --model /path/to/model \
    --trust-remote-code \
    --dtype float16 \
    --max-model-len 8192 \
    --block-size 64

B. MoE Model (8-card TP)#

XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
XMLIR_FORCE_USE_XPU_GRAPH=1 \
KUNLUN_USE_MOE_FFN_BLOCK=True \
XPU_USE_MOE_SORTED_THRES=120 \
python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 8888 \
    --model /path/to/moe-model-int8 \
    --trust-remote-code \
    --dtype float16 \
    --max-model-len 32768 \
    --tensor-parallel-size 8 \
    --max_num_seqs 4 \
    --block-size 64 \
    --no-enable-chunked-prefill \
    --distributed-executor-backend mp \
    --no-enable-prefix-caching

C. DeepSeek-V3 (MLA + MoE, W8A8)#

XMLIR_ENABLE_MOCK_TORCH_COMPILE=false \
USE_ORI_ROPE=1 \
XPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 8806 \
    --model /path/to/DeepSeek-V3-w8a8 \
    --gpu-memory-utilization 0.98 \
    --trust-remote-code \
    --max-model-len 32768 \
    --tensor-parallel-size 8 \
    --dtype float16 \
    --max_num_seqs 4 \
    --block-size 64 \
    --no-enable-chunked-prefill \
    --distributed-executor-backend mp \
    --no-enable-prefix-caching

📝 Document Maintenance: If you have questions or suggestions, please provide feedback in GitHub Issues.