FAQs

FAQs#

Version Specific FAQs#

[[v0.15.1] FAQ & Feedback]

General FAQs#

1. What devices are currently supported?#

Currently, ONLY Kunlun3 series(P800) series are supported

Below series are NOT supported yet:

Kunlun4 series(M100 and M300)
Kunlun2 series(R200)
Kunlun1 series

We will support the kunlun4 M100 platform in early 2026.

2. How to get our docker containers?#

base:docker pull wjie520/vllm_kunlun:v0.0.1.

3. How vllm-kunlun work with vLLM?#

vllm-kunlun is a hardware plugin for vLLM. Basically, the version of vllm-kunlun is the same as the version of vllm. For example, if you use vllm 0.15.1, you should use vllm-kunlun 0.15.1 as well. For main branch, we will make sure vllm-kunlun and vllm are compatible by each commit.

4. How to handle the out-of-memory issue?#

OOM errors typically occur when the model exceeds the memory capacity of a single XPU. For general guidance, you can refer to vLLM OOM troubleshooting documentation.

In scenarios where XPUs have limited high bandwidth memory (HBM) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:

Limit --max-model-len: It can save the HBM usage for kv cache initialization step.
Adjust --gpu-memory-utilization: If unspecified, the default value is 0.9. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: vLLM - Inference and Serving - Engine Arguments.