vLLM server performance

vLLM server performance#

vLLM benchmark CLI#

You can directly use vLLM’s CLI benchmark. For more details, please refer tovLLM Developer Guide Benchmark Suites

1.Online testing#

1.1Start the vLLM server#

Server startup script reference

python -m vllm.entrypoints.openai.api_server \
      --host 0.0.0.0 \
      --port 8000 \
      --model /xxxx/xxxx/mkdel\
      --gpu-memory-utilization 0.9 \
      --trust-remote-code \
      --max-model-len 32768 \
      --tensor-parallel-size 1 \
      --dtype float16 \
      --no-enable-prefix-caching \
      --no-enable-chunked-prefill \
      --distributed-executor-backend mp \
      --served-model-name modelname

1.2Execute test#

To run the test script, you can refer to the code below.

#!/bin/bash
# Run benchmark tests
python -m vllm.entrypoints.cli.main bench serve \
    --host 127.0.0.1 \
    --port xxxx \
    --backend vllm \
    --model modelname \
    --dataset-name random \
    --num-prompts 500 \
    --random-input-len 1024 \
    --random-output-len 1024 \
    --tokenizer /xxxx/xxxx/model \
    --ignore-eos 2>&1 | tee benchmark.log

1.3Result#

The following content will be displayed after the process is complete.

========== Serving Benchmark Result ==========
Successful requests:                          500
Benchmark duration (s):                       144.89
Total input tokens:                           510414
Total generated tokens:                       512000
Request throughput (req/s):                   3.45
Output token throughput (tok/s):              3533.68
Total Token throughput (tok/s):               7056.42
----------Time to First Token----------
Mean TTFT (ms):                               57959.61
Median TTFT (ms):                             43551.93
P99 TTFT (ms):                                116202.52
----------Time per Output Token (excl. 1st token)----------
Mean TPOT (ms):                               33.30
Median TPOT (ms):                             34.15
P99 TPOT (ms):                                35.59
----------Inter-token Latency----------
Mean ITL (ms):                                33.30
Median ITL (ms):                              29.05
P99 ITL (ms):                                 46.14
============================================

Key Parameter Explanation:

index	meaning	Optimization Objective
*Output Throughput*	Output token generation rate	↑ The higher the better
*Mean TTFT*	First Token Delay (Time To First Token)	↓ The lower the better
*P99 TTFT*	99% of requests have delayed first token.	↓ The lower the better
*Mean TPOT*	Average generation time per output token	↓ The lower the better
*P99 TPOT*	99% of requests’ time per token generation	↓ The lower the better
*ITL*	Delay between adjacent output tokens	↓ The lower the better

2.Offline testing#

Coming soon…

EvalScope#

EvalScope is a comprehensive model testing tool that can test not only model accuracy but also performance. For more information, please visit [website address missing].EvalScope，A brief introduction follows.

1.Download and install#

EvalScope supports use in Python environments. Users can install EvalScope via pip or from source code. Here are examples of both installation methods:

#pip
pip install evalscope[perf] -U
#git
git clone https://github.com/modelscope/evalscope.git
cd evalscope
pip install -e '.[perf]'

After downloading, some modules may be missing, causing the program to fail to run. Just follow the prompts to install them.

2.Start using#

The following demonstrates the performance test of the Qwen3-8B in a single-card scenario.

2.1Start the server#

The first step is to start the server. The example script is shown below.

python -m vllm.entrypoints.openai.api_server \
      --host 0.0.0.0 \
      --port 8000 \
      --model /models/Qwen3-8B\
      --gpu-memory-utilization 0.9 \
      --trust-remote-code \
      --max-model-len 32768 \
      --tensor-parallel-size 1 \
      --dtype float16 \
      --no-enable-prefix-caching \
      --no-enable-chunked-prefill \
      --distributed-executor-backend mp \
      --served-model-name Qwen3-8B-Instruct

2.2 Start EvalScope#

Start EvalScope to begin performance testing.

evalscope perf \
  --parallel 1 10\#The number of concurrent requests can be tested at once, separated by spaces.
  --number 10 20\#The total number of requests per request, aligned with spaces and the concurrency count.
  --model Qwen3-8B \
  --url http://127.0.0.1:xxxx/v1/chat/completions \
  --api openai \
  --dataset random \
  --max-tokens 1024 \
  --min-tokens 1024 \
  --prefix-length 0 \
  --min-prompt-length 1024 \
  --max-prompt-length 1024 \
  --tokenizer-path /xxxx/xxxx/Qwen3-8B\
  --extra-args '{"ignore_eos": true}'

2.3Results Analysis#

The following figure shows the results. You can view other data from a single test through the logs. For the specific meaning of the parameters, please refer to the parameter interpretation in the vLLM benchmark test.

Performance Test Summary Report

Basic Information:
+-------------------+------------------------+
| Model             | Qwen3-8B               |
| Total Generated   | 30,720.0 tokens        |
| Total Test Time   | 199.79 seconds         |
| Avg Output Rate   | 153.76 tokens/sec      |
+-------------------+------------------------+

Detailed Performance Metrics
+-------+------+------------+------------+-----------+-----------+-----------+-----------+-----------+---------------+
| Conc. | RPS  | Avg Lat.(s)| P99 Lat.(s)| Gen. Toks/s| Avg TTFT(s)| P99 TTFT(s)| Avg TPOT(s)| P99 TPOT(s)| Success Rate  |
+-------+------+------------+------------+-----------+-----------+-----------+-----------+-----------+---------------+
| 1     | 0.07 | 16.191     | 16.475     | 70.40      | 0.080     | 0.085     | 0.016     | 0.016     | 100.0%        |
| 10    | 0.53 | 18.927     | 19.461     | 540.87     | 0.503     | 0.562     | 0.018     | 0.019     | 100.0%        |
+-------+------+------------+------------+-----------+-----------+-----------+-----------+-----------+---------------+

Best Performance Configuration
Highest RPS:      Concurrency 10 (0.53 req/sec)
Lowest Latency:   Concurrency 1 (16.191 seconds)

Performance Recommendations:
* The system seems not to have reached its performance bottleneck, try higher concurrency