Skip to main content
vLLM is a high-throughput inference server for open-weight models. OpenClaw connects to vLLM via its OpenAI-compatible API.

Prerequisites

  • A running vLLM server with its OpenAI-compatible API enabled
  • The server must be reachable from the OpenClaw gateway host

Activation

vLLM auto-activates when VLLM_API_KEY is set in the environment. The default base URL is http://127.0.0.1:8000/v1.
Run openclaw onboard to configure vLLM interactively — it generates the correct models.providers.vllm config block for you.

Configuration

models:
  providers:
    vllm:
      baseUrl: "http://127.0.0.1:8000/v1"
      api: "openai-completions"
      apiKey: "VLLM_API_KEY"   # env var name, not the literal key
      models:
        - id: "meta-llama/Llama-3.1-8B-Instruct"
          name: "Llama 3.1 8B"
          contextWindow: 128000
          maxTokens: 4096
          input: 0
          cost: 0
KeyTypeDescription
baseUrlstringBase URL of the vLLM OpenAI-compatible endpoint
apistringAPI style — openai-completions for vLLM
apiKeystringName of the env var holding the API key
models[].idstringModel ID as loaded by your vLLM server
models[].contextWindowintegerContext window size in tokens
models[].maxTokensintegerMax output tokens
models[].input / costnumberCost per token (use 0 for self-hosted)

Running vLLM

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct

Verify the connection

openclaw models status --probe-provider vllm
vLLM does not expose a model list endpoint by default. The --probe flag sends a test completion to verify connectivity.