Skip to content

🚀 vLLM Production Inference

SuperOptiX Logo

vLLM High-Performance Inference

Production-grade LLM serving with vLLM


🎯 What is vLLM?

vLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). It's designed for production deployments with:

  • High Throughput: Serve multiple requests efficiently with continuous batching
  • Memory Efficiency: PagedAttention algorithm reduces memory usage
  • Production Ready: Built for scale and reliability
  • OpenAI Compatible: Drop-in replacement for OpenAI API

✅ Advantages

  • 3-10x faster than HuggingFace
  • Up to 24x higher throughput
  • Reduced memory footprint
  • Continuous batching
  • OpenAI API compatible

🎯 Best For

  • Production deployments
  • High-volume serving
  • Multi-user applications
  • Enterprise workloads
  • API-based serving

🚀 Installation

# Install vLLM
pip install vllm

# For CUDA 12.1
pip install vllm

# For CUDA 11.8
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118

Option 2: Docker

# Pull vLLM Docker image
docker pull vllm/vllm-openai:latest

# Run vLLM server
docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3-8B-Instruct

🔧 Configuration

Start vLLM Server

# Basic vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --port 8000

# With advanced settings
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --port 8000 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 4096

SuperOptiX Configuration

spec:
  language_model:
    provider: openai  # vLLM is OpenAI API compatible
    model: meta-llama/Llama-3-8B-Instruct
    api_base: http://localhost:8000/v1
    api_key: "dummy"  # vLLM doesn't require real API key
    temperature: 0.7
    max_tokens: 1000

📊 Performance Comparison

Engine Throughput Memory Latency
vLLM 24x ⭐ Low ✅ Fast ✅
HuggingFace 1x (baseline) High Moderate
Ollama 5x Medium Fast

🔬 Advanced Configuration

Tensor Parallelism (Multi-GPU)

# Use 4 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70B-Instruct \
    --tensor-parallel-size 4 \
    --port 8000

Quantization for Memory Efficiency

# AWQ 4-bit quantization
python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-70B-AWQ \
    --quantization awq \
    --port 8000

# GPTQ quantization
python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-70B-GPTQ \
    --quantization gptq \
    --port 8000

Custom Sampling Parameters

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.95 \
    --max-num-seqs 256 \
    --port 8000

📝 SuperOptiX Integration Example

apiVersion: agent/v1
kind: AgentSpec
metadata:
  name: vllm_production_agent
  id: vllm_production_agent
  namespace: production
  version: 1.0.0
  level: genies

spec:
  target_framework: dspy

  language_model:
    provider: openai  # vLLM is OpenAI compatible
    model: meta-llama/Llama-3-8B-Instruct
    api_base: http://localhost:8000/v1
    api_key: "dummy"
    temperature: 0.7
    max_tokens: 2000

  persona:
    role: Production AI Assistant
    goal: Provide fast, reliable responses at scale

  input_fields:
    - name: query
      type: str

  output_fields:
    - name: response
      type: str

  feature_specifications:
    scenarios:
      - name: Performance test
        input:
          query: "Explain vLLM benefits"
        expected_output:
          response: "vLLM explanation"

🔄 Usage with SuperOptiX CLI

# Initialize project
super init vllm_project
cd vllm_project

# Start vLLM server (in background)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --port 8000 &

# Create agent with vLLM
super agent pull assistant_openai

# Update playbook to use vLLM endpoint
# Edit agents/demo/assistant_openai_playbook.yaml:
#   language_model:
#     provider: openai
#     api_base: http://localhost:8000/v1

# Compile and run
super agent compile assistant_openai
super agent run assistant_openai --goal "Hello from vLLM!"

📈 Benchmarks

Model Size vLLM Tokens/sec HuggingFace Tokens/sec Speedup
7B ~1500 ~150 10x ⚡
13B ~1000 ~100 10x ⚡
70B ~300 ~15 20x 🚀

🎯 Use Cases

🏢

Enterprise APIs

Serve thousands of users simultaneously with high throughput

🤖

Multi-Agent Systems

Run multiple agents efficiently with batch processing

Real-Time Applications

Low-latency inference for interactive applications


🔬 Advanced Features

PagedAttention

vLLM's revolutionary memory management:

# PagedAttention automatically manages KV cache
# No configuration needed - it just works!

Continuous Batching

# vLLM automatically batches incoming requests
# Optimal throughput without manual tuning
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --max-num-seqs 256  # Max concurrent sequences

Speculative Decoding

# Use smaller model for speculation, larger for verification
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70B-Instruct \
    --speculative-model meta-llama/Llama-3-8B-Instruct \
    --num-speculative-tokens 5

🌐 Deployment

Production Deployment

# Production settings
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 \
    --max-num-seqs 256 \
    --disable-log-requests

Docker Compose

version: '3.8'
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: >
      --model meta-llama/Llama-3-8B-Instruct
      --tensor-parallel-size 2
      --gpu-memory-utilization 0.9

🚀 Next Steps


📚 Resources