Skip to content

⚡ SGLang Production Inference

SuperOptiX Logo

SGLang High-Performance Inference

Structured generation and fast inference with SGLang


🎯 What is SGLang?

SGLang (Structured Generation Language) is a fast serving framework for large language models and vision language models with:

  • Structured Generation: Native support for constrained output formats (JSON, regex, etc.)
  • RadixAttention: Advanced KV cache reuse across requests
  • High Performance: Competitive with or faster than vLLM on many workloads
  • OpenAI Compatible: Drop-in replacement for OpenAI API

✅ Unique Features

  • Structured output generation (JSON, regex)
  • RadixAttention for cache reuse
  • Faster than vLLM on structured tasks
  • Multi-modal support (Vision + Language)
  • OpenAI API compatible

🎯 Best For

  • Structured data extraction
  • JSON output requirements
  • Agentic workflows with tools
  • Multi-modal applications
  • High cache-hit workloads

🚀 Installation

Option 1: pip

# Install SGLang
pip install "sglang[all]"

# Or minimal install
pip install sglang

Option 2: Docker

# Pull SGLang Docker image
docker pull lmsysorg/sglang:latest

# Run SGLang server
docker run --gpus all \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --port 30000

🔧 Configuration

Start SGLang Server

# Basic SGLang server
python -m sglang.launch_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --port 30000

# With advanced settings
python -m sglang.launch_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --port 30000 \
    --tp 2 \
    --mem-fraction-static 0.9 \
    --max-running-requests 256

SuperOptiX Configuration

spec:
  language_model:
    provider: openai  # SGLang is OpenAI API compatible
    model: meta-llama/Llama-3-8B-Instruct
    api_base: http://localhost:30000/v1
    api_key: "dummy"  # SGLang doesn't require real API key
    temperature: 0.7
    max_tokens: 1000

🎯 Structured Generation

SGLang's killer feature is structured output:

JSON Schema Enforcement

from sglang import function, gen

@function
def extract_info(s, text):
    s += f"Extract structured info from: {text}\n"
    s += gen("output", max_tokens=200, 
             regex=r'\{"name": "[^"]+", "age": \d+, "email": "[^"]+"}\')

# Guaranteed JSON output!

With SuperOptiX

spec:
  language_model:
    provider: openai
    model: meta-llama/Llama-3-8B-Instruct
    api_base: http://localhost:30000/v1
    response_format: json_object  # Structured output

  output_fields:
    - name: response
      type: str
      format: json  # Enforce JSON

📊 vLLM vs SGLang Comparison

Feature vLLM SGLang
Throughput Excellent Excellent ✅
Structured Output Basic Advanced 🏆
Cache Reuse Good RadixAttention 🚀
Multi-Modal Limited Vision + Language ✅
Maturity Production ✅ Emerging ⚡

📝 SuperOptiX Integration Example

apiVersion: agent/v1
kind: AgentSpec
metadata:
  name: sglang_structured_agent
  id: sglang_structured_agent
  namespace: production
  version: 1.0.0
  level: genies

spec:
  target_framework: dspy

  language_model:
    provider: openai  # SGLang is OpenAI compatible
    model: meta-llama/Llama-3-8B-Instruct
    api_base: http://localhost:30000/v1
    api_key: "dummy"
    temperature: 0.7
    max_tokens: 2000
    response_format: json_object  # Structured output!

  persona:
    role: Data Extraction Specialist
    goal: Extract structured information accurately

  input_fields:
    - name: text
      type: str

  output_fields:
    - name: response
      type: str
      format: json

  feature_specifications:
    scenarios:
      - name: JSON extraction
        input:
          text: "John Doe is 30 years old, email: john@example.com"
        expected_output:
          response: '{"name": "John Doe", "age": 30}'
          expected_keywords:
            - John Doe
            - "30"

🔄 Usage with SuperOptiX CLI

# Start SGLang server
python -m sglang.launch_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --port 30000 &

# Create agent with SGLang
super agent pull assistant_openai

# Update playbook to use SGLang endpoint
# Edit agents/demo/assistant_openai_playbook.yaml:
#   language_model:
#     provider: openai
#     api_base: http://localhost:30000/v1

# Compile and run
super agent compile assistant_openai
super agent run assistant_openai --goal "Extract info: Alice, age 25, alice@email.com"

🚀 Next Steps


📚 Resources