⚡ SGLang Production Inference
SGLang High-Performance Inference
Structured generation and fast inference with SGLang
🎯 What is SGLang?
SGLang (Structured Generation Language) is a fast serving framework for large language models and vision language models with:
- Structured Generation: Native support for constrained output formats (JSON, regex, etc.)
- RadixAttention: Advanced KV cache reuse across requests
- High Performance: Competitive with or faster than vLLM on many workloads
- OpenAI Compatible: Drop-in replacement for OpenAI API
✅ Unique Features
|
🎯 Best For
|
🚀 Installation
Option 1: pip
# Install SGLang
pip install "sglang[all]"
# Or minimal install
pip install sglang
Option 2: Docker
# Pull SGLang Docker image
docker pull lmsysorg/sglang:latest
# Run SGLang server
docker run --gpus all \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model meta-llama/Llama-3-8B-Instruct \
--port 30000
🔧 Configuration
Start SGLang Server
# Basic SGLang server
python -m sglang.launch_server \
--model meta-llama/Llama-3-8B-Instruct \
--port 30000
# With advanced settings
python -m sglang.launch_server \
--model meta-llama/Llama-3-8B-Instruct \
--port 30000 \
--tp 2 \
--mem-fraction-static 0.9 \
--max-running-requests 256
SuperOptiX Configuration
spec:
language_model:
provider: openai # SGLang is OpenAI API compatible
model: meta-llama/Llama-3-8B-Instruct
api_base: http://localhost:30000/v1
api_key: "dummy" # SGLang doesn't require real API key
temperature: 0.7
max_tokens: 1000
🎯 Structured Generation
SGLang's killer feature is structured output:
JSON Schema Enforcement
from sglang import function, gen
@function
def extract_info(s, text):
s += f"Extract structured info from: {text}\n"
s += gen("output", max_tokens=200,
regex=r'\{"name": "[^"]+", "age": \d+, "email": "[^"]+"}\')
# Guaranteed JSON output!
With SuperOptiX
spec:
language_model:
provider: openai
model: meta-llama/Llama-3-8B-Instruct
api_base: http://localhost:30000/v1
response_format: json_object # Structured output
output_fields:
- name: response
type: str
format: json # Enforce JSON
📊 vLLM vs SGLang Comparison
| Feature | vLLM | SGLang |
| Throughput | Excellent | Excellent ✅ |
| Structured Output | Basic | Advanced 🏆 |
| Cache Reuse | Good | RadixAttention 🚀 |
| Multi-Modal | Limited | Vision + Language ✅ |
| Maturity | Production ✅ | Emerging ⚡ |
📝 SuperOptiX Integration Example
apiVersion: agent/v1
kind: AgentSpec
metadata:
name: sglang_structured_agent
id: sglang_structured_agent
namespace: production
version: 1.0.0
level: genies
spec:
target_framework: dspy
language_model:
provider: openai # SGLang is OpenAI compatible
model: meta-llama/Llama-3-8B-Instruct
api_base: http://localhost:30000/v1
api_key: "dummy"
temperature: 0.7
max_tokens: 2000
response_format: json_object # Structured output!
persona:
role: Data Extraction Specialist
goal: Extract structured information accurately
input_fields:
- name: text
type: str
output_fields:
- name: response
type: str
format: json
feature_specifications:
scenarios:
- name: JSON extraction
input:
text: "John Doe is 30 years old, email: john@example.com"
expected_output:
response: '{"name": "John Doe", "age": 30}'
expected_keywords:
- John Doe
- "30"
🔄 Usage with SuperOptiX CLI
# Start SGLang server
python -m sglang.launch_server \
--model meta-llama/Llama-3-8B-Instruct \
--port 30000 &
# Create agent with SGLang
super agent pull assistant_openai
# Update playbook to use SGLang endpoint
# Edit agents/demo/assistant_openai_playbook.yaml:
# language_model:
# provider: openai
# api_base: http://localhost:30000/v1
# Compile and run
super agent compile assistant_openai
super agent run assistant_openai --goal "Extract info: Alice, age 25, alice@email.com"
🚀 Next Steps
📚 Resources
- Official Documentation: https://sglang.readthedocs.io
- GitHub: https://github.com/sgl-project/sglang
- Paper: "SGLang: Efficient Execution of Structured Language Model Programs"