Dataset Import Guide
Import external datasets for agent training and evaluation, scaling beyond YAML scenarios to thousands of examples.
🎯 Overview
SuperOptiX now supports importing external datasets in addition to BDD scenarios. This allows you to: - ✅ Use existing datasets (CSV, JSON, Parquet, HuggingFace) - ✅ Scale to 10,000+ examples (vs 5-10 YAML scenarios) - ✅ Leverage standard ML workflows - ✅ Access HuggingFace's 100,000+ datasets - ✅ Mix datasets with BDD scenarios for best results
🚀 Quick Start
1. Create a Dataset
# Create CSV file
cat << 'EOF' > data/sentiment_train.csv
text,label
"This is amazing!",positive
"This is terrible",negative
"It's okay",neutral
EOF
2. Configure in Playbook
# agent_playbook.yaml
spec:
datasets:
- name: training_data
source: ./data/sentiment_train.csv
format: csv
mapping:
input: text
output: label
input_field_name: text
output_field_name: sentiment
limit: 1000
shuffle: true
3. Preview Dataset
super agent dataset preview my_agent --limit 5
4. Use It
super agent compile my_agent
# → 📊 Loaded 1000 examples from dataset!
super agent evaluate my_agent
# → Uses all 1000 examples!
super agent optimize my_agent --auto light
# → GEPA trains on all 1000 examples!
📋 Supported Formats
CSV Files
datasets:
- name: csv_training
source: ./data/train.csv
format: csv
mapping:
input: text_column
output: label_column
input_field_name: text
output_field_name: sentiment
limit: 5000
shuffle: true
Requirements: pandas (already installed)
JSON Files
datasets:
- name: json_training
source: ./data/train.json
format: json
mapping:
input: question_field
output: answer_field
JSON Format:
[
{"question": "What is AI?", "answer": "Artificial Intelligence"},
{"question": "What is ML?", "answer": "Machine Learning"}
]
JSONL Files (Recommended for Large Datasets)
datasets:
- name: jsonl_training
source: ./data/train.jsonl
format: jsonl
mapping:
input: text
output: label
limit: 10000
JSONL Format (one JSON per line):
{"text": "Great product!", "label": "positive"}
{"text": "Poor quality", "label": "negative"}
{"text": "It's okay", "label": "neutral"}
Parquet Files (Best for Big Data)
datasets:
- name: parquet_training
source: ./data/train.parquet
format: parquet
mapping:
input: text
output: label
Requirements: pandas, pyarrow (already installed)
Benefits: - Compressed (smaller files) - Fast loading - Preserves data types - Industry standard for big data
HuggingFace Datasets 🔥
datasets:
- name: imdb_sentiment
source: huggingface:imdb
format: huggingface
mapping:
input: text
output: label
split: train
limit: 10000
shuffle: true
Popular Datasets:
- huggingface:imdb - Movie reviews (50K)
- huggingface:ag_news - News classification (120K)
- huggingface:sst2 - Sentiment (67K)
- huggingface:squad - Q&A (87K)
- huggingface:glue:sst2 - With subset
- ... 100,000+ more!
Requirements: pip install datasets
🎯 Advanced Usage
Multi-Column Mapping
datasets:
- name: qa_dataset
source: ./data/qa.csv
format: csv
mapping:
input:
question: question_column
context: context_column
output:
answer: answer_column
confidence: confidence_column
Multiple Datasets
datasets:
# Training data
- name: train_set
source: ./data/train.csv
format: csv
mapping: {input: text, output: label}
limit: 10000
# Test data
- name: test_set
source: ./data/test.csv
format: csv
mapping: {input: text, output: label}
limit: 1000
# HuggingFace data
- name: hf_data
source: huggingface:imdb
format: huggingface
mapping: {input: text, output: label}
split: train
limit: 5000
Result: 16,000 total examples!
Mix Datasets + BDD Scenarios (Recommended!)
# Bulk training data from datasets
datasets:
- name: training_data
source: ./data/large_dataset.csv
format: csv
mapping: {input: text, output: label}
limit: 10000
# Specific edge cases from BDD scenarios
feature_specifications:
scenarios:
- name: sarcasm_test
input: {text: "Oh great, another bug"}
expected_output: {sentiment: negative}
- name: mixed_sentiment
input: {text: "Good product but poor service"}
expected_output: {sentiment: neutral}
Benefits: - 10,000 examples for robust training - Specific edge cases for testing - Best of both worlds!
🛠️ CLI Commands
Preview Dataset
super agent dataset preview my_agent --limit 10
Output:
Preview: training_data (showing 10 of 5000)
╭──────┬────────────────────────────┬───────────────╮
│ # │ Input: text │ Output: label │
├──────┼────────────────────────────┼───────────────┤
│ 1 │ This is amazing! │ positive │
│ 2 │ Terrible experience │ negative │
│ ... │ ... │ ... │
╰──────┴────────────────────────────┴───────────────╯
Dataset Info
super agent dataset info my_agent
Output:
╭──────────────────── 📊 training_data ────────────────────╮
│ Name: training_data │
│ Source: ./data/train.csv │
│ Format: csv │
│ Total Examples: 5000 │
│ Split: train │
│ Shuffled: True │
│ Input Fields: text │
│ Output Fields: sentiment │
╰──────────────────────────────────────────────────────────╯
Total Examples Across All Datasets: 5000
Validate Dataset
super agent dataset validate my_agent
Output:
Validating 1 dataset(s)...
✅ training_data: Valid
✅ All datasets valid!
📊 Examples
Example 1: Sentiment Analysis with CSV
apiVersion: agent/v1
kind: AgentSpec
metadata:
name: Sentiment Analyzer
id: sentiment_analyzer
spec:
language_model:
model: llama3.1:8b
input_fields:
- name: text
type: string
output_fields:
- name: sentiment
type: string
datasets:
- name: reviews
source: ./data/customer_reviews.csv
format: csv
mapping:
input: review_text
output: sentiment_label
input_field_name: text
output_field_name: sentiment
limit: 5000
shuffle: true
tasks:
- name: classify
instruction: Classify sentiment as positive, negative, or neutral
Example 2: Q&A with HuggingFace
spec:
datasets:
- name: squad_qa
source: huggingface:squad
format: huggingface
mapping:
input:
question: question
context: context
output:
answer: answer
split: train
limit: 10000
Example 3: Multi-Format Mix
datasets:
# Main training data (CSV)
- name: main_training
source: ./data/train.csv
format: csv
mapping: {input: text, output: label}
limit: 8000
# Validation data (JSON)
- name: validation
source: ./data/val.json
format: json
mapping: {input: text, output: label}
limit: 1000
# External data (HuggingFace)
- name: external
source: huggingface:sst2
format: huggingface
mapping: {input: sentence, output: label}
split: train
limit: 5000
Total: 14,000 examples from 3 sources!
🎓 Best Practices
1. Start Small, Scale Up
# Development
datasets:
- source: ./data/train.csv
limit: 100 # Small for fast iteration
# Production
datasets:
- source: ./data/train.csv
limit: 10000 # Full dataset
2. Always Shuffle
datasets:
- shuffle: true # Prevents ordering bias
3. Use Limits
datasets:
- limit: 5000 # Control training time/cost
4. Validate First
super agent dataset validate my_agent
super agent dataset preview my_agent
super agent dataset info my_agent
5. Mix Datasets + BDD
datasets: # Bulk data
- source: ./data/train.csv
limit: 5000
feature_specifications: # Edge cases
scenarios:
- name: edge_case_1
- name: edge_case_2
🔧 Troubleshooting
Issue: File Not Found
❌ Failed to load dataset: No such file or directory
Fix: Use absolute paths or check relative path from playbook location
# Use absolute path
source: /full/path/to/data.csv
# Or relative from playbook location
source: ../../data/data.csv
Issue: Column Not Found
❌ Error: Column 'text_column' not found
Fix: Check your CSV/JSON column names match mapping
# Check column names
head -1 data.csv
# Update mapping to match
mapping:
input: actual_column_name # Must match CSV header
Issue: Import Error
❌ Dataset import feature not available
Fix: Reinstall SuperOptiX
cd /path/to/SuperOptiX
pip install -e .
Issue: HuggingFace Download Slow
# Use limit for faster download
datasets:
- source: huggingface:imdb
limit: 1000 # Downloads only 1000 examples
📚 Complete Example: Production Sentiment Analyzer
apiVersion: agent/v1
kind: AgentSpec
metadata:
name: Production Sentiment Analyzer
id: sentiment_prod
namespace: analysis
version: 2.0.0
spec:
language_model:
provider: ollama
model: llama3.1:8b
temperature: 0.3
input_fields:
- name: text
type: string
output_fields:
- name: sentiment
type: string
# Import 15,000 examples from 3 sources!
datasets:
# Custom CSV data (8,000 examples)
- name: customer_reviews
source: ./data/reviews_2024.csv
format: csv
mapping:
input: review_text
output: sentiment_label
input_field_name: text
output_field_name: sentiment
limit: 8000
shuffle: true
# Validation data (2,000 examples)
- name: validation_set
source: ./data/validation.jsonl
format: jsonl
mapping:
input: text
output: label
input_field_name: text
output_field_name: sentiment
limit: 2000
# HuggingFace IMDB dataset (5,000 examples)
- name: imdb_reviews
source: huggingface:imdb
format: huggingface
mapping:
input: text
output: label
input_field_name: text
output_field_name: sentiment
split: train
limit: 5000
# Keep BDD scenarios for edge cases
feature_specifications:
scenarios:
- name: sarcasm_detection
input: {text: "Oh great, another delay"}
expected_output: {sentiment: negative}
- name: mixed_sentiment
input: {text: "Good food but slow service"}
expected_output: {sentiment: neutral}
tasks:
- name: analyze
instruction: Classify sentiment as positive, negative, or neutral
optimization:
optimizer:
name: GEPA
params:
auto: medium # More data = use medium budget
reflection_lm: llama3.1:8b
Result: 15,002 total examples (15,000 from datasets + 2 BDD scenarios)!
🎬 Demo Workflow
# 1. Preview your data
super agent dataset preview sentiment_prod
# 2. Validate configuration
super agent dataset validate sentiment_prod
# 3. See dataset stats
super agent dataset info sentiment_prod
# 4. Compile
super agent compile sentiment_prod
# → "📊 Loaded 15,000 examples from datasets!"
# 5. Evaluate
super agent evaluate sentiment_prod
# → Uses all 15,002 examples
# 6. Optimize
super agent optimize sentiment_prod --auto medium --fresh
# → GEPA trains on 15,000 examples (better results!)
# 7. Re-evaluate
super agent evaluate sentiment_prod
# → See improvement from massive dataset!
💡 Tips & Tricks
Tip 1: Start with HuggingFace
Don't have data? Use HuggingFace!
datasets:
- name: quick_start
source: huggingface:imdb
format: huggingface
mapping: {input: text, output: label}
limit: 1000 # Quick download
Browse datasets: https://huggingface.co/datasets
Tip 2: Use JSONL for Large Files
# Bad: Single JSON file (loads all in memory)
format: json
# Good: JSONL (streams line by line)
format: jsonl
Tip 3: Shuffle for Better Training
datasets:
- shuffle: true # ✅ Prevents ordering bias
limit: 5000
Tip 4: Validate Before Training
# Always validate first!
super agent dataset validate my_agent
# Then preview
super agent dataset preview my_agent
# Then use
super agent compile my_agent
📖 API Reference
Dataset Configuration Schema
datasets:
- name: string # Required
source: string # Required (file path or huggingface:name)
format: enum # csv|json|jsonl|parquet|huggingface
mapping: # Required
input: string|object
output: string|object
input_field_name: string # Optional
output_field_name: string # Optional
split: enum # train|test|validation|all
limit: integer # Optional (max examples)
shuffle: boolean # Optional (default: true)
Mapping Formats
Simple Mapping (single field):
mapping:
input: column_name
output: column_name
input_field_name: text # Agent field name
output_field_name: label # Agent field name
Complex Mapping (multiple fields):
mapping:
input:
question: question_col
context: context_col
output:
answer: answer_col
score: score_col
🎉 Summary
Dataset Import enables: - ✅ Import from 5 formats (CSV, JSON, JSONL, Parquet, HuggingFace) - ✅ Scale to 10,000+ examples - ✅ Use existing datasets - ✅ Standard ML workflows - ✅ Better GEPA optimization - ✅ Mix with BDD scenarios
Get Started:
# Add datasets to your playbook
# Run: super agent dataset preview my_agent
# Compile and use!
Supported Formats: CSV, JSON, JSONL, Parquet, HuggingFace