Skip to content

Dataset Import Guide

Import external datasets for agent training and evaluation, scaling beyond YAML scenarios to thousands of examples.


๐ŸŽฏ Overview

SuperOptiX now supports importing external datasets in addition to BDD scenarios. This allows you to: - Use existing datasets (CSV, JSON, Parquet, HuggingFace) - Scale to 10,000+ examples (vs 5-10 YAML scenarios) - Leverage standard ML workflows - Access HuggingFace's 100,000+ datasets - Mix datasets with BDD scenarios for best results


๐Ÿš€ Quick Start

1. Create a Dataset

# Create CSV file
cat << 'EOF' > data/sentiment_train.csv
text,label
"This is amazing!",positive
"This is terrible",negative
"It's okay",neutral
EOF

2. Configure in Playbook

# agent_playbook.yaml
spec:
  datasets:
    - name: training_data
      source: ./data/sentiment_train.csv
      format: csv
      mapping:
        input: text
        output: label
        input_field_name: text
        output_field_name: sentiment
      limit: 1000
      shuffle: true

3. Preview Dataset

super agent dataset preview my_agent --limit 5

4. Use It

super agent compile my_agent
# โ†’ ๐Ÿ“Š Loaded 1000 examples from dataset!

super agent evaluate my_agent
# โ†’ Uses all 1000 examples!

super agent optimize my_agent --auto light
# โ†’ GEPA trains on all 1000 examples!

๐Ÿ“‹ Supported Formats

CSV Files

datasets:
  - name: csv_training
    source: ./data/train.csv
    format: csv
    mapping:
      input: text_column
      output: label_column
      input_field_name: text
      output_field_name: sentiment
    limit: 5000
    shuffle: true

Requirements: pandas (already installed)


JSON Files

datasets:
  - name: json_training
    source: ./data/train.json
    format: json
    mapping:
      input: question_field
      output: answer_field

JSON Format:

[
  {"question": "What is AI?", "answer": "Artificial Intelligence"},
  {"question": "What is ML?", "answer": "Machine Learning"}
]


datasets:
  - name: jsonl_training
    source: ./data/train.jsonl
    format: jsonl
    mapping:
      input: text
      output: label
    limit: 10000

JSONL Format (one JSON per line):

{"text": "Great product!", "label": "positive"}
{"text": "Poor quality", "label": "negative"}
{"text": "It's okay", "label": "neutral"}


Parquet Files (Best for Big Data)

datasets:
  - name: parquet_training
    source: ./data/train.parquet
    format: parquet
    mapping:
      input: text
      output: label

Requirements: pandas, pyarrow (already installed)

Benefits: - Compressed (smaller files) - Fast loading - Preserves data types - Industry standard for big data


HuggingFace Datasets ๐Ÿ”ฅ

datasets:
  - name: imdb_sentiment
    source: huggingface:imdb
    format: huggingface
    mapping:
      input: text
      output: label
    split: train
    limit: 10000
    shuffle: true

Popular Datasets: - huggingface:imdb - Movie reviews (50K) - huggingface:ag_news - News classification (120K) - huggingface:sst2 - Sentiment (67K) - huggingface:squad - Q&A (87K) - huggingface:glue:sst2 - With subset - ... 100,000+ more!

Requirements: pip install datasets


๐ŸŽฏ Advanced Usage

Multi-Column Mapping

datasets:
  - name: qa_dataset
    source: ./data/qa.csv
    format: csv
    mapping:
      input:
        question: question_column
        context: context_column
      output:
        answer: answer_column
        confidence: confidence_column

Multiple Datasets

datasets:
  # Training data
  - name: train_set
    source: ./data/train.csv
    format: csv
    mapping: {input: text, output: label}
    limit: 10000

  # Test data
  - name: test_set
    source: ./data/test.csv
    format: csv
    mapping: {input: text, output: label}
    limit: 1000

  # HuggingFace data
  - name: hf_data
    source: huggingface:imdb
    format: huggingface
    mapping: {input: text, output: label}
    split: train
    limit: 5000

Result: 16,000 total examples!


# Bulk training data from datasets
datasets:
  - name: training_data
    source: ./data/large_dataset.csv
    format: csv
    mapping: {input: text, output: label}
    limit: 10000

# Specific edge cases from BDD scenarios
feature_specifications:
  scenarios:
  - name: sarcasm_test
    input: {text: "Oh great, another bug"}
    expected_output: {sentiment: negative}

  - name: mixed_sentiment
    input: {text: "Good product but poor service"}
    expected_output: {sentiment: neutral}

Benefits: - 10,000 examples for robust training - Specific edge cases for testing - Best of both worlds!


๐Ÿ› ๏ธ CLI Commands

Preview Dataset

super agent dataset preview my_agent --limit 10

Output:

Preview: training_data (showing 10 of 5000)
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ #    โ”‚ Input: text                โ”‚ Output: label โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1    โ”‚ This is amazing!           โ”‚ positive      โ”‚
โ”‚ 2    โ”‚ Terrible experience        โ”‚ negative      โ”‚
โ”‚ ...  โ”‚ ...                        โ”‚ ...           โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ


Dataset Info

super agent dataset info my_agent

Output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ“Š training_data โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Name: training_data                                      โ”‚
โ”‚ Source: ./data/train.csv                                 โ”‚
โ”‚ Format: csv                                              โ”‚
โ”‚ Total Examples: 5000                                     โ”‚
โ”‚ Split: train                                             โ”‚
โ”‚ Shuffled: True                                           โ”‚
โ”‚ Input Fields: text                                       โ”‚
โ”‚ Output Fields: sentiment                                 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Total Examples Across All Datasets: 5000


Validate Dataset

super agent dataset validate my_agent

Output:

Validating 1 dataset(s)...
training_data: Valid
All datasets valid!


๐Ÿ“Š Examples

Example 1: Sentiment Analysis with CSV

apiVersion: agent/v1
kind: AgentSpec
metadata:
  name: Sentiment Analyzer
  id: sentiment_analyzer
spec:
  language_model:
    model: llama3.1:8b

  input_fields:
  - name: text
    type: string

  output_fields:
  - name: sentiment
    type: string

  datasets:
  - name: reviews
    source: ./data/customer_reviews.csv
    format: csv
    mapping:
      input: review_text
      output: sentiment_label
      input_field_name: text
      output_field_name: sentiment
    limit: 5000
    shuffle: true

  tasks:
  - name: classify
    instruction: Classify sentiment as positive, negative, or neutral

Example 2: Q&A with HuggingFace

spec:
  datasets:
  - name: squad_qa
    source: huggingface:squad
    format: huggingface
    mapping:
      input:
        question: question
        context: context
      output:
        answer: answer
    split: train
    limit: 10000

Example 3: Multi-Format Mix

datasets:
  # Main training data (CSV)
  - name: main_training
    source: ./data/train.csv
    format: csv
    mapping: {input: text, output: label}
    limit: 8000

  # Validation data (JSON)
  - name: validation
    source: ./data/val.json
    format: json
    mapping: {input: text, output: label}
    limit: 1000

  # External data (HuggingFace)
  - name: external
    source: huggingface:sst2
    format: huggingface
    mapping: {input: sentence, output: label}
    split: train
    limit: 5000

Total: 14,000 examples from 3 sources!


๐ŸŽ“ Best Practices

1. Start Small, Scale Up

# Development
datasets:
  - source: ./data/train.csv
    limit: 100  # Small for fast iteration

# Production
datasets:
  - source: ./data/train.csv
    limit: 10000  # Full dataset

2. Always Shuffle

datasets:
  - shuffle: true  # Prevents ordering bias

3. Use Limits

datasets:
  - limit: 5000  # Control training time/cost

4. Validate First

super agent dataset validate my_agent
super agent dataset preview my_agent
super agent dataset info my_agent

5. Mix Datasets + BDD

datasets:        # Bulk data
  - source: ./data/train.csv
    limit: 5000

feature_specifications:  # Edge cases
  scenarios:
  - name: edge_case_1
  - name: edge_case_2

๐Ÿ”ง Troubleshooting

Issue: File Not Found

Failed to load dataset: No such file or directory

Fix: Use absolute paths or check relative path from playbook location

# Use absolute path
source: /full/path/to/data.csv

# Or relative from playbook location
source: ../../data/data.csv

Issue: Column Not Found

Error: Column 'text_column' not found

Fix: Check your CSV/JSON column names match mapping

# Check column names
head -1 data.csv

# Update mapping to match
mapping:
  input: actual_column_name  # Must match CSV header

Issue: Import Error

Dataset import feature not available

Fix: Reinstall SuperOptiX

cd /path/to/SuperOptiX
pip install -e .

Issue: HuggingFace Download Slow

# Use limit for faster download
datasets:
  - source: huggingface:imdb
    limit: 1000  # Downloads only 1000 examples

๐Ÿ“š Complete Example: Production Sentiment Analyzer

apiVersion: agent/v1
kind: AgentSpec
metadata:
  name: Production Sentiment Analyzer
  id: sentiment_prod
  namespace: analysis
  version: 2.0.0
spec:
  language_model:
    provider: ollama
    model: llama3.1:8b
    temperature: 0.3

  input_fields:
  - name: text
    type: string

  output_fields:
  - name: sentiment
    type: string

  # Import 15,000 examples from 3 sources!
  datasets:
  # Custom CSV data (8,000 examples)
  - name: customer_reviews
    source: ./data/reviews_2024.csv
    format: csv
    mapping:
      input: review_text
      output: sentiment_label
      input_field_name: text
      output_field_name: sentiment
    limit: 8000
    shuffle: true

  # Validation data (2,000 examples)
  - name: validation_set
    source: ./data/validation.jsonl
    format: jsonl
    mapping:
      input: text
      output: label
      input_field_name: text
      output_field_name: sentiment
    limit: 2000

  # HuggingFace IMDB dataset (5,000 examples)
  - name: imdb_reviews
    source: huggingface:imdb
    format: huggingface
    mapping:
      input: text
      output: label
      input_field_name: text
      output_field_name: sentiment
    split: train
    limit: 5000

  # Keep BDD scenarios for edge cases
  feature_specifications:
    scenarios:
    - name: sarcasm_detection
      input: {text: "Oh great, another delay"}
      expected_output: {sentiment: negative}

    - name: mixed_sentiment
      input: {text: "Good food but slow service"}
      expected_output: {sentiment: neutral}

  tasks:
  - name: analyze
    instruction: Classify sentiment as positive, negative, or neutral

  optimization:
    optimizer:
      name: GEPA
      params:
        auto: medium  # More data = use medium budget
        reflection_lm: llama3.1:8b

Result: 15,002 total examples (15,000 from datasets + 2 BDD scenarios)!


๐ŸŽฌ Demo Workflow

# Preview your data
super agent dataset preview sentiment_prod

# Validate configuration
super agent dataset validate sentiment_prod

# See dataset stats
super agent dataset info sentiment_prod

# Compile
super agent compile sentiment_prod
# โ†’ "๐Ÿ“Š Loaded 15,000 examples from datasets!"

# Evaluate
super agent evaluate sentiment_prod
# โ†’ Uses all 15,002 examples

# Optimize
super agent optimize sentiment_prod --auto medium --fresh
# โ†’ GEPA trains on 15,000 examples (better results!)

# Re-evaluate
super agent evaluate sentiment_prod
# โ†’ See improvement from massive dataset!

๐Ÿ’ก Tips & Tricks

Tip 1: Start with HuggingFace

Don't have data? Use HuggingFace!

datasets:
  - name: quick_start
    source: huggingface:imdb
    format: huggingface
    mapping: {input: text, output: label}
    limit: 1000  # Quick download

Browse datasets: https://huggingface.co/datasets


Tip 2: Use JSONL for Large Files

# Bad: Single JSON file (loads all in memory)
format: json

# Good: JSONL (streams line by line)
format: jsonl

Tip 3: Shuffle for Better Training

datasets:
  - shuffle: true  # Prevents ordering bias
    limit: 5000

Tip 4: Validate Before Training

# Always validate first!
super agent dataset validate my_agent

# Then preview
super agent dataset preview my_agent

# Then use
super agent compile my_agent

๐Ÿ“– API Reference

Dataset Configuration Schema

datasets:
  - name: string              # Required
    source: string            # Required (file path or huggingface:name)
    format: enum              # csv|json|jsonl|parquet|huggingface
    mapping:                  # Required
      input: string|object
      output: string|object
      input_field_name: string   # Optional
      output_field_name: string  # Optional
    split: enum               # train|test|validation|all
    limit: integer            # Optional (max examples)
    shuffle: boolean          # Optional (default: true)

Mapping Formats

Simple Mapping (single field):

mapping:
  input: column_name
  output: column_name
  input_field_name: text    # Agent field name
  output_field_name: label  # Agent field name

Complex Mapping (multiple fields):

mapping:
  input:
    question: question_col
    context: context_col
  output:
    answer: answer_col
    score: score_col


๐ŸŽ‰ Summary

Dataset Import enables: - Import from 5 formats (CSV, JSON, JSONL, Parquet, HuggingFace) - Scale to 10,000+ examples - Use existing datasets - Standard ML workflows - Better GEPA optimization - Mix with BDD scenarios

Get Started:

# Add datasets to your playbook
# Run: super agent dataset preview my_agent
# Compile and use!


Supported Formats: CSV, JSON, JSONL, Parquet, HuggingFace