Creating Gold Example Data
Complete guide to creating high-quality training data for DSPy optimization using DSPy Code.
What is Gold Example Data?
Gold example data (also called training data) consists of input-output pairs that represent the correct behavior of your DSPy program.
Example for sentiment analysis:
Why it's called "gold":
- Represents the "gold standard" of correct outputs
- Used to train and optimize your DSPy programs
- Quality of gold data directly impacts optimization results
Data Requirements for GEPA
GEPA (Genetic Pareto) requires:
Minimum: - 10-20 examples for simple tasks - 50-100 examples for complex tasks
Recommended: - 50-200 examples for production use - Diverse examples covering edge cases - Balanced across categories (for classification)
Format: - JSON or JSONL (JSON Lines) - Consistent field names - All examples have same input fields - Clear, unambiguous outputs
Methods to Create Gold Data
DSPy Code provides three methods:
- AI-Generated - Let AI create synthetic examples
- Interactive - Manually enter examples
- Import - Load from existing files
Method 1: AI-Generated Data (Recommended)
Use the LLM to generate diverse, high-quality examples:
What happens:
- DSPy Code analyzes your task
- Generates diverse input examples
- Creates correct outputs
- Ensures variety and coverage
Output:
đ˛ Generating 50 diverse examples for sentiment analysis...
â Generated 50 examples!
Sample Examples:
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââŽ
â Input: "The movie was absolutely fantastic, a real masterpiece!" â
â Output: "positive" â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â Input: "I'm not sure how I feel about this, it's neither good nor bad." â
â Output: "neutral" â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â Input: "What a terrible experience, I regret every moment." â
â Output: "negative" â
â°âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¯
Distribution:
positive: 17 examples (34%)
negative: 16 examples (32%)
neutral: 17 examples (34%)
Next Steps:
âĸ Type /save-data <filename> to save as JSONL
âĸ Ask me to generate more examples
âĸ Use these for GEPA optimization
âĸ Request different types of examples
Advanced generation:
Method 2: Interactive Data Collection
Manually enter examples through guided prompts:
Interactive flow:
đ Training Data Collection
Let's collect training examples for optimization.
You'll need at least 10 examples.
Example 1:
Enter inputs (or 'done' to finish):
Field name [done]: text
text value: I love this product!
Add another input field? [y/N]: n
Expected output: positive
â Example 1 added
Example 2:
Enter inputs (or 'done' to finish):
Field name [done]: text
text value: This is terrible
Add another input field? [y/N]: n
Expected output: negative
â Example 2 added
...
Add more examples? (have 10) [Y/n]: n
â Collected 10 examples
Training Data Summary:
Total examples: 10
Input fields: text
Sample Examples:
ââââââââââââââââââââââââââââŦâââââââââââ
â text â Output â
ââââââââââââââââââââââââââââŧâââââââââââ¤
â I love this product! â positive â
â This is terrible â negative â
â It's okay â neutral â
ââââââââââââââââââââââââââââ´âââââââââââ
Save to file? [Y/n]: y
Filename [training_data.jsonl]: sentiment_train.jsonl
â Saved to data/sentiment_train.jsonl
Method 3: Import from Files
Load existing data from JSON or JSONL files:
Supported formats:
JSONL (JSON Lines) - Recommended:
{"text": "I love this!", "sentiment": "positive"}
{"text": "Terrible product", "sentiment": "negative"}
{"text": "It's okay", "sentiment": "neutral"}
JSON Array:
[
{"text": "I love this!", "sentiment": "positive"},
{"text": "Terrible product", "sentiment": "negative"},
{"text": "It's okay", "sentiment": "neutral"}
]
CSV (auto-converted):
Import process:
/data load sentiment_examples.csv
Loading data from sentiment_examples.csv...
â Detected CSV format
â Converting to JSONL
â Loaded 50 examples
Validation:
â All examples have consistent fields
â No empty values
â Field names match: text, sentiment
Data Summary:
Total: 50 examples
Fields: text (input), sentiment (output)
Distribution:
positive: 17 (34%)
negative: 16 (32%)
neutral: 17 (34%)
â Data ready for optimization!
Data Quality Guidelines
1. Diversity
Good - Diverse examples:
{"text": "I absolutely love this!", "sentiment": "positive"}
{"text": "Best purchase ever!", "sentiment": "positive"}
{"text": "Exceeded my expectations", "sentiment": "positive"}
{"text": "Terrible quality", "sentiment": "negative"}
{"text": "Very disappointed", "sentiment": "negative"}
{"text": "Waste of money", "sentiment": "negative"}
{"text": "It's okay", "sentiment": "neutral"}
{"text": "Nothing special", "sentiment": "neutral"}
{"text": "Average product", "sentiment": "neutral"}
Bad - Repetitive examples:
{"text": "I love it", "sentiment": "positive"}
{"text": "I love this", "sentiment": "positive"}
{"text": "I love that", "sentiment": "positive"}
2. Balance
Good - Balanced distribution:
Bad - Imbalanced:
3. Clarity
Good - Clear, unambiguous:
{"text": "This product is amazing!", "sentiment": "positive"}
{"text": "Worst purchase ever", "sentiment": "negative"}
Bad - Ambiguous:
{"text": "Well...", "sentiment": "positive"} // Unclear
{"text": "It's something", "sentiment": "neutral"} // Vague
4. Realistic
Good - Real-world examples:
{"text": "The app crashes when I click save. Very frustrating.", "sentiment": "negative"}
{"text": "Fast shipping and great customer service!", "sentiment": "positive"}
Bad - Artificial examples:
{"text": "positive sentiment example", "sentiment": "positive"}
{"text": "test negative", "sentiment": "negative"}
5. Completeness
Good - All fields present:
{
"question": "What is Python?",
"context": "Python is a programming language created by Guido van Rossum.",
"answer": "Python is a programming language"
}
Bad - Missing fields:
{
"question": "What is Python?",
"answer": "Python is a programming language"
// Missing context field!
}
Data Validation
Automatic Validation
DSPy Code automatically validates your data:
Checks performed:
- Consistent fields - All examples have same inputs
- No empty values - All fields have content
- Correct types - Values match expected types
- Sufficient quantity - Enough examples for optimization
- Distribution - Balanced across categories
Validation report:
Data Validation Report:
â Field Consistency
All 50 examples have fields: text, sentiment
â No Empty Values
All fields contain data
â Type Checking
text: string (50/50)
sentiment: string (50/50)
â Quantity
50 examples (minimum 10 required)
â Distribution
positive: 30 (60%) â Overrepresented
negative: 15 (30%)
neutral: 5 (10%) â Underrepresented
Recommendation: Add more neutral and negative examples
Quality Score: 85/100
Issues: 1 warning
Errors: 0
Manual Validation
Review examples manually:
Display options:
/data show --limit 10 # Show first 10
/data show --random 5 # Show 5 random
/data show --filter positive # Show only positive
/data show --stats # Show statistics
Data Augmentation
Expand Existing Data
Generate variations of existing examples:
What happens:
- Analyzes existing examples
- Identifies patterns
- Generates similar but different examples
- Maintains distribution
Example:
Original:
Augmented:
{"text": "This product is fantastic!", "sentiment": "positive"}
{"text": "Absolutely love it!", "sentiment": "positive"}
{"text": "Best product I've bought!", "sentiment": "positive"}
Paraphrase Examples
Create paraphrases for more variety:
Add Edge Cases
Request specific edge cases:
Generate 20 edge case examples for sentiment analysis including sarcasm, mixed emotions, and ambiguous statements
Generated:
{"text": "Oh great, another bug. Just what I needed.", "sentiment": "negative"}
{"text": "It's good but could be better", "sentiment": "neutral"}
{"text": "I hate to love this", "sentiment": "positive"}
Data Organization
File Naming
By task:
By version:
data/
âââ sentiment_v1_train.jsonl
âââ sentiment_v2_train.jsonl
âââ sentiment_v3_train.jsonl
By source:
data/
âââ sentiment_ai_generated.jsonl
âââ sentiment_manual.jsonl
âââ sentiment_real_users.jsonl
Train/Val/Test Split
Recommended split:
- Training: 70-80%
- Validation: 10-15%
- Test: 10-15%
Split existing data:
Output:
Splitting sentiment_all.jsonl...
â Created sentiment_train.jsonl (70 examples, 70%)
â Created sentiment_val.jsonl (15 examples, 15%)
â Created sentiment_test.jsonl (15 examples, 15%)
Total: 100 examples
Merge Data Files
Combine multiple data files:
Data for Different Tasks
Classification
Structure:
Example - Sentiment:
Example - Email:
Question Answering
Structure:
Example:
{
"question": "What is the capital of France?",
"context": "France is a country in Europe. Its capital is Paris.",
"answer": "Paris"
}
Text Generation
Structure:
Example - Summarization:
Extraction
Structure:
Example - Named Entity Recognition:
{
"text": "Apple Inc. was founded by Steve Jobs in California.",
"entities": ["Apple Inc.", "Steve Jobs", "California"]
}
RAG (Retrieval-Augmented Generation)
Structure:
{
"query": "user question",
"retrieved_context": "relevant documents",
"answer": "generated answer"
}
Example:
{
"query": "How do I install DSPy?",
"retrieved_context": "DSPy can be installed using pip: pip install dspy",
"answer": "You can install DSPy by running 'pip install dspy' in your terminal."
}
Using Gold Data for Optimization
Save Data
Saved to: data/sentiment_train.jsonl
Load Data for Optimization
Specify Data in GEPA Script
Generated GEPA scripts include data loading:
def load_training_data(filepath):
"""Load training examples from JSONL."""
examples = []
with open(filepath, 'r') as f:
for line in f:
data = json.loads(line)
example = dspy.Example(**data).with_inputs('text')
examples.append(example)
return examples
# Load data
trainset = load_training_data('data/sentiment_train.jsonl')
Data Best Practices
1. Start with AI Generation
Quickest way to get started:
2. Review and Refine
Check generated examples:
Remove bad examples, add edge cases.
3. Augment with Real Data
Add real user data when available:
4. Validate Before Optimization
Always validate:
Fix issues before running GEPA.
5. Keep Multiple Versions
Save versions as you improve:
data/
âââ sentiment_v1.jsonl # Initial AI-generated
âââ sentiment_v2.jsonl # After manual review
âââ sentiment_v3.jsonl # With real user data
6. Document Your Data
Add metadata file:
# data/sentiment_metadata.yaml
dataset: sentiment_v3
created: 2025-01-15
total_examples: 150
source: AI-generated + manual review + user feedback
distribution:
positive: 50
negative: 50
neutral: 50
quality_score: 95
notes: |
- Includes edge cases for sarcasm
- Balanced across all categories
- Validated for consistency
Troubleshooting
Not Enough Examples
Solution:
Imbalanced Data
Solution:
Inconsistent Fields
Solution:
Review and fix manually, or regenerate:
Empty Values
Solution:
Remove or fix the example:
Or regenerate:
Summary
Creating gold example data:
- â AI-generated (fastest)
- â Interactive collection (most control)
- â Import from files (use existing data)
- â Validation and quality checks
- â Data augmentation
- â Train/val/test splits
- â Task-specific formats
Key commands:
Generate N examples for [task]- AI generation/data collect- Interactive collection/data load <file>- Import data/data validate- Check quality/save-data <file>- Save data/data split- Create train/val/test/data merge- Combine datasets
Best practices:
- Start with 50-100 examples
- Ensure diversity and balance
- Use clear, realistic examples
- Validate before optimization
- Keep multiple versions
- Document your data