Benchmarking¶
CodexOpt computes a 0.0 to 1.0 score per file and records detailed feedback.
What Gets Scored¶
AGENTS.md¶
Signals include:
- too short or too long
- token heaviness
- empty files
- contradictory guidance
- duplicate lines
- missing workflow guidance
- missing verification guidance
- missing output-format guidance
- weak repo or task alignment
SKILL.md¶
Signals include:
- missing frontmatter
- missing
nameordescription - overly long frontmatter fields
- weak trigger clarity
- weak workflow guidance
- weak verification guidance
- weak repo or task alignment
Evidence-aware Feedback¶
If task_files or issue_files are configured, benchmark output includes:
- task file count
- issue file count
- criterion-level sub-scores
- natural-language feedback
Example¶
Typical output: