Skip to content

Evaluation Engine

The evaluation engine provides a framework for testing SuperQode's QE capabilities through structured scenarios and behavior verification. This is used for benchmarking, regression testing, and validating QE role effectiveness.


Overview

The evaluation engine (evaluation/) enables systematic testing of:

  • QE role effectiveness
  • Finding accuracy
  • Fix quality
  • False positive rates
  • Performance benchmarks
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    EVALUATION ENGINE                         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                              โ”‚
โ”‚  Scenarios                                                   โ”‚
โ”‚       โ”‚                                                      โ”‚
โ”‚       โ–ผ                                                      โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚
โ”‚  โ”‚  Behaviors  โ”‚ โ”€โ”€ โ”‚   Engine    โ”‚ โ”€โ”€ โ”‚  Adapters   โ”‚     โ”‚
โ”‚  โ”‚ (Expected)  โ”‚    โ”‚ (Runner)    โ”‚    โ”‚ (Output)    โ”‚     โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚
โ”‚                            โ”‚                                 โ”‚
โ”‚                            โ–ผ                                 โ”‚
โ”‚                     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                         โ”‚
โ”‚                     โ”‚   Results   โ”‚                         โ”‚
โ”‚                     โ”‚  (Metrics)  โ”‚                         โ”‚
โ”‚                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                         โ”‚
โ”‚                                                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Components

Scenarios

Scenarios (evaluation/scenarios.py) define test cases:

@dataclass
class Scenario:
    id: str                      # Unique identifier
    name: str                    # Human-readable name
    description: str             # What this tests
    setup: SetupConfig           # How to prepare
    target_path: str             # What to analyze
    roles: list[str]             # Roles to run
    expected_behaviors: list[Behavior]  # What should happen
    timeout: int                 # Max duration

Example Scenario:

scenarios:
  - id: sql-injection-basic
    name: "SQL Injection Detection"
    description: "Verify security_tester finds SQL injection"
    setup:
      template: vulnerable-flask-app
      inject_vulnerabilities:
        - type: sql_injection
          location: src/api.py
          line: 45
    target_path: src/
    roles:
      - security_tester
    expected_behaviors:
      - type: finding
        severity: critical
        pattern: "SQL injection"
        file: src/api.py
    timeout: 120

Behaviors

Behaviors (evaluation/behaviors.py) define expected outcomes:

Behavior Type Description Validation
Finding Should detect specific issue Pattern match in findings
No Finding Should NOT detect (false positive test) Absence check
Fix Should suggest valid fix Fix verification
Test Generation Should generate test File existence
Performance Should complete within time Duration check
@dataclass
class FindingBehavior(Behavior):
    type: str = "finding"
    severity: str          # critical, high, medium, low
    pattern: str           # Regex to match in finding
    file: str | None       # Expected file
    line: int | None       # Expected line

@dataclass  
class PerformanceBehavior(Behavior):
    type: str = "performance"
    max_duration_seconds: int
    max_token_usage: int | None

Engine

The engine (evaluation/engine.py) runs evaluations:

class EvaluationEngine:
    async def run_scenario(self, scenario: Scenario) -> ScenarioResult:
        # Setup test environment
        workspace = await self.setup_workspace(scenario.setup)

        # Run QE session
        qe_result = await self.run_qe(
            workspace=workspace,
            roles=scenario.roles,
            timeout=scenario.timeout
        )

        # Verify behaviors
        behavior_results = []
        for behavior in scenario.expected_behaviors:
            result = await self.verify_behavior(behavior, qe_result)
            behavior_results.append(result)

        # Cleanup
        await workspace.cleanup()

        return ScenarioResult(
            scenario_id=scenario.id,
            passed=all(r.passed for r in behavior_results),
            behavior_results=behavior_results,
            qe_result=qe_result
        )

Adapters

Adapters (evaluation/adapters.py) format output for different systems:

Adapter Output Format Use Case
JSON Raw JSON Programmatic access
JUnit XML JUnit format CI/CD integration
Markdown Readable report Documentation
Console Colored terminal Local development

Running Evaluations

Run All Scenarios

superqode eval run --all

Run Specific Scenario

superqode eval run --scenario sql-injection-basic

Run Category

superqode eval run --category security

Output Formats

# Console output (default)
superqode eval run --all

# JSON output
superqode eval run --all --format json > results.json

# JUnit XML (for CI)
superqode eval run --all --format junit > results.xml

Scenario Templates

Pre-built vulnerable code templates for testing:

Template Vulnerabilities Roles Tested
vulnerable-flask-app SQL injection, XSS, auth bypass security_tester
buggy-api Missing validation, error handling api_tester
untested-module Low coverage, edge cases unit_tester
slow-queries N+1, missing indexes performance_tester

Creating Custom Templates

# .superqode/eval-templates/my-template.yaml
template:
  name: my-custom-template
  base: python-fastapi
  files:
    - path: src/vulnerable.py
      content: |
        def unsafe_query(user_id):
            # SQL injection vulnerability
            return db.execute(f"SELECT * FROM users WHERE id = {user_id}")

Metrics

Evaluation produces these metrics:

Metric Description Target
Detection Rate % of issues found > 90%
False Positive Rate % of incorrect findings < 10%
Fix Accuracy % of valid fixes > 80%
Time to Detection Seconds to first finding < 30s
Token Efficiency Findings per 1K tokens > 0.5

Metric Calculation

metrics = EvaluationMetrics(
    total_scenarios=len(scenarios),
    passed_scenarios=len([s for s in results if s.passed]),
    detection_rate=true_positives / total_expected,
    false_positive_rate=false_positives / total_findings,
    avg_duration=sum(r.duration for r in results) / len(results),
    total_tokens=sum(r.tokens_used for r in results)
)

CI/CD Integration

GitHub Actions

- name: Run SuperQode Evaluation
  run: |
    superqode eval run --all --format junit > eval-results.xml

- name: Upload Results
  uses: actions/upload-artifact@v3
  with:
    name: evaluation-results
    path: eval-results.xml

- name: Publish Test Results
  uses: mikepenz/action-junit-report@v3
  with:
    report_paths: eval-results.xml

Benchmark Comparison

# Save baseline
superqode eval run --all --format json > baseline.json

# Compare after changes
superqode eval compare baseline.json current.json

Configuration

evaluation:
  # Scenarios directory
  scenarios_dir: .superqode/eval-scenarios

  # Templates directory  
  templates_dir: .superqode/eval-templates

  # Parallelism
  max_parallel: 4

  # Timeout multiplier (for slow machines)
  timeout_multiplier: 1.5

  # Results storage
  results_dir: .superqode/eval-results

  # Thresholds for pass/fail
  thresholds:
    detection_rate: 0.9
    false_positive_rate: 0.1
    fix_accuracy: 0.8

Extending the Engine

Adding Custom Behaviors

# evaluation/behaviors.py

@dataclass
class CustomBehavior(Behavior):
    type: str = "custom"
    custom_check: str

    def verify(self, qe_result: QEResult) -> BehaviorResult:
        # Custom verification logic
        passed = self.run_custom_check(qe_result)
        return BehaviorResult(passed=passed)

Adding Custom Adapters

# evaluation/adapters.py

class CustomAdapter(OutputAdapter):
    def format(self, results: list[ScenarioResult]) -> str:
        # Custom formatting logic
        return custom_formatted_output