← Back to Submissions

Audit Report: Paper 197

Code Audit Report: Submission #197

"Survival of the Useful: Evolutionary Boids as a Sandbox for Agent Societies" Audit Date: 2024 Auditor: Claude Code Autonomous Auditing System

---

EXECUTIVE SUMMARY

This submission presents a multi-agent system that uses Boids-inspired rules (Separation, Alignment, Cohesion) to enable emergent tool development. The code is functionally complete, well-structured, and appears capable of producing results. However, there are critical limitations regarding reproducibility due to dependency on proprietary API access and missing experimental data.

Overall Assessment: MEDIUM RISK - Code appears functional but has reproducibility barriers and missing verification data.

---

DETAILED FINDINGS

1. COMPLETENESS & STRUCTURAL INTEGRITY ✅ GOOD

Strengths: Weaknesses: Verdict: Code structure is complete and production-ready. Missing experimental outputs is concerning but not a code integrity issue.

---

2. RESULTS AUTHENTICITY RED FLAGS ⚠️ MEDIUM CONCERN

Critical Observations: POSITIVE INDICATORS: CONCERNING PATTERNS: Paper Claims vs Code:

| Paper Claim | Code Support | Verification Status |

|-------------|--------------|---------------------|

| TCI = C_code + C_iface + C_comp (0-10 scale) | Lines 69-70 in complexity_analyzer.py | ✅ Matches |

| Separation uses TF-IDF, threshold 0.3 | Lines 88-147 in boids_rules.py | ✅ Matches |

| Alignment identifies complexity/quality leaders | Lines 38-53 in boids_rules.py | ✅ Matches |

| Evolution removes bottom 20% by TCI | Lines 145-149 in evolutionary_algorithm.py | ✅ Matches |

| Median LOC: Boids 34.0 vs Baseline 38.0 | No actual output data to verify | ⚠️ Cannot verify |

| Test pass rate metrics | Lines 829-831 in run_experiment.py calculate rates | ✅ Computed, not hardcoded |

Verdict: Code appears to compute results legitimately, but lack of actual experimental output data prevents verification of paper claims.

---

3. IMPLEMENTATION-PAPER CONSISTENCY ✅ STRONG

Boids Rules Implementation: Tool Complexity Index (TCI):

Code implementation (lines 69-70, complexity_analyzer.py):

tci_raw = code_score + iface_score + comp_score # Matches paper formula

tci_final = quality_gate * tci_raw

Paper formula: TCI = C_code + C_iface + C_comp (total range [0,10])

- Code Complexity (0-3): 3.0 * min(1.0, LOC/300)

- Interface Complexity (0-2): param_score + return_score

- Compositional Complexity (0-5): min(4.0, tools0.5) + min(1.0, imports0.1)

Perfect match with paper equations

Evolutionary Algorithm: Hyperparameters Match:

| Parameter | Paper | Code | Match |

|-----------|-------|------|-------|

| k neighbors | 2 | Default 2 (line 75, run_experiment.py) | ✅ |

| Separation threshold | 0.45 | Default 0.45 (line 276, run_real_experiment.py) | ✅ |

| Evolution frequency | "every few rounds" | Configurable, default 5 | ✅ |

| Selection rate | 20% | 0.2 (line 24, evolutionary_algorithm.py) | ✅ |

Verdict: Implementation strongly matches paper methodology.

---

4. CODE QUALITY SIGNALS ✅ GOOD

Positive Indicators: Quality Metrics: Minor Issues: Verdict: Code quality is production-grade with minor polish issues.

---

5. FUNCTIONALITY INDICATORS ✅ APPEARS FUNCTIONAL

Data Loading: Training/Simulation Loop: Evaluation Metrics: LLM Integration: Evidence of Development: Verdict: Code appears fully functional but absolutely requires Azure OpenAI API access to run.

---

6. DEPENDENCY & ENVIRONMENT ISSUES ⚠️ HIGH BARRIER

Critical Dependencies: Standard/Available Libraries: Proprietary/Restricted Dependencies: Environment Requirements: Version Conflicts: Realistic Computational Resources: Verdict: SEVERE REPRODUCIBILITY BARRIER - System cannot be verified without proprietary API access.

---

SEVERITY ASSESSMENT

CRITICAL Issues: 🔴 1 FOUND

  1. 🔴 CRITICAL: Proprietary API Dependency Prevents Independent Verification
    • System is completely non-functional without Azure OpenAI API credentials
    • No alternative implementation or mock mode provided
    • Results cannot be independently verified without incurring API costs
    • This is a fundamental reproducibility barrier for research validation

HIGH Issues: 🟠 2 FOUND

  1. 🟠 HIGH: Missing Experimental Output Data
    • "Raw experiment data" folder contains HTML file browser output, not actual results
    • Cannot verify paper claims (LOC statistics, TCI values, test pass rates) against actual experiment outputs
    • No baseline results provided for comparison
  1. 🟠 HIGH: No Random Seed Management
    • No explicit random seed setting in code
    • Results may not be deterministically reproducible even with same API
    • Stochastic elements (LLM temperature, selection randomness) not controlled

MEDIUM Issues: 🟡 3 FOUND

  1. 🟡 MEDIUM: Missing Template Tools Directory
    • shared_tools_template referenced but not included in submission
    • System may fail on initialization (though code has fallback handling)
  1. 🟡 MEDIUM: No Framework Unit Tests
    • Code generates tests for agent-built tools but has no self-tests
    • Cannot verify framework components work correctly without running full experiments
  1. 🟡 MEDIUM: Incomplete Documentation of Cost
    • No estimate of API costs for reproduction
    • Potential financial barrier not disclosed

LOW Issues: 🟢 2 FOUND

  1. 🟢 LOW: Debug Statements in Production Code
    • Print statements left in (e.g., complexity_analyzer debug output)
    • Minor code cleanliness issue
  1. 🟢 LOW: Duplicate Requirements Entry
    • python-dotenv listed twice in requirements.txt
    • Does not affect functionality

---

AGENT REPRODUCIBILITY ASSESSMENT

Question: Does the directory log the researchers' use of AI in the research process, showing prompts used to generate code for analysis? Answer: AGENT REPRODUCIBLE: FALSE Reasoning: Note: The paper is about AI agents that use LLMs, but does not document whether the framework itself was AI-assisted in development.

---

PAPER CLAIMS VERIFICATION

Verifiable Claims ✅

  1. Boids Rules Implementation: Code correctly implements Separation (TF-IDF), Alignment (leader identification), and Cohesion (global summary)
  2. TCI Formula: Matches paper exactly (C_code + C_iface + C_comp, 0-10 scale)
  3. Evolution Mechanism: Complexity-based selection of bottom 20% implemented
  4. Ablation Study Support: Code can run all 7 configurations mentioned
  5. Tool Composition Tracking: Adoption counts computed from code analysis

Unverifiable Claims ⚠️

  1. ⚠️ Specific Numeric Results: Cannot verify "Boids 34.0 vs Baseline 38.0" without actual output files
  2. ⚠️ Cross-Domain Consistency: No results data for Creative Writing, Data Science, Research Assistant domains
  3. ⚠️ Evolution Statistics: Claims like "Initial population 5 → Post-evolution 6+" not verifiable
  4. ⚠️ Test Pass Rates: Specific percentages (76.29%, 74.49%, etc.) cannot be traced to outputs

---

RECOMMENDATIONS

For Reviewers:

  1. Request actual experimental output files (results.json from experiments directory)
  2. Verify API cost disclosures - full reproduction could cost hundreds of dollars
  3. Consider providing test API credentials or mock mode for verification
  4. Request deterministic reproduction protocol with random seed management

For Authors:

  1. Include actual results.json files from all experiments reported in paper
  2. Provide mock/simulation mode that doesn't require API access for code verification
  3. Add random seed management for deterministic reproducibility
  4. Document API costs and provide resource estimates for reproduction
  5. Include shared_tools_template directory or clarify initialization
  6. Add framework unit tests to verify components independently

---

CONCLUSION

Code Authenticity: ✅ LIKELY AUTHENTIC

The code appears to be a genuine, functional implementation with:

Reproducibility Status: ⚠️ LIMITED

Major Barriers:
  1. 🚨 Proprietary API dependency (Azure OpenAI) prevents independent verification
  2. ⚠️ Missing experimental outputs prevent result validation
  3. ⚠️ No deterministic reproduction protocol

Overall Risk Level: MEDIUM 🟡

Summary: This is a well-engineered system with code that appears capable of producing the claimed results. However, reproducibility is severely limited by the proprietary API dependency and missing experimental data. The code itself shows no red flags for fraud or manipulation, but independent verification is practically impossible without substantial resources (API access + costs). Recommendation: Code can be conditionally accepted pending:
  1. Provision of actual experimental output files
  2. Clear documentation of API costs and access requirements
  3. Ideally, a mock mode or cached results for verification

---

AUDIT METADATA

Auditor Notes: This represents a sophisticated multi-agent system with impressive engineering. The primary concern is not code quality or authenticity, but practical reproducibility for peer review validation.