Code Audit Report: Submission #197
"Survival of the Useful: Evolutionary Boids as a Sandbox for Agent Societies"
Audit Date: 2024
Auditor: Claude Code Autonomous Auditing System
---
EXECUTIVE SUMMARY
This submission presents a multi-agent system that uses Boids-inspired rules (Separation, Alignment, Cohesion) to enable emergent tool development. The code is functionally complete, well-structured, and appears capable of producing results. However, there are critical limitations regarding reproducibility due to dependency on proprietary API access and missing experimental data.
Overall Assessment: MEDIUM RISK - Code appears functional but has reproducibility barriers and missing verification data.
---
DETAILED FINDINGS
1. COMPLETENESS & STRUCTURAL INTEGRITY ✅ GOOD
Strengths:
- ✅ Complete implementation of all claimed components (Boids rules, evolutionary algorithm, TCI analyzer, agent system)
- ✅ Well-organized directory structure with clear separation of concerns
- ✅ Comprehensive experiment orchestration with
run_real_experiment.py and run_experiment.py
- ✅ All major functions have implementation (no placeholder
pass statements in critical paths)
- ✅ Tool complexity analyzer (TCI-Lite v4) fully implemented with clear formula matching paper claims
- ✅ Boids rules implementation matches paper description (semantic similarity using TF-IDF, alignment with successful tools, cohesion with global summary)
- ✅ Evolutionary algorithm implementation present with selection, crossover, and mutation
Weaknesses:
- ⚠️ No actual experimental results data included - "raw experiment data" folder contains only HTML file listings, not actual JSON results
- ⚠️ No pre-generated results to verify against paper claims
- ⚠️ Template tools directory (
shared_tools_template) not present in submission
Verdict: Code structure is complete and production-ready. Missing experimental outputs is concerning but not a code integrity issue.
---
2. RESULTS AUTHENTICITY RED FLAGS ⚠️ MEDIUM CONCERN
Critical Observations:
POSITIVE INDICATORS:
- ✅ Results are computed, not hardcoded - TCI scores calculated from actual code analysis
- ✅ Complexity metrics derived from real parsing of Python AST (lines 97-130 in
complexity_analyzer.py)
- ✅ Tool adoption counts computed by scanning code for
context.call_tool() patterns (lines 457-502 in run_experiment.py)
- ✅ No evidence of hardcoded accuracy/metrics values
- ✅ Proper statistical calculations (mean, aggregation) throughout
- ✅ Results tracked per round with timestamps and detailed logging
CONCERNING PATTERNS:
- ⚠️ Cannot verify results authenticity - No actual experimental output files included with submission
- ⚠️ "Raw experiment data" folder contains only file browser HTML output, not actual JSON results (baseline_10_10, boids_10_10, etc. are HTML listings)
- ⚠️ Specific numeric results in paper (e.g., "Creative Writing: Boids 34.0 vs. Baseline 38.0") cannot be traced to actual output files
- ⚠️ No random seed management visible in code - results may not be deterministically reproducible
Paper Claims vs Code:
| Paper Claim | Code Support | Verification Status |
|-------------|--------------|---------------------|
| TCI = C_code + C_iface + C_comp (0-10 scale) | Lines 69-70 in complexity_analyzer.py | ✅ Matches |
| Separation uses TF-IDF, threshold 0.3 | Lines 88-147 in boids_rules.py | ✅ Matches |
| Alignment identifies complexity/quality leaders | Lines 38-53 in boids_rules.py | ✅ Matches |
| Evolution removes bottom 20% by TCI | Lines 145-149 in evolutionary_algorithm.py | ✅ Matches |
| Median LOC: Boids 34.0 vs Baseline 38.0 | No actual output data to verify | ⚠️ Cannot verify |
| Test pass rate metrics | Lines 829-831 in run_experiment.py calculate rates | ✅ Computed, not hardcoded |
Verdict: Code
appears to compute results legitimately, but
lack of actual experimental output data prevents verification of paper claims.
---
3. IMPLEMENTATION-PAPER CONSISTENCY ✅ STRONG
Boids Rules Implementation:
- ✅ Separation Rule: Semantic similarity using TF-IDF (sklearn) with cosine similarity threshold (default 0.3, configurable to 0.45) - matches Equation 3 description
- ✅ Alignment Rule: Identifies complexity leaders, quality leaders (test-passing), and adoption leaders - matches Equation 4-5 description
- ✅ Cohesion Rule: Uses LLM-generated global summary to guide collective behavior - matches Equation 6 description
- ✅ Decision Synthesis: Rules can be individually enabled/disabled for ablation studies (lines 271-285 in run_experiment.py)
Tool Complexity Index (TCI):
Code implementation (lines 69-70, complexity_analyzer.py):
tci_raw = code_score + iface_score + comp_score # Matches paper formula
tci_final = quality_gate * tci_raw
Paper formula: TCI = C_code + C_iface + C_comp (total range [0,10])
- Code Complexity (0-3): 3.0 * min(1.0, LOC/300)
- Interface Complexity (0-2): param_score + return_score
- Compositional Complexity (0-5): min(4.0, tools0.5) + min(1.0, imports0.1)
✅ Perfect match with paper equations
Evolutionary Algorithm:
- ✅ Selection rate: 0.2 (bottom 20%) - line 24 in evolutionary_algorithm.py
- ✅ Fitness based on average TCI - lines 101-143 in evolutionary_algorithm.py
- ✅ Crossover and mutation implemented (though simplified as prompt-based)
Hyperparameters Match:
| Parameter | Paper | Code | Match |
|-----------|-------|------|-------|
| k neighbors | 2 | Default 2 (line 75, run_experiment.py) | ✅ |
| Separation threshold | 0.45 | Default 0.45 (line 276, run_real_experiment.py) | ✅ |
| Evolution frequency | "every few rounds" | Configurable, default 5 | ✅ |
| Selection rate | 20% | 0.2 (line 24, evolutionary_algorithm.py) | ✅ |
Verdict: Implementation strongly matches paper methodology.
---
4. CODE QUALITY SIGNALS ✅ GOOD
Positive Indicators:
- ✅ Professional code structure with clear module separation
- ✅ Comprehensive logging throughout (using Python logging module)
- ✅ Proper error handling in most critical paths
- ✅ Docstrings present for major functions
- ✅ Type hints used extensively (typing module imported in most files)
- ✅ No excessive commented-out code
- ✅ Reasonable code duplication (DRY principles mostly followed)
Quality Metrics:
- Import usage: All imported libraries are used (openai, scikit-learn, matplotlib, pandas)
- Dead code: Minimal - only some debug print statements
- Documentation: Good - README.md, REPRODUCIBILITY.md comprehensive
- Testing infrastructure: Present (test generation and execution in agent_v1.py)
Minor Issues:
- ⚠️ Some debug print statements left in production code (e.g., lines 298-302 in agent_v1.py)
- ⚠️ Requirements.txt has duplicate entries (python-dotenv listed twice)
- ⚠️ No unit tests for the framework itself (only generated tests for agent-built tools)
Verdict: Code quality is production-grade with minor polish issues.
---
5. FUNCTIONALITY INDICATORS ✅ APPEARS FUNCTIONAL
Data Loading:
- ✅ Meta-prompts loaded from JSON (10 domains defined in meta_prompts.json)
- ✅ Tool registry manages shared and personal tools
- ✅ Proper file I/O for saving/loading tool metadata
Training/Simulation Loop:
- ✅ Complete observe → reflect → build → test cycle (agent_v1.py)
- ✅ Multi-round simulation with proper state tracking
- ✅ Agent actions tracked per round with detailed results
Evaluation Metrics:
- ✅ TCI computed from actual code analysis (AST parsing)
- ✅ Test pass rates calculated from execution results
- ✅ Complexity evolution tracked over rounds
- ✅ Agent productivity metrics collected
LLM Integration:
- ✅ Azure OpenAI client wrapper implemented (azure_client.py)
- ✅ Chat, JSON mode, and structured output methods present
- ⚠️ CRITICAL DEPENDENCY: System cannot run without valid Azure OpenAI API credentials
- ⚠️ No fallback or mock mode for testing without API access
Evidence of Development:
- ✅ Version control artifacts implied (references to v1, v4, etc.)
- ✅ Evolution from legacy systems mentioned in comments
- ✅ Iterative refinement visible (TCI-Lite v4, implies earlier versions)
Verdict: Code appears fully functional but
absolutely requires Azure OpenAI API access to run.
---
6. DEPENDENCY & ENVIRONMENT ISSUES ⚠️ HIGH BARRIER
Critical Dependencies:
Standard/Available Libraries:
- ✅ openai==1.51.0 (available via pip)
- ✅ python-dotenv, matplotlib, scikit-learn (standard packages)
- ✅ pandas, numpy (standard data science stack)
Proprietary/Restricted Dependencies:
- 🚨 CRITICAL BLOCKER: Azure OpenAI API access required
- System is completely non-functional without valid credentials
- No mock/simulation mode available
- No provided test credentials or demo mode
- Cost implications: Running full experiments would incur substantial API costs (10 agents × 10 rounds × multiple LLM calls per turn)
Environment Requirements:
- Python 3.8+ (reasonable)
- Azure OpenAI endpoint, API key, deployment name required
- No GPU needed (good)
- Moderate compute: 8-16GB RAM recommended
Version Conflicts:
- ✅ No obvious version conflicts detected
- ⚠️ openai==1.51.0 is specific version (may break with API changes)
Realistic Computational Resources:
- ✅ CPU-only operation (no unrealistic GPU requirements)
- ✅ Reasonable memory footprint
- ⚠️ API cost could be substantial for full reproduction
Verdict: SEVERE REPRODUCIBILITY BARRIER - System cannot be verified without proprietary API access.
---
SEVERITY ASSESSMENT
CRITICAL Issues: 🔴 1 FOUND
- 🔴 CRITICAL: Proprietary API Dependency Prevents Independent Verification
- System is completely non-functional without Azure OpenAI API credentials
- No alternative implementation or mock mode provided
- Results cannot be independently verified without incurring API costs
- This is a fundamental reproducibility barrier for research validation
HIGH Issues: 🟠 2 FOUND
- 🟠 HIGH: Missing Experimental Output Data
- "Raw experiment data" folder contains HTML file browser output, not actual results
- Cannot verify paper claims (LOC statistics, TCI values, test pass rates) against actual experiment outputs
- No baseline results provided for comparison
- 🟠 HIGH: No Random Seed Management
- No explicit random seed setting in code
- Results may not be deterministically reproducible even with same API
- Stochastic elements (LLM temperature, selection randomness) not controlled
MEDIUM Issues: 🟡 3 FOUND
- 🟡 MEDIUM: Missing Template Tools Directory
shared_tools_template referenced but not included in submission
- System may fail on initialization (though code has fallback handling)
- 🟡 MEDIUM: No Framework Unit Tests
- Code generates tests for agent-built tools but has no self-tests
- Cannot verify framework components work correctly without running full experiments
- 🟡 MEDIUM: Incomplete Documentation of Cost
- No estimate of API costs for reproduction
- Potential financial barrier not disclosed
LOW Issues: 🟢 2 FOUND
- 🟢 LOW: Debug Statements in Production Code
- Print statements left in (e.g., complexity_analyzer debug output)
- Minor code cleanliness issue
- 🟢 LOW: Duplicate Requirements Entry
- python-dotenv listed twice in requirements.txt
- Does not affect functionality
---
AGENT REPRODUCIBILITY ASSESSMENT
Question: Does the directory log the researchers' use of AI in the research process, showing prompts used to generate code for analysis?
Answer: AGENT REPRODUCIBLE: FALSE
Reasoning:
- No evidence of AI-assisted code generation prompts in submission
- No logs of Claude, ChatGPT, Copilot, or other AI tools used to develop the framework
- README mentions the system uses Azure OpenAI for agent behavior, but does not document whether AI was used to write the framework code itself
- No .cursor/, .copilot/, or AI conversation logs present
- Code appears hand-written with professional software engineering practices
Note: The paper is
about AI agents that use LLMs, but does not document whether the framework itself was AI-assisted in development.
---
PAPER CLAIMS VERIFICATION
Verifiable Claims ✅
- ✅ Boids Rules Implementation: Code correctly implements Separation (TF-IDF), Alignment (leader identification), and Cohesion (global summary)
- ✅ TCI Formula: Matches paper exactly (C_code + C_iface + C_comp, 0-10 scale)
- ✅ Evolution Mechanism: Complexity-based selection of bottom 20% implemented
- ✅ Ablation Study Support: Code can run all 7 configurations mentioned
- ✅ Tool Composition Tracking: Adoption counts computed from code analysis
Unverifiable Claims ⚠️
- ⚠️ Specific Numeric Results: Cannot verify "Boids 34.0 vs Baseline 38.0" without actual output files
- ⚠️ Cross-Domain Consistency: No results data for Creative Writing, Data Science, Research Assistant domains
- ⚠️ Evolution Statistics: Claims like "Initial population 5 → Post-evolution 6+" not verifiable
- ⚠️ Test Pass Rates: Specific percentages (76.29%, 74.49%, etc.) cannot be traced to outputs
---
RECOMMENDATIONS
For Reviewers:
- Request actual experimental output files (results.json from experiments directory)
- Verify API cost disclosures - full reproduction could cost hundreds of dollars
- Consider providing test API credentials or mock mode for verification
- Request deterministic reproduction protocol with random seed management
For Authors:
- Include actual results.json files from all experiments reported in paper
- Provide mock/simulation mode that doesn't require API access for code verification
- Add random seed management for deterministic reproducibility
- Document API costs and provide resource estimates for reproduction
- Include shared_tools_template directory or clarify initialization
- Add framework unit tests to verify components independently
---
CONCLUSION
Code Authenticity: ✅ LIKELY AUTHENTIC
The code appears to be a genuine, functional implementation with:
- No evidence of hardcoded results
- Proper algorithmic implementation of claimed methods
- Professional software engineering practices
- Legitimate complexity analysis and metric calculation
Reproducibility Status: ⚠️ LIMITED
Major Barriers:
- 🚨 Proprietary API dependency (Azure OpenAI) prevents independent verification
- ⚠️ Missing experimental outputs prevent result validation
- ⚠️ No deterministic reproduction protocol
Overall Risk Level: MEDIUM 🟡
Summary: This is a well-engineered system with code that
appears capable of producing the claimed results. However,
reproducibility is severely limited by the proprietary API dependency and missing experimental data. The code itself shows no red flags for fraud or manipulation, but
independent verification is practically impossible without substantial resources (API access + costs).
Recommendation: Code can be conditionally accepted pending:
- Provision of actual experimental output files
- Clear documentation of API costs and access requirements
- Ideally, a mock mode or cached results for verification
---
AUDIT METADATA
- Files Analyzed: 15+ source files
- Lines of Code Reviewed: ~3500+ LOC
- Dependencies Checked: ✅ All verified
- Execution Tests: ❌ Cannot run without API credentials
- Output Verification: ⚠️ No experimental outputs available
Auditor Notes: This represents a sophisticated multi-agent system with impressive engineering. The primary concern is not code quality or authenticity, but
practical reproducibility for peer review validation.