Audit Summary

CODEBASE AUDIT RESULT: MEDIUM AGENT REPRODUCIBILITY: False

---

Detailed Code Audit Report - Submission 220

Executive Summary

This submission presents a comprehensive implementation of the BiCA (Bidirectional Cognitive Adaptation) framework for human-AI collaboration. The codebase contains approximately 10,676 lines of Python code with a well-structured architecture including training loops, neural network models, environment implementations, and evaluation metrics. While the implementation is largely complete and functional, there are several concerning issues related to placeholder data in metrics computation and some questionable design choices that raise red flags about result authenticity.

---

1. COMPLETENESS & STRUCTURAL INTEGRITY

Strengths:

✅ Complete project structure with proper modular organization:

Main training script (train_maptalk.py, 1091 lines)
Model implementations (policy, human_surrogate, protocol, rep_mapper, instructor)
Environment implementation (MapTalk with OOD variants)
Loss functions (PPO, IB, RepGap)
Evaluation metrics (BAS, CCM)
Configuration files in YAML format
Utility scripts for running experiments

✅ All entry points exist and appear functional:

bica/train_maptalk.py - Main training script with BiCATrainer class
run_experiment.py - Multi-experiment runner (889 lines)
run_maptalk_comparison.py - Baseline comparison runner (289 lines)

✅ Proper dependency management:

requirements.txt with standard ML libraries (scipy, scikit-learn, pandas, etc.)
environment.yml for conda environment setup
setup.py for package installation

✅ No missing imports: All local imports reference files that exist in the codebase

✅ Complete model architectures:

AIPolicy: GRU-based policy network with proper forward/backward pass
ValueNet: Value function for advantage estimation
HumanSurrogate: GRU-based with protocol table state
ProtocolGenerator: Gumbel-Softmax sampling implementation
RepresentationMapper: Wasserstein + CCA alignment
Instructor: Adaptive teaching network

Critical Issues:

🔴 DUMMY DATA IN METRICS COMPUTATION (lines 907-913 in train_maptalk.py):

Dummy prediction data (would need actual model predictions in real implementation)
if human_features and ai_features:
    n_samples = min(len(human_features), len(ai_features))
    metrics_data['human_predictions'] = np.random.rand(n_samples, 10)  # Dummy
    metrics_data['human_targets'] = np.random.randint(0, 10, n_samples)
    metrics_data['ai_predictions'] = np.random.rand(n_samples, 4)  # Dummy
    metrics_data['ai_targets'] = np.random.randint(0, 4, n_samples)

This is a major red flag. The code explicitly uses random data for computing the Mutual Predictability (MP) component of the BAS score, which is one of the paper's key evaluation metrics. This means the BAS scores reported in the paper may not be accurately computed from actual model predictions.

🔴 HARDCODED BASELINE VALUES (lines 922-932 in train_maptalk.py):

metrics_data['performance'] = {
    'baseline_success': 0.5,  # Baseline comparison
    'perturbed_success': id_success,
    'perturbation_kl': 0.02,  # Small perturbation
    'avg_steps': 30.0,
    'avg_tokens': total_messages / max(len(all_trajectories), 1)
}

metrics_data['ood_performance'] = {
    'success_rate': ood_success,
    'collision_rate': 0.1,  # Estimated
    'miscalibration': 0.05  # Estimated
}

Multiple metrics use hardcoded or "estimated" values rather than actual measurements:

Baseline success rate fixed at 0.5
Collision rate "estimated" at 0.1
Miscalibration "estimated" at 0.05
Average steps hardcoded to 30.0

🟡 Fallback values in CCA computation (run_experiment.py, lines 732-783):

The CCA correlation computation has multiple fallback values (0.5, 0.4, 0.3, 0.35, 0.2) that are returned when computation fails or insufficient data is available. While this is more defensible than the dummy data above, it still means some reported metrics may not reflect actual alignment.

Minor Issues:

⚠️ Comment indicates incomplete implementation:

One "placeholder" comment found in human_surrogate.py

⚠️ No TODOs or FIXMEs found: This is unusual for research code and might indicate the code was cleaned up for submission, but also shows no obvious incomplete sections.

---

2. RESULTS AUTHENTICITY RED FLAGS

Major Concerns:

🔴 Metrics computed from dummy data: As noted above, the Mutual Predictability (MP) component of BAS is computed using np.random.rand() and np.random.randint() rather than actual model predictions. This directly undermines the authenticity of reported BAS scores.

🔴 Hardcoded comparison values: Baseline success rates and other comparison metrics are hardcoded rather than computed from actual baseline experiments. While the code does include a proper baseline comparison runner (run_maptalk_comparison.py), the metrics preparation function uses hardcoded values.

🟡 Suspicious fallback logic: The CCA computation has extensive fallback logic that returns predetermined values (ranging from 0.2 to 0.5) when computation fails. This could mask issues with the actual alignment computation.

Mitigating Factors:

✅ Core task performance metrics appear legitimate:

Success rates are computed from actual episode outcomes
Episode rewards are summed from actual step rewards
Episode lengths are counted from actual steps
These are the primary results metrics

✅ Training loop appears genuine:

Proper PPO implementation with clipping and advantage computation
Real gradient computation and backpropagation
Proper trajectory collection from environment
KL divergence computation looks correct

✅ No evidence of cherry-picked seeds: The seed_list.txt contains reasonable seed values (13, 42, 15213, 2025, 4096) that don't appear to be cherry-picked for specific results.

Assessment:

The dummy data and hardcoded values are concerning, but they appear to affect secondary evaluation metrics (BAS/CCM components) rather than the primary task performance metrics (success rate, steps to completion, rewards). The paper's main claims about task performance could still be valid even if the BAS/CCM scores are questionable.

---

3. IMPLEMENTATION-PAPER CONSISTENCY

Based on the methods document (220_methods_results.md):

Architecture Consistency:

✅ AI Policy Network: Matches specification - "Recurrent GRU-based architecture that processes observations and human messages"

Implementation: GRU(128) with 256-dimensional hidden layers ✓

✅ Human Surrogate Network: Matches specification - "GRU-based surrogate models human protocol learning"

Implementation: GRU(128) with protocol table maintenance ✓

✅ Protocol Generator: Matches specification - "Uses Gumbel-Softmax sampling"

Implementation: Gumbel-Softmax with temperature annealing (τ_t+1 = max(τ_end, τ_t · γ)) ✓

✅ Representation Mapper: Matches specification - "Aligns via Wasserstein distance and CCA"

Implementation: Both Wasserstein and CCA losses present ✓

✅ Instructor Network: Matches specification - "Provides adaptive guidance"

Implementation: GRU-based intervention network ✓

Hyperparameter Consistency:

✅ Training parameters match paper:

PPO clip epsilon: 0.2 ✓
Entropy coefficient: 0.01 ✓
KL budgets: λ_A=0.02, λ_H=0.01 ✓
Learning rate: 3e-4 ✓
GAE lambda: 0.95 ✓
Gamma: 0.99 ✓

✅ Environment parameters match:

Grid size: 8x8 ✓
Obstacle rates: [0.1, 0.15] for ID, [0.25, 0.35] for OOD ✓
Max steps: 80 (paper says 60, slight discrepancy but reasonable)
Reward structure: -0.5 step, -5.0 collision, +50.0 goal ✓

Experimental Protocol:

✅ MapTalk environment implemented correctly:

Asymmetric observations (AI: 3×3 egocentric; Human: full 8×8 map) ✓
Discrete communication vocabularies ✓
Bidirectional messaging ✓
OOD variants (high obstacles, sensor noise, corridors, rooms) ✓

⚠️ Latent Navigator: Full implementation exists but appears to be a separate auxiliary experiment with its own dataset and models. Implementation looks complete.

---

4. CODE QUALITY SIGNALS

Positive Indicators:

✅ Well-organized structure: Clear separation of concerns with dedicated modules for models, losses, environments, evaluation

✅ Proper error handling: Try-catch blocks in metrics computation

✅ Reasonable code reuse: Factory functions for model creation

✅ Documentation: Docstrings present for most functions

✅ Type hints: Many functions include type annotations

✅ Configuration-driven: YAML configs for experiments rather than hardcoded values in scripts

Negative Indicators:

🔴 Dummy data generation: Explicit use of random data for metrics (see Section 2)

🔴 Hardcoded comparison baselines: Not computing baselines from actual runs

🟡 Extensive fallback logic: Could mask underlying issues

⚠️ Limited comments: Some complex sections lack explanatory comments

⚠️ No unit tests: No test files found in submission

Code Quality Score: 6/10

The code is reasonably well-written and organized, but the dummy data and hardcoded values significantly undermine confidence in the results.

---

5. FUNCTIONALITY INDICATORS

Data Loading:

✅ Proper environment reset and step functions

✅ Configurable obstacle generation

✅ Reachability checking with BFS

✅ Path clearing to ensure solvability

Training Loops:

✅ Complete PPO implementation:

Trajectory collection with TrajectoryBuffer class
GAE advantage computation
Policy and value loss computation
Multiple PPO epochs with gradient clipping
KL-to-prior regularization

✅ Alternating updates (E-A-P-M-I-Λ steps):

Human surrogate update (E step)
AI policy update (A step)
Protocol update (P step)
RepMapper update (M step)
Instructor update (I step)
Dual variable update (Λ step)

Evaluation:

🟡 Mixed quality:

Core metrics (success rate, episode length, rewards) computed correctly
Advanced metrics (BAS, CCM) use dummy/fallback data
OOD evaluation implemented
Checkpoint saving/loading implemented

Functionality Score: 7/10

Core functionality appears solid, but evaluation metrics have significant issues.

---

6. DEPENDENCY & ENVIRONMENT ISSUES

Dependencies:

✅ All standard libraries: scipy, scikit-learn, pandas, matplotlib, seaborn, plotly

✅ Standard RL/ML packages: PyTorch (via environment.yml), statsmodels, POT (optimal transport)

✅ Logging: wandb, tensorboard

✅ No exotic dependencies

Potential Issues:

⚠️ PyTorch not in requirements.txt: Only in environment.yml, which could cause confusion

⚠️ PyQt5 for GUI: Latent Navigator requires GUI libraries that may not be available on all systems

⚠️ No version pins: Most dependencies don't specify versions, which could lead to compatibility issues

Environment Score: 8/10

Dependencies are reasonable and standard, with minor version specification issues.

---

7. REPRODUCIBILITY ASSESSMENT

Positive Factors:

✅ Fixed seeds provided: seed_list.txt contains multiple seeds for reproducibility

✅ Deterministic operations mentioned: README claims "All experiments use fixed random seeds and deterministic operations"

✅ Configuration files: All hyperparameters specified in YAML configs

✅ Checkpoint saving: Training checkpoints saved for resuming

✅ Comprehensive README: Clear instructions for setup and running experiments

Negative Factors:

🔴 Dummy data in metrics: Cannot reproduce exact BAS/CCM scores from paper

🔴 Hardcoded baselines: Cannot verify baseline comparisons

🟡 No version pins: Could lead to different results with different library versions

⚠️ Long training times: Paper suggests full experiments need 50 epochs × 32 batch_episodes = 1600 episodes

Reproducibility Score: 5/10

Core experiments could be reproduced, but some reported metrics cannot be accurately reproduced due to dummy data issues.

---

8. SEVERITY ASSESSMENT

Critical Issues (🔴):

Dummy prediction data: Mutual Predictability metric uses random data instead of actual model predictions
Hardcoded baseline values: Comparison metrics use fixed values instead of computed baselines
Estimated OOD metrics: Collision rate and miscalibration use hardcoded estimates

These issues directly affect the authenticity of reported evaluation metrics, particularly:

BAS (Bidirectional Alignment Score) - MP component is compromised
Some comparison claims about baseline performance

High-Level Issues (🟡):

Fallback CCA values: CCA computation returns predetermined values when it fails
Limited actual baseline runs: While comparison code exists, metrics use hardcoded values

Medium Issues (⚠️):

No version specifications: Could affect reproducibility
One placeholder comment: Suggests some incomplete implementation
No unit tests: Makes it harder to verify correctness

Low Issues:

Documentation could be more detailed: Some complex sections lack comments
Minor parameter discrepancy: Max steps 80 vs 60 mentioned in some contexts

---

9. AGENT REPRODUCIBILITY ANALYSIS

Finding: No evidence of AI agent prompts or logs

Search Results:

No prompt files found in the repository
No agent interaction logs
No .txt or .md files documenting AI assistance
The 220_methods_results.md file contains paper methods but no AI generation documentation

Code Characteristics:

The code exhibits characteristics of human-written research code:

Mix of well-structured and expedient solutions
Explicit comments noting dummy data and placeholders
Gradual quality variations across files
Some incomplete sections with honest comments
Debugging print statements present

However, there are also some signs that could indicate AI assistance:

Comprehensive docstrings throughout
Consistent coding style
Well-organized module structure
Factory pattern usage

Conclusion: No direct evidence of AI agent usage or documented prompts. The code quality suggests professional development but without transparency about AI tools used.

---

10. OVERALL ASSESSMENT

Summary:

This is a substantial research codebase with a complete implementation of a complex reinforcement learning system. The core functionality appears sound, with proper training loops, model architectures, and environment implementations. However, critical issues with metrics computation significantly undermine confidence in some of the reported results.

Key Concerns:

Metrics Authenticity: Some evaluation metrics (particularly BAS components) use dummy or hardcoded data
Result Verification: Cannot independently verify all claims from the paper using the provided code
Transparency: Dummy data usage is documented in comments but not prominently disclosed

Key Strengths:

Core Implementation: Training loop, models, and environment are well-implemented
Architecture Consistency: Implementations match paper specifications
Completeness: All major components are present and appear functional
Configuration: Proper YAML-based configuration system

Can the Paper's Claims Be Verified?

Primary claims (task performance): ✅ Likely verifiable - success rates, steps, rewards appear computed correctly
Secondary claims (BAS/CCM scores): ❌ Cannot be fully verified - some components use dummy/fallback data
Ablation studies: ✅ Code supports running ablations
OOD robustness: ✅ OOD environments implemented

Recommendation:

The codebase demonstrates a serious implementation effort with a complete system, but the dummy data in metrics computation is a significant red flag that warrants further investigation. The authors should be asked to:

Provide corrected metrics computation without dummy data
Explain why dummy data was used and clarify which results in the paper are affected
Re-run evaluations with proper metrics computation
Document any AI assistance used in development

AUDIT RESULT: MEDIUM - Code is largely functional with good structure, but metrics authenticity issues prevent a LOW severity rating. Not HIGH/CRITICAL because core training and task performance appear legitimate.

---

11. RECOMMENDATIONS FOR AUTHORS

Immediate: Remove dummy data from metrics computation and recompute BAS/CCM scores with actual model predictions
Immediate: Replace hardcoded baseline values with computed baselines from actual baseline runs
High Priority: Add unit tests for critical components (metrics computation, loss functions)
High Priority: Add version specifications to all dependencies
Medium Priority: Document any AI assistance used in code development
Medium Priority: Add more inline comments explaining complex algorithmic sections
Low Priority: Consider adding visualization tools for debugging training

---

12. CONCLUSION

This submission represents a substantial engineering effort with approximately 10,676 lines of thoughtfully structured Python code. The implementation is largely complete and appears functional for the core training task. However, the presence of dummy data in metrics computation and hardcoded comparison values raises serious questions about the authenticity of some reported results.

The primary experimental results (success rates, episode lengths, rewards) appear to be computed correctly and could likely be reproduced. The secondary evaluation metrics (BAS, CCM components) have significant issues that prevent full verification of those claims.

Final Grade: MEDIUM - Functional implementation with concerning metrics authenticity issues that need to be addressed before full confidence can be established.

Audit Report: Paper 220