← Back to Submissions

Audit Report: Paper 220

Audit Summary

CODEBASE AUDIT RESULT: MEDIUM AGENT REPRODUCIBILITY: False

---

Detailed Code Audit Report - Submission 220

Executive Summary

This submission presents a comprehensive implementation of the BiCA (Bidirectional Cognitive Adaptation) framework for human-AI collaboration. The codebase contains approximately 10,676 lines of Python code with a well-structured architecture including training loops, neural network models, environment implementations, and evaluation metrics. While the implementation is largely complete and functional, there are several concerning issues related to placeholder data in metrics computation and some questionable design choices that raise red flags about result authenticity.

---

1. COMPLETENESS & STRUCTURAL INTEGRITY

Strengths:

Complete project structure with proper modular organization:

All entry points exist and appear functional:

Proper dependency management:

No missing imports: All local imports reference files that exist in the codebase

Complete model architectures:

Critical Issues:

🔴 DUMMY DATA IN METRICS COMPUTATION (lines 907-913 in train_maptalk.py):

Dummy prediction data (would need actual model predictions in real implementation)

if human_features and ai_features:

n_samples = min(len(human_features), len(ai_features))

metrics_data['human_predictions'] = np.random.rand(n_samples, 10) # Dummy

metrics_data['human_targets'] = np.random.randint(0, 10, n_samples)

metrics_data['ai_predictions'] = np.random.rand(n_samples, 4) # Dummy

metrics_data['ai_targets'] = np.random.randint(0, 4, n_samples)

This is a major red flag. The code explicitly uses random data for computing the Mutual Predictability (MP) component of the BAS score, which is one of the paper's key evaluation metrics. This means the BAS scores reported in the paper may not be accurately computed from actual model predictions.

🔴 HARDCODED BASELINE VALUES (lines 922-932 in train_maptalk.py):

metrics_data['performance'] = {

'baseline_success': 0.5, # Baseline comparison

'perturbed_success': id_success,

'perturbation_kl': 0.02, # Small perturbation

'avg_steps': 30.0,

'avg_tokens': total_messages / max(len(all_trajectories), 1)

}

metrics_data['ood_performance'] = {

'success_rate': ood_success,

'collision_rate': 0.1, # Estimated

'miscalibration': 0.05 # Estimated

}

Multiple metrics use hardcoded or "estimated" values rather than actual measurements:

🟡 Fallback values in CCA computation (run_experiment.py, lines 732-783):

The CCA correlation computation has multiple fallback values (0.5, 0.4, 0.3, 0.35, 0.2) that are returned when computation fails or insufficient data is available. While this is more defensible than the dummy data above, it still means some reported metrics may not reflect actual alignment.

Minor Issues:

⚠️ Comment indicates incomplete implementation:

⚠️ No TODOs or FIXMEs found: This is unusual for research code and might indicate the code was cleaned up for submission, but also shows no obvious incomplete sections.

---

2. RESULTS AUTHENTICITY RED FLAGS

Major Concerns:

🔴 Metrics computed from dummy data: As noted above, the Mutual Predictability (MP) component of BAS is computed using np.random.rand() and np.random.randint() rather than actual model predictions. This directly undermines the authenticity of reported BAS scores.

🔴 Hardcoded comparison values: Baseline success rates and other comparison metrics are hardcoded rather than computed from actual baseline experiments. While the code does include a proper baseline comparison runner (run_maptalk_comparison.py), the metrics preparation function uses hardcoded values.

🟡 Suspicious fallback logic: The CCA computation has extensive fallback logic that returns predetermined values (ranging from 0.2 to 0.5) when computation fails. This could mask issues with the actual alignment computation.

Mitigating Factors:

Core task performance metrics appear legitimate:

Training loop appears genuine:

No evidence of cherry-picked seeds: The seed_list.txt contains reasonable seed values (13, 42, 15213, 2025, 4096) that don't appear to be cherry-picked for specific results.

Assessment:

The dummy data and hardcoded values are concerning, but they appear to affect secondary evaluation metrics (BAS/CCM components) rather than the primary task performance metrics (success rate, steps to completion, rewards). The paper's main claims about task performance could still be valid even if the BAS/CCM scores are questionable.

---

3. IMPLEMENTATION-PAPER CONSISTENCY

Based on the methods document (220_methods_results.md):

Architecture Consistency:

AI Policy Network: Matches specification - "Recurrent GRU-based architecture that processes observations and human messages"

Human Surrogate Network: Matches specification - "GRU-based surrogate models human protocol learning"

Protocol Generator: Matches specification - "Uses Gumbel-Softmax sampling"

Representation Mapper: Matches specification - "Aligns via Wasserstein distance and CCA"

Instructor Network: Matches specification - "Provides adaptive guidance"

Hyperparameter Consistency:

Training parameters match paper:

Environment parameters match:

Experimental Protocol:

MapTalk environment implemented correctly:

⚠️ Latent Navigator: Full implementation exists but appears to be a separate auxiliary experiment with its own dataset and models. Implementation looks complete.

---

4. CODE QUALITY SIGNALS

Positive Indicators:

Well-organized structure: Clear separation of concerns with dedicated modules for models, losses, environments, evaluation

Proper error handling: Try-catch blocks in metrics computation

Reasonable code reuse: Factory functions for model creation

Documentation: Docstrings present for most functions

Type hints: Many functions include type annotations

Configuration-driven: YAML configs for experiments rather than hardcoded values in scripts

Negative Indicators:

🔴 Dummy data generation: Explicit use of random data for metrics (see Section 2)

🔴 Hardcoded comparison baselines: Not computing baselines from actual runs

🟡 Extensive fallback logic: Could mask underlying issues

⚠️ Limited comments: Some complex sections lack explanatory comments

⚠️ No unit tests: No test files found in submission

Code Quality Score: 6/10

The code is reasonably well-written and organized, but the dummy data and hardcoded values significantly undermine confidence in the results.

---

5. FUNCTIONALITY INDICATORS

Data Loading:

Proper environment reset and step functions

Configurable obstacle generation

Reachability checking with BFS

Path clearing to ensure solvability

Training Loops:

Complete PPO implementation:

Alternating updates (E-A-P-M-I-Λ steps):

Evaluation:

🟡 Mixed quality:

Functionality Score: 7/10

Core functionality appears solid, but evaluation metrics have significant issues.

---

6. DEPENDENCY & ENVIRONMENT ISSUES

Dependencies:

All standard libraries: scipy, scikit-learn, pandas, matplotlib, seaborn, plotly

Standard RL/ML packages: PyTorch (via environment.yml), statsmodels, POT (optimal transport)

Logging: wandb, tensorboard

No exotic dependencies

Potential Issues:

⚠️ PyTorch not in requirements.txt: Only in environment.yml, which could cause confusion

⚠️ PyQt5 for GUI: Latent Navigator requires GUI libraries that may not be available on all systems

⚠️ No version pins: Most dependencies don't specify versions, which could lead to compatibility issues

Environment Score: 8/10

Dependencies are reasonable and standard, with minor version specification issues.

---

7. REPRODUCIBILITY ASSESSMENT

Positive Factors:

Fixed seeds provided: seed_list.txt contains multiple seeds for reproducibility

Deterministic operations mentioned: README claims "All experiments use fixed random seeds and deterministic operations"

Configuration files: All hyperparameters specified in YAML configs

Checkpoint saving: Training checkpoints saved for resuming

Comprehensive README: Clear instructions for setup and running experiments

Negative Factors:

🔴 Dummy data in metrics: Cannot reproduce exact BAS/CCM scores from paper

🔴 Hardcoded baselines: Cannot verify baseline comparisons

🟡 No version pins: Could lead to different results with different library versions

⚠️ Long training times: Paper suggests full experiments need 50 epochs × 32 batch_episodes = 1600 episodes

Reproducibility Score: 5/10

Core experiments could be reproduced, but some reported metrics cannot be accurately reproduced due to dummy data issues.

---

8. SEVERITY ASSESSMENT

Critical Issues (🔴):

  1. Dummy prediction data: Mutual Predictability metric uses random data instead of actual model predictions
  2. Hardcoded baseline values: Comparison metrics use fixed values instead of computed baselines
  3. Estimated OOD metrics: Collision rate and miscalibration use hardcoded estimates

These issues directly affect the authenticity of reported evaluation metrics, particularly:

High-Level Issues (🟡):

  1. Fallback CCA values: CCA computation returns predetermined values when it fails
  2. Limited actual baseline runs: While comparison code exists, metrics use hardcoded values

Medium Issues (⚠️):

  1. No version specifications: Could affect reproducibility
  2. One placeholder comment: Suggests some incomplete implementation
  3. No unit tests: Makes it harder to verify correctness

Low Issues:

  1. Documentation could be more detailed: Some complex sections lack comments
  2. Minor parameter discrepancy: Max steps 80 vs 60 mentioned in some contexts

---

9. AGENT REPRODUCIBILITY ANALYSIS

Finding: No evidence of AI agent prompts or logs

Search Results:

Code Characteristics:

The code exhibits characteristics of human-written research code:

However, there are also some signs that could indicate AI assistance:

Conclusion: No direct evidence of AI agent usage or documented prompts. The code quality suggests professional development but without transparency about AI tools used.

---

10. OVERALL ASSESSMENT

Summary:

This is a substantial research codebase with a complete implementation of a complex reinforcement learning system. The core functionality appears sound, with proper training loops, model architectures, and environment implementations. However, critical issues with metrics computation significantly undermine confidence in some of the reported results.

Key Concerns:

  1. Metrics Authenticity: Some evaluation metrics (particularly BAS components) use dummy or hardcoded data
  2. Result Verification: Cannot independently verify all claims from the paper using the provided code
  3. Transparency: Dummy data usage is documented in comments but not prominently disclosed

Key Strengths:

  1. Core Implementation: Training loop, models, and environment are well-implemented
  2. Architecture Consistency: Implementations match paper specifications
  3. Completeness: All major components are present and appear functional
  4. Configuration: Proper YAML-based configuration system

Can the Paper's Claims Be Verified?

Recommendation:

The codebase demonstrates a serious implementation effort with a complete system, but the dummy data in metrics computation is a significant red flag that warrants further investigation. The authors should be asked to:

  1. Provide corrected metrics computation without dummy data
  2. Explain why dummy data was used and clarify which results in the paper are affected
  3. Re-run evaluations with proper metrics computation
  4. Document any AI assistance used in development
AUDIT RESULT: MEDIUM - Code is largely functional with good structure, but metrics authenticity issues prevent a LOW severity rating. Not HIGH/CRITICAL because core training and task performance appear legitimate.

---

11. RECOMMENDATIONS FOR AUTHORS

  1. Immediate: Remove dummy data from metrics computation and recompute BAS/CCM scores with actual model predictions
  2. Immediate: Replace hardcoded baseline values with computed baselines from actual baseline runs
  3. High Priority: Add unit tests for critical components (metrics computation, loss functions)
  4. High Priority: Add version specifications to all dependencies
  5. Medium Priority: Document any AI assistance used in code development
  6. Medium Priority: Add more inline comments explaining complex algorithmic sections
  7. Low Priority: Consider adding visualization tools for debugging training

---

12. CONCLUSION

This submission represents a substantial engineering effort with approximately 10,676 lines of thoughtfully structured Python code. The implementation is largely complete and appears functional for the core training task. However, the presence of dummy data in metrics computation and hardcoded comparison values raises serious questions about the authenticity of some reported results.

The primary experimental results (success rates, episode lengths, rewards) appear to be computed correctly and could likely be reproduced. The secondary evaluation metrics (BAS, CCM components) have significant issues that prevent full verification of those claims.

Final Grade: MEDIUM - Functional implementation with concerning metrics authenticity issues that need to be addressed before full confidence can be established.