---

Audit Summary

CODEBASE AUDIT RESULT: CRITICAL AGENT REPRODUCIBILITY: False

---

Detailed Code Audit Report: Submission 291

Executive Summary

This submission contains NO EXECUTABLE CODE WHATSOEVER. The submission consists entirely of pre-generated data files (JSON/JSONL input specifications and CSV output results) with no implementation code, scripts, or computational infrastructure to reproduce the claimed results. This represents a complete failure to provide a functional codebase and directly contradicts the reproducibility statement's claims.

---

1. COMPLETENESS & STRUCTURAL INTEGRITY: CRITICAL

1.1 Missing Core Implementation Files

Finding: The submission contains ZERO executable code files. Evidence:

No Python scripts (.py files)
No Jupyter notebooks (.ipynb files)
No R scripts (.R or .r files)
No shell scripts (.sh files)
No code in any programming language

Files Present:

/sub_291/
├── 291_methods_results.md (paper description)
└── Mentor-Mind/
    ├── A1 Input Specifications/
    │   ├── advisor_graph_outputs.jsonl (162 pre-generated outputs)
    │   ├── baseline_cot.jsonl (162 pre-generated outputs)
    │   ├── baseline_memo.jsonl (162 pre-generated outputs)
    │   ├── baseline_sc5.jsonl (162 pre-generated outputs)
    │   ├── mentors_oracle.json (configuration data)
    │   ├── mentors_text.json (configuration data)
    │   └── scenarios.json (54 scenario specifications)
    ├── A2 Generated Evaluation Artifacts/
    │   ├── all_rows.csv (648 result rows + header)
    │   ├── by_difficulty.csv
    │   ├── by_domain.csv
    │   ├── by_mentor.csv
    │   ├── config_used.json
    │   ├── fairness_summary.csv
    │   └── overall.csv
    └── Reproducibility Statement.pdf

1.2 Contradiction with Reproducibility Statement

The reproducibility statement (page 1) explicitly claims:

> "We provide an anonymized artifact containing: (i) the scenario generator and the 54 scenario files... Scripts to regenerate all tables and figures from logs are included"

Reality: NO scripts of any kind are present in the submission.

The statement lists components that should be present but are entirely missing:

❌ Scenario generator code
❌ Prompt templates (mentioned but not provided)
❌ Monte Carlo simulation code ("Python-based external simulator" - paper section on LLM-External Computation Hybrid)
❌ Scripts to regenerate tables and figures
❌ LLM inference code
❌ Utility computation code
❌ CVaR calculation code
❌ Constraint checking code
❌ Oracle computation code
❌ Weight learning code (3-fold cross-validation mentioned)

1.3 No Dependencies or Environment Specification

Missing:

No requirements.txt
No environment.yml
No setup.py or pyproject.toml
No Docker configuration
No indication of which libraries/frameworks were used

---

2. RESULTS AUTHENTICITY RED FLAGS: CRITICAL

2.1 All Results Are Pre-Generated

Critical Finding: All experimental results exist only as CSV files with no code to generate them. Evidence:

all_rows.csv: 648 rows of results (162 scenarios × 4 methods)
All .jsonl files contain only final outputs with no generation code
No Monte Carlo simulation code despite claims of 400 samples per action
No LLM API calls or inference code
No computational traces showing how results were produced

2.2 Major Numerical Discrepancy Between Paper and Data

Critical Inconsistency:

|--------|------------------------|-----------------|-------------|

| Mentor-Mind Alignment | 97.5% | 89.5% | -8.0 pp |

| CoT Alignment | 83.3% | 75.3% | -8.0 pp |

| SC-5 Alignment | 78.4% | 71.6% | -6.8 pp |

| Memo Alignment | 82.7% | 75.3% | -7.4 pp |

From overall.csv:

method,align,regret,hcvr,n
graph_grounded_v3,0.8950617283950617,-0.007700617283950617,0.0,162
cot_vanilla,0.7530864197530864,-0.0019753086419753113,0.11728395061728394,162

Analysis:

The paper claims 97.5% alignment (97.53% ± 2.4%) for Mentor-Mind (graph_grounded_v3)
The actual data shows 89.5% alignment (0.8950617...)
This is an 8 percentage point discrepancy - far beyond any rounding or statistical variance
The same systematic ~7-8 percentage point inflation appears across ALL methods
This suggests either:

Results were manually inflated in the paper, OR
Different data was used than what was submitted, OR
Post-hoc filtering/manipulation of results

2.3 Regret Values Show Opposite Sign

Paper Claims:

Mentor-Mind ΔEU = +0.0005 (only positive value)
Baselines ΔEU = -0.0045

Actual CSV Data:

graph_grounded_v3: regret = -0.007700617... (NEGATIVE, not positive)
cot_vanilla: regret = -0.0019753... (less negative than Mentor-Mind!)

The sign flip and magnitude differences suggest fundamental inconsistencies in how metrics were calculated or reported.

2.4 No Evidence of Actual Computation

Missing Computational Artifacts:

No random seed usage in any code (statement claims seeds: {20250913, 101, 2024, 7, 11})
No Monte Carlo sampling code (claims 400 samples/action)
No CVaR computation (claims α=0.10)
No utility weight learning code (claims 3-fold CV with grid search)
No LLM API logs or responses
No intermediate computation states
No timing logs (statement claims wall-clock times are logged)

---

3. IMPLEMENTATION-PAPER CONSISTENCY: NOT ASSESSABLE

Status: Cannot assess implementation consistency because NO IMPLEMENTATION EXISTS.

The paper describes:

Influence diagram structure with decision/chance/utility nodes
Monte Carlo sampling (S=400)
CVaR optimization (α=0.10)
Mean-CVaR mixing (λ=0.30)
LLM prompting with GPT-3.5-class model (temperature=0.2, top_p=1.0)
Python-based external simulator
3-fold cross-validation for weight learning

None of this can be verified as no code is present.

---

4. CODE QUALITY SIGNALS: NOT APPLICABLE

Status: No code to assess.

---

5. FUNCTIONALITY INDICATORS: CRITICAL FAILURE

5.1 No Functional Components

The submission provides:

✅ Scenario specifications (JSON) - data only
✅ Mentor specifications (JSON) - data only
✅ Pre-computed results (CSV/JSONL) - outputs only
❌ No code to go from inputs to outputs

5.2 Impossibility of Reproduction

Without implementation code, it is impossible to:

Generate new scenarios
Run the Mentor-Mind method
Run baseline methods (CoT, Self-Consistency, MemoPrompt)
Compute oracle decisions
Evaluate alignment, regret, or HCVR metrics
Reproduce any table or figure from the paper
Verify any computational claim
Test the method on new data

5.3 Data-Only Submission

This submission is equivalent to providing a spreadsheet of results with scenario descriptions. While the data is organized, it proves nothing about:

Whether the method actually works
Whether results were computed as described
Whether the implementation is correct
Whether the claims are reproducible

---

6. DEPENDENCY & ENVIRONMENT ISSUES: NOT ASSESSABLE

Status: No code means no dependencies to assess. Note: The reproducibility statement claims "runs do not require specialized hardware (we used a single GPU for prompting and CPU for simulation)" but provides no code that could be run on any hardware.

---

7. ADDITIONAL RED FLAGS

7.1 Inconsistent Naming Convention

The method is called:

"Mentor-Mind" in the paper and methods document
"graph_grounded_v3" in the data files

The "v3" suffix suggests multiple iterations, raising questions about whether earlier versions produced different (worse) results that were not reported.

7.2 Trace Information is Empty

In baseline files (e.g., baseline_cot.jsonl), all entries show:

"trace": {"top_factors": []}

This suggests:

Either the reasoning traces were stripped out
Or the baselines were run without actually capturing their reasoning
Or the outputs were fabricated without running actual LLM inference

7.3 Perfect Patterns in Results

Looking at all_rows.csv, the graph_grounded_v3 method shows:

Perfect alignment (1.0) and zero regret for all easy scenarios
Suspiciously clean patterns across domains
HCVR of exactly 0.0 across all 162 decisions

While this could be legitimate, without code to verify, it could also indicate:

Results were hand-selected or filtered
Unsuccessful runs were excluded
Data was constructed to match desired outcomes

7.4 No Version Control Artifacts

No .git directory
No commit history
No development artifacts
No evidence of iterative development

---

8. AGENT REPRODUCIBILITY ASSESSMENT

Finding: AGENT REPRODUCIBILITY = False Rationale:

The submission contains NO documentation of:

AI tools used to generate code
Prompts used for code generation
LLM assistance in implementation
Agent-based development processes

However, this is a moot point because there is no code at all to assess for AI generation.

---

9. SEVERITY CLASSIFICATION

Overall Assessment: CRITICAL

Justification:

✅ Complete absence of executable code - Most severe possible issue
✅ Impossible to reproduce any results - Violates basic scientific standards
✅ Major numerical discrepancies (8 percentage points) between paper and submitted data
✅ False claims in reproducibility statement - States scripts are included when they are not
✅ No computational infrastructure - No way to verify any claim
✅ Pre-generated results only - Suggests results may have been manually created

Red Flag Density: Maximum possible. Every critical criterion for a functional submission fails.

---

10. SPECIFIC VIOLATIONS OF REPRODUCIBILITY CLAIMS

The Reproducibility Statement makes specific claims that are demonstrably false:

| Claim | Reality | Violation |

|-------|---------|-----------|

| "Scripts to regenerate all tables and figures from logs are included" | NO scripts present | FALSE |

| "the scenario generator" provided | No generator code | FALSE |

| "prompt templates for all methods" provided | No templates present | FALSE |

| Results can be "regenerate[d]" | Impossible without code | FALSE |

| "runs do not require specialized hardware" | Cannot run at all | MOOT |

---

11. CONCLUSIONS

11.1 Summary of Findings

This submission completely fails to provide a reproducible codebase. It consists entirely of:

Input data specifications (JSON)
Pre-generated output results (CSV/JSONL)
Documentation (Markdown, PDF)

With ZERO implementation code.

11.2 Impact on Scientific Claims

The absence of code means:

❌ Cannot verify the method works as described
❌ Cannot reproduce any results
❌ Cannot validate computational claims (Monte Carlo, CVaR, etc.)
❌ Cannot assess implementation correctness
❌ Cannot extend or build upon the work
❌ Cannot debug or understand failure modes

11.3 Evidence Quality

The submitted data shows major inconsistencies with the paper:

8 percentage point discrepancy in key metric (alignment)
Sign flip in regret values
Magnitude differences across all metrics

This raises serious questions about data authenticity and whether the paper's claims are based on the submitted results at all.

11.4 Reproducibility Status

Reproducibility Level: ZERO

This submission cannot be used to reproduce anything. It is equivalent to submitting a screenshot of results without any supporting code or methodology.

11.5 Recommendation

This submission should be classified as incomplete and non-functional. The reproducibility statement contains false claims about the contents of the submission. The numerical discrepancies between paper and data suggest potential issues with result reporting integrity.

Required for acceptance:

Complete implementation code for all methods
Scripts to generate all results from scratch
Environment specifications and dependencies
Explanation of numerical discrepancies
Honest reproducibility statement reflecting actual submission contents

---

12. VERIFICATION CHECKLIST

[x] Searched exhaustively for code files (Python, R, Julia, shell scripts, notebooks)
[x] Checked for compressed archives
[x] Examined all directories and subdirectories
[x] Read reproducibility statement
[x] Compared paper claims to actual data
[x] Verified file contents and formats
[x] Checked for configuration/dependency files
[x] Assessed numerical consistency
[x] Documented all findings with evidence

Final Verification: This audit is based on exhaustive examination of all files in /sub_291/. The finding of zero executable code is definitive and incontrovertible.

Audit Report: Paper 291