← Back to Submissions

Audit Report: Paper 213

Audit Summary

CODEBASE AUDIT RESULT: HIGH AGENT REPRODUCIBILITY: True

---

Detailed Code Audit Report - Submission #213

Executive Summary

This submission presents a reinforcement learning (RL) framework for sustainable investment decision-making in office buildings, integrating Deep Q-Networks (DQN) with Large Language Models (LLMs). The codebase demonstrates serious implementation issues that warrant a HIGH severity rating, including:

  1. Critical implementation bug in the greedy evaluation logic (train.py line 15, sensitivity.py line 13, robustness.py line 13)
  2. Incomplete data in the data_sources.csv file (only 11 entries vs. 30+ parameters required)
  3. Missing experimental outputs (no training trajectories, sensitivity results, or robustness results included)
  4. Indentation errors in compute_runtimes.py that prevent execution
  5. Path inconsistencies in the run.sh script

However, the code shows evidence of a legitimate research effort with:

---

1. COMPLETENESS & STRUCTURAL INTEGRITY

1.1 Critical Implementation Bug - Greedy Action Selection

Location: train.py line 15, sensitivity.py line 13, robustness.py line 13 Issue: The greedy action selection uses an incorrect and fragile approach:

train.py line 15

a = np.argmax(agent.q_target.forward.__self__.net[-1].weight.detach().numpy() @ s +

agent.q_target.forward.__self__.net[-1].bias.detach().numpy())

Problems:
  1. Incorrect logic: This only uses the final linear layer's weights and biases, completely bypassing the two hidden layers (with 64 neurons each) and their ReLU activations. This is NOT a greedy action selection - it's computing a completely wrong Q-value.
  2. Architectural mismatch: The DQN has 3 layers: Linear(state_dim→64) → ReLU → Linear(64→64) → ReLU → Linear(64→action_dim). The code ignores the first two layers entirely.
  3. Fragile access pattern: Using forward.__self__.net[-1] is an undocumented internal access pattern.
  4. Will produce wrong results: The evaluation trajectories, sensitivity analysis, and robustness tests ALL use this broken logic.
Correct implementation should be:
with torch.no_grad():

s_tensor = torch.tensor(s, dtype=torch.float32).unsqueeze(0)

a = int(torch.argmax(agent.q(s_tensor), dim=1).item())

Impact: This bug means that all reported "greedy" evaluation results are NOT actually using the trained policy correctly. The results in the paper may not reflect what the trained DQN actually learned.

1.2 Indentation Error in compute_runtimes.py

Location: compute_runtimes.py lines 134-269 Issue: Lines 134-269 have incorrect indentation (8 spaces instead of 4), causing a syntax error that prevents execution.

Line 134 has 8 spaces when it should have 4

root = Path(__file__).resolve().parent # WRONG

root = Path(__file__).resolve().parent # CORRECT

Impact: The script will fail immediately with IndentationError when executed. This means the authors could not have run this script as-is to generate the compute_runtimes output.

1.3 Incomplete data_sources.csv

Issue: The provided data_sources.csv contains only 11 parameter entries (4 for UK, 7 for US), but the environment requires 30+ parameters per case as documented in the code comments and reproducibility statement. Missing critical parameters: Workaround: The compute_runtimes.py script (lines 82-116) includes a fallback to generate a minimal CSV with hardcoded values matching paper Table 1 if the file doesn't exist or is incomplete. This suggests the authors were aware of the incomplete data file. Impact: Without running the fallback generator, the main scripts (train.py, sensitivity.py, robustness.py) will crash with KeyError exceptions.

1.4 Path Inconsistencies

Location: run.sh line 15 Issue: The script references datasets/llm_eval_reference.csv and datasets/llm_eval_outputs.csv, but these files are in the root directory, not in a datasets/ subdirectory.

run.sh line 15 (incorrect)

python llm_eval.py --ref datasets/llm_eval_reference.csv --pred datasets/llm_eval_outputs.csv ...

Correct paths

python llm_eval.py --ref llm_eval_reference.csv --pred llm_eval_outputs.csv ...

Impact: The run.sh script will fail when executing the LLM evaluation step.

1.5 Missing Expected Outputs

According to the Reproducibility_Statement.md, the following intermediate outputs should be included but are missing:

Present outputs:

The README.txt in results/ acknowledges: "training not executed in this environment" due to missing PyTorch dependencies.

1.6 Positive Completeness Indicators

Complete core implementations:

No TODOs or placeholder functions in the main code

No hardcoded results - all metrics are computed from environment dynamics

Proper entry points with argparse-based command-line interfaces

---

2. RESULTS AUTHENTICITY RED FLAGS

2.1 Suspicious Perfect LLM Scores

Location: results/llm_eval_scores.csv Observation: All three test cases achieve perfect scores (1.000) across all four metrics: Analysis: While technically possible if the outputs exactly match references, this is unusual and warrants scrutiny. However, examining the files: Verdict: The perfect scores are explainable because the output file contains trivial exact matches to the reference. This appears to be a minimal test case rather than real LLM output. The actual LLM prompts and responses are documented in llm_prompts_responses.json, which contains more realistic examples.

2.2 No Cherry-Picked Seeds

✅ The code consistently uses seed=42 throughout, which is standard practice. No evidence of multiple seed values or cherry-picking in the code.

2.3 No Hardcoded Results

✅ All reward components are computed dynamically:

---

3. IMPLEMENTATION-PAPER CONSISTENCY

3.1 Model Architecture Consistency ✅

Paper claims (from 213_methods_results.md): Code implementation (dqn_agent.py lines 30-42):
self.net = nn.Sequential(

nn.Linear(state_dim, 64),

nn.ReLU(),

nn.Linear(64, 64),

nn.ReLU(),

nn.Linear(64, action_dim)

)

Verification: The architecture matches exactly. State dimension is 4 (stage) + 3 (design) + 3 (construction) + 4 (continuous) = 14 features, not 20 as claimed in the paper. INCONSISTENCY: Paper claims 20 features, but code uses 14 features. This is a documentation error, not a code error.

3.2 Hyperparameters Consistency ✅

| Hyperparameter | Paper Claim | Code Implementation (dqn_agent.py) |

|----------------|-------------|-------------------------------------|

| Learning rate | 0.001 | ✅ lr: float = 1e-3 (line 20) |

| Batch size | 64 | ✅ batch_size: int = 64 (line 22) |

| Replay buffer | 10,000 | ✅ buffer_size: int = 10000 (line 23) |

| ε-greedy | 1 → 0.1 | ✅ epsilon_start: 1.0, epsilon_end: 0.1 (lines 24-25) |

| Target update | Not specified | 250 steps (line 27) |

| Optimizer | Adam | ✅ optim.Adam (line 73) |

Verdict: Excellent consistency between paper and code.

3.3 Reward Function Consistency ✅

Paper description: R(s, a) = −NPVcost + Σ αi·fi(s) Code implementation (rl_env.py): Verdict: The reward structure matches the paper's multi-objective formulation with weighted scalarization.

3.4 MDP Structure Consistency ✅

Paper claims: Sequential decision-making across three phases (design, construction, operation) Code: Environment properly models 3 decision stages (lines 128-250):
  1. Stage 1 (Design): 3 actions (conventional/green/ultra)
  2. Stage 2 (Construction): 3 actions (standard/green materials/social)
  3. Stage 3 (Operation): 3 actions (standard/smart/wellness)

3.5 Parameter Values - Partial Consistency ⚠️

Issue: The incomplete data_sources.csv prevents full verification. The fallback values in compute_runtimes.py match some paper claims (Table 1 from methods_results.md), but we cannot verify if the clustered data in data_sources_clustered.xlsx is accurate without openpyxl.

---

4. CODE QUALITY SIGNALS

4.1 Dead Code - Low ✅

No significant dead or commented-out code. The codebase is clean and focused.

4.2 Unused Imports - None ✅

All imports are used appropriately:

4.3 Code Duplication - Moderate ⚠️

The eval_greedy function is duplicated in sensitivity.py and robustness.py with identical implementations (including the same bug). This suggests copy-paste without refactoring, but the duplication is minimal and the functions are short.

4.4 Error Handling - Minimal ⚠️

4.5 Development Evidence ✅

Positive indicators: Negative indicators:

---

5. FUNCTIONALITY INDICATORS

5.1 Data Loading - Functional ✅

The _load_params method (rl_env.py lines 71-84) properly loads CSV data with error handling. The environment correctly reads case-specific parameters from the data file.

5.2 Training Loop - Functional ✅

train.py implements a complete training loop with: Key components:
  1. ✅ Proper loss computation using MSE of TD error (dqn_agent.py line 108)
  2. ✅ Gradient computation and backpropagation (lines 109-111)
  3. ✅ Target network synchronization every 250 steps (line 113-114)

5.3 Evaluation Metrics - Computed ✅

All metrics are actually computed from environment dynamics:

No print-only or fake metrics detected.

5.4 LLM Integration - Documented but Not Automated ⚠️

Documented: llm_prompts_responses.json contains: Issue: The LLM calls are NOT automated in the code. The JSON file appears to be manually curated examples, not programmatically generated outputs. The paper claims "ChatGPT-5 used with engineered prompts" but there's no API integration code. LLM evaluation: The llm_eval.py script only evaluates pre-existing outputs, it doesn't call any LLM.

---

6. DEPENDENCY & ENVIRONMENT ISSUES

6.1 Dependencies - Standard ✅

requirements.txt:
pandas>=2.0.0

numpy>=1.24.0

torch>=2.0.0

All dependencies are standard, widely-available packages. No exotic or unmaintained libraries.

6.2 Missing Optional Dependencies ⚠️

6.3 Computational Resources - Reasonable ✅

Training: 5000 episodes of a simple 3-step MDP with 14-dimensional state space Estimated runtime: Seconds to minutes on modern CPU, which matches the paper's claim of "feasible CPU-only reproduction."

6.4 Python Version - Compatible ✅

Code uses standard Python 3.10+ features. No incompatibilities detected.

---

7. AGENT REPRODUCIBILITY ASSESSMENT

7.1 Evidence of LLM Usage ✅ TRUE

File: llm_prompts_responses.json

This file explicitly documents:

  1. Parameter extraction prompts used to query LLMs for numeric parameters from guidance documents
  2. Explanation generation prompts with specific instructions and expected outputs
  3. Evaluation protocol for automatic scoring
Example prompt:

> "According to UKGBC (United Kingdom Green Building Council) guidance on green offices, what is a typical range of energy savings versus a code-minimum baseline for high‑performance design? Reply with a single number (%) you recommend for modelling and cite the guidance succinctly."

Response documented:

> "45% — High‑performance office refurbishments often demonstrate 30–50% lower energy intensity versus code‑minimum; we adopt the midpoint for modelling consistency (UKGBC)."

7.2 AI Agent Usage in Research Process ✅

The submission demonstrates:

VERDICT: The researchers documented their LLM usage transparently, including exact prompts and sample responses. This satisfies the criterion for AGENT REPRODUCIBILITY: True.

---

8. SPECIFIC RED FLAGS CHECKLIST

| Red Flag | Present? | Details |

|----------|----------|---------|

| TODOs in critical paths | ❌ No | No TODOs found |

| Placeholder functions | ❌ No | All functions are implemented |

| Hardcoded results | ❌ No | Results computed dynamically |

| Missing critical files | ⚠️ Partial | Data file incomplete, but fallback exists |

| Cherry-picked seeds | ❌ No | Consistent use of seed=42 |

| Broken imports | ❌ No | All imports are valid |

| Unrealistic resources | ❌ No | CPU-feasible training |

| Code duplication | ⚠️ Minor | eval_greedy duplicated |

| Critical bugs | ✅ YES | Greedy evaluation bug affects all results |

| Indentation errors | ✅ YES | compute_runtimes.py cannot run |

| Path inconsistencies | ✅ YES | run.sh has wrong paths |

---

9. CONCLUSIONS

9.1 Severity Assessment: HIGH

The codebase contains critical implementation bugs that invalidate the reported results:

  1. Greedy evaluation bug: The most serious issue. All evaluation, sensitivity, and robustness results use incorrect Q-values by bypassing 2/3 of the neural network. This means the reported performance numbers may not reflect what the DQN actually learned.
  1. Cannot reproduce as-is: Multiple execution-blocking issues (indentation error, path inconsistencies, incomplete data) prevent running the code without modifications.
  1. Incomplete submission: Missing all main experimental outputs (trajectories, trained models, sensitivity/robustness results).

9.2 Evidence of Legitimate Research Effort

Despite the critical bugs, there is substantial evidence this is real research:

Complete implementations: All core components (environment, agent, training) are fully implemented

Correct architecture: DQN implementation matches modern standards

Proper methodology: Reward function, MDP structure, and hyperparameters align with paper

Comprehensive documentation: README, reproducibility statement, and code comments

LLM transparency: Documented prompts and evaluation methodology

9.3 Likely Explanation

This appears to be a hastily prepared submission with:

The core research is likely valid, but the submission artifacts are not reproducible without significant fixes.

9.4 Required Fixes for Reproducibility

To make this code reproducible, the authors must:

  1. FIX CRITICAL BUG: Replace lines 15 (train.py), 13 (sensitivity.py), 13 (robustness.py) with proper greedy selection
  2. Fix indentation: Correct lines 134-269 in compute_runtimes.py
  3. Complete data file: Either include full data_sources.csv or add openpyxl to requirements.txt and document the build process
  4. Fix paths: Update run.sh line 15 to use correct file paths
  5. Include outputs: Run experiments and include all trajectory, sensitivity, and robustness CSVs
  6. Include models: Provide trained model files (dqn_US.pt, dqn_UK.pt)

9.5 Recommendations for Reviewers

  1. Request corrected submission: The bugs are fixable, but results must be regenerated
  2. Verify re-run results: Once fixed, results should be reproduced independently
  3. Check feature count: Clarify why paper claims 20 features but code has 14
  4. Validate LLM claims: Verify that LLM was actually used for parameter extraction (not just documented as an example)

---

10. AGENT REPRODUCIBILITY

AGENT REPRODUCIBILITY: True Justification: The submission includes llm_prompts_responses.json which explicitly documents:

While the LLM integration is not automated in code (manual prompts rather than API calls), the researchers have transparently documented their AI agent usage in the research process, including prompts that generated parameters used in the code/analysis.

---

FINAL VERDICT

CODEBASE AUDIT RESULT: HIGH

The submission contains critical bugs that invalidate reported results, multiple execution-blocking errors, and missing experimental outputs. However, the core implementation demonstrates legitimate research effort with proper RL methodology. The code requires substantial fixes before it can be considered reproducible.

AGENT REPRODUCIBILITY: True

The researchers documented their LLM usage with exact prompts and sample responses, satisfying the transparency requirement for AI agent usage in the research process.