Audit Summary

CODEBASE AUDIT RESULT: HIGH AGENT REPRODUCIBILITY: True

---

Detailed Code Audit Report - Submission #213

Executive Summary

This submission presents a reinforcement learning (RL) framework for sustainable investment decision-making in office buildings, integrating Deep Q-Networks (DQN) with Large Language Models (LLMs). The codebase demonstrates serious implementation issues that warrant a HIGH severity rating, including:

Critical implementation bug in the greedy evaluation logic (train.py line 15, sensitivity.py line 13, robustness.py line 13)
Incomplete data in the data_sources.csv file (only 11 entries vs. 30+ parameters required)
Missing experimental outputs (no training trajectories, sensitivity results, or robustness results included)
Indentation errors in compute_runtimes.py that prevent execution
Path inconsistencies in the run.sh script

However, the code shows evidence of a legitimate research effort with:

Complete RL environment implementation with proper reward computation
Functional DQN agent with standard architecture matching paper claims
Comprehensive documentation and reproducibility statement
Evidence of LLM usage documented in llm_prompts_responses.json (AGENT REPRODUCIBILITY = True)

---

1. COMPLETENESS & STRUCTURAL INTEGRITY

1.1 Critical Implementation Bug - Greedy Action Selection

Location: train.py line 15, sensitivity.py line 13, robustness.py line 13 Issue: The greedy action selection uses an incorrect and fragile approach:

train.py line 15
a = np.argmax(agent.q_target.forward.__self__.net[-1].weight.detach().numpy() @ s +
              agent.q_target.forward.__self__.net[-1].bias.detach().numpy())

Problems:

Incorrect logic: This only uses the final linear layer's weights and biases, completely bypassing the two hidden layers (with 64 neurons each) and their ReLU activations. This is NOT a greedy action selection - it's computing a completely wrong Q-value.
Architectural mismatch: The DQN has 3 layers: Linear(state_dim→64) → ReLU → Linear(64→64) → ReLU → Linear(64→action_dim). The code ignores the first two layers entirely.
Fragile access pattern: Using forward.__self__.net[-1] is an undocumented internal access pattern.
Will produce wrong results: The evaluation trajectories, sensitivity analysis, and robustness tests ALL use this broken logic.

Correct implementation should be:

with torch.no_grad():
    s_tensor = torch.tensor(s, dtype=torch.float32).unsqueeze(0)
    a = int(torch.argmax(agent.q(s_tensor), dim=1).item())

Impact: This bug means that all reported "greedy" evaluation results are NOT actually using the trained policy correctly. The results in the paper may not reflect what the trained DQN actually learned.

1.2 Indentation Error in compute_runtimes.py

Location: compute_runtimes.py lines 134-269 Issue: Lines 134-269 have incorrect indentation (8 spaces instead of 4), causing a syntax error that prevents execution.

Line 134 has 8 spaces when it should have 4
        root = Path(__file__).resolve().parent  # WRONG
    root = Path(__file__).resolve().parent      # CORRECT

Impact: The script will fail immediately with IndentationError when executed. This means the authors could not have run this script as-is to generate the compute_runtimes output.

1.3 Incomplete data_sources.csv

Issue: The provided data_sources.csv contains only 11 parameter entries (4 for UK, 7 for US), but the environment requires 30+ parameters per case as documented in the code comments and reproducibility statement. Missing critical parameters:

construction_cost
design_premium_green, design_premium_ultra
discount_rate, gamma, energy_noise
electricity_price
embodied_carbon_baseline, embodied_carbon_reduction_green
grid_carbon_intensity
high_perf_eui
job_creation_baseline, job_creation_enhanced
occupants_per_10000m2
operation_years
productivity_gain_hq_ieq
scc (Social Cost of Carbon)
value_of_1pct_productivity

Workaround: The compute_runtimes.py script (lines 82-116) includes a fallback to generate a minimal CSV with hardcoded values matching paper Table 1 if the file doesn't exist or is incomplete. This suggests the authors were aware of the incomplete data file. Impact: Without running the fallback generator, the main scripts (train.py, sensitivity.py, robustness.py) will crash with KeyError exceptions.

1.4 Path Inconsistencies

Location: run.sh line 15 Issue: The script references datasets/llm_eval_reference.csv and datasets/llm_eval_outputs.csv, but these files are in the root directory, not in a datasets/ subdirectory.

run.sh line 15 (incorrect)
python llm_eval.py --ref datasets/llm_eval_reference.csv --pred datasets/llm_eval_outputs.csv ...

Correct paths
python llm_eval.py --ref llm_eval_reference.csv --pred llm_eval_outputs.csv ...

Impact: The run.sh script will fail when executing the LLM evaluation step.

1.5 Missing Expected Outputs

According to the Reproducibility_Statement.md, the following intermediate outputs should be included but are missing:

results/US_trajectory.csv
results/UK_trajectory.csv
results/sensitivity_US.csv
results/sensitivity_UK.csv
results/robustness_US.csv
results/robustness_UK.csv
Training model files: models/dqn_US.pt, models/dqn_UK.pt

Present outputs:

results/llm_eval_scores.csv (minimal, 3 test cases with perfect scores)
results/compute_runtimes_times.json (placeholder with null values and note about missing dependencies)

The README.txt in results/ acknowledges: "training not executed in this environment" due to missing PyTorch dependencies.

1.6 Positive Completeness Indicators

✅ Complete core implementations:

rl_env.py: Full 265-line environment with proper MDP structure
dqn_agent.py: Complete 127-line DQN implementation with replay buffer
train.py: Training loop with experience replay and target network updates
sensitivity.py: Sensitivity analysis across 5 scenarios
robustness.py: Robustness testing with noise and climate scenarios
llm_eval.py: Automatic evaluation metrics implementation
build_parameters_from_clusters.py: 166-line parameter aggregation utility

✅ No TODOs or placeholder functions in the main code

✅ No hardcoded results - all metrics are computed from environment dynamics

✅ Proper entry points with argparse-based command-line interfaces

---

2. RESULTS AUTHENTICITY RED FLAGS

2.1 Suspicious Perfect LLM Scores

Location: results/llm_eval_scores.csv Observation: All three test cases achieve perfect scores (1.000) across all four metrics:

exact_match: 1.000
token_f1: 1.000
numeric_consistency: 1.000
coverage: 1.000

Analysis: While technically possible if the outputs exactly match references, this is unusual and warrants scrutiny. However, examining the files:

llm_eval_reference.csv: Contains 3 questions with reference answers
llm_eval_outputs.csv: Contains 3 answers that appear to exactly match references

Verdict: The perfect scores are explainable because the output file contains trivial exact matches to the reference. This appears to be a minimal test case rather than real LLM output. The actual LLM prompts and responses are documented in llm_prompts_responses.json, which contains more realistic examples.

2.2 No Cherry-Picked Seeds

✅ The code consistently uses seed=42 throughout, which is standard practice. No evidence of multiple seed values or cherry-picking in the code.

2.3 No Hardcoded Results

✅ All reward components are computed dynamically:

Energy costs calculated from EUI price NPV factor (rl_env.py lines 207-212)
Carbon costs computed from emissions * SCC (lines 214-216)
Productivity values calculated from occupants gains NPV factor (lines 228-231)
No magic numbers or hardcoded performance metrics

---

3. IMPLEMENTATION-PAPER CONSISTENCY

3.1 Model Architecture Consistency ✅

Paper claims (from 213_methods_results.md):

Input layer: 20 features after one-hot encoding
Two hidden layers: 64 neurons each, ReLU activation
Output layer: one node per action

Code implementation (dqn_agent.py lines 30-42):

self.net = nn.Sequential(
    nn.Linear(state_dim, 64),
    nn.ReLU(),
    nn.Linear(64, 64),
    nn.ReLU(),
    nn.Linear(64, action_dim)
)

Verification: The architecture matches exactly. State dimension is 4 (stage) + 3 (design) + 3 (construction) + 4 (continuous) = 14 features, not 20 as claimed in the paper. INCONSISTENCY: Paper claims 20 features, but code uses 14 features. This is a documentation error, not a code error.

3.2 Hyperparameters Consistency ✅

| Hyperparameter | Paper Claim | Code Implementation (dqn_agent.py) |

|----------------|-------------|-------------------------------------|

| Learning rate | 0.001 | ✅ lr: float = 1e-3 (line 20) |

| Batch size | 64 | ✅ batch_size: int = 64 (line 22) |

| Replay buffer | 10,000 | ✅ buffer_size: int = 10000 (line 23) |

| ε-greedy | 1 → 0.1 | ✅ epsilon_start: 1.0, epsilon_end: 0.1 (lines 24-25) |

| Target update | Not specified | 250 steps (line 27) |

| Optimizer | Adam | ✅ optim.Adam (line 73) |

Verdict: Excellent consistency between paper and code.

3.3 Reward Function Consistency ✅

Paper description: R(s, a) = −NPVcost + Σ αi·fi(s) Code implementation (rl_env.py):

Design stage: -capex (line 150)
Construction: -carbon_cost * alpha_carbon + jobs_value (lines 175, 181)
Operation: -total_npv_costs + total_benefits (line 236)

Verdict: The reward structure matches the paper's multi-objective formulation with weighted scalarization.

3.4 MDP Structure Consistency ✅

Paper claims: Sequential decision-making across three phases (design, construction, operation) Code: Environment properly models 3 decision stages (lines 128-250):

Stage 1 (Design): 3 actions (conventional/green/ultra)
Stage 2 (Construction): 3 actions (standard/green materials/social)
Stage 3 (Operation): 3 actions (standard/smart/wellness)

3.5 Parameter Values - Partial Consistency ⚠️

Issue: The incomplete data_sources.csv prevents full verification. The fallback values in compute_runtimes.py match some paper claims (Table 1 from methods_results.md), but we cannot verify if the clustered data in data_sources_clustered.xlsx is accurate without openpyxl.

---

4. CODE QUALITY SIGNALS

4.1 Dead Code - Low ✅

No significant dead or commented-out code. The codebase is clean and focused.

4.2 Unused Imports - None ✅

All imports are used appropriately:

torch, numpy, pandas: used throughout
argparse, csv, json, os: used for CLI and I/O
collections.deque: used for replay buffer
dataclasses: used for configuration

4.3 Code Duplication - Moderate ⚠️

The eval_greedy function is duplicated in sensitivity.py and robustness.py with identical implementations (including the same bug). This suggests copy-paste without refactoring, but the duplication is minimal and the functions are short.

4.4 Error Handling - Minimal ⚠️

Environment loading has error handling for missing parameters (rl_env.py line 87)
PyTorch import has try-catch with helpful error message (dqn_agent.py lines 9-14)
Build script has robust error handling (build_parameters_from_clusters.py)
Missing: No validation that loaded model files match environment dimensions
Missing: No checks for NaN/Inf values in training
Missing: No validation of CSV file formats

4.5 Development Evidence ✅

Positive indicators:

Consistent coding style throughout
Proper use of dataclasses for configuration
Type hints in function signatures (build_parameters_from_clusters.py)
Comprehensive docstrings in key files
Version control artifacts not present (cleaned for submission)

Negative indicators:

The critical bug in greedy evaluation suggests limited testing
Indentation error suggests code was not run as-is
Path inconsistencies suggest last-minute changes

---

5. FUNCTIONALITY INDICATORS

5.1 Data Loading - Functional ✅

The _load_params method (rl_env.py lines 71-84) properly loads CSV data with error handling. The environment correctly reads case-specific parameters from the data file.

5.2 Training Loop - Functional ✅

train.py implements a complete training loop with:

Experience replay (line 22: agent.remember)
Batch updates (line 23: agent.update)
Epsilon decay (handled in DQNAgent.select_action)
Target network updates (dqn_agent.py line 113-114)
Model checkpointing (line 52: agent.save)

Key components:

✅ Proper loss computation using MSE of TD error (dqn_agent.py line 108)
✅ Gradient computation and backpropagation (lines 109-111)
✅ Target network synchronization every 250 steps (line 113-114)

5.3 Evaluation Metrics - Computed ✅

All metrics are actually computed from environment dynamics:

Energy use: calculated from state.eui (line 208)
Energy costs: NPV calculated with proper discount factors (lines 211-212)
Carbon emissions: computed from grid carbon intensity (line 215)
Productivity value: computed from occupant count and gains (line 231)

No print-only or fake metrics detected.

5.4 LLM Integration - Documented but Not Automated ⚠️

Documented: llm_prompts_responses.json contains:

2 parameter extraction prompts with responses
2 explanation generation prompts (US and UK cases)
Evaluation protocol description

Issue: The LLM calls are NOT automated in the code. The JSON file appears to be manually curated examples, not programmatically generated outputs. The paper claims "ChatGPT-5 used with engineered prompts" but there's no API integration code. LLM evaluation: The llm_eval.py script only evaluates pre-existing outputs, it doesn't call any LLM.

---

6. DEPENDENCY & ENVIRONMENT ISSUES

6.1 Dependencies - Standard ✅

requirements.txt:

pandas>=2.0.0
numpy>=1.24.0
torch>=2.0.0

All dependencies are standard, widely-available packages. No exotic or unmaintained libraries.

6.2 Missing Optional Dependencies ⚠️

openpyxl: Required to read data_sources_clustered.xlsx but not listed in requirements.txt
psutil: Used in compute_runtimes.py for hardware detection but handled gracefully if missing

6.3 Computational Resources - Reasonable ✅

Training: 5000 episodes of a simple 3-step MDP with 14-dimensional state space

Network: ~10K parameters (small)
No GPU required (CPU sufficient)
Memory: Replay buffer holds 10,000 transitions of 14 floats (~560 KB)

Estimated runtime: Seconds to minutes on modern CPU, which matches the paper's claim of "feasible CPU-only reproduction."

6.4 Python Version - Compatible ✅

Code uses standard Python 3.10+ features. No incompatibilities detected.

---

7. AGENT REPRODUCIBILITY ASSESSMENT

7.1 Evidence of LLM Usage ✅ TRUE

File: llm_prompts_responses.json

This file explicitly documents:

Parameter extraction prompts used to query LLMs for numeric parameters from guidance documents
Explanation generation prompts with specific instructions and expected outputs
Evaluation protocol for automatic scoring

Example prompt:

> "According to UKGBC (United Kingdom Green Building Council) guidance on green offices, what is a typical range of energy savings versus a code-minimum baseline for high‑performance design? Reply with a single number (%) you recommend for modelling and cite the guidance succinctly."

Response documented:

> "45% — High‑performance office refurbishments often demonstrate 30–50% lower energy intensity versus code‑minimum; we adopt the midpoint for modelling consistency (UKGBC)."

7.2 AI Agent Usage in Research Process ✅

The submission demonstrates:

Prompt engineering: Carefully crafted prompts for parameter extraction
Output formatting: Structured prompts requesting specific formats (e.g., "120-150 words")
Quality control: Automatic evaluation metrics (exact_match, token_f1, numeric_consistency)

VERDICT: The researchers documented their LLM usage transparently, including exact prompts and sample responses. This satisfies the criterion for AGENT REPRODUCIBILITY: True.

---

8. SPECIFIC RED FLAGS CHECKLIST

| Red Flag | Present? | Details |

|----------|----------|---------|

| TODOs in critical paths | ❌ No | No TODOs found |

| Placeholder functions | ❌ No | All functions are implemented |

| Hardcoded results | ❌ No | Results computed dynamically |

| Missing critical files | ⚠️ Partial | Data file incomplete, but fallback exists |

| Cherry-picked seeds | ❌ No | Consistent use of seed=42 |

| Broken imports | ❌ No | All imports are valid |

| Unrealistic resources | ❌ No | CPU-feasible training |

| Code duplication | ⚠️ Minor | eval_greedy duplicated |

| Critical bugs | ✅ YES | Greedy evaluation bug affects all results |

| Indentation errors | ✅ YES | compute_runtimes.py cannot run |

| Path inconsistencies | ✅ YES | run.sh has wrong paths |

---

9. CONCLUSIONS

9.1 Severity Assessment: HIGH

The codebase contains critical implementation bugs that invalidate the reported results:

Greedy evaluation bug: The most serious issue. All evaluation, sensitivity, and robustness results use incorrect Q-values by bypassing 2/3 of the neural network. This means the reported performance numbers may not reflect what the DQN actually learned.

Cannot reproduce as-is: Multiple execution-blocking issues (indentation error, path inconsistencies, incomplete data) prevent running the code without modifications.

Incomplete submission: Missing all main experimental outputs (trajectories, trained models, sensitivity/robustness results).

9.2 Evidence of Legitimate Research Effort

Despite the critical bugs, there is substantial evidence this is real research:

✅ Complete implementations: All core components (environment, agent, training) are fully implemented

✅ Correct architecture: DQN implementation matches modern standards

✅ Proper methodology: Reward function, MDP structure, and hyperparameters align with paper

✅ Comprehensive documentation: README, reproducibility statement, and code comments

✅ LLM transparency: Documented prompts and evaluation methodology

9.3 Likely Explanation

This appears to be a hastily prepared submission with:

Working research code that was modified for submission
Last-minute changes that introduced bugs (greedy evaluation refactoring?)
Copy-paste errors (indentation in compute_runtimes.py)
Incomplete data export process (missing full CSV, Excel dependency)
Inability to run full experiments in submission environment (no PyTorch available)

The core research is likely valid, but the submission artifacts are not reproducible without significant fixes.

9.4 Required Fixes for Reproducibility

To make this code reproducible, the authors must:

FIX CRITICAL BUG: Replace lines 15 (train.py), 13 (sensitivity.py), 13 (robustness.py) with proper greedy selection
Fix indentation: Correct lines 134-269 in compute_runtimes.py
Complete data file: Either include full data_sources.csv or add openpyxl to requirements.txt and document the build process
Fix paths: Update run.sh line 15 to use correct file paths
Include outputs: Run experiments and include all trajectory, sensitivity, and robustness CSVs
Include models: Provide trained model files (dqn_US.pt, dqn_UK.pt)

9.5 Recommendations for Reviewers

Request corrected submission: The bugs are fixable, but results must be regenerated
Verify re-run results: Once fixed, results should be reproduced independently
Check feature count: Clarify why paper claims 20 features but code has 14
Validate LLM claims: Verify that LLM was actually used for parameter extraction (not just documented as an example)

---

10. AGENT REPRODUCIBILITY

AGENT REPRODUCIBILITY: True Justification: The submission includes llm_prompts_responses.json which explicitly documents:

Exact prompts used to query LLMs for parameter extraction
Sample LLM responses with numeric values
Explanation generation prompts with formatting requirements
Evaluation protocol for automatic scoring of LLM outputs

While the LLM integration is not automated in code (manual prompts rather than API calls), the researchers have transparently documented their AI agent usage in the research process, including prompts that generated parameters used in the code/analysis.

---

FINAL VERDICT

CODEBASE AUDIT RESULT: HIGH

The submission contains critical bugs that invalidate reported results, multiple execution-blocking errors, and missing experimental outputs. However, the core implementation demonstrates legitimate research effort with proper RL methodology. The code requires substantial fixes before it can be considered reproducible.

AGENT REPRODUCIBILITY: True

The researchers documented their LLM usage with exact prompts and sample responses, satisfying the transparency requirement for AI agent usage in the research process.

Audit Report: Paper 213