← Back to Submissions

Audit Report: Paper 295

Code Audit Report: sub_295

Date: 2025-10-14 Auditor: Claude Code

---

Executive Summary

This submission presents a framework for LLM-driven catalyst discovery but contains critical red flags indicating that the reported experimental results were not generated by the provided code. The code contains hardcoded mock results, uses random number generators for DFT validation, and includes pre-computed CSV files with results that cannot be reproduced by running the pipeline.

---

Risk Level: CRITICAL

---

Key Red Flags

  1. Hardcoded Mock DFT Results with Random Number Generation
    • Evidence: catalyst_discovery_pipeline.py:429-438

   # Mock DFT results for demonstration

dft_results = {

"formation_energy": -0.5 + np.random.uniform(-0.3, 0.3),

"energy_above_hull": max(0, np.random.uniform(-0.05, 0.15)),

"band_gap": max(0, np.random.uniform(0, 2.0)),

"adsorption_energies": {

"CO_top": -0.6 + np.random.uniform(-0.3, 0.3),

"H_top": -0.3 + np.random.uniform(-0.2, 0.2)

}

}

  1. Pre-computed Results Files Inconsistent with Code Capabilities
    • Evidence:
    • fig1_catalyst_data.csv contains 50 "Known" catalysts with specific mixing enthalpies and d-band centers
    • candidate_selection_data.csv contains 50+ "LLM_Generated_HEA" catalysts with precise limiting potentials (e.g., 0.7229095437628683 V)
    • No code path exists to generate these precise values from the mock data
    • Impact: The figures in the paper are based on pre-computed data that cannot be reproduced by running the provided code. The precision (16+ decimal places) suggests these are placeholder data, not real computational results.
  1. Mock Candidate Generation Without Real LLM Integration
    • Evidence: catalyst_discovery_pipeline.py:296-363

   def _llm_generate(self, prompt: str) -> List[Dict]:

if not self.config["openai_api_key"]:

# Return mock candidates for testing

return self._mock_candidates()

Three hardcoded mock candidates are returned when API key is missing.

  1. Stability Screening Uses Random Values Instead of Calculations
    • Evidence: novelty_screening.py:337-351

   def _estimate_hull_distance(self, composition, phase_diagram):

# This is a simplified estimation

# In practice, would need actual formation energy from DFT

# Estimate based on typical metastable materials

# Random offset for demonstration

estimated_offset = np.random.uniform(0.0, 0.2)

return estimated_offset

  1. No Actual DFT Calculations, Only Input File Generation
    • Evidence: dft_automation.py:378-436
    • The code can generate VASP/QE/GPAW input files but only runs test calculations with EMT (Effective Medium Theory), a toy model
    • Comment on line 417: "Run locally with EMT for testing"
    • Impact: All claimed DFT results (formation energies, binding energies, band structures) in the paper could not have been produced by this code.
  1. Empty Data Aggregation Without API Keys
    • Evidence: data_aggregation.py:67-69, 213-220

   if not mp_config["api_key"]:

print("Warning: No Materials Project API key found. Skipping MP data.")

return materials # Returns empty list

Similar patterns for NOMAD, OC20, and literature sources

  1. Missing NumPy Import in Main Pipeline
    • Evidence: catalyst_discovery_pipeline.py:654-655

   if __name__ == "__main__":

# Add numpy import for mock DFT results

import numpy as np

NumPy is imported at the bottom only when running as main, but used in method at line 431

  1. Inconsistent Result Reporting
    • Evidence: Paper reports specific values like:
    • "Fe0.2Co0.2Ni0.2Ir0.1Ru0.3 with ηOER = 0.285 V"
    • "Ehull = 32 meV/atom"
    • "82% stability rate"
    • The code's mock data generates random values that would never produce such specific, consistent results
    • Impact: The paper's results were obtained through other means (manual calculation, different code, or fabrication) not represented in this submission.

---

Confidence Assessment

Can this code reproduce the paper's results? No Reasoning:

---

Detailed Findings

Completeness & Structural Integrity

Positive Aspects: Critical Issues: Placeholder Functions:

Results Authenticity Red FLAGS

Major Concerns:
  1. Random Number Generation for Results:
    • Formation energy: -0.5 + np.random.uniform(-0.3, 0.3)
    • Hull distance: np.random.uniform(0.0, 0.2)
    • Band gap: np.random.uniform(0, 2.0)
    • Binding energies: random offsets added
  1. Pre-computed CSV Files:
    • fig1_catalyst_data.csv: Contains 50 known catalysts + 100 LLM-generated catalysts
    • candidate_selection_data.csv: Contains 50 candidates with 16-decimal-place precision
    • These files contain the data used for paper figures but cannot be generated by the code
  1. Execution Evidence:
    • No output directories with actual results
    • No DFT calculation directories with completed runs
    • No vector database indexes
    • No validation history files
  1. Statistical Claims:
    • Paper: "82% met thermodynamic stability"
    • Code: Returns random stability values between 0.0-0.2 eV
    • Paper: "75% of LLM-HEAs achieved ηOER <0.40 V"
    • Code: No pathway to calculate OER overpotentials

Implementation-Paper Consistency

Claimed in Paper: Found in Code: DFT Parameters:

Paper claims "PBE+U (U: Fe=3.3, Co=3.4, Ni=3.5...)" but code only sets generic INCAR parameters without element-specific U values in dft_automation.py:213-245.

Code Quality

Strengths: Weaknesses: Warning Signs:

Functionality

What Works: What Doesn't Work: Cannot Reproduce:

Dependencies & Environment

Dependencies: Generally appropriate and well-specified in requirements.txt: Issues: Computational Resources:

Paper claims "~200 CPUs + 8 GPUs" but code has no GPU utilization and only uses 4 parallel workers by default.

---

Recommended Actions

  1. Request the actual code that generated the paper's results. The provided code is a demonstration framework, not the implementation used for the study.
  1. Request DFT calculation output files for the top-performing catalysts mentioned in the paper (e.g., Fe0.2Co0.2Ni0.2Ir0.1Ru0.3) to verify computational results.
  1. Ask authors to explain the discrepancy between:
    • Mock/random results in code vs. precise values in paper
    • Pre-computed CSV files vs. claimed computational pipeline
    • 3 mock candidates in code vs. 250+ claimed in paper
  1. Request the complete RAG database with 50,000+ materials entries claimed in the paper.
  1. Ask how the specific statistical results were obtained (82% stability, 75% activity threshold, etc.) given the random number generation in the code.
  1. Request proof of LLM usage (API logs, token usage records) to verify that GPT-4 was actually used to generate candidates.
  1. Verify figure generation: Ask authors to regenerate figures from scratch using only the provided code to demonstrate reproducibility.
  1. Request git history if available to understand when mock implementations were added and whether functional code existed previously.

---

Files Reviewed

---

Conclusion

This submission represents a well-structured framework for catalyst discovery but is essentially a proof-of-concept or demonstration code rather than the actual implementation used to obtain the paper's results. The presence of mock data generators, random number usage for critical calculations, and pre-computed result files strongly suggests that the paper's results were obtained through other means not represented in this code submission.

The code cannot reproduce any specific numerical result from the paper and would require substantial additional implementation to become functional. This represents a critical failure in computational reproducibility.