← Back to Submissions

Audit Report: Paper 284

Code Audit Report: sub_284

Date: 2025-10-14 Auditor: Claude Code

---

Executive Summary

This submission provides a functioning implementation of an LLM-based system (BadScientist) designed to generate fabricated research papers and evaluate them with LLM reviewers. The code is structurally complete and appears capable of running the experiments described in the paper. However, NO EXPERIMENTAL RESULTS are included in the submission - all results are hosted externally on an anonymous repository, making independent verification impossible. The code requires paid Azure OpenAI API access and specific model names ("gpt-5", "o3", "o4-mini", "gpt-4.1") that do not exist in OpenAI's public API, raising serious questions about reproducibility.

---

Risk Level: HIGH

---

Key Red Flags

  1. All Experimental Results Are External and Unverifiable
    • Evidence: BadScientist/experiment/cs/results/README.md:1 states "ALL EXPERIMENT RESULTS ARE IN https://anonymous.4open.science/r/BadScientist"
    • Impact: CRITICAL - The submission contains zero actual experimental outputs. All paper claims about acceptance rates, integrity concern rates, and detection performance cannot be verified from the submitted code alone. The external link is on an anonymous hosting service that could disappear or be modified at any time.
  1. API Configuration Requires Non-Existent Model Names
    • Evidence: BadScientist/local_review_config.py:5 specifies AZURE_OPENAI_MODEL = "gpt-5" and line 11 specifies REVIEW_AZURE_OPENAI_MODEL = "gpt-4.1, o3, o4-mini"
    • Impact: CRITICAL - These model names do not exist in OpenAI's public API as of January 2025. "gpt-5" does not exist; "o3" and "o4-mini" are not actual OpenAI model names (OpenAI uses "gpt-4", "gpt-3.5-turbo", etc.). This makes the code non-reproducible without access to private/internal Azure endpoints that are not documented.
  1. Empty API Keys in Configuration
    • Evidence: BadScientist/local_review_config.py:2, 8 contain empty strings: AZURE_OPENAI_API_KEY = "" and REVIEW_AZURE_OPENAI_API_KEY = ""
    • Impact: HIGH - The code will fail immediately when run without valid credentials. While this is expected for submitted code, combined with the non-existent model names, it suggests these experiments may have been run on private infrastructure that cannot be replicated.
  1. No Mechanism to Verify Paper Claims About Specific Metrics
    • Evidence: Paper claims specific metrics like "TooGoodGains (S1): 67.0% / 82.0%" acceptance rates, but no analysis code exists to compute these from raw results
    • Impact: HIGH - Even if experimental outputs were provided, there's no code to verify how the paper's reported statistics were calculated. The code only generates papers and reviews, but doesn't aggregate or analyze results.
  1. Hardcoded Manipulation Strategies in Prompt Engineering
    • Evidence: BadScientist/experiment/cs/strategy.json contains explicit instructions to fabricate: "Too-Good Gains: Emphasize extraordinary performance improvements", "Baseline Selection: Carefully curate the baselines... sidelining any particularly competitive alternatives", "Formal Proof w/ Subtle Gap: Subtly embed a minor oversight"
    • Impact: MEDIUM - While this is the point of the research (adversarial paper generation), it confirms the code is designed to generate deceptive content. The strategies align exactly with paper claims, showing consistency.
  1. Results Generation is Entirely Synthetic
    • Evidence: BadScientist/core.py:71-73 states figures include "code that implements the proposed methods" and "Data/visualization synthesis: sample pseudo-data D ~ q(· | s, t, θ)"
    • Impact: MEDIUM - All experimental results in generated papers are fabricated by LLMs, not computed from actual implementations. This is expected given the paper's thesis but means no actual ML experiments are conducted.
  1. No Validation or Ground Truth for Review Quality
    • Evidence: BadScientist/perform_review.py:162-704 implements review generation but has no validation against human reviews or calibration metrics
    • Impact: MEDIUM - While the paper describes calibration on ICLR 2025 submissions, the code doesn't include this calibration dataset or validation logic, making it impossible to verify reviewer accuracy claims.

---

Confidence Assessment

Can this code reproduce the paper's results? Unlikely Reasoning:

---

Detailed Findings

Completeness & Structural Integrity

POSITIVE: NEGATIVE:

Results Authenticity

MAJOR CONCERN: VERIFICATION IMPOSSIBLE:

Implementation-Paper Consistency

STRONG ALIGNMENT: INCONSISTENCIES:

Code Quality

STRENGTHS: WEAKNESSES: CODE DUPLICATION:

Functionality

PAPER GENERATION PIPELINE (FUNCTIONAL): REVIEW PIPELINE (FUNCTIONAL): MISSING COMPONENTS:

Dependencies & Environment

DEPENDENCIES (requirements.txt):
backoff

openai

requests

numpy

matplotlib

pypdf

pymupdf

pymupdf4llm

ISSUES: ENVIRONMENT ASSUMPTIONS:

---

Recommended Actions

For Reviewers:

  1. Request the complete external repository immediately - Verify the anonymous.4open.science link provides all raw results, not just summaries
  1. Demand clarification on model names - Ask authors to explain what "gpt-5", "o3", "o4-mini", and "gpt-4.1" actually refer to. Are these internal codenames? Private models? Typos?
  1. Request analysis/aggregation scripts - Ask for code that computes all statistics reported in Tables 1-3 from raw review outputs
  1. Request calibration dataset or methodology - Code should include ICLR 2025 calibration data or at minimum the sampling logic
  1. Verify the experiment actually ran - Request timestamps, logs, or other provenance showing when experiments were conducted

Questions to Ask Authors:

  1. What Azure OpenAI model names correspond to "gpt-5", "o3", "o4-mini", and "gpt-4.1"? Are these publicly available?
  1. Can you provide the complete contents of the anonymous.4open.science repository as part of the submission?
  1. Where is the code that computes acceptance rates, ICR metrics, and detection performance from raw reviews?
  1. How were the calibration thresholds τ_rate=7 and τ_0.5=6.667 derived? Please provide this calculation code.
  1. The code generates entirely synthetic results via LLM - were any actual ML experiments conducted, or is everything fabricated as claimed?
  1. Can you provide at least a sample of generated papers and their reviews to verify the pipeline works as described?

Additional Material to Request:

  1. Raw experimental outputs: At minimum 10-20 complete generated papers + reviews from different strategies
  2. Calibration dataset: ICLR 2025 submissions used for threshold setting (or citation if publicly available)
  3. Analysis notebooks: Jupyter notebooks showing how paper statistics were computed
  4. API cost logs: To verify the scale of experiments (25 seeds × 6 strategies × 4 papers = 600 papers + reviews)
  5. Configuration template: The mentioned local_review_config.py.example file

---

Files Reviewed

Core Implementation:

Configuration & Data:

Supporting Tools:

---

Final Assessment

This is a high-risk submission due to the complete absence of experimental results and the use of non-reproducible API configurations. The code itself appears to be a legitimate implementation of the described system and could likely generate papers and reviews if given access to the appropriate LLM APIs. However, without the external results repository or analysis scripts, the paper's quantitative claims cannot be verified.

The submission falls into a gray area: it provides the "code" as promised, but in a form that cannot be independently validated against the paper's claims. This could be acceptable if the authors provide the external repository and analysis scripts upon request, or it could indicate the results were generated differently than described.

Recommendation: Request complete experimental outputs and analysis code before making a final determination. The structural quality of the implementation suggests legitimate research, but the missing components prevent full verification.

---