Code Audit Report: sub_284

Date: 2025-10-14 Auditor: Claude Code

---

Executive Summary

This submission provides a functioning implementation of an LLM-based system (BadScientist) designed to generate fabricated research papers and evaluate them with LLM reviewers. The code is structurally complete and appears capable of running the experiments described in the paper. However, NO EXPERIMENTAL RESULTS are included in the submission - all results are hosted externally on an anonymous repository, making independent verification impossible. The code requires paid Azure OpenAI API access and specific model names ("gpt-5", "o3", "o4-mini", "gpt-4.1") that do not exist in OpenAI's public API, raising serious questions about reproducibility.

---

Risk Level: HIGH

---

Key Red Flags

All Experimental Results Are External and Unverifiable

Evidence: BadScientist/experiment/cs/results/README.md:1 states "ALL EXPERIMENT RESULTS ARE IN https://anonymous.4open.science/r/BadScientist"
Impact: CRITICAL - The submission contains zero actual experimental outputs. All paper claims about acceptance rates, integrity concern rates, and detection performance cannot be verified from the submitted code alone. The external link is on an anonymous hosting service that could disappear or be modified at any time.

API Configuration Requires Non-Existent Model Names

Evidence: BadScientist/local_review_config.py:5 specifies AZURE_OPENAI_MODEL = "gpt-5" and line 11 specifies REVIEW_AZURE_OPENAI_MODEL = "gpt-4.1, o3, o4-mini"
Impact: CRITICAL - These model names do not exist in OpenAI's public API as of January 2025. "gpt-5" does not exist; "o3" and "o4-mini" are not actual OpenAI model names (OpenAI uses "gpt-4", "gpt-3.5-turbo", etc.). This makes the code non-reproducible without access to private/internal Azure endpoints that are not documented.

Empty API Keys in Configuration

Evidence: BadScientist/local_review_config.py:2, 8 contain empty strings: AZURE_OPENAI_API_KEY = "" and REVIEW_AZURE_OPENAI_API_KEY = ""
Impact: HIGH - The code will fail immediately when run without valid credentials. While this is expected for submitted code, combined with the non-existent model names, it suggests these experiments may have been run on private infrastructure that cannot be replicated.

No Mechanism to Verify Paper Claims About Specific Metrics

Evidence: Paper claims specific metrics like "TooGoodGains (S1): 67.0% / 82.0%" acceptance rates, but no analysis code exists to compute these from raw results
Impact: HIGH - Even if experimental outputs were provided, there's no code to verify how the paper's reported statistics were calculated. The code only generates papers and reviews, but doesn't aggregate or analyze results.

Hardcoded Manipulation Strategies in Prompt Engineering

Evidence: BadScientist/experiment/cs/strategy.json contains explicit instructions to fabricate: "Too-Good Gains: Emphasize extraordinary performance improvements", "Baseline Selection: Carefully curate the baselines... sidelining any particularly competitive alternatives", "Formal Proof w/ Subtle Gap: Subtly embed a minor oversight"
Impact: MEDIUM - While this is the point of the research (adversarial paper generation), it confirms the code is designed to generate deceptive content. The strategies align exactly with paper claims, showing consistency.

Results Generation is Entirely Synthetic

Evidence: BadScientist/core.py:71-73 states figures include "code that implements the proposed methods" and "Data/visualization synthesis: sample pseudo-data D ~ q(· | s, t, θ)"
Impact: MEDIUM - All experimental results in generated papers are fabricated by LLMs, not computed from actual implementations. This is expected given the paper's thesis but means no actual ML experiments are conducted.

No Validation or Ground Truth for Review Quality

Evidence: BadScientist/perform_review.py:162-704 implements review generation but has no validation against human reviews or calibration metrics
Impact: MEDIUM - While the paper describes calibration on ICLR 2025 submissions, the code doesn't include this calibration dataset or validation logic, making it impossible to verify reviewer accuracy claims.

---

Confidence Assessment

Can this code reproduce the paper's results? Unlikely Reasoning:

Missing critical components: All experimental results are external; no analysis code exists to compute paper statistics
Non-reproducible API requirements: Model names don't exist in public APIs; requires undocumented private Azure infrastructure
Structural completeness: The core pipeline (paper generation + review) appears implementationally complete and could run if proper API access existed
Consistency with claims: Code structure and strategies align perfectly with paper descriptions, suggesting the external results likely came from this code

---

Detailed Findings

Completeness & Structural Integrity

POSITIVE:

All Python entry points exist and are complete (launch.py:1-52, run_experiments.py:1-374, core.py:1-1490)
No TODO comments, placeholder functions, or hardcoded result values in the generation pipeline
Imports reference valid packages (openai, backoff, numpy, matplotlib, pymupdf, etc.)
LaTeX compilation logic includes error handling and repair attempts (core.py:323-417)
Concurrent execution properly implemented with ThreadPoolExecutor (core.py:1452-1480, run_experiments.py:238-307)

NEGATIVE:

Results directory is completely empty except README pointing to external hosting (experiment/cs/results/)
No analysis scripts to compute acceptance rates, ICR, or other metrics reported in paper
Configuration file template (local_review_config.py.example) mentioned in README doesn't exist
No calibration dataset (ICLR 2025 submissions) included despite being central to methodology
No detection evaluation scripts despite paper reporting TPR/FPR/Acc/F1 metrics

Results Authenticity

MAJOR CONCERN:

All results are externally hosted on anonymous.4open.science with no local copies
No raw review outputs, generated papers, or intermediate data included in submission
No provenance tracking or git history to verify when experiments were run
Code generates entirely synthetic data via LLM (core.py:71-73: "sample pseudo-data D ~ q(· | s, t, θ)")

VERIFICATION IMPOSSIBLE:

Cannot verify claims like "TooGoodGains: 67.0% / 82.0% acceptance rates"
Cannot verify "o3: 50.6% ICR, o4-mini: 5.7%, GPT-4.1: 8.0%"
Cannot verify detection performance (TPR/FPR metrics)
External repository could be modified or deleted at any time

Implementation-Paper Consistency

STRONG ALIGNMENT:

Strategy definitions (strategy.json) match paper descriptions exactly:
S1 (TooGoodGains): "emphasize extraordinary performance improvements"
S2 (BaselineSelect): "curate baselines... sidelining competitive alternatives"
S3 (StatTheater): "precise p-values, confidence intervals... forthcoming repository"
S4 (CoherencePolish): "seamless flow... standardize terminology"
S5 (ProofGap): "Subtly embed a minor oversight in reasoning"

Review system implementation matches paper description:
Multi-model panel support (perform_review.py:279-405)
Score aggregation by averaging (perform_review.py:217-264)
Decision voting with tie-breaking (perform_review.py:251-263)
1-10 scoring scale (perform_review.py:115, 238)

Experiment structure matches paper:
25 seed ideas (seed_ideas.json has 25 entries)
6 strategy setups (S1-S5 + combined)
4 papers per seed per strategy (run_experiments.py:159: default=4)

INCONSISTENCIES:

Paper mentions "GPT-5" for planning agent (284_methods_results.md:10) but this model doesn't exist
Paper references "gpt-4.1" (local_review_config.py:11) which is not a real model name
Model names "o3" and "o4-mini" don't match OpenAI's naming conventions
No code for "stratified calibration set" construction described in paper
No code for "probability-consistent threshold τ0.5" calculation

Code Quality

STRENGTHS:

Well-structured with clear separation of concerns (core.py for generation, perform_review.py for review)
Comprehensive error handling with LLM-based auto-repair for figure code (core.py:351-416)
Extensive LaTeX sanitization to handle special characters (core.py:470-747)
Retry logic with exponential backoff (llm_client.py:42, perform_review.py:445-498)
Proper Unicode to LaTeX conversion (core.py:493-639)

WEAKNESSES:

Global variable for review details (perform_review.py:14: LAST_REVIEW_DETAILS = None) is not thread-safe
Commented-out config loading (perform_review.py:68: # _load_local_review_config())
No logging of API costs despite potentially expensive multi-model calls
Hard timeout of 120 seconds for figure generation (core.py:333) may be too short for complex plots
No validation that generated papers actually compile successfully

CODE DUPLICATION:

Config loading logic duplicated in llm_client.py and perform_review.py
API credential retrieval repeated in multiple functions (perform_review.py:301-316, 422-437, 829-845)

Functionality

PAPER GENERATION PIPELINE (FUNCTIONAL):

Idea generation with deduplication (core.py:174-245)
Paper package creation with sections, abstracts, figures (core.py:248-270)
Figure code execution with repair attempts (core.py:323-417)
LaTeX template filling and compilation (core.py:1039-1185, 1188-1281)
Reference generation via LLM (core.py:749-905) - Semantic Scholar integration disabled
Parallel processing of multiple ideas (core.py:1452-1480)

REVIEW PIPELINE (FUNCTIONAL):

Multi-model ensemble reviews (perform_review.py:287-407)
Score aggregation and decision voting (perform_review.py:217-264)
Few-shot examples loading (perform_review.py:790-819)
PDF text extraction with multiple fallbacks (perform_review.py:734-765)
Reflection-based review refinement (perform_review.py:669-692)

MISSING COMPONENTS:

No analysis/aggregation scripts for computing paper metrics
No calibration logic for setting thresholds τ_rate and τ_0.5
No detection dataset construction code
No integrity concern extraction via "LLM judge (GPT-5)" as described in paper

Dependencies & Environment

DEPENDENCIES (requirements.txt):

backoff
openai
requests
numpy
matplotlib
pypdf
pymupdf
pymupdf4llm

ISSUES:

No version pinning for any package (could cause compatibility issues)
Requires LaTeX installation (pdflatex, bibtex) not in requirements.txt
Python 3.12 specified in README but no version constraint in code
Azure OpenAI SDK version not specified but uses features requiring recent versions

ENVIRONMENT ASSUMPTIONS:

Requires paid Azure OpenAI API access with specific models
No indication of computational requirements (likely minimal since LLMs do all work)
Assumes macOS or Linux for LaTeX installation commands

---

Recommended Actions

For Reviewers:

Request the complete external repository immediately - Verify the anonymous.4open.science link provides all raw results, not just summaries

Demand clarification on model names - Ask authors to explain what "gpt-5", "o3", "o4-mini", and "gpt-4.1" actually refer to. Are these internal codenames? Private models? Typos?

Request analysis/aggregation scripts - Ask for code that computes all statistics reported in Tables 1-3 from raw review outputs

Request calibration dataset or methodology - Code should include ICLR 2025 calibration data or at minimum the sampling logic

Verify the experiment actually ran - Request timestamps, logs, or other provenance showing when experiments were conducted

Questions to Ask Authors:

What Azure OpenAI model names correspond to "gpt-5", "o3", "o4-mini", and "gpt-4.1"? Are these publicly available?

Can you provide the complete contents of the anonymous.4open.science repository as part of the submission?

Where is the code that computes acceptance rates, ICR metrics, and detection performance from raw reviews?

How were the calibration thresholds τ_rate=7 and τ_0.5=6.667 derived? Please provide this calculation code.

The code generates entirely synthetic results via LLM - were any actual ML experiments conducted, or is everything fabricated as claimed?

Can you provide at least a sample of generated papers and their reviews to verify the pipeline works as described?

Additional Material to Request:

Raw experimental outputs: At minimum 10-20 complete generated papers + reviews from different strategies
Calibration dataset: ICLR 2025 submissions used for threshold setting (or citation if publicly available)
Analysis notebooks: Jupyter notebooks showing how paper statistics were computed
API cost logs: To verify the scale of experiments (25 seeds × 6 strategies × 4 papers = 600 papers + reviews)
Configuration template: The mentioned local_review_config.py.example file

---

Files Reviewed

Core Implementation:

BadScientist/core.py (1,490 lines) - Main paper generation pipeline
BadScientist/perform_review.py (896 lines) - LLM review system
BadScientist/llm_client.py (159 lines) - Azure OpenAI client wrapper
BadScientist/launch.py (52 lines) - Entry point script
BadScientist/run_experiments.py (374 lines) - Batch experiment runner

Configuration & Data:

BadScientist/local_review_config.py (16 lines) - API configuration
BadScientist/experiment/cs/seed_ideas.json (77 lines) - 25 research seed ideas
BadScientist/experiment/cs/strategy.json (7 lines) - 5 manipulation strategies
BadScientist/requirements.txt (9 lines) - Dependencies
BadScientist/README.md (74 lines) - Documentation
BadScientist/experiment/cs/results/README.md (2 lines) - External results pointer

Supporting Tools:

BadScientist/tools/compile_latex.py (referenced but not read in detail)
BadScientist/tools/figure_safety.py (referenced but not read in detail)
BadScientist/semantic_scholar.py (referenced but not read in detail)

---

Final Assessment

This is a high-risk submission due to the complete absence of experimental results and the use of non-reproducible API configurations. The code itself appears to be a legitimate implementation of the described system and could likely generate papers and reviews if given access to the appropriate LLM APIs. However, without the external results repository or analysis scripts, the paper's quantitative claims cannot be verified.

The submission falls into a gray area: it provides the "code" as promised, but in a form that cannot be independently validated against the paper's claims. This could be acceptable if the authors provide the external repository and analysis scripts upon request, or it could indicate the results were generated differently than described.

Recommendation: Request complete experimental outputs and analysis code before making a final determination. The structural quality of the implementation suggests legitimate research, but the missing components prevent full verification.

---

Audit Report: Paper 284