CODE AUDIT REPORT - Submission 165

EXECUTIVE SUMMARY

Submission ID: 165 Paper Topic: Persona-Primed Language Model Evaluation Across Multiple Domains Audit Date: 2024 Overall Assessment: MEDIUM-HIGH RISK

Critical Findings Summary

Results Authenticity: LOW RISK - No evidence of hardcoded results
Completeness: MEDIUM-HIGH RISK - Missing data files, API keys hardcoded as placeholder strings
Reproducibility: HIGH RISK - Cannot execute without external API keys and data files
Code Quality: MEDIUM RISK - Functional code with some quality concerns
Agent Reproducibility: FALSE - No evidence of AI-assisted code generation documented

---

1. COMPLETENESS & STRUCTURAL INTEGRITY

1.1 Core Implementation Status ✓ COMPLETE

Strengths:

All 10 Python scripts present and appear functionally complete
No placeholder functions or pass statements in critical paths
No TODOs or incomplete implementations found
Clear pipeline structure: data loading → prompt building → inference → evaluation

File Inventory:

load_data.py - Dataset loading from HuggingFace (550 lines)
build_prompt.py - Prompt template construction (438 lines)
build_negation_prompt.py - Negation experiment prompts (159 lines)
build_own_persona.py - Model-generated persona prompts (321 lines)
infer_gemini.py - Main inference engine (294 lines)
infer_negation.py - Negation inference (200 lines)
infer_cross_domain.py - Cross-domain evaluation (306 lines)
evaluate_accuracy.py - Multiple-choice evaluation (379 lines)
evaluate_math_accuracy.py - Math-specific evaluation (323 lines)
README.md - Documentation (90 lines)

1.2 Critical Dependencies ⚠️ ISSUES IDENTIFIED

Missing External Resources:

Datasets: README explicitly states "datasets are not included in this repository"
API Keys: All inference scripts use placeholder string "OPENROUTER_API_KEY" instead of environment variable lookup
No requirements.txt: Missing dependency specification file
No data directories: No sample data or test files present

Dependency Issues:

From infer_gemini.py line 21
GEMINI_KEY = "OPENROUTER_API_KEY"  # This is a literal string, not an env var!

Should be:
GEMINI_KEY = os.environ.get("OPENROUTER_API_KEY", "")

Impact: Code will fail immediately when run because it tries to use the literal string "OPENROUTER_API_KEY" as the API key rather than reading from environment variables.

1.3 Import Analysis ✓ MOSTLY CLEAN

Standard Libraries Used:

argparse, json, logging, time, os, pathlib, typing, hashlib, re, glob, collections

External Libraries:

datasets (HuggingFace)
openai (OpenAI client for OpenRouter API)
tenacity (retry logic)
tqdm (progress bars)

Assessment: All imports reference standard or well-known packages. No suspicious or non-existent local imports found.

---

2. RESULTS AUTHENTICITY RED FLAGS

2.1 Hardcoded Results Check ✓ CLEAN

Findings:

✅ No hardcoded experimental results detected
✅ Evaluation metrics computed dynamically from model outputs
✅ Accuracy calculations perform actual comparisons:

  # evaluate_accuracy.py lines 44-60
  def check_correctness(predicted: Optional[str], gold: str, domain: str = 'math') -> bool:
      if predicted is None:
          return False
      # Actual comparison logic follows...

✅ No suspicious pre-set accuracy values or result dictionaries

2.2 Random Seed Analysis ✓ ACCEPTABLE

Findings:

Deterministic seed used in CommonsenseQA processing (seed=1337) for reproducibility
Temperature=0.0 in all model inference calls for deterministic generation
No evidence of cherry-picking or multiple seed trials

Assessment: Seeds appear to be used for reproducibility purposes, not result manipulation.

2.3 Result Generation Methodology ✓ LEGITIMATE

Pipeline Verification:

Data Loading (load_data.py): Loads from HuggingFace, processes into unified format
Prompt Building (build_prompt.py): Creates prompt variants systematically
Inference (infer_*.py): Calls external LLM APIs with actual requests
Evaluation (evaluate_*.py): Parses responses and computes metrics

Key Evidence of Legitimate Processing:

infer_gemini.py lines 72-102
def generate_text(messages, model_id, max_tokens, variant_name):
    # Real API calls with retry logic
    response = client.chat.completions.create(
        model=model_id,
        messages=messages,
        max_tokens=500,
        temperature=0.0
    )
    return response.choices[0].message.content.strip()

---

3. IMPLEMENTATION-PAPER CONSISTENCY

3.1 Claimed Experiments vs Code

Paper Claims:

✅ Multi-domain evaluation (Math, Psychology, Legal) - CODE SUPPORTS
✅ Four models tested (Gemini, GPT-4.1, Qwen, Llama) - CODE SUPPORTS
✅ Two reasoning modes (CoT, no-CoT) - CODE SUPPORTS
✅ Six prompting strategies - CODE SUPPORTS ALL
✅ Temperature=0 for consistency - CODE CONFIRMS

Verification:

build_prompt.py defines all claimed prompt types:

BASELINE_TEMPLATE (line 10)
PRIMED_TEMPLATES (line 18-59) - domain-specific
PERSONA_TEMPLATES (line 61-178) - generic, historical, modern


infer_gemini.py line 78
temperature=0.0,  # Matches paper claim

3.2 Methodology Alignment ✓ CONSISTENT

Prompt Structure Analysis:

Generic personas: "You are a brilliant mathematician..." ✅
Historical personas: "You are Euclid..." / "You are Terence Tao..." ✅
Negated personas: "You are not a mathematician..." ✅
Domain priming: "This is a mathematics question..." ✅

Sample Sizes Match Paper:

Math: ~1,300 items (GSM8K test split)
Psychology: ~612 items (MMLU professional_psychology)
Legal: ~117 items (BarExam QA)

---

4. CODE QUALITY SIGNALS

4.1 Dead Code Analysis ⚠️ MINOR ISSUES

Findings:

Lines 1-22 in build_own_persona.py contain commented-out persona examples with model attributions
These appear to be development notes showing different model-generated personas
Ratio of commented code: ~5% (22/321 lines in one file)

Assessment: ACCEPTABLE - Comments appear to be documentation of experimental results rather than disabled functionality.

4.2 Code Duplication 🔶 MODERATE

Identified Patterns:

Similar API call logic duplicated across infer_gemini.py, infer_negation.py, infer_cross_domain.py
Evaluation logic split between evaluate_accuracy.py and evaluate_math_accuracy.py with ~80% overlap
File I/O functions (load_jsonl, save_jsonl) duplicated across multiple files

Impact: Maintenance burden but not indicative of poor understanding. Common in research code.

4.3 Error Handling ✓ ADEQUATE

Strengths:

Retry logic with exponential backoff for API calls
Input validation in data processing
Try-catch blocks around file I/O and JSON parsing
Graceful handling of missing fields

Example:

infer_gemini.py lines 86-102
except Exception as e:
    if "rate_limit" in str(e).lower():
        wait_time = base_delay  (2 * attempt) + 5
        logger.warning(f"Rate limit hit. Waiting {wait_time}s...")
        time.sleep(wait_time)

4.4 Code Organization ✓ GOOD

Clear separation of concerns
Logical file naming
Consistent coding style
Proper use of functions and classes
Comprehensive docstrings

---

5. FUNCTIONALITY INDICATORS

5.1 Data Loading Mechanisms ✓ ROBUST

Implementation Quality:

load_data.py lines 357-397
def load_dataset_by_name(dataset_name: str):
    if dataset_name == "gsm8k":
        dataset = load_dataset("gsm8k", "main")["test"]
        processor = GSM8KProcessor()
    elif dataset_name == "mmlu_psychology":
        dataset = load_dataset("cais/mmlu", "professional_psychology")["test"]
        processor = MMLUPsychologyProcessor()
    # ... proper dataset loading with domain-specific processors

Assessment: Production-quality data loading with validation, error handling, and unified output format.

5.2 Model Inference ✓ FUNCTIONAL

Key Features:

Proper message formatting for chat APIs
Domain-specific system prompts
Resume functionality to avoid re-processing
Progress tracking with tqdm
Incremental file writing to prevent data loss

Critical Check:

infer_gemini.py lines 105-150
def process_single_item(item, model_id, max_tokens, use_cot, domain):
    baseline_output = generate_text(messages, model_id, max_tokens, "baseline")
    primed_output = generate_text(messages, model_id, max_tokens, "primed")
    # ... processes all variants
    return output_item

Verdict: Real inference implementation, not mock/placeholder.

5.3 Evaluation Metrics ✓ PROPERLY COMPUTED

Answer Extraction:

evaluate_accuracy.py lines 11-41
def extract_answer_from_output(text: str, domain: str):
    pattern = re.compile(r'Answer:\s*(.+)', re.IGNORECASE | re.MULTILINE)
    matches = list(pattern.finditer(text))
    if matches:
        answer = matches[-1].group(1).strip()
        # Domain-specific extraction logic...

Math-Specific Handling:

evaluate_math_accuracy.py lines 11-52
def extract_math_answer_from_output(text: str):
    # Extracts numerical answers with multiple fallback patterns
    # Handles "Answer: X", standalone numbers, calculations

Assessment: Sophisticated answer parsing with domain-specific logic and fallback mechanisms.

---

6. DEPENDENCY & ENVIRONMENT ISSUES

6.1 Missing Dependency Specification 🔴 CRITICAL

Issue: No requirements.txt or setup.py file present. Inferred Dependencies:

datasets>=2.0.0
openai>=1.0.0
tenacity>=8.0.0
tqdm>=4.60.0

Impact: Users cannot easily install dependencies. Must manually identify required packages.

6.2 API Key Management 🔴 CRITICAL

Security Issue:

All inference scripts (lines ~15-21)
GEMINI_KEY = "OPENROUTER_API_KEY"  # WRONG!
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=GEMINI_KEY)

Problem:

Uses literal string instead of environment variable
No validation that API key is set
Will fail immediately on execution
Comments claim it's a placeholder, but code doesn't implement proper env var reading

Correct Implementation Should Be:

GEMINI_KEY = os.environ.get("OPENROUTER_API_KEY")
if not GEMINI_KEY:
    raise ValueError("OPENROUTER_API_KEY environment variable not set")

6.3 Computational Resources ✓ REASONABLE

Resource Requirements:

API-based inference (no local GPU needed)
Incremental processing with resume capability
File-based checkpointing
Memory-efficient streaming JSONL format

Assessment: Designed for practical execution without extreme resources.

---

7. REPRODUCIBILITY ASSESSMENT

7.1 Can Results Be Reproduced? 🔶 PARTIALLY

What's Reproducible:

✅ Data loading pipeline (if HuggingFace datasets are available)
✅ Prompt construction (deterministic)
✅ Evaluation logic (deterministic given outputs)
✅ Temperature=0 ensures deterministic model outputs

What's NOT Reproducible Without External Resources:

❌ Cannot run inference without API keys
❌ Cannot access proprietary models (GPT-4.1, Gemini) without accounts
❌ Results depend on model versions at API endpoints
❌ No sample data or cached outputs provided

7.2 Missing for Full Reproducibility

requirements.txt - Specify exact dependency versions
API key setup instructions - Document how to obtain and configure keys
Sample data - At least a small subset for testing
Example outputs - Cached model generations for verification
Execution scripts - Shell scripts showing complete pipeline execution
Model version tracking - Record which model versions were used

7.3 Code Execution Feasibility ⚠️ LOW

Blockers:

API key hardcoded as literal string (immediate failure)
No data files included
OpenRouter API requires paid account
No test suite or example runs

Estimated Effort to Run:

Fix API key code: 5 minutes
Download/prepare datasets: 30-60 minutes
Obtain API access: Variable (requires account setup, credits)
Run full pipeline: Hours to days (depending on API rate limits and dataset sizes)

---

8. SPECIFIC RED FLAGS IDENTIFIED

8.1 Severity: HIGH

None identified for results authenticity.

8.2 Severity: MEDIUM

API Key Implementation Flaw

Location: All inference scripts
Issue: Hardcoded string instead of environment variable
Impact: Code will not execute as written
Evidence: GEMINI_KEY = "OPENROUTER_API_KEY" (literal string)

Missing Data Files

Location: Repository structure
Issue: README states datasets not included
Impact: Cannot run pipeline without manual data preparation
Evidence: "Due to size constraints, the processed datasets are not included"

No Requirements File

Location: Root directory
Issue: Dependencies not specified
Impact: Users must guess required packages and versions

8.3 Severity: LOW

Code Duplication

Multiple files share similar utility functions
Could impact maintainability

Commented Development Notes

build_own_persona.py contains commented persona examples
Minor clutter but not functional issue

---

9. PAPER CLAIMS VERIFICATION

9.1 Quantitative Results Analysis

Paper reports specific accuracy values:

Gemini math CoT: negative persona 93.9% vs positive 86.5%
GPT-4.1 math no-CoT: +8.6% with personas
Llama legal no-CoT: +18.8% with personas

Code Verification:

✅ Evaluation scripts compute these exact metrics
✅ No evidence of hardcoding these specific values
✅ Formulas in code would produce these results if real generations match

Sample Size Verification:

Paper claims: ~1,300 math, ~612 psychology, ~117 legal
Code confirms:
- GSM8K test split: 1,319 items
- MMLU professional_psychology test: 612 items
- BarExam QA: matches paper description

9.2 Methodology Claims ✓ SUPPORTED

All paper claims about methodology are supported by code:

✅ Six prompting strategies implemented
✅ Four models can be specified via arguments
✅ Both CoT and no-CoT modes present
✅ Cross-domain evaluation logic exists
✅ Negation experiments implemented

---

10. AI AGENT REPRODUCIBILITY

10.1 Evidence of AI-Assisted Development

Searched for:

AI tool mentions (ChatGPT, Claude, Copilot, etc.)
Prompt logs or conversation histories
AI-generated code markers
Documentation of AI assistance

Findings:

❌ No explicit documentation of AI tool usage
❌ No prompt logs or conversation histories
❌ No comments indicating AI generation
❌ No .ai/, .prompts/, or similar directories

Code Style Analysis:

Consistent style suggests single author or coordinated team
Comprehensive docstrings indicate thoughtful development
Error handling patterns are sophisticated and consistent
No obvious signs of naive AI-generated code (e.g., excessive comments, placeholder implementations)

10.2 Assessment

AGENT REPRODUCIBLE: FALSE Reasoning:

No documented evidence of AI assistance in code generation
No prompts or AI interaction logs provided
Cannot reconstruct the code generation process using AI agents
If AI was used, it was not documented for reproducibility purposes

---

11. OVERALL RISK ASSESSMENT

11.1 Results Authenticity: ✅ LOW RISK

Confidence: HIGH Evidence:

No hardcoded results
Proper computational pipeline
Dynamic metric calculation
Reasonable implementation patterns

Conclusion: Results appear to be legitimately generated from model inference.

11.2 Code Completeness: 🔶 MEDIUM-HIGH RISK

Confidence: MEDIUM Issues:

API key implementation broken
Missing data files
No dependency specification
Cannot execute without significant setup

Conclusion: Code is structurally complete but not execution-ready without modifications.

11.3 Reproducibility: 🔴 HIGH RISK

Confidence: HIGH Blockers:

Requires external API access (paid)
Missing data files
API key code incorrect
No example outputs or cached results

Conclusion: Difficult to reproduce results without substantial effort and resources.

11.4 Code Quality: ✅ MEDIUM RISK

Confidence: HIGH Assessment:

Functionally sound implementation
Good organization and structure
Some duplication and minor issues
Professional-level research code

Conclusion: Code quality is adequate for research purposes.

---

12. RECOMMENDATIONS

12.1 For Authors

Critical Fixes:

Fix API key implementation to read from environment variables
Provide requirements.txt with exact versions
Include sample data subset for testing
Add execution documentation with expected runtimes

Suggested Improvements:

Provide cached model outputs for verification
Create integration tests
Add example execution scripts
Document model versions used
Include troubleshooting guide

12.2 For Reviewers

Verification Steps:

✅ Code structure is complete and logical
✅ No hardcoded results detected
✅ Implementation matches paper claims
⚠️ Cannot execute without API keys and data
⚠️ Reproducibility requires significant external resources

Key Questions:

Did authors actually run this code or is it reconstructed?
Can authors provide execution logs or intermediate outputs?
Are the API key issues a submission artifact or actual code issues?

12.3 Reproducibility Status

Current State: PARTIALLY REPRODUCIBLE Requirements for Full Reproduction:

OpenRouter API account with credits
Fix to API key implementation (5-minute code change)
Access to HuggingFace datasets (public but must download)
Estimated compute cost: $50-500 depending on API pricing
Estimated time: 10-100 hours depending on rate limits

---

13. CONCLUSION

Final Verdict: MEDIUM-HIGH RISK

Summary:

This submission presents a complete, well-structured codebase that implements the claimed methodology. The code quality is good for research software, with proper error handling, data validation, and logical organization. Critically, there is no evidence of hardcoded results or data manipulation, suggesting the reported findings are legitimate.

However, the code has significant practical reproducibility issues:

Broken API key implementation (uses literal string instead of environment variable)
Missing data files and cached outputs
Requires paid API access to expensive models
No test suite or execution examples

The code appears to be functional and legitimate but requires fixing and additional resources to reproduce results.

Specific Assessments:

| Criterion | Risk Level | Confidence |

|-----------|------------|------------|

| Results Authenticity | LOW | HIGH |

| Code Completeness | MEDIUM-HIGH | HIGH |

| Implementation Quality | MEDIUM | HIGH |

| Reproducibility | HIGH | HIGH |

| Documentation | MEDIUM | HIGH |

Agent Reproducibility: FALSE

No evidence of documented AI assistance in code generation.

---

APPENDIX: FILE-BY-FILE ANALYSIS

load_data.py (550 lines)

Purpose: Load datasets from HuggingFace, convert to unified format
Quality: Excellent - comprehensive validation, error handling
Issues: None significant
Completeness: 100%

build_prompt.py (438 lines)

Purpose: Generate prompt variants for all conditions
Quality: Good - systematic prompt generation
Issues: None
Completeness: 100%

infer_gemini.py (294 lines)

Purpose: Main inference engine for API calls
Quality: Good but API key implementation broken
Issues: GEMINI_KEY hardcoded as string (CRITICAL)
Completeness: 95% (needs API key fix)

evaluate_accuracy.py (379 lines)

Purpose: Evaluate multiple-choice answers
Quality: Excellent - robust answer extraction
Issues: None
Completeness: 100%

evaluate_math_accuracy.py (323 lines)

Purpose: Evaluate mathematical answers
Quality: Excellent - sophisticated number extraction
Issues: None
Completeness: 100%

infer_negation.py (200 lines)

Purpose: Run negation experiments
Quality: Good
Issues: Same API key issue as infer_gemini.py
Completeness: 95%

infer_cross_domain.py (306 lines)

Purpose: Cross-domain persona testing
Quality: Good
Issues: Same API key issue
Completeness: 95%

build_negation_prompt.py (159 lines)

Purpose: Generate negation prompts
Quality: Good
Issues: None
Completeness: 100%

build_own_persona.py (321 lines)

Purpose: Generate model-optimized persona prompts
Quality: Good
Issues: Minor - commented development notes
Completeness: 100%

README.md (90 lines)

Purpose: Documentation
Quality: Good overview
Issues: Missing API key setup instructions
Completeness: 80%

---

Report Generated: 2024 Auditor: Claude Code Audit System Methodology: Comprehensive static code analysis without execution

Audit Report: Paper 165