← Back to Submissions

Audit Report: Paper 165

CODE AUDIT REPORT - Submission 165

EXECUTIVE SUMMARY

Submission ID: 165 Paper Topic: Persona-Primed Language Model Evaluation Across Multiple Domains Audit Date: 2024 Overall Assessment: MEDIUM-HIGH RISK

Critical Findings Summary

---

1. COMPLETENESS & STRUCTURAL INTEGRITY

1.1 Core Implementation Status ✓ COMPLETE

Strengths: File Inventory:
  1. load_data.py - Dataset loading from HuggingFace (550 lines)
  2. build_prompt.py - Prompt template construction (438 lines)
  3. build_negation_prompt.py - Negation experiment prompts (159 lines)
  4. build_own_persona.py - Model-generated persona prompts (321 lines)
  5. infer_gemini.py - Main inference engine (294 lines)
  6. infer_negation.py - Negation inference (200 lines)
  7. infer_cross_domain.py - Cross-domain evaluation (306 lines)
  8. evaluate_accuracy.py - Multiple-choice evaluation (379 lines)
  9. evaluate_math_accuracy.py - Math-specific evaluation (323 lines)
  10. README.md - Documentation (90 lines)

1.2 Critical Dependencies ⚠️ ISSUES IDENTIFIED

Missing External Resources: Dependency Issues:

From infer_gemini.py line 21

GEMINI_KEY = "OPENROUTER_API_KEY" # This is a literal string, not an env var!

Should be:

GEMINI_KEY = os.environ.get("OPENROUTER_API_KEY", "")

Impact: Code will fail immediately when run because it tries to use the literal string "OPENROUTER_API_KEY" as the API key rather than reading from environment variables.

1.3 Import Analysis ✓ MOSTLY CLEAN

Standard Libraries Used: External Libraries: Assessment: All imports reference standard or well-known packages. No suspicious or non-existent local imports found.

---

2. RESULTS AUTHENTICITY RED FLAGS

2.1 Hardcoded Results Check ✓ CLEAN

Findings:

  # evaluate_accuracy.py lines 44-60

def check_correctness(predicted: Optional[str], gold: str, domain: str = 'math') -> bool:

if predicted is None:

return False

# Actual comparison logic follows...

2.2 Random Seed Analysis ✓ ACCEPTABLE

Findings: Assessment: Seeds appear to be used for reproducibility purposes, not result manipulation.

2.3 Result Generation Methodology ✓ LEGITIMATE

Pipeline Verification:
  1. Data Loading (load_data.py): Loads from HuggingFace, processes into unified format
  2. Prompt Building (build_prompt.py): Creates prompt variants systematically
  3. Inference (infer_*.py): Calls external LLM APIs with actual requests
  4. Evaluation (evaluate_*.py): Parses responses and computes metrics
Key Evidence of Legitimate Processing:

infer_gemini.py lines 72-102

def generate_text(messages, model_id, max_tokens, variant_name):

# Real API calls with retry logic

response = client.chat.completions.create(

model=model_id,

messages=messages,

max_tokens=500,

temperature=0.0

)

return response.choices[0].message.content.strip()

---

3. IMPLEMENTATION-PAPER CONSISTENCY

3.1 Claimed Experiments vs Code

Paper Claims:
  1. ✅ Multi-domain evaluation (Math, Psychology, Legal) - CODE SUPPORTS
  2. ✅ Four models tested (Gemini, GPT-4.1, Qwen, Llama) - CODE SUPPORTS
  3. ✅ Two reasoning modes (CoT, no-CoT) - CODE SUPPORTS
  4. ✅ Six prompting strategies - CODE SUPPORTS ALL
  5. ✅ Temperature=0 for consistency - CODE CONFIRMS
Verification:

build_prompt.py defines all claimed prompt types:

  • BASELINE_TEMPLATE (line 10)
  • PRIMED_TEMPLATES (line 18-59) - domain-specific
  • PERSONA_TEMPLATES (line 61-178) - generic, historical, modern

infer_gemini.py line 78

temperature=0.0, # Matches paper claim

3.2 Methodology Alignment ✓ CONSISTENT

Prompt Structure Analysis: Sample Sizes Match Paper:

---

4. CODE QUALITY SIGNALS

4.1 Dead Code Analysis ⚠️ MINOR ISSUES

Findings: Assessment: ACCEPTABLE - Comments appear to be documentation of experimental results rather than disabled functionality.

4.2 Code Duplication 🔶 MODERATE

Identified Patterns: Impact: Maintenance burden but not indicative of poor understanding. Common in research code.

4.3 Error Handling ✓ ADEQUATE

Strengths: Example:

infer_gemini.py lines 86-102

except Exception as e:

if "rate_limit" in str(e).lower():

wait_time = base_delay (2 * attempt) + 5

logger.warning(f"Rate limit hit. Waiting {wait_time}s...")

time.sleep(wait_time)

4.4 Code Organization ✓ GOOD

---

5. FUNCTIONALITY INDICATORS

5.1 Data Loading Mechanisms ✓ ROBUST

Implementation Quality:

load_data.py lines 357-397

def load_dataset_by_name(dataset_name: str):

if dataset_name == "gsm8k":

dataset = load_dataset("gsm8k", "main")["test"]

processor = GSM8KProcessor()

elif dataset_name == "mmlu_psychology":

dataset = load_dataset("cais/mmlu", "professional_psychology")["test"]

processor = MMLUPsychologyProcessor()

# ... proper dataset loading with domain-specific processors

Assessment: Production-quality data loading with validation, error handling, and unified output format.

5.2 Model Inference ✓ FUNCTIONAL

Key Features: Critical Check:

infer_gemini.py lines 105-150

def process_single_item(item, model_id, max_tokens, use_cot, domain):

baseline_output = generate_text(messages, model_id, max_tokens, "baseline")

primed_output = generate_text(messages, model_id, max_tokens, "primed")

# ... processes all variants

return output_item

Verdict: Real inference implementation, not mock/placeholder.

5.3 Evaluation Metrics ✓ PROPERLY COMPUTED

Answer Extraction:

evaluate_accuracy.py lines 11-41

def extract_answer_from_output(text: str, domain: str):

pattern = re.compile(r'Answer:\s*(.+)', re.IGNORECASE | re.MULTILINE)

matches = list(pattern.finditer(text))

if matches:

answer = matches[-1].group(1).strip()

# Domain-specific extraction logic...

Math-Specific Handling:

evaluate_math_accuracy.py lines 11-52

def extract_math_answer_from_output(text: str):

# Extracts numerical answers with multiple fallback patterns

# Handles "Answer: X", standalone numbers, calculations

Assessment: Sophisticated answer parsing with domain-specific logic and fallback mechanisms.

---

6. DEPENDENCY & ENVIRONMENT ISSUES

6.1 Missing Dependency Specification 🔴 CRITICAL

Issue: No requirements.txt or setup.py file present. Inferred Dependencies:
datasets>=2.0.0

openai>=1.0.0

tenacity>=8.0.0

tqdm>=4.60.0

Impact: Users cannot easily install dependencies. Must manually identify required packages.

6.2 API Key Management 🔴 CRITICAL

Security Issue:

All inference scripts (lines ~15-21)

GEMINI_KEY = "OPENROUTER_API_KEY" # WRONG!

client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=GEMINI_KEY)

Problem:
  1. Uses literal string instead of environment variable
  2. No validation that API key is set
  3. Will fail immediately on execution
  4. Comments claim it's a placeholder, but code doesn't implement proper env var reading
Correct Implementation Should Be:
GEMINI_KEY = os.environ.get("OPENROUTER_API_KEY")

if not GEMINI_KEY:

raise ValueError("OPENROUTER_API_KEY environment variable not set")

6.3 Computational Resources ✓ REASONABLE

Resource Requirements: Assessment: Designed for practical execution without extreme resources.

---

7. REPRODUCIBILITY ASSESSMENT

7.1 Can Results Be Reproduced? 🔶 PARTIALLY

What's Reproducible: What's NOT Reproducible Without External Resources:

7.2 Missing for Full Reproducibility

  1. requirements.txt - Specify exact dependency versions
  2. API key setup instructions - Document how to obtain and configure keys
  3. Sample data - At least a small subset for testing
  4. Example outputs - Cached model generations for verification
  5. Execution scripts - Shell scripts showing complete pipeline execution
  6. Model version tracking - Record which model versions were used

7.3 Code Execution Feasibility ⚠️ LOW

Blockers:
  1. API key hardcoded as literal string (immediate failure)
  2. No data files included
  3. OpenRouter API requires paid account
  4. No test suite or example runs
Estimated Effort to Run:

---

8. SPECIFIC RED FLAGS IDENTIFIED

8.1 Severity: HIGH

None identified for results authenticity.

8.2 Severity: MEDIUM

  1. API Key Implementation Flaw
    • Location: All inference scripts
    • Issue: Hardcoded string instead of environment variable
    • Impact: Code will not execute as written
    • Evidence: GEMINI_KEY = "OPENROUTER_API_KEY" (literal string)
  1. Missing Data Files
    • Location: Repository structure
    • Issue: README states datasets not included
    • Impact: Cannot run pipeline without manual data preparation
    • Evidence: "Due to size constraints, the processed datasets are not included"
  1. No Requirements File
    • Location: Root directory
    • Issue: Dependencies not specified
    • Impact: Users must guess required packages and versions

8.3 Severity: LOW

  1. Code Duplication
    • Multiple files share similar utility functions
    • Could impact maintainability
  1. Commented Development Notes
    • build_own_persona.py contains commented persona examples
    • Minor clutter but not functional issue

---

9. PAPER CLAIMS VERIFICATION

9.1 Quantitative Results Analysis

Paper reports specific accuracy values: Code Verification: Sample Size Verification:

Paper claims: ~1,300 math, ~612 psychology, ~117 legal

Code confirms:

- GSM8K test split: 1,319 items

- MMLU professional_psychology test: 612 items

- BarExam QA: matches paper description

9.2 Methodology Claims ✓ SUPPORTED

All paper claims about methodology are supported by code:

---

10. AI AGENT REPRODUCIBILITY

10.1 Evidence of AI-Assisted Development

Searched for: Findings: Code Style Analysis:

10.2 Assessment

AGENT REPRODUCIBLE: FALSE Reasoning:

---

11. OVERALL RISK ASSESSMENT

11.1 Results Authenticity: ✅ LOW RISK

Confidence: HIGH Evidence: Conclusion: Results appear to be legitimately generated from model inference.

11.2 Code Completeness: 🔶 MEDIUM-HIGH RISK

Confidence: MEDIUM Issues: Conclusion: Code is structurally complete but not execution-ready without modifications.

11.3 Reproducibility: 🔴 HIGH RISK

Confidence: HIGH Blockers: Conclusion: Difficult to reproduce results without substantial effort and resources.

11.4 Code Quality: ✅ MEDIUM RISK

Confidence: HIGH Assessment: Conclusion: Code quality is adequate for research purposes.

---

12. RECOMMENDATIONS

12.1 For Authors

Critical Fixes:
  1. Fix API key implementation to read from environment variables
  2. Provide requirements.txt with exact versions
  3. Include sample data subset for testing
  4. Add execution documentation with expected runtimes
Suggested Improvements:
  1. Provide cached model outputs for verification
  2. Create integration tests
  3. Add example execution scripts
  4. Document model versions used
  5. Include troubleshooting guide

12.2 For Reviewers

Verification Steps:
  1. ✅ Code structure is complete and logical
  2. ✅ No hardcoded results detected
  3. ✅ Implementation matches paper claims
  4. ⚠️ Cannot execute without API keys and data
  5. ⚠️ Reproducibility requires significant external resources
Key Questions:
  1. Did authors actually run this code or is it reconstructed?
  2. Can authors provide execution logs or intermediate outputs?
  3. Are the API key issues a submission artifact or actual code issues?

12.3 Reproducibility Status

Current State: PARTIALLY REPRODUCIBLE Requirements for Full Reproduction:

---

13. CONCLUSION

Final Verdict: MEDIUM-HIGH RISK

Summary:

This submission presents a complete, well-structured codebase that implements the claimed methodology. The code quality is good for research software, with proper error handling, data validation, and logical organization. Critically, there is no evidence of hardcoded results or data manipulation, suggesting the reported findings are legitimate.

However, the code has significant practical reproducibility issues:
  1. Broken API key implementation (uses literal string instead of environment variable)
  2. Missing data files and cached outputs
  3. Requires paid API access to expensive models
  4. No test suite or execution examples
The code appears to be functional and legitimate but requires fixing and additional resources to reproduce results.

Specific Assessments:

| Criterion | Risk Level | Confidence |

|-----------|------------|------------|

| Results Authenticity | LOW | HIGH |

| Code Completeness | MEDIUM-HIGH | HIGH |

| Implementation Quality | MEDIUM | HIGH |

| Reproducibility | HIGH | HIGH |

| Documentation | MEDIUM | HIGH |

Agent Reproducibility: FALSE

No evidence of documented AI assistance in code generation.

---

APPENDIX: FILE-BY-FILE ANALYSIS

load_data.py (550 lines)

build_prompt.py (438 lines)

infer_gemini.py (294 lines)

evaluate_accuracy.py (379 lines)

evaluate_math_accuracy.py (323 lines)

infer_negation.py (200 lines)

infer_cross_domain.py (306 lines)

build_negation_prompt.py (159 lines)

build_own_persona.py (321 lines)

README.md (90 lines)

---

Report Generated: 2024 Auditor: Claude Code Audit System Methodology: Comprehensive static code analysis without execution