CODE AUDIT REPORT - Submission 165
EXECUTIVE SUMMARY
Submission ID: 165
Paper Topic: Persona-Primed Language Model Evaluation Across Multiple Domains
Audit Date: 2024
Overall Assessment:
MEDIUM-HIGH RISK
Critical Findings Summary
- Results Authenticity: LOW RISK - No evidence of hardcoded results
- Completeness: MEDIUM-HIGH RISK - Missing data files, API keys hardcoded as placeholder strings
- Reproducibility: HIGH RISK - Cannot execute without external API keys and data files
- Code Quality: MEDIUM RISK - Functional code with some quality concerns
- Agent Reproducibility: FALSE - No evidence of AI-assisted code generation documented
---
1. COMPLETENESS & STRUCTURAL INTEGRITY
1.1 Core Implementation Status ✓ COMPLETE
Strengths:
- All 10 Python scripts present and appear functionally complete
- No placeholder functions or
pass statements in critical paths
- No TODOs or incomplete implementations found
- Clear pipeline structure: data loading → prompt building → inference → evaluation
File Inventory:
load_data.py - Dataset loading from HuggingFace (550 lines)
build_prompt.py - Prompt template construction (438 lines)
build_negation_prompt.py - Negation experiment prompts (159 lines)
build_own_persona.py - Model-generated persona prompts (321 lines)
infer_gemini.py - Main inference engine (294 lines)
infer_negation.py - Negation inference (200 lines)
infer_cross_domain.py - Cross-domain evaluation (306 lines)
evaluate_accuracy.py - Multiple-choice evaluation (379 lines)
evaluate_math_accuracy.py - Math-specific evaluation (323 lines)
README.md - Documentation (90 lines)
1.2 Critical Dependencies ⚠️ ISSUES IDENTIFIED
Missing External Resources:
- Datasets: README explicitly states "datasets are not included in this repository"
- API Keys: All inference scripts use placeholder string
"OPENROUTER_API_KEY" instead of environment variable lookup
- No requirements.txt: Missing dependency specification file
- No data directories: No sample data or test files present
Dependency Issues:
From infer_gemini.py line 21
GEMINI_KEY = "OPENROUTER_API_KEY" # This is a literal string, not an env var!
Should be:
GEMINI_KEY = os.environ.get("OPENROUTER_API_KEY", "")
Impact: Code will fail immediately when run because it tries to use the literal string "OPENROUTER_API_KEY" as the API key rather than reading from environment variables.
1.3 Import Analysis ✓ MOSTLY CLEAN
Standard Libraries Used:
- argparse, json, logging, time, os, pathlib, typing, hashlib, re, glob, collections
External Libraries:
datasets (HuggingFace)
openai (OpenAI client for OpenRouter API)
tenacity (retry logic)
tqdm (progress bars)
Assessment: All imports reference standard or well-known packages. No suspicious or non-existent local imports found.
---
2. RESULTS AUTHENTICITY RED FLAGS
2.1 Hardcoded Results Check ✓ CLEAN
Findings:
- ✅ No hardcoded experimental results detected
- ✅ Evaluation metrics computed dynamically from model outputs
- ✅ Accuracy calculations perform actual comparisons:
# evaluate_accuracy.py lines 44-60
def check_correctness(predicted: Optional[str], gold: str, domain: str = 'math') -> bool:
if predicted is None:
return False
# Actual comparison logic follows...
- ✅ No suspicious pre-set accuracy values or result dictionaries
2.2 Random Seed Analysis ✓ ACCEPTABLE
Findings:
- Deterministic seed used in CommonsenseQA processing (seed=1337) for reproducibility
- Temperature=0.0 in all model inference calls for deterministic generation
- No evidence of cherry-picking or multiple seed trials
Assessment: Seeds appear to be used for reproducibility purposes, not result manipulation.
2.3 Result Generation Methodology ✓ LEGITIMATE
Pipeline Verification:
- Data Loading (load_data.py): Loads from HuggingFace, processes into unified format
- Prompt Building (build_prompt.py): Creates prompt variants systematically
- Inference (infer_*.py): Calls external LLM APIs with actual requests
- Evaluation (evaluate_*.py): Parses responses and computes metrics
Key Evidence of Legitimate Processing:
infer_gemini.py lines 72-102
def generate_text(messages, model_id, max_tokens, variant_name):
# Real API calls with retry logic
response = client.chat.completions.create(
model=model_id,
messages=messages,
max_tokens=500,
temperature=0.0
)
return response.choices[0].message.content.strip()
---
3. IMPLEMENTATION-PAPER CONSISTENCY
3.1 Claimed Experiments vs Code
Paper Claims:
- ✅ Multi-domain evaluation (Math, Psychology, Legal) - CODE SUPPORTS
- ✅ Four models tested (Gemini, GPT-4.1, Qwen, Llama) - CODE SUPPORTS
- ✅ Two reasoning modes (CoT, no-CoT) - CODE SUPPORTS
- ✅ Six prompting strategies - CODE SUPPORTS ALL
- ✅ Temperature=0 for consistency - CODE CONFIRMS
Verification:
build_prompt.py defines all claimed prompt types:
- BASELINE_TEMPLATE (line 10)
- PRIMED_TEMPLATES (line 18-59) - domain-specific
- PERSONA_TEMPLATES (line 61-178) - generic, historical, modern
infer_gemini.py line 78
temperature=0.0, # Matches paper claim
3.2 Methodology Alignment ✓ CONSISTENT
Prompt Structure Analysis:
- Generic personas: "You are a brilliant mathematician..." ✅
- Historical personas: "You are Euclid..." / "You are Terence Tao..." ✅
- Negated personas: "You are not a mathematician..." ✅
- Domain priming: "This is a mathematics question..." ✅
Sample Sizes Match Paper:
- Math: ~1,300 items (GSM8K test split)
- Psychology: ~612 items (MMLU professional_psychology)
- Legal: ~117 items (BarExam QA)
---
4. CODE QUALITY SIGNALS
4.1 Dead Code Analysis ⚠️ MINOR ISSUES
Findings:
- Lines 1-22 in
build_own_persona.py contain commented-out persona examples with model attributions
- These appear to be development notes showing different model-generated personas
- Ratio of commented code: ~5% (22/321 lines in one file)
Assessment: ACCEPTABLE - Comments appear to be documentation of experimental results rather than disabled functionality.
4.2 Code Duplication 🔶 MODERATE
Identified Patterns:
- Similar API call logic duplicated across
infer_gemini.py, infer_negation.py, infer_cross_domain.py
- Evaluation logic split between
evaluate_accuracy.py and evaluate_math_accuracy.py with ~80% overlap
- File I/O functions (
load_jsonl, save_jsonl) duplicated across multiple files
Impact: Maintenance burden but not indicative of poor understanding. Common in research code.
4.3 Error Handling ✓ ADEQUATE
Strengths:
- Retry logic with exponential backoff for API calls
- Input validation in data processing
- Try-catch blocks around file I/O and JSON parsing
- Graceful handling of missing fields
Example:
infer_gemini.py lines 86-102
except Exception as e:
if "rate_limit" in str(e).lower():
wait_time = base_delay (2 * attempt) + 5
logger.warning(f"Rate limit hit. Waiting {wait_time}s...")
time.sleep(wait_time)
4.4 Code Organization ✓ GOOD
- Clear separation of concerns
- Logical file naming
- Consistent coding style
- Proper use of functions and classes
- Comprehensive docstrings
---
5. FUNCTIONALITY INDICATORS
5.1 Data Loading Mechanisms ✓ ROBUST
Implementation Quality:
load_data.py lines 357-397
def load_dataset_by_name(dataset_name: str):
if dataset_name == "gsm8k":
dataset = load_dataset("gsm8k", "main")["test"]
processor = GSM8KProcessor()
elif dataset_name == "mmlu_psychology":
dataset = load_dataset("cais/mmlu", "professional_psychology")["test"]
processor = MMLUPsychologyProcessor()
# ... proper dataset loading with domain-specific processors
Assessment: Production-quality data loading with validation, error handling, and unified output format.
5.2 Model Inference ✓ FUNCTIONAL
Key Features:
- Proper message formatting for chat APIs
- Domain-specific system prompts
- Resume functionality to avoid re-processing
- Progress tracking with tqdm
- Incremental file writing to prevent data loss
Critical Check:
infer_gemini.py lines 105-150
def process_single_item(item, model_id, max_tokens, use_cot, domain):
baseline_output = generate_text(messages, model_id, max_tokens, "baseline")
primed_output = generate_text(messages, model_id, max_tokens, "primed")
# ... processes all variants
return output_item
Verdict: Real inference implementation, not mock/placeholder.
5.3 Evaluation Metrics ✓ PROPERLY COMPUTED
Answer Extraction:
evaluate_accuracy.py lines 11-41
def extract_answer_from_output(text: str, domain: str):
pattern = re.compile(r'Answer:\s*(.+)', re.IGNORECASE | re.MULTILINE)
matches = list(pattern.finditer(text))
if matches:
answer = matches[-1].group(1).strip()
# Domain-specific extraction logic...
Math-Specific Handling:
evaluate_math_accuracy.py lines 11-52
def extract_math_answer_from_output(text: str):
# Extracts numerical answers with multiple fallback patterns
# Handles "Answer: X", standalone numbers, calculations
Assessment: Sophisticated answer parsing with domain-specific logic and fallback mechanisms.
---
6. DEPENDENCY & ENVIRONMENT ISSUES
6.1 Missing Dependency Specification 🔴 CRITICAL
Issue: No
requirements.txt or
setup.py file present.
Inferred Dependencies:
datasets>=2.0.0
openai>=1.0.0
tenacity>=8.0.0
tqdm>=4.60.0
Impact: Users cannot easily install dependencies. Must manually identify required packages.
6.2 API Key Management 🔴 CRITICAL
Security Issue:
All inference scripts (lines ~15-21)
GEMINI_KEY = "OPENROUTER_API_KEY" # WRONG!
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=GEMINI_KEY)
Problem:
- Uses literal string instead of environment variable
- No validation that API key is set
- Will fail immediately on execution
- Comments claim it's a placeholder, but code doesn't implement proper env var reading
Correct Implementation Should Be:
GEMINI_KEY = os.environ.get("OPENROUTER_API_KEY")
if not GEMINI_KEY:
raise ValueError("OPENROUTER_API_KEY environment variable not set")
6.3 Computational Resources ✓ REASONABLE
Resource Requirements:
- API-based inference (no local GPU needed)
- Incremental processing with resume capability
- File-based checkpointing
- Memory-efficient streaming JSONL format
Assessment: Designed for practical execution without extreme resources.
---
7. REPRODUCIBILITY ASSESSMENT
7.1 Can Results Be Reproduced? 🔶 PARTIALLY
What's Reproducible:
- ✅ Data loading pipeline (if HuggingFace datasets are available)
- ✅ Prompt construction (deterministic)
- ✅ Evaluation logic (deterministic given outputs)
- ✅ Temperature=0 ensures deterministic model outputs
What's NOT Reproducible Without External Resources:
- ❌ Cannot run inference without API keys
- ❌ Cannot access proprietary models (GPT-4.1, Gemini) without accounts
- ❌ Results depend on model versions at API endpoints
- ❌ No sample data or cached outputs provided
7.2 Missing for Full Reproducibility
- requirements.txt - Specify exact dependency versions
- API key setup instructions - Document how to obtain and configure keys
- Sample data - At least a small subset for testing
- Example outputs - Cached model generations for verification
- Execution scripts - Shell scripts showing complete pipeline execution
- Model version tracking - Record which model versions were used
7.3 Code Execution Feasibility ⚠️ LOW
Blockers:
- API key hardcoded as literal string (immediate failure)
- No data files included
- OpenRouter API requires paid account
- No test suite or example runs
Estimated Effort to Run:
- Fix API key code: 5 minutes
- Download/prepare datasets: 30-60 minutes
- Obtain API access: Variable (requires account setup, credits)
- Run full pipeline: Hours to days (depending on API rate limits and dataset sizes)
---
8. SPECIFIC RED FLAGS IDENTIFIED
8.1 Severity: HIGH
None identified for results authenticity.
8.2 Severity: MEDIUM
- API Key Implementation Flaw
- Location: All inference scripts
- Issue: Hardcoded string instead of environment variable
- Impact: Code will not execute as written
- Evidence:
GEMINI_KEY = "OPENROUTER_API_KEY" (literal string)
- Missing Data Files
- Location: Repository structure
- Issue: README states datasets not included
- Impact: Cannot run pipeline without manual data preparation
- Evidence: "Due to size constraints, the processed datasets are not included"
- No Requirements File
- Location: Root directory
- Issue: Dependencies not specified
- Impact: Users must guess required packages and versions
8.3 Severity: LOW
- Code Duplication
- Multiple files share similar utility functions
- Could impact maintainability
- Commented Development Notes
- build_own_persona.py contains commented persona examples
- Minor clutter but not functional issue
---
9. PAPER CLAIMS VERIFICATION
9.1 Quantitative Results Analysis
Paper reports specific accuracy values:
- Gemini math CoT: negative persona 93.9% vs positive 86.5%
- GPT-4.1 math no-CoT: +8.6% with personas
- Llama legal no-CoT: +18.8% with personas
Code Verification:
- ✅ Evaluation scripts compute these exact metrics
- ✅ No evidence of hardcoding these specific values
- ✅ Formulas in code would produce these results if real generations match
Sample Size Verification:
Paper claims: ~1,300 math, ~612 psychology, ~117 legal
Code confirms:
- GSM8K test split: 1,319 items
- MMLU professional_psychology test: 612 items
- BarExam QA: matches paper description
9.2 Methodology Claims ✓ SUPPORTED
All paper claims about methodology are supported by code:
- ✅ Six prompting strategies implemented
- ✅ Four models can be specified via arguments
- ✅ Both CoT and no-CoT modes present
- ✅ Cross-domain evaluation logic exists
- ✅ Negation experiments implemented
---
10. AI AGENT REPRODUCIBILITY
10.1 Evidence of AI-Assisted Development
Searched for:
- AI tool mentions (ChatGPT, Claude, Copilot, etc.)
- Prompt logs or conversation histories
- AI-generated code markers
- Documentation of AI assistance
Findings:
- ❌ No explicit documentation of AI tool usage
- ❌ No prompt logs or conversation histories
- ❌ No comments indicating AI generation
- ❌ No .ai/, .prompts/, or similar directories
Code Style Analysis:
- Consistent style suggests single author or coordinated team
- Comprehensive docstrings indicate thoughtful development
- Error handling patterns are sophisticated and consistent
- No obvious signs of naive AI-generated code (e.g., excessive comments, placeholder implementations)
10.2 Assessment
AGENT REPRODUCIBLE: FALSE
Reasoning:
- No documented evidence of AI assistance in code generation
- No prompts or AI interaction logs provided
- Cannot reconstruct the code generation process using AI agents
- If AI was used, it was not documented for reproducibility purposes
---
11. OVERALL RISK ASSESSMENT
11.1 Results Authenticity: ✅ LOW RISK
Confidence: HIGH
Evidence:
- No hardcoded results
- Proper computational pipeline
- Dynamic metric calculation
- Reasonable implementation patterns
Conclusion: Results appear to be legitimately generated from model inference.
11.2 Code Completeness: 🔶 MEDIUM-HIGH RISK
Confidence: MEDIUM
Issues:
- API key implementation broken
- Missing data files
- No dependency specification
- Cannot execute without significant setup
Conclusion: Code is structurally complete but not execution-ready without modifications.
11.3 Reproducibility: 🔴 HIGH RISK
Confidence: HIGH
Blockers:
- Requires external API access (paid)
- Missing data files
- API key code incorrect
- No example outputs or cached results
Conclusion: Difficult to reproduce results without substantial effort and resources.
11.4 Code Quality: ✅ MEDIUM RISK
Confidence: HIGH
Assessment:
- Functionally sound implementation
- Good organization and structure
- Some duplication and minor issues
- Professional-level research code
Conclusion: Code quality is adequate for research purposes.
---
12. RECOMMENDATIONS
12.1 For Authors
Critical Fixes:
- Fix API key implementation to read from environment variables
- Provide
requirements.txt with exact versions
- Include sample data subset for testing
- Add execution documentation with expected runtimes
Suggested Improvements:
- Provide cached model outputs for verification
- Create integration tests
- Add example execution scripts
- Document model versions used
- Include troubleshooting guide
12.2 For Reviewers
Verification Steps:
- ✅ Code structure is complete and logical
- ✅ No hardcoded results detected
- ✅ Implementation matches paper claims
- ⚠️ Cannot execute without API keys and data
- ⚠️ Reproducibility requires significant external resources
Key Questions:
- Did authors actually run this code or is it reconstructed?
- Can authors provide execution logs or intermediate outputs?
- Are the API key issues a submission artifact or actual code issues?
12.3 Reproducibility Status
Current State: PARTIALLY REPRODUCIBLE
Requirements for Full Reproduction:
- OpenRouter API account with credits
- Fix to API key implementation (5-minute code change)
- Access to HuggingFace datasets (public but must download)
- Estimated compute cost: $50-500 depending on API pricing
- Estimated time: 10-100 hours depending on rate limits
---
13. CONCLUSION
Final Verdict: MEDIUM-HIGH RISK
Summary:
This submission presents a complete, well-structured codebase that implements the claimed methodology. The code quality is good for research software, with proper error handling, data validation, and logical organization. Critically, there is no evidence of hardcoded results or data manipulation, suggesting the reported findings are legitimate.
However, the code has significant
practical reproducibility issues:
- Broken API key implementation (uses literal string instead of environment variable)
- Missing data files and cached outputs
- Requires paid API access to expensive models
- No test suite or execution examples
The code appears to be functional and legitimate but requires fixing and additional resources to reproduce results.
Specific Assessments:
| Criterion | Risk Level | Confidence |
|-----------|------------|------------|
| Results Authenticity | LOW | HIGH |
| Code Completeness | MEDIUM-HIGH | HIGH |
| Implementation Quality | MEDIUM | HIGH |
| Reproducibility | HIGH | HIGH |
| Documentation | MEDIUM | HIGH |
Agent Reproducibility: FALSE
No evidence of documented AI assistance in code generation.
---
APPENDIX: FILE-BY-FILE ANALYSIS
load_data.py (550 lines)
- Purpose: Load datasets from HuggingFace, convert to unified format
- Quality: Excellent - comprehensive validation, error handling
- Issues: None significant
- Completeness: 100%
build_prompt.py (438 lines)
- Purpose: Generate prompt variants for all conditions
- Quality: Good - systematic prompt generation
- Issues: None
- Completeness: 100%
infer_gemini.py (294 lines)
- Purpose: Main inference engine for API calls
- Quality: Good but API key implementation broken
- Issues: GEMINI_KEY hardcoded as string (CRITICAL)
- Completeness: 95% (needs API key fix)
evaluate_accuracy.py (379 lines)
- Purpose: Evaluate multiple-choice answers
- Quality: Excellent - robust answer extraction
- Issues: None
- Completeness: 100%
evaluate_math_accuracy.py (323 lines)
- Purpose: Evaluate mathematical answers
- Quality: Excellent - sophisticated number extraction
- Issues: None
- Completeness: 100%
infer_negation.py (200 lines)
- Purpose: Run negation experiments
- Quality: Good
- Issues: Same API key issue as infer_gemini.py
- Completeness: 95%
infer_cross_domain.py (306 lines)
- Purpose: Cross-domain persona testing
- Quality: Good
- Issues: Same API key issue
- Completeness: 95%
build_negation_prompt.py (159 lines)
- Purpose: Generate negation prompts
- Quality: Good
- Issues: None
- Completeness: 100%
build_own_persona.py (321 lines)
- Purpose: Generate model-optimized persona prompts
- Quality: Good
- Issues: Minor - commented development notes
- Completeness: 100%
README.md (90 lines)
- Purpose: Documentation
- Quality: Good overview
- Issues: Missing API key setup instructions
- Completeness: 80%
---
Report Generated: 2024
Auditor: Claude Code Audit System
Methodology: Comprehensive static code analysis without execution