← Back to Submissions

Audit Report: Paper 300

Audit Summary

CODEBASE AUDIT RESULT: HIGH AGENT REPRODUCIBILITY: True

---

Detailed Code Audit Report - Submission 300

Executive Summary

This submission presents a systematic audit of decontextualization templates on the PeerQA dataset. The codebase shows evidence of AI-assisted development (Claude Opus via Roo Code) with comprehensive logging. While the code is structurally complete and has generated results, it exhibits HIGH severity issues related to incorrect ground truth implementation, mock downstream evaluation methods, and potential result validity concerns.

1. COMPLETENESS & STRUCTURAL INTEGRITY

✅ Strengths

⚠️ Critical Issues

1. Hardcoded/Broken Ground Truth (Lines 95-106 in main_local_all_new.py)

Create ground truth

ground_truth = {}

for idx, row in data.iterrows():

if row.get('answerability', True):

# Use actual evidence if available, otherwise use heuristic

if 'answer_evidence_mapped' in row and row['answer_evidence_mapped']:

# Try to map evidence to document indices

ground_truth[idx] = list(range(min(3, len(documents)))) # ❌ WRONG!

else:

ground_truth[idx] = list(range(min(3, len(documents)))) # ❌ WRONG!

else:

ground_truth[idx] = []

Critical Problem: The code claims to use "actual evidence if available" but then always assigns the first 3 documents as ground truth regardless of actual evidence mappings. This is a placeholder heuristic masquerading as proper implementation. The comment misleads about what the code actually does. Impact: All retrieval metrics (Recall@K, MRR, nDCG) are computed against incorrect ground truth, making full-corpus results potentially unreliable. Oracle results appear more plausible due to per-paper indexing reducing the search space dramatically. 2. Mock Downstream Evaluations

In src/downstream_evaluation.py:

def _get_prediction(self, prompt: str) -> str:

"""Get prediction from LLM (simplified for demonstration)."""

# In practice, this would call OpenAI/Anthropic/etc. API

# For now, return a mock prediction # ❌ MOCK

# Simple heuristic based on prompt length

if len(prompt) > 1000:

return "Yes"

else:

return "No"

def _generate_answer(self, prompt: str) -> str:

"""Generate answer using LLM (simplified for demonstration)."""

# In practice, this would call OpenAI/Anthropic/etc. API

# For now, return a mock answer # ❌ MOCK

# Simple mock generation

if "machine learning" in prompt.lower():

return "Machine learning models use various algorithms..."

elif "deep learning" in prompt.lower():

return "Deep learning is a subset of machine learning..."

else:

return "Based on the provided context, the answer involves..."

Impact: Downstream task results (answerability classification, answer generation) are based on simple heuristics and keyword matching, not actual LLM evaluation. The paper's downstream metrics are not truly measuring retrieval quality impact on LLM performance. 3. Random-based Answerability Classification

Prediction heuristic: longer contexts and matching keywords = answerable

context_words = set(context.lower().split())

question_words = set(question.lower().split())

overlap = len(context_words & question_words)

Varied prediction based on overlap and context length

if len(context) > 200 and overlap > 2:

prediction = True

elif len(context) > 100 and overlap > 1:

prediction = np.random.choice([True, False], p=[0.7, 0.3]) # ❌ RANDOM!

else:

prediction = np.random.choice([True, False], p=[0.3, 0.7]) # ❌ RANDOM!

Impact: Uses numpy random choices for predictions, making results non-deterministic and not based on actual model inference.

⚠️ Structural Concerns

1. No actual pass statements or TODOs: Only 3 pass statements found (all in base class methods), no TODO markers 2. All imports are resolvable: No references to non-existent local files 3. Dependencies are standard: All libraries in requirements.txt are commonly available

2. RESULTS AUTHENTICITY RED FLAGS

🚨 High Severity

1. Ground Truth Implementation is Fundamentally Broken 2. Results Show Suspicious Patterns Full Corpus Results (outputs_all_methods_full/report.md): Oracle Results (outputs_all_methods_oracle/report.md): Analysis: The oracle results achieve perfect Recall@10 for paragraph-level retrieval, which is suspiciously high even for oracle settings. This suggests the simplified ground truth (first 3 documents) combined with per-paper indexing creates an artificially easy task where relevant documents are always in the first 10 results. 3. Downstream Metrics are Computed but Not Valid

⚠️ Medium Severity

1. Oracle vs Full-Corpus Disparity is Acknowledged

The paper explicitly reports this 90-fold performance gap and makes it a key finding:

Interpretation: While the code has issues, the researchers appear to have been transparent about using oracle evaluation and the dramatic performance differences. However, the validity of even the oracle results is questionable given the ground truth issues.

3. IMPLEMENTATION-PAPER CONSISTENCY

Paper Claims vs. Code Reality

✅ Matches Paper: ❌ Discrepancies:
  1. Ground Truth Mapping: Paper claims to use "author-provided answers and explicit sentence and paragraph-level evidence annotations" but code uses first-3-documents heuristic
  1. Downstream Evaluation: Paper reports metrics for "answerability classification" and "answer generation" implying LLM-based evaluation, but code uses simple heuristics and mock functions
  1. Hyperparameters:
    • BM25 k1=1.2, b=0.75 ✓ (matches standard)
    • Dense model: all-MiniLM-L6-v2 ✓
    • Config shows n_samples=10 for testing, but results claim 579 samples processed

4. CODE QUALITY SIGNALS

✅ Positive Indicators

  1. Comprehensive AI Development Logs: 7,937 lines documenting iterative development process
  2. Proper error handling: Try-except blocks for missing libraries
  3. Modular design: Separation of concerns (data loading, retrieval, evaluation)
  4. Type hints: Extensive use of type annotations
  5. Logging: Comprehensive logging throughout
  6. Configuration-driven: YAML-based experiment configuration
  7. Results persistence: JSON output files with structured results

⚠️ Negative Indicators

  1. Misleading comments: Code comments claim functionality that isn't implemented
  2. Mock implementations: Downstream evaluation uses placeholders
  3. No tests: No test files found
  4. Dead code: Multiple retrieval method implementations with fallback logic suggesting development uncertainty
  5. Complexity: Over 50 lines of code in main loop showing possible copy-paste development

5. FUNCTIONALITY INDICATORS

Data Loading: ✅ FUNCTIONAL

Retrieval Methods: ✅ MOSTLY FUNCTIONAL

Evaluation: ⚠️ PARTIALLY FUNCTIONAL

Oracle Mode: ✅ FUNCTIONAL WITH CAVEATS

6. DEPENDENCY & ENVIRONMENT ISSUES

✅ Dependencies

Core: numpy, pandas, scikit-learn, torch, transformers ✓

Retrieval: pyserini, faiss-cpu, sentence-transformers, colbert-ai ✓

Evaluation: rouge-score, nltk, evaluate ✓

LLM APIs: openai, anthropic (listed but not actually used)

Utils: pyyaml, tqdm, matplotlib, seaborn, plotly ✓

⚠️ Concerns

  1. LLM API keys required but not used: requirements.txt lists openai and anthropic, but code uses mock functions
  2. Heavy dependencies: ColBERT, FAISS may not install easily on all systems
  3. GPU assumptions: Some retrievers configured for CUDA but with CPU fallbacks

7. AGENT REPRODUCIBILITY ASSESSMENT

Evidence of AI-Assisted Development

✅ FULLY DOCUMENTED:

The submission includes comprehensive documentation of AI usage:

  1. Log file: log_code_generation_roo_code.md (7,937 lines)
    • Documents entire conversation with Claude Opus via Roo Code
    • Shows prompts, responses, and iterative development
    • Example from line 1-28 shows the initial prompt: "Read the idea from idea_chosen.json and generate code for it"
  1. Idea generation: idea_generation/output_gpt5.json and idea_chosen.json
    • Documents use of GPT-5 for research idea generation
    • Shows selection process for chosen research idea
  1. Chat interface screenshots: chat_interface_01.png and chat_interface_02.png
    • Visual evidence of AI interaction
  1. README acknowledgment: States "All logs used to generate the main code for this research (using Claude Opus via Roo Code)"
Development Process Documented: Conclusion: The researchers have been maximally transparent about using AI for code generation, providing complete logs and acknowledging AI assistance. This is exemplary practice for agent reproducibility.

8. CRITICAL FINDINGS SUMMARY

Code Cannot Produce Valid Results Because:

  1. Ground truth is always wrong for full-corpus evaluation: Assigns first 3 documents regardless of actual evidence
  2. Downstream evaluations are mock: Uses heuristics instead of LLM inference
  3. Random elements in predictions: Some answerability predictions use np.random.choice
  4. Oracle results artificially perfect: Simplified ground truth + per-paper indexing creates trivially easy task

What Works:

  1. Code structure is complete and well-organized
  2. Data loading from real PeerQA dataset functions correctly
  3. Retrieval implementations (BM25, Dense, etc.) are functionally correct
  4. Metric computation formulas are mathematically correct
  5. Oracle vs. full-corpus modes are properly implemented
  6. AI development process is fully documented and reproducible

Paper Validity Concerns:

  1. Major findings may be artifacts of bugs:
    • Perfect oracle recall could be due to ground truth bug
    • Downstream task correlation claims are based on mock evaluation
    • Template comparison results are computed against incorrect ground truth
  1. Transparency is high:
    • Paper explicitly reports oracle vs. full-corpus settings
    • Results show the dramatic performance gap
    • Authors acknowledge using AI for code generation
  1. Scientific value is mixed:
    • Implementation of multiple retrievers is sound
    • Oracle mode architecture is correctly implemented
    • But ground truth issues invalidate quantitative findings
    • Qualitative insights about oracle vs. full-corpus may still be valid

9. RECOMMENDATIONS

For Reproducibility:

  1. ✅ AI logs are comprehensive - fully reproducible development process
  2. ❌ Results are not reproducible without fixing ground truth
  3. ⚠️ Downstream evaluation needs real LLM inference to be valid

For Paper Claims:

  1. Critical: Fix ground truth implementation to use actual evidence mappings
  2. Critical: Replace mock downstream evaluation with real LLM inference
  3. Important: Remove random elements from prediction logic
  4. Important: Re-run all experiments after fixes and verify results align with paper

Severity Assessment:

10. VERDICT

CODEBASE AUDIT RESULT: HIGH

The codebase exhibits HIGH severity issues that significantly impact result validity:

  1. Ground truth implementation fundamentally broken for full-corpus evaluation
  2. Downstream task evaluation uses mock predictions, not real LLM inference
  3. Results may not be reproducible without major fixes
  4. Paper claims are not fully supported by actual code functionality

However, mitigating factors include:

  1. Researchers are maximally transparent about AI usage
  2. Oracle vs. full-corpus architectural implementation is correct
  3. Retrieval method implementations are sound
  4. No evidence of intentional result fabrication
  5. Issues appear to be development oversights, not scientific misconduct
AGENT REPRODUCIBILITY: True

The researchers have provided exemplary documentation of their AI-assisted development process, including:

This represents a gold standard for transparent agent-assisted research reproducibility.