← Back to Submissions

Audit Report: Paper 77

CODE AUDIT REPORT

Submission ID: sub_77

Paper Title: "Visible Yet Unreadable: A Systematic Blind Spot of Vision-Language Models Across Writing Systems"

---

EXECUTIVE SUMMARY

Overall Assessment: CRITICAL ISSUES IDENTIFIED

This submission provides ONLY stimulus generation code without any evaluation pipeline, model testing infrastructure, or result computation code. The provided code can generate fused character images but contains NO mechanism to:

  1. Query VLMs (GPT-4o, Claude, Gemini, etc.)
  2. Collect model responses
  3. Compute accuracy metrics
  4. Generate the extensive results reported in the paper
Severity: CRITICAL - Results cannot be reproduced from provided code

---

DETAILED FINDINGS

1. COMPLETENESS & STRUCTURAL INTEGRITY: CRITICAL

#### Missing Components:

#### What IS Provided:

#### Code Quality of Provided Files:

2. RESULTS AUTHENTICITY RED FLAGS: CRITICAL

Major Concern: Results reported without corresponding evaluation code

The paper presents extensive quantitative results:

Red Flag: NONE of these results can be computed from the provided code. This creates significant reproducibility concerns:
  1. No verifiable computation path: Without evaluation code, there's no way to verify that reported accuracies were actually computed programmatically vs. manually entered
  1. Specific numeric claims lack support: The paper reports precise percentages (e.g., "GPT-4o: Strict 0.0-0.7%; Avg 5.2-11.1%") but no computation pipeline exists to generate these
  1. Missing data: The code can generate stimuli but there's no evidence the actual experimental stimuli images used in the paper are preserved or that model responses are logged
Mitigation: The stimulus generation code appears legitimate and could produce appropriate test stimuli. The ABSENCE of evaluation code doesn't prove results are fake, but it prevents verification.

3. IMPLEMENTATION-PAPER CONSISTENCY: HIGH SEVERITY

#### Discrepancies Found:

  1. Dataset Size Mismatch:
    • Paper claims: "100 four-character idioms" and "100 eight-letter words"
    • Actual files: 99 entries each (both cin_100.txt and en_100.txt)
    • Impact: Minor - likely a counting error with/without header
  1. Fusion Mode Completeness:
    • Paper describes: "horizontal (top/bottom halves), vertical (left/right halves), diagonal (main diagonal)"
    • Code implements: 7 modes total (lr, tb, diag, checker, vstripes, hstripes, alpha)
    • Assessment: Code provides MORE than paper describes; paper likely reports subset used
  1. Color Scheme Specification:
    • Paper states: "first half rendered in red, second half in green"
    • Code defaults: pure_red and pure_green schemes with gradients
    • Assessment: Consistent, though implementation is more sophisticated than paper suggests
  1. Missing Methodology Details:
    • Paper mentions "two-run averaging" for accuracies - NO code implements this
    • Paper describes "attention checks" for human baseline - NO code exists
    • Paper mentions "per-item difficulty analysis" - NO analysis code provided

4. CODE QUALITY SIGNALS: LOW SEVERITY

Positive Indicators: Minor Issues:

5. FUNCTIONALITY INDICATORS: MEDIUM SEVERITY

Stimulus Generation (Provided Code): Evaluation Pipeline (MISSING): Assessment: The 30% of code that exists is functional; the 70% that's missing is the scientifically critical evaluation infrastructure.

6. DEPENDENCY & ENVIRONMENT ISSUES: LOW SEVERITY

Missing Specifications: Inferred Dependencies (from imports):
PIL (Pillow)

numpy

scipy

tqdm

argparse (standard library)

json (standard library)

os, re, math, random, hashlib, unicodedata (standard library)

Assessment: All dependencies are standard and commonly available. However:

7. AGENT REPRODUCIBILITY ASSESSMENT

Search for AI Usage Documentation: AGENT REPRODUCIBLE: FALSE

No documentation of AI prompts or AI-assisted code generation process was found.

---

CRITICAL GAPS PREVENTING REPRODUCIBILITY

1. Model Evaluation Infrastructure (CRITICAL)

The complete absence of code to:

2. Metrics Computation (CRITICAL)

No implementation of:

3. Experimental Data (HIGH)

Missing:

4. Analysis Pipeline (HIGH)

No code for:

---

RECOMMENDATIONS

For Authors:

  1. IMMEDIATELY RELEASE: Complete evaluation pipeline including model querying, prompt templates, and metrics computation
  2. PROVIDE: Raw experimental data (stimulus images, model responses, accuracy scores)
  3. INCLUDE: Environment specification with exact dependency versions
  4. ADD: End-to-end reproduction script that goes from data files to paper results
  5. DOCUMENT: API keys setup, model checkpoint locations, computational requirements
  6. SPECIFY: Which font files were used for Chinese character rendering

For Reviewers:

  1. REQUEST: Complete code release before accepting paper
  2. VERIFY: That reported results can be recomputed from provided code
  3. CHECK: Availability of intermediate experimental data
  4. ASSESS: Whether "code release" actually enables reproduction or just shows partial implementation

---

CONCLUSION

Final Verdict: INCOMPLETE SUBMISSION - CRITICAL REPRODUCIBILITY ISSUES

The provided code represents approximately 30% of what would be needed for full reproducibility:

The code quality of what IS provided is good, but the ABSENCE of the evaluation pipeline means:

This is a CRITICAL issue for a research submission. While there's no evidence of fabricated results or malicious code, the submission fails basic reproducibility standards by omitting the core scientific infrastructure needed to generate the reported findings.

Recommendation: REQUIRE complete code release including evaluation pipeline before publication acceptance.

---

APPENDIX: FILES ANALYZED

sub_77/

├── 77_methods_results.md (5,664 bytes) - Paper methodology and results summary

├── 77_Visible_Yet_Unreadable_A_Sy.pdf (2.97 MB) - Full paper

└── character-fusion-generator-main/

├── cin-gen.py (15,167 bytes) - Chinese character fusion generator ✓

├── en-color-gradient-fusion.py (14,360 bytes) - English word fusion generator ✓

├── cin_100.txt (1,299 bytes) - 99 Chinese idioms ✓

├── en_100.txt (899 bytes) - 99 English words ✓

└── README.md (6,496 bytes) - Documentation for generators ✓

Total Python code: 2 files, 692 lines Missing: Evaluation pipeline (estimated 1,000-2,000 lines) Missing: Analysis scripts (estimated 500-1,000 lines) Missing: Experimental data artifacts

---

Audit completed: 2024 Methodology: Static code analysis, structural assessment, cross-referencing with paper claims Tools: Manual code review, grep searches, structural analysis