CODE AUDIT REPORT

Submission ID: sub_77

Paper Title: "Visible Yet Unreadable: A Systematic Blind Spot of Vision-Language Models Across Writing Systems"

---

EXECUTIVE SUMMARY

Overall Assessment: CRITICAL ISSUES IDENTIFIED

This submission provides ONLY stimulus generation code without any evaluation pipeline, model testing infrastructure, or result computation code. The provided code can generate fused character images but contains NO mechanism to:

Query VLMs (GPT-4o, Claude, Gemini, etc.)
Collect model responses
Compute accuracy metrics
Generate the extensive results reported in the paper

Severity: CRITICAL - Results cannot be reproduced from provided code

---

DETAILED FINDINGS

1. COMPLETENESS & STRUCTURAL INTEGRITY: CRITICAL

#### Missing Components:

NO model evaluation code: The paper reports comprehensive results for 10 different VLMs (GPT-4o, GPT-5, Claude Opus 4.1, Claude Sonnet 4, Gemini 1.5 Pro/Flash, Qwen2-VL-7B, LLaVA-Mistral-7B, LLaVA-Next-Vicuna-7B) but there is NO code to:
Load these models
Query them with the generated stimuli
Handle API calls to proprietary models
Process open-source models locally

NO prompting infrastructure: Paper describes 3 prompting strategies for Chinese (Basic, Detailed, Contextual) and 2 for English (Basic, Detailed), but NO code implements these prompts or manages prompt variations

NO evaluation metrics: Paper reports:
Strict Match accuracy
Average Similarity (using difflib.SequenceMatcher)
Exact Match accuracy
Per-item difficulty analysis
Cross-prompt comparisons
NONE of this evaluation logic exists in the code

NO experimental orchestration: Missing:
Scripts to run experiments across all models
Code to manage the "two-run averaging" mentioned
Result aggregation and analysis pipelines
Scripts to generate tables/figures from raw results

NO human baseline code: Paper mentions "10 native readers per script" with "randomized order" and "attention checks" - no code for this experiment

#### What IS Provided:

cin-gen.py: Chinese character fusion generator (348 lines) - COMPLETE and FUNCTIONAL for its scope
en-color-gradient-fusion.py: English word color gradient fusion (344 lines) - COMPLETE and FUNCTIONAL for its scope
cin_100.txt: 99 Chinese idioms (missing 1 entry to reach 100)
en_100.txt: 99 English words (missing 1 entry to reach 100)
README.md: Documentation for stimulus generation only

#### Code Quality of Provided Files:

Both Python files are well-structured with proper error handling
Appropriate use of PIL, NumPy for image generation
Command-line interfaces are well-designed
Functions are modular and documented
NO placeholder functions or TODOs
NO hardcoded results
Code appears genuinely functional for stimulus generation

2. RESULTS AUTHENTICITY RED FLAGS: CRITICAL

Major Concern: Results reported without corresponding evaluation code

The paper presents extensive quantitative results:

Chinese idioms: Strict match 0.0-5.2%, Average similarity 0.5-24.4% across 10 models × 3 prompts = 30 experimental conditions
English words: Exact match up to 20%, with specific per-item analysis (e.g., "hardware" at 0%, "keyboard" at 16.7-55.6%)
Cross-model comparisons, prompt sensitivity analysis, per-item difficulty rankings

Red Flag: NONE of these results can be computed from the provided code. This creates significant reproducibility concerns:

No verifiable computation path: Without evaluation code, there's no way to verify that reported accuracies were actually computed programmatically vs. manually entered

Specific numeric claims lack support: The paper reports precise percentages (e.g., "GPT-4o: Strict 0.0-0.7%; Avg 5.2-11.1%") but no computation pipeline exists to generate these

Missing data: The code can generate stimuli but there's no evidence the actual experimental stimuli images used in the paper are preserved or that model responses are logged

Mitigation: The stimulus generation code appears legitimate and could produce appropriate test stimuli. The ABSENCE of evaluation code doesn't prove results are fake, but it prevents verification.

3. IMPLEMENTATION-PAPER CONSISTENCY: HIGH SEVERITY

#### Discrepancies Found:

Dataset Size Mismatch:

Paper claims: "100 four-character idioms" and "100 eight-letter words"
Actual files: 99 entries each (both cin_100.txt and en_100.txt)
Impact: Minor - likely a counting error with/without header

Fusion Mode Completeness:

Paper describes: "horizontal (top/bottom halves), vertical (left/right halves), diagonal (main diagonal)"
Code implements: 7 modes total (lr, tb, diag, checker, vstripes, hstripes, alpha)
Assessment: Code provides MORE than paper describes; paper likely reports subset used

Color Scheme Specification:

Paper states: "first half rendered in red, second half in green"
Code defaults: pure_red and pure_green schemes with gradients
Assessment: Consistent, though implementation is more sophisticated than paper suggests

Missing Methodology Details:

Paper mentions "two-run averaging" for accuracies - NO code implements this
Paper describes "attention checks" for human baseline - NO code exists
Paper mentions "per-item difficulty analysis" - NO analysis code provided

4. CODE QUALITY SIGNALS: LOW SEVERITY

Positive Indicators:

Clean, well-organized code structure
Proper imports, all from standard libraries (PIL, NumPy, scipy, argparse, tqdm)
Good function decomposition and naming
Appropriate error handling (e.g., checking word length, file existence)
Comprehensive CLI with argparse
JSONL output format for metadata tracking
No dead code or excessive commenting
No unused imports

Minor Issues:

No version control artifacts (.git) visible
No requirements.txt or environment specification
Some magic numbers (thresholds, padding ratios) without explanation
Chinese text processing uses regex and unicode checks - functional but could be more robust

5. FUNCTIONALITY INDICATORS: MEDIUM SEVERITY

Stimulus Generation (Provided Code):

✓ Proper character rendering with dynamic font sizing
✓ Multiple fusion modes implemented
✓ Grid composition for multi-character layouts
✓ Metadata logging in JSONL format
✓ Batch processing with progress bars (tqdm)
✓ Configurable parameters via CLI

Evaluation Pipeline (MISSING):

✗ NO model loading code
✗ NO API integration for proprietary models
✗ NO prompt templates
✗ NO response collection
✗ NO accuracy computation
✗ NO result aggregation
✗ NO statistical analysis

Assessment: The 30% of code that exists is functional; the 70% that's missing is the scientifically critical evaluation infrastructure.

6. DEPENDENCY & ENVIRONMENT ISSUES: LOW SEVERITY

Missing Specifications:

No requirements.txt or environment.yml
No Python version specified (code uses modern type hints suggesting Python 3.8+)

Inferred Dependencies (from imports):

PIL (Pillow)
numpy
scipy
tqdm
argparse (standard library)
json (standard library)
os, re, math, random, hashlib, unicodedata (standard library)

Assessment: All dependencies are standard and commonly available. However:

No version pinning could lead to reproducibility issues
No specification of Chinese fonts required for rendering
Missing VLM-specific dependencies (transformers, torch, API clients)

7. AGENT REPRODUCIBILITY ASSESSMENT

Search for AI Usage Documentation:

Searched for: AI, GPT, Claude, ChatGPT, Copilot, "generated by", "prompt" in code comments
Result: NO evidence of documented AI assistance or prompt logs in the code
The README contains "Generated with ❤️ for text processing" but this refers to the tool's output, not how the code was created

AGENT REPRODUCIBLE: FALSE

No documentation of AI prompts or AI-assisted code generation process was found.

---

CRITICAL GAPS PREVENTING REPRODUCIBILITY

1. Model Evaluation Infrastructure (CRITICAL)

The complete absence of code to:

Interface with 10 different VLMs
Implement prompt variations
Collect and store responses
Handle API rate limiting, errors, costs

2. Metrics Computation (CRITICAL)

No implementation of:

Strict match accuracy for Chinese
SequenceMatcher similarity scoring
Exact match for English
Per-item, per-model, per-prompt aggregations

3. Experimental Data (HIGH)

Missing:

Actual generated stimulus images used in experiments
Model response logs
Raw accuracy data before aggregation
Human baseline experimental data

4. Analysis Pipeline (HIGH)

No code for:

Statistical analysis
Cross-model comparisons
Prompt sensitivity analysis
Per-item difficulty rankings
Figure/table generation

---

RECOMMENDATIONS

For Authors:

IMMEDIATELY RELEASE: Complete evaluation pipeline including model querying, prompt templates, and metrics computation
PROVIDE: Raw experimental data (stimulus images, model responses, accuracy scores)
INCLUDE: Environment specification with exact dependency versions
ADD: End-to-end reproduction script that goes from data files to paper results
DOCUMENT: API keys setup, model checkpoint locations, computational requirements
SPECIFY: Which font files were used for Chinese character rendering

For Reviewers:

REQUEST: Complete code release before accepting paper
VERIFY: That reported results can be recomputed from provided code
CHECK: Availability of intermediate experimental data
ASSESS: Whether "code release" actually enables reproduction or just shows partial implementation

---

CONCLUSION

Final Verdict: INCOMPLETE SUBMISSION - CRITICAL REPRODUCIBILITY ISSUES

The provided code represents approximately 30% of what would be needed for full reproducibility:

✓ Stimulus generation: COMPLETE and FUNCTIONAL
✗ Model evaluation: COMPLETELY MISSING
✗ Metrics computation: COMPLETELY MISSING
✗ Results analysis: COMPLETELY MISSING
✗ Experimental data: MISSING

The code quality of what IS provided is good, but the ABSENCE of the evaluation pipeline means:

Results cannot be verified programmatically
Experiments cannot be reproduced
Claims cannot be independently validated

This is a CRITICAL issue for a research submission. While there's no evidence of fabricated results or malicious code, the submission fails basic reproducibility standards by omitting the core scientific infrastructure needed to generate the reported findings.

Recommendation: REQUIRE complete code release including evaluation pipeline before publication acceptance.

---

APPENDIX: FILES ANALYZED

sub_77/
├── 77_methods_results.md (5,664 bytes) - Paper methodology and results summary
├── 77_Visible_Yet_Unreadable_A_Sy.pdf (2.97 MB) - Full paper
└── character-fusion-generator-main/
    ├── cin-gen.py (15,167 bytes) - Chinese character fusion generator ✓
    ├── en-color-gradient-fusion.py (14,360 bytes) - English word fusion generator ✓
    ├── cin_100.txt (1,299 bytes) - 99 Chinese idioms ✓
    ├── en_100.txt (899 bytes) - 99 English words ✓
    └── README.md (6,496 bytes) - Documentation for generators ✓

Total Python code: 2 files, 692 lines Missing: Evaluation pipeline (estimated 1,000-2,000 lines) Missing: Analysis scripts (estimated 500-1,000 lines) Missing: Experimental data artifacts

---

Audit completed: 2024 Methodology: Static code analysis, structural assessment, cross-referencing with paper claims Tools: Manual code review, grep searches, structural analysis

Audit Report: Paper 77