CODE AUDIT REPORT
Submission ID: sub_77
Paper Title: "Visible Yet Unreadable: A Systematic Blind Spot of Vision-Language Models Across Writing Systems"
---
EXECUTIVE SUMMARY
Overall Assessment: CRITICAL ISSUES IDENTIFIED
This submission provides ONLY stimulus generation code without any evaluation pipeline, model testing infrastructure, or result computation code. The provided code can generate fused character images but contains NO mechanism to:
- Query VLMs (GPT-4o, Claude, Gemini, etc.)
- Collect model responses
- Compute accuracy metrics
- Generate the extensive results reported in the paper
Severity: CRITICAL - Results cannot be reproduced from provided code
---
DETAILED FINDINGS
1. COMPLETENESS & STRUCTURAL INTEGRITY: CRITICAL
#### Missing Components:
- NO model evaluation code: The paper reports comprehensive results for 10 different VLMs (GPT-4o, GPT-5, Claude Opus 4.1, Claude Sonnet 4, Gemini 1.5 Pro/Flash, Qwen2-VL-7B, LLaVA-Mistral-7B, LLaVA-Next-Vicuna-7B) but there is NO code to:
- Load these models
- Query them with the generated stimuli
- Handle API calls to proprietary models
- Process open-source models locally
- NO prompting infrastructure: Paper describes 3 prompting strategies for Chinese (Basic, Detailed, Contextual) and 2 for English (Basic, Detailed), but NO code implements these prompts or manages prompt variations
- NO evaluation metrics: Paper reports:
- Strict Match accuracy
- Average Similarity (using difflib.SequenceMatcher)
- Exact Match accuracy
- Per-item difficulty analysis
- Cross-prompt comparisons
- NONE of this evaluation logic exists in the code
- NO experimental orchestration: Missing:
- Scripts to run experiments across all models
- Code to manage the "two-run averaging" mentioned
- Result aggregation and analysis pipelines
- Scripts to generate tables/figures from raw results
- NO human baseline code: Paper mentions "10 native readers per script" with "randomized order" and "attention checks" - no code for this experiment
#### What IS Provided:
cin-gen.py: Chinese character fusion generator (348 lines) - COMPLETE and FUNCTIONAL for its scope
en-color-gradient-fusion.py: English word color gradient fusion (344 lines) - COMPLETE and FUNCTIONAL for its scope
cin_100.txt: 99 Chinese idioms (missing 1 entry to reach 100)
en_100.txt: 99 English words (missing 1 entry to reach 100)
README.md: Documentation for stimulus generation only
#### Code Quality of Provided Files:
- Both Python files are well-structured with proper error handling
- Appropriate use of PIL, NumPy for image generation
- Command-line interfaces are well-designed
- Functions are modular and documented
- NO placeholder functions or TODOs
- NO hardcoded results
- Code appears genuinely functional for stimulus generation
2. RESULTS AUTHENTICITY RED FLAGS: CRITICAL
Major Concern: Results reported without corresponding evaluation code
The paper presents extensive quantitative results:
- Chinese idioms: Strict match 0.0-5.2%, Average similarity 0.5-24.4% across 10 models × 3 prompts = 30 experimental conditions
- English words: Exact match up to 20%, with specific per-item analysis (e.g., "hardware" at 0%, "keyboard" at 16.7-55.6%)
- Cross-model comparisons, prompt sensitivity analysis, per-item difficulty rankings
Red Flag: NONE of these results can be computed from the provided code. This creates significant reproducibility concerns:
- No verifiable computation path: Without evaluation code, there's no way to verify that reported accuracies were actually computed programmatically vs. manually entered
- Specific numeric claims lack support: The paper reports precise percentages (e.g., "GPT-4o: Strict 0.0-0.7%; Avg 5.2-11.1%") but no computation pipeline exists to generate these
- Missing data: The code can generate stimuli but there's no evidence the actual experimental stimuli images used in the paper are preserved or that model responses are logged
Mitigation: The stimulus generation code appears legitimate and could produce appropriate test stimuli. The ABSENCE of evaluation code doesn't prove results are fake, but it prevents verification.
3. IMPLEMENTATION-PAPER CONSISTENCY: HIGH SEVERITY
#### Discrepancies Found:
- Dataset Size Mismatch:
- Paper claims: "100 four-character idioms" and "100 eight-letter words"
- Actual files: 99 entries each (both cin_100.txt and en_100.txt)
- Impact: Minor - likely a counting error with/without header
- Fusion Mode Completeness:
- Paper describes: "horizontal (top/bottom halves), vertical (left/right halves), diagonal (main diagonal)"
- Code implements: 7 modes total (lr, tb, diag, checker, vstripes, hstripes, alpha)
- Assessment: Code provides MORE than paper describes; paper likely reports subset used
- Color Scheme Specification:
- Paper states: "first half rendered in red, second half in green"
- Code defaults:
pure_red and pure_green schemes with gradients
- Assessment: Consistent, though implementation is more sophisticated than paper suggests
- Missing Methodology Details:
- Paper mentions "two-run averaging" for accuracies - NO code implements this
- Paper describes "attention checks" for human baseline - NO code exists
- Paper mentions "per-item difficulty analysis" - NO analysis code provided
4. CODE QUALITY SIGNALS: LOW SEVERITY
Positive Indicators:
- Clean, well-organized code structure
- Proper imports, all from standard libraries (PIL, NumPy, scipy, argparse, tqdm)
- Good function decomposition and naming
- Appropriate error handling (e.g., checking word length, file existence)
- Comprehensive CLI with argparse
- JSONL output format for metadata tracking
- No dead code or excessive commenting
- No unused imports
Minor Issues:
- No version control artifacts (.git) visible
- No requirements.txt or environment specification
- Some magic numbers (thresholds, padding ratios) without explanation
- Chinese text processing uses regex and unicode checks - functional but could be more robust
5. FUNCTIONALITY INDICATORS: MEDIUM SEVERITY
Stimulus Generation (Provided Code):
- ✓ Proper character rendering with dynamic font sizing
- ✓ Multiple fusion modes implemented
- ✓ Grid composition for multi-character layouts
- ✓ Metadata logging in JSONL format
- ✓ Batch processing with progress bars (tqdm)
- ✓ Configurable parameters via CLI
Evaluation Pipeline (MISSING):
- ✗ NO model loading code
- ✗ NO API integration for proprietary models
- ✗ NO prompt templates
- ✗ NO response collection
- ✗ NO accuracy computation
- ✗ NO result aggregation
- ✗ NO statistical analysis
Assessment: The 30% of code that exists is functional; the 70% that's missing is the scientifically critical evaluation infrastructure.
6. DEPENDENCY & ENVIRONMENT ISSUES: LOW SEVERITY
Missing Specifications:
- No requirements.txt or environment.yml
- No Python version specified (code uses modern type hints suggesting Python 3.8+)
Inferred Dependencies (from imports):
PIL (Pillow)
numpy
scipy
tqdm
argparse (standard library)
json (standard library)
os, re, math, random, hashlib, unicodedata (standard library)
Assessment: All dependencies are standard and commonly available. However:
- No version pinning could lead to reproducibility issues
- No specification of Chinese fonts required for rendering
- Missing VLM-specific dependencies (transformers, torch, API clients)
7. AGENT REPRODUCIBILITY ASSESSMENT
Search for AI Usage Documentation:
- Searched for: AI, GPT, Claude, ChatGPT, Copilot, "generated by", "prompt" in code comments
- Result: NO evidence of documented AI assistance or prompt logs in the code
- The README contains "Generated with ❤️ for text processing" but this refers to the tool's output, not how the code was created
AGENT REPRODUCIBLE: FALSE
No documentation of AI prompts or AI-assisted code generation process was found.
---
CRITICAL GAPS PREVENTING REPRODUCIBILITY
1. Model Evaluation Infrastructure (CRITICAL)
The complete absence of code to:
- Interface with 10 different VLMs
- Implement prompt variations
- Collect and store responses
- Handle API rate limiting, errors, costs
2. Metrics Computation (CRITICAL)
No implementation of:
- Strict match accuracy for Chinese
- SequenceMatcher similarity scoring
- Exact match for English
- Per-item, per-model, per-prompt aggregations
3. Experimental Data (HIGH)
Missing:
- Actual generated stimulus images used in experiments
- Model response logs
- Raw accuracy data before aggregation
- Human baseline experimental data
4. Analysis Pipeline (HIGH)
No code for:
- Statistical analysis
- Cross-model comparisons
- Prompt sensitivity analysis
- Per-item difficulty rankings
- Figure/table generation
---
RECOMMENDATIONS
For Authors:
- IMMEDIATELY RELEASE: Complete evaluation pipeline including model querying, prompt templates, and metrics computation
- PROVIDE: Raw experimental data (stimulus images, model responses, accuracy scores)
- INCLUDE: Environment specification with exact dependency versions
- ADD: End-to-end reproduction script that goes from data files to paper results
- DOCUMENT: API keys setup, model checkpoint locations, computational requirements
- SPECIFY: Which font files were used for Chinese character rendering
For Reviewers:
- REQUEST: Complete code release before accepting paper
- VERIFY: That reported results can be recomputed from provided code
- CHECK: Availability of intermediate experimental data
- ASSESS: Whether "code release" actually enables reproduction or just shows partial implementation
---
CONCLUSION
Final Verdict: INCOMPLETE SUBMISSION - CRITICAL REPRODUCIBILITY ISSUES
The provided code represents approximately 30% of what would be needed for full reproducibility:
- ✓ Stimulus generation: COMPLETE and FUNCTIONAL
- ✗ Model evaluation: COMPLETELY MISSING
- ✗ Metrics computation: COMPLETELY MISSING
- ✗ Results analysis: COMPLETELY MISSING
- ✗ Experimental data: MISSING
The code quality of what IS provided is good, but the ABSENCE of the evaluation pipeline means:
- Results cannot be verified programmatically
- Experiments cannot be reproduced
- Claims cannot be independently validated
This is a CRITICAL issue for a research submission. While there's no evidence of fabricated results or malicious code, the submission fails basic reproducibility standards by omitting the core scientific infrastructure needed to generate the reported findings.
Recommendation: REQUIRE complete code release including evaluation pipeline before publication acceptance.
---
APPENDIX: FILES ANALYZED
sub_77/
├── 77_methods_results.md (5,664 bytes) - Paper methodology and results summary
├── 77_Visible_Yet_Unreadable_A_Sy.pdf (2.97 MB) - Full paper
└── character-fusion-generator-main/
├── cin-gen.py (15,167 bytes) - Chinese character fusion generator ✓
├── en-color-gradient-fusion.py (14,360 bytes) - English word fusion generator ✓
├── cin_100.txt (1,299 bytes) - 99 Chinese idioms ✓
├── en_100.txt (899 bytes) - 99 English words ✓
└── README.md (6,496 bytes) - Documentation for generators ✓
Total Python code: 2 files, 692 lines
Missing: Evaluation pipeline (estimated 1,000-2,000 lines)
Missing: Analysis scripts (estimated 500-1,000 lines)
Missing: Experimental data artifacts
---
Audit completed: 2024
Methodology: Static code analysis, structural assessment, cross-referencing with paper claims
Tools: Manual code review, grep searches, structural analysis